top of page
Search
  • Writer's pictureAtanu Maity

Text Mining and Analysis in R - Use Case (Part-1)


TEXT ANALYTICS OF FLIPKART REVIEWS FOR SONY XPERIA Z

Contents:

  • Problem Definition

  • Methodology

  • Analysis

  • Recommendation

Problem Definition:

Detail analysis of Flipkart Reviews of Sony Xperia Z.


Methodology


Data Collection:

Flipkart Reviews of Various Users of the product.

Each review is independent of each other.

Sampling Design:

Total 110 reviews have been collected for the product from Flipkart to do the analysis.

Among 458 reviews, first 110 reviews have been taken for our purpose.


TextMining Methods used for Analysis:

Worcloud

Latent Semantic Analysis (LSA)

Support Vector Machine

Sentiment Analysis


Tool Used:

R Studio


Analysis


Review Extraction from Flipkart:

Packages Needed: RCurl, XML, rvest, tm, wordcloud


First we will build a anchorlist and doclist which will contain the page links from where we will extract the reviews and contents of those pages respectively.


code:


Output:

anchorlist

[[1]]

  1. "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=10"

[[2]]

  1. "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=20"

[[3]]

  1. "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=30"

[[4]]

  1. "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=40"

[[5]]

  1. "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=50"

[[6]]

  1. "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=60"

[[7]]

  1. "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=0"

[[8]]

  1. "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=70"

[[9]]

  1. "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=80"

[[10]]

  1. "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=90"

[[11]]

  1. "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=100"

[[12]]

  1. "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=110"

[[13]]

  1. "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=120"

[[14]]

  1. "/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=130"

Doclist:


The extracted doclist can be viewed from the following link https://drive.google.com/file/d/0B_Y-QpyzAwhSUkZ1Wmg0bWNWRWM/view?usp=sharing


Now from this doclist we have extracted the ‘reviews’ on which the analysis will be done.


Review Extraction:




Output:


Total 110 reviews have been extracted from 11 docs. The review list can be found from the following link

https://drive.google.com/file/d/0B_Y-QpyzAwhSU0Y3UHFHeFdmYVk/view?usp=sharing


Filtering of the review contents:

We have removed Numbers, Punctuations, Stopwords from the reviews and made all the letters of each word to lowercase.


code:


Making of WordCloud:

At the very first step we have made a TermDocument Matrix with words in rows and documents in columns from the reviews. And using that we finally have made the WordCloud.


code:


Output:


Inference from the WordCloud output:

  1. Words like Phone, Sony, Xperia, Water, Good, Camera, Battery, Service, Display, Screen, Quality etc are important, as they are looking big and bold compare to other words.

  2. Quality of the phone might have been very satisfactory and also overall performance might be very good as words like screen, battery, camera, display had been repeatedly used by the consumers

  3. There might have been some issue with Water for this Sony model.

  4. Also there might have been some comparison of Sony with Samsung and HTC handsets.

Rating Extraction and Analysis of Ratings


We have extracted rating for Sony Experia Z for each of the first 110 reviews and found that each reviewer has given rating and then done the analysis accordingly.

Rating Extraction:


code:


Output:

missingRating>



























So from this output we can see that there is no missing rating for each individual page i.e. each reviewer has given the rating. Now we want those ratings. Those 110 ratings are like following

Ratings>

  1. "5 stars" "1 star" "5 stars" "5 stars" "4 stars" "5 stars" "1 star" "1 star" "5 stars"

  2. "5 stars" "3 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "1 star" "1 star"

  3. "4 stars" "5 stars" "4 stars" "1 star" "1 star" "5 stars" "1 star" "2 stars" "5 stars"

  4. "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars"

  5. "5 stars" "5 stars" "1 star" "1 star" "3 stars" "5 stars" "5 stars" "5 stars" "1 star"

  6. "2 stars" "1 star" "5 stars" "4 stars" "4 stars" "2 stars" "4 stars" "5 stars" "3 stars"

  7. "1 star" "5 stars" "4 stars" "5 stars" "5 stars" "5 stars" "5 stars" "1 star" "5 stars"

  8. "5 stars" "4 stars" "5 stars" "1 star" "1 star" "5 stars" "5 stars" "5 stars" "5 stars"

  9. "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "4 stars" "5 stars" "5 stars" "1 star"

  10. "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars"

  11. "1 star" "5 stars" "5 stars" "5 stars" "4 stars" "4 stars" "1 star" "1 star" "1 star"

  12. "1 star" "4 stars" "5 stars" "5 stars" "1 star" "1 star" "5 stars" "1 star" "1 star"

  13. "5 stars" "5 stars"

The extracted ratings are in character format, but for our calculation purpose we want them as numerical. So we have done like this


code:

Output:

finalRating

  1. 5 1 5 5 4 5 1 1 5 5 3 5 5 5 5 5 1 1 4 5 4 1 1 5 1 2 5 5 5 5 5 5 5 5 5 5 5 5 1 1 3 5 5 5 1 2 1

  2. 5 4 4 2 4 5 3 1 5 4 5 5 5 5 1 5 5 4 5 1 1 5 5 5 5 5 5 5 5 5 4 5 5 1 5 5 5 5 5 5 5 5 5 1 5 5 5

  3. 4 4 1 1 1 1 4 5 5 1 1 5 1 1 5 5

Now we have defined the satisfaction level of the users for the product in this way, if the user has given rating above 3, then he/she is satisfied with it or if rating is below 3 or equal to 3 then he/she is not.


code:


Output:

[1] "satisfied" "dissatisfied" "satisfied" "satisfied" "satisfied"

"satisfied"

  1. "dissatisfied" "dissatisfied" "satisfied" "satisfied" "dissatisfied" "satisfied"

  2. "satisfied" "satisfied" "satisfied" "satisfied" "dissatisfied" "dissatisfied"

  3. "satisfied" "satisfied" "satisfied" "dissatisfied" "dissatisfied" "satisfied"

  4. "dissatisfied" "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied"

  5. "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied"

  6. "satisfied" "satisfied" "dissatisfied" "dissatisfied" "dissatisfied" "satisfied"

  7. "satisfied" "satisfied" "dissatisfied" "dissatisfied" "dissatisfied" "satisfied"

  8. "satisfied" "satisfied" "dissatisfied" "satisfied" "satisfied" "dissatisfied"

  9. "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied"

  10. "satisfied" "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied"

  11. "dissatisfied" "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied"

  12. "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied"

  13. "satisfied" "satisfied" "dissatisfied" "satisfied" "satisfied" "satisfied"

  14. "satisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied"

  15. "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied"

  16. "dissatisfied" "dissatisfied" "dissatisfied" "dissatisfied" "satisfied" "satisfied"

  17. "satisfied" "dissatisfied" "dissatisfied" "satisfied" "dissatisfied" "dissatisfied"

  18. "satisfied""satisfied"

Now we have made a data frame combining the document-term matrix and the satisfaction level, so for each document (i.e for each review) we have a satisfaction level in this data frame and then we have done the classification modeling by applying SVM (Support Vector Machine) technique so that by looking at some words we can measure either the user is satisfied or not with the product.


code:


Now we want to know the variables (i.e the words) which are most and least important in predicting the satisfaction level.


code:

We got the importance matrix of 2923x2 order whose one column is words and other column is importance of a particular word in a particular row, the words are arranged in increasing order with respect to their importance.


The 5 least important words:









Warranty, said, centre, cost, customers, working are the least 5 important words and their importance in prediction of satisfaction level are like above table.

The 5 most important words:










Battery, quality, great, best, camera, awesome are the 5 most important words and their importance in prediction of satisfaction level are like above table.


Sentiment Analysis


Now we want to know what is the overall sentiment of the reviewers/ users has been worked for this particular product. Is it positive or negative? What is the average score of the sentiment.

For this purpose we have measured either a review is polar or neutral by computing polarity score of each review and then have calculated the average polarity score for all the 110 reviews. If it is positive then there is a positive sentiment for the product or if it is negative then there is a negative sentiment for the product among the users.


code:

Output:

Here we can clearly see that in this 11 documents, there are 12830 words and 1807 sentences. The positive words like happy, glad, well, like, willing are there in the documents. The average polarity calculated for all the documents is 0.111 and it is also positive. So from here we can infer that overall positive sentiment has been worked in the user of the product.


Polarity plot:

From this polarity plot we can clearly see that the average polarity, which is represented here as cross sign, is above 0 and also most of the dense part is lying above 0. So from the plot also we can infer that overall it’s a positive sentiment on an average.


Summary:


In this Part-1 we have covered the topics like, 1) Text Extraction 2) Building Word Cloud 3) Rating Extraction and Classification 4) Sentiment Analysis.


In Part-2 we will coverup the topics like 5) Latent Semantic Analysis 6) Topic Modelling 7) Text Clustering and few more related and essential stuffs.


Till then Happy Reading. :)


#DataScienceinR #DataScience #TextMining

18 views0 comments

Recent Posts

See All

Airflow Airflow is a platform to programmatically author, schedule and monitor workflows. Use Airflow to author workflows as Directed Acyclic Graphs (DAGs) of tasks. The main 4 components of Airflow a

bottom of page