Atanu Maity
Text Mining and Analysis in R - Use Case (Part-1)
TEXT ANALYTICS OF FLIPKART REVIEWS FOR SONY XPERIA Z
Contents:
Problem Definition
Methodology
Analysis
Recommendation
Problem Definition:
Detail analysis of Flipkart Reviews of Sony Xperia Z.
Methodology
Data Collection:
Flipkart Reviews of Various Users of the product.
Each review is independent of each other.
Sampling Design:
Total 110 reviews have been collected for the product from Flipkart to do the analysis.
Among 458 reviews, first 110 reviews have been taken for our purpose.
TextMining Methods used for Analysis:
Worcloud
Latent Semantic Analysis (LSA)
Support Vector Machine
Sentiment Analysis
Tool Used:
R Studio
Analysis
Review Extraction from Flipkart:
Packages Needed: RCurl, XML, rvest, tm, wordcloud
First we will build a anchorlist and doclist which will contain the page links from where we will extract the reviews and contents of those pages respectively.
code:

Output:
anchorlist
[[1]]
"/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=10"
[[2]]
"/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=20"
[[3]]
"/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=30"
[[4]]
"/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=40"
[[5]]
"/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=50"
[[6]]
"/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=60"
[[7]]
"/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=0"
[[8]]
"/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=70"
[[9]]
"/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=80"
[[10]]
"/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=90"
[[11]]
"/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=100"
[[12]]
"/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=110"
[[13]]
"/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=120"
[[14]]
"/sony-xperia-z/product-reviews/ITME3H7SHFCU6TFS?pid=MOBDGPK4XSZPTDZY&rating=1,2,3,4,5&reviewers=all& type=all&sort=most_helpful&start=130"
Doclist:
The extracted doclist can be viewed from the following link https://drive.google.com/file/d/0B_Y-QpyzAwhSUkZ1Wmg0bWNWRWM/view?usp=sharing
Now from this doclist we have extracted the ‘reviews’ on which the analysis will be done.
Review Extraction:

Output:
Total 110 reviews have been extracted from 11 docs. The review list can be found from the following link
https://drive.google.com/file/d/0B_Y-QpyzAwhSU0Y3UHFHeFdmYVk/view?usp=sharing
Filtering of the review contents:
We have removed Numbers, Punctuations, Stopwords from the reviews and made all the letters of each word to lowercase.
code:

Making of WordCloud:
At the very first step we have made a TermDocument Matrix with words in rows and documents in columns from the reviews. And using that we finally have made the WordCloud.
code:

Output:

Inference from the WordCloud output:
Words like Phone, Sony, Xperia, Water, Good, Camera, Battery, Service, Display, Screen, Quality etc are important, as they are looking big and bold compare to other words.
Quality of the phone might have been very satisfactory and also overall performance might be very good as words like screen, battery, camera, display had been repeatedly used by the consumers
There might have been some issue with Water for this Sony model.
Also there might have been some comparison of Sony with Samsung and HTC handsets.
Rating Extraction and Analysis of Ratings
We have extracted rating for Sony Experia Z for each of the first 110 reviews and found that each reviewer has given rating and then done the analysis accordingly.
Rating Extraction:
code:

Output:
missingRating>

So from this output we can see that there is no missing rating for each individual page i.e. each reviewer has given the rating. Now we want those ratings. Those 110 ratings are like following
Ratings>
"5 stars" "1 star" "5 stars" "5 stars" "4 stars" "5 stars" "1 star" "1 star" "5 stars"
"5 stars" "3 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "1 star" "1 star"
"4 stars" "5 stars" "4 stars" "1 star" "1 star" "5 stars" "1 star" "2 stars" "5 stars"
"5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars"
"5 stars" "5 stars" "1 star" "1 star" "3 stars" "5 stars" "5 stars" "5 stars" "1 star"
"2 stars" "1 star" "5 stars" "4 stars" "4 stars" "2 stars" "4 stars" "5 stars" "3 stars"
"1 star" "5 stars" "4 stars" "5 stars" "5 stars" "5 stars" "5 stars" "1 star" "5 stars"
"5 stars" "4 stars" "5 stars" "1 star" "1 star" "5 stars" "5 stars" "5 stars" "5 stars"
"5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "4 stars" "5 stars" "5 stars" "1 star"
"5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars" "5 stars"
"1 star" "5 stars" "5 stars" "5 stars" "4 stars" "4 stars" "1 star" "1 star" "1 star"
"1 star" "4 stars" "5 stars" "5 stars" "1 star" "1 star" "5 stars" "1 star" "1 star"
"5 stars" "5 stars"
The extracted ratings are in character format, but for our calculation purpose we want them as numerical. So we have done like this
code:

Output:
finalRating
5 1 5 5 4 5 1 1 5 5 3 5 5 5 5 5 1 1 4 5 4 1 1 5 1 2 5 5 5 5 5 5 5 5 5 5 5 5 1 1 3 5 5 5 1 2 1
5 4 4 2 4 5 3 1 5 4 5 5 5 5 1 5 5 4 5 1 1 5 5 5 5 5 5 5 5 5 4 5 5 1 5 5 5 5 5 5 5 5 5 1 5 5 5
4 4 1 1 1 1 4 5 5 1 1 5 1 1 5 5
Now we have defined the satisfaction level of the users for the product in this way, if the user has given rating above 3, then he/she is satisfied with it or if rating is below 3 or equal to 3 then he/she is not.
code:

Output:
[1] "satisfied" "dissatisfied" "satisfied" "satisfied" "satisfied"
"satisfied"
"dissatisfied" "dissatisfied" "satisfied" "satisfied" "dissatisfied" "satisfied"
"satisfied" "satisfied" "satisfied" "satisfied" "dissatisfied" "dissatisfied"
"satisfied" "satisfied" "satisfied" "dissatisfied" "dissatisfied" "satisfied"
"dissatisfied" "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied"
"satisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied"
"satisfied" "satisfied" "dissatisfied" "dissatisfied" "dissatisfied" "satisfied"
"satisfied" "satisfied" "dissatisfied" "dissatisfied" "dissatisfied" "satisfied"
"satisfied" "satisfied" "dissatisfied" "satisfied" "satisfied" "dissatisfied"
"dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied"
"satisfied" "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied"
"dissatisfied" "dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied"
"satisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied"
"satisfied" "satisfied" "dissatisfied" "satisfied" "satisfied" "satisfied"
"satisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied"
"dissatisfied" "satisfied" "satisfied" "satisfied" "satisfied" "satisfied"
"dissatisfied" "dissatisfied" "dissatisfied" "dissatisfied" "satisfied" "satisfied"
"satisfied" "dissatisfied" "dissatisfied" "satisfied" "dissatisfied" "dissatisfied"
"satisfied""satisfied"
Now we have made a data frame combining the document-term matrix and the satisfaction level, so for each document (i.e for each review) we have a satisfaction level in this data frame and then we have done the classification modeling by applying SVM (Support Vector Machine) technique so that by looking at some words we can measure either the user is satisfied or not with the product.
code:

Now we want to know the variables (i.e the words) which are most and least important in predicting the satisfaction level.
code:

We got the importance matrix of 2923x2 order whose one column is words and other column is importance of a particular word in a particular row, the words are arranged in increasing order with respect to their importance.
The 5 least important words:

Warranty, said, centre, cost, customers, working are the least 5 important words and their importance in prediction of satisfaction level are like above table.
The 5 most important words:

Battery, quality, great, best, camera, awesome are the 5 most important words and their importance in prediction of satisfaction level are like above table.
Sentiment Analysis
Now we want to know what is the overall sentiment of the reviewers/ users has been worked for this particular product. Is it positive or negative? What is the average score of the sentiment.
For this purpose we have measured either a review is polar or neutral by computing polarity score of each review and then have calculated the average polarity score for all the 110 reviews. If it is positive then there is a positive sentiment for the product or if it is negative then there is a negative sentiment for the product among the users.
code:

Output:

Here we can clearly see that in this 11 documents, there are 12830 words and 1807 sentences. The positive words like happy, glad, well, like, willing are there in the documents. The average polarity calculated for all the documents is 0.111 and it is also positive. So from here we can infer that overall positive sentiment has been worked in the user of the product.
Polarity plot:

From this polarity plot we can clearly see that the average polarity, which is represented here as cross sign, is above 0 and also most of the dense part is lying above 0. So from the plot also we can infer that overall it’s a positive sentiment on an average.
Summary:
In this Part-1 we have covered the topics like, 1) Text Extraction 2) Building Word Cloud 3) Rating Extraction and Classification 4) Sentiment Analysis.
In Part-2 we will coverup the topics like 5) Latent Semantic Analysis 6) Topic Modelling 7) Text Clustering and few more related and essential stuffs.
Till then Happy Reading. :)