Atanu Maity
Text Mining and Analysis in R - Use Case (Part-2)
As a continuation of Text Mining and Analysis in R Part-1, here we will start our discussion with Latent Semantic Analysis.
Latent Semantic Analysis
We have done Latent Semantic Analysis to know various semantic relationships among various pieces of texts or words. We have tried to find different topic spaces which are defined by the words/ terms and their contextual synonyms used in the reviews for Sony Xperia Z using LSA technique.
First we have created a Document-term Matrix from the reviews we have.
code:

Output:

So here we can see first 3 columns are like -, “red” and ‘. These symbols would have no significance in our analysis. So we have removed these first three columns.

Output Sample:

Similarly we have calculated weighted TF-IDF score for each word in each document and represented them in matrix form called m_tfidf which is of 110x2923 order.
code:

Output:

Now we have done SVD( Singular Value Decomposition) of that ‘m’ and ‘m_tfidf’ matrices through LSA technique and here we chose dimension share = 0.6
code:

Note: Here ‘t()’ operator is transpose of a matrix. So actually we are doing analysis on Term-Document Matrix.
SVD has divided both the matrices into 3 component matrices. One is Term-Dimension matrix, one is Diagonal matrix containing the Eigen Values and last one is Document-Dimension Matrix.
For lsa_m, those are lsa_m$tk (2923x34), lsa_m$sk (34x1) and lsa_m$dk (110x34) respectively
For lsa_mtfidf, those are lsa_mtfidf$tk (2923x45), lsa_mtfidf$sk (45x1) and lsa_mtfidf$dk (110x45) respectively
Now checking for lsa_m$sk, we have
89.49235 43.53660 38.06285 33.57073 31.87552 28.48354 25.34172 23.19148 21.81091 21.29905 20.44049
20.27423 19.45968 19.06065 18.46173 17.47600 17.17047 16.70915 15.97285 15.91400 15.25898 14.62148
14.45512 14.29820 14.23971 13.38578 13.16881 13.02947 12.81555 12.56187 12.38814 12.23705 11.99354
11.91981
So first 4 Eigen values are explaining the maximum variances, and afterwards no such significant changes.
Thus for our purpose we have selected first 4 significant dimensions excluding the others and will go forward with these 4 dimensions.
Now we will do clustering the similar terms into similar clusters. For that we have two step process:
First a k-mean clustering
Second a Hierarchical clustering to find optimal number of clusters
code:

Output:

Step2:
Here we have used ‘ward.D’ method for Hierarchical Clustering and then scaling the plot by different colors for different heights we have tried to find optimal number of clusters.
code:

Output:

From the Dendogram and color scaling we can say that optimal number of clusters will be either 3 or 4 or 7.
Now we will check cluster sizes for each to decide which one will be the optimum and for this purpose we will again use k-means clustering technique.

Output:



Looking at these figures, we have decided k=4 will give the optimal solution for our purpose i.e., we have 4 clusters as optimal solution.
We will look for the clusters membership now. For this we have done like following:

Output:
K4_1 i.e., cluster 1 memberships are like that

Looking at this we can say, that cluster 1 is reflecting for the good things in the sony xperia phone like display, camera, battery etc or this topic space is saying for good things in the phone.
K4_2 i.e., cluster 2 memberships are like that

Looking at this, we can infer that this topic space is discussing about the bad things in the phone. Words like ‘bad’, ‘issues’, ‘complaining’, ‘useless’, ‘ill’ are provoking negative impact in this topic space and with this link, may be processor, charging system of this phone are not so good or not upto the expectation level of the consumers.
K4_3 i.e., cluster 3 membership is

Only one word is here and also of frequency 1. We cant say anything from this. This could be a noise.
K4_4 i.e., cluster 4 memberships are like that

Looking at the top frequency words in this cluster we can say this cluster/ topic space is also discussing about some service issues of the phone. There might be a lot of problem with the service of this phone.
Next we have created ‘lsa_tk4’ matrix whose columns are words and those 4 dimensions which we have chosen from ‘lsa_m$sk’. ‘lsa_tk4’ is a 2923x5 order matrix.

Output Sample:

Recommendations
Few Recommendations which we can infer from our analysis:
The company should concentrate on their customer services to improve the customer relationship and to hold their loyal customers.
Sony is known for sound and image quality and for a long decade, they are satisfying their customers with these qualities, so in future they should not compromise with the quality of these features to compete other brands in the market in its category.