Blog Post 2 - Exploratory Data Analysis (EDA)
Alex Berry, Jason Chan, Hyunjoon Lee, Sayan Samanta, Christina Ye
Brown University Data Science Initiative
DATA 2040: Deep Learning
April 26th, 2020
The Data Set
To recap from our Blog Post 1, the data we are using is a subset of the “Amazon Review Data (2018)” dataset created by Jianmo Ni at UCSD, which contains a total of 233.1 million reviews. Our project-specfic dataset is composed of 11,285,464 reviews under “Clothes, Shoes, and Jewelry” category. The dataset is in JSON format. A sample review looks like:
{
"image": ["https://images-na.ssl-images-amazon.com/images/I/71eG75FTJJL._SY88.jpg"],
"overall": 5.0,
"vote": "2",
"verified": True,
"reviewTime": "01 1, 2018",
"reviewerID": "AUI6WTTT0QZYS",
"asin": "5120053084",
"style": {
"Size:": "Large",
"Color:": "Charcoal"
},
"reviewerName": "Abbey",
"reviewText": "I now have 4 of the 5 available colors of this shirt... ",
"summary": "Comfy, flattering, discreet--highly recommended!",
"unixReviewTime": 1514764800
}
Preprocessing
Some preprocessing was necessary before we proceeded to our EDA. Firstly, we gave each review an ID and dropped all the features except rating and review text. This allowed us to remove unnecessary information and shrink the size of the dataset considerably. The target of this project is to study the possibility of generating product reviews based on a given set of context/keywords. Therefore, we generated a keyword for each review. For keyword generation, we used code from Ng Wai Foong’s tutorial which used NLP library called spaCy, developed at MIT. Lastly, we converted the file format from JSON to TSV. We use Tab-seperated values (TSV) format because it is very efficient for programming languages like Python and Tensorflow.
Exploratory Data Analysis (EDA)
We used a randomly sampled subset of 278,646 samples for EDA. (ID, Rating, Keyword, Review Text).
parent_directory = os.path.dirname(os.getcwd())
df = pd.read_csv(parent_directory+"/data/Clothing_Shoes_and_Jewelry_5.tsv", sep= "\t")
df.head()
For EDA purposes, we also generated a feature called “polarity” using TextBlob. The purpose of creating the polarity feature to examine whether rating is a valid representation of the sentiment of the review. Such representation is important because ratings also provide a useful context when generating a text in our project. Polarity ranges from -1.0 to 1.0, where -1.0 means extremely negative, 1.0 means extremely positive, and 0.0 means neutral. We refered to Susan Li’s tutorial.
df['polarity'] = df['text'].map(lambda text: tb.TextBlob(str(text)).sentiment.polarity)
df['text'] = df['text'].astype("U")
Distribution of Sentiment Polarity and Ratings
The distribution of the sentiment polarity is close to normal distribution ranging from about -0.5 to 1.0. According to the distribution, people generally give neutral to moderately positive reviews to the products. From our plot, we have that our dataset is good representative sample with a wide range of reviews with regards to sentiment. Figure 1. Distribution of review sentiment polarity
The distribution of ratings is left-skewed (negatively skewed). We can see that people generally tend to give 4 and 5 star reviews. Figure 2. Distribution of review ratings
Words and Bigrams
Then, we looked at the most frequent 20 words and bigrams in our reviews. Figure 3. 20 most frequent words in reviews
Figure 4. 20 most frequent bigrams in reviews
We have that the most frequently used bigrams are a combination of verb and adjective that denotes the quality of the product and how well the product fits to the customer. We have that most bigrams denote positive remarks about the product. Similarly, the most frequently used words are positive adjectives.
Word Cloud by Ratings
Next, using WordCloud library, we looked at the wordcloud of the reviews. First, we plotted the wordcloud of all the reviews, and then we plotted the wordclouds by ratings (rating = 1, rating = 5) to see whether there is a difference in the use of words in the review texts by ratings. We removed the stopwards and plotted the wordcloud accordingly. Below is the code we implemented to plot our wordcloud.
def plot_wordcloud(text,image_name, mask=None, max_words=200, max_font_size=100, figure_size=(24.0,16.0),
title = None, title_size=40, image_color=False):
stopwords = set(STOPWORDS)
more_stopwords = {'one', 'br', 'Po', 'th', 'sayi', 'fo', 'Unknown'}
stopwords = stopwords.union(more_stopwords)
wordcloud = WordCloud(background_color='black',
stopwords = stopwords,
max_words = max_words,
max_font_size = max_font_size,
random_state = 42,
width=800,
height=400,
mask = mask)
wordcloud.generate(str(text))
plt.figure(figsize=figure_size)
if image_color:
image_colors = ImageColorGenerator(mask);
plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear");
plt.title(title, fontdict={'size': title_size,
'verticalalignment': 'bottom'})
plt.savefig(image_name+".png", transparent=True)
else:
plt.imshow(wordcloud);
plt.title(title, fontdict={'size': title_size, 'color': 'black',
'verticalalignment': 'bottom'})
plt.axis('off');
plt.tight_layout()
plt.savefig(image_name+".png", transparent=True)
plot_wordcloud(df["text"],"reviews",title="Word Cloud of reviews")
Figure 5. Word cloud of text of all reviews
We have that the most commonly used words throughout the reviews include “great”, “well”, product”, “daughter”, “bought”, “tutu”, “nice”, etc.. Figure 6. Word cloud of text of reviews with ratings of 5.0
Figure 7. Word cloud of text of reviews with ratings of 1.0
We have that the wordcloud for ratings = 5 is very similar to that of all the reviews. However, for wordcloud for ratings = 1, we see that the most primarily used words are “size”, “fit”, “large”, “big”, “tight”, “cheap”. We can infer that most of 1 star ratings come with dissatisfaction with size (too big or too small).
So far, we have that most of our reviews are positive reviews given the ratings, polarity, and the wordcloud. However, looking at the distribution of the sentiment polarity, it does seem that the general positivity of the reviews are bias from our data. It seems more due to the fact that people are generally satisfied with the products they purchased on Amazon.
Top 20 Keywords
Lastly, we looked at the top 20 keywords from our dataset. Figure 8. 20 most frequent keywords
The top 20 keywords include positive adjectives like “great”, “nice”, “good”, “comfortable” and products like “shoes (shoe)”, “shirt”, “bra”.
Ratings vs Sentiment Polarity for Top Keywords
Next, we looked at the average ratings and average sentiment polarity of the keywords. The purpose of this exploration is to examine whether the ratings fairly represent the sentiment polarity of the review text (at least have a reasonable association). According to our plots, the keywords “great”, “nice”, “good”, “comfortable” have average ratings of above 4 and higher average sentiment polarities (above 0.3). For keyword “size”, the average rating is below 4, which means that most people post reviews of size when they are generally not satisfied (size too big, size too small). The average sentiment polarity of keyword “size” was the only polarity below 0.2. Other keywords tell a similar story. The ratings represent the sentiment polarity of the reviews fairly well. Figure 9. Average rating of 20 most frequent keywords
Figure 10. Average polarity of 20 most frequent keywords
Lastly, we look at the Boxplot of the ratings and sentiment polarity for the top keywords. Again, the ratings and sentiment polarity are related. Figure 11. Boxplots of rating score by 6 most frequent keywords
Figure 12. Boxplots of polarity by 6 most frequent keywords
Next Steps
Now that we preprocessed the data and performed thorough EDA, we will proceed to implementing the initial model. Refer to Blog Post 1 for models we explored and plan to make use of in our implementation.
References
-
Foong, Ng Wai. “Extract Keywords Using SpaCy in Python.” Medium, Better Programming, 21 Jan. 2020, medium.com/better-programming/extract-keywords-using-spacy-in-python-4a8415478fbf. (This resource was used to learn how to use the SpaCy package to extract keywords. This was key because we preprocessed all of our data to create the keyword variable since this is what we would be training based on.)
-
Li, Susan. A Complete Exploratory Data Analysis and Visualization for Text Data. Medium: Towards Data Science, 27 Apr. 2019, towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualization-for-text-data-29fb1b96fb6a. (This resource gave a broad overview of how to do EDA on text data using plotly. This was very helpful in making our figures.)
-
Ni, Jianmo. “Amazon Review Data (2018).” Amazon Review Data, 2018, nijianmo.github.io/amazon/index.html. (This is the original source of our data.)