Blog Post 6 - Executive Summary

Alex Berry, Jason Chan, Hyunjoon Lee, Sayan Samanta, Christina Ye

Brown University Data Science Initiative
DATA 2040: Deep Learning
May 14th, 2020

Executive Summary

Introduction

The objective of the current project was to develop a generative model for topic-driven text generation. More specifically, we studied the possibility of building a model that can generate human-like texts while controlling the content of the output based on user-defined context (keyword and sentiment). This project is important because the fact that one can generate plausible (human-like) texts about a user-defined topic with user-defined sentiment opens door to numerous negative and positive applications. One negative application (and a major one) is deceiving the public by generating a body of plausible (human-like) yet fake news articles about a company or stock or product reviews that falsely influence the market. A positive application of this project is its application as responsive chat-bots which can steer the conversation proactively in a desired direction. To train the model, we used 11,285,464 reviews under “Clothes, Shoes, and Jewelry” category from “Amazon Review Data (2018)” dataset created by Jianmo Ni at UCSD to train our model. The large sample size as well as the availability of product ratings (substitute for sentiment) made the dataset appropriate for the objective of this project. During EDA, we also determined that the ratings of the reviews represented the sentiment of the texts fairly well.

Baseline Model

The baseline model we built was a fine-tuned GPT-2 model. While GPT-2 was designed to predict the next word based on the previous words in a text, it has been shown to be capable of producing convincing synthetic texts that garnered a high credibility score. The original model (with 1.5 billion parameters) was trained on a dataset of roughly 8 million text-documents from the internet and is curated for diversity to represent multiple domains. However, one of GPT-2’s limitations is that it was designed to generate long texts and was not built to incorporate user-defined keywords. Therefore, we fine-tuned the model such that it could generate shorter texts (by introducing tokens) and contain specified keywords (see Blog Post 3). In a nutshell, we were able to build a working baseline model that could generate texts based on keywords and ratings. As a result, we were able to successfully add an extra layer of complexity through the sentiment discriminator that allowed us to have greater control over the sentiment of our generated Amazon reviews.

(Examples)

Rating: 1, Keyword: Belt

Very thin fabric. Material is super thin. Not much give in the waist. The belt is pretty thin. I am a size 6-8, and the 6-8 fit me perfectly. I wear a size 7 dress, but the 6-8 was too tight and the length was too short for me. I was really disappointed with the purchase.

Rating: 3, Keyword: Belt

I purchased a belt with a buckle, which I thought was a nice addition. Unfortunately, the buckle broke. I had to glue it back together. If you have a small wrist, you might be able to use it, but if you have a larger wrist, you might not be able to use it.

Rating: 5, Keyword: Belt

Belt fit perfectly. Very happy with this purchase. Very comfortable. The fit is perfect.

Rating: 1, Keyword: Boot

Boots are very heavy and have no traction on the bottom of the foot. The soles are thin and hard, they are not able to hold up to the very high price. I have ordered two different sizes for my son and they have the same problem. He can not wear them. We cannot find ones that have the sole.

Rating: 3, Keyword: Boot

Comfortable. But the leather has a different color to it, and I know that the black will fade, so I am not sure if the leather is the same color. I do not know if this was the only issue.

Rating: 5, Keyword: Boot

Nice boots and very comfortable. The color is very nice and the fit is very flattering.

Final Model

PPLM

The drawback of the fine-tuned GPT-2 was that it was far too large and inaccessible to implement and re-train to the particular specifications of a new task. To address this issue, we incorporated the Plug and Play Language Model (PPLM) to our GPT-2 model. Incorporating PPLM allowed the model to complete more specified tasks, such as discussing a particular topic or containing a specific sentiment (positive or negative). For instance, despite GTP-2’s remarkable ability to encode information about speech patterns, grammar, and spelling, it could not generate a positive text given a negative starting text. However, with PPLM, we were able to force our model to generate a positive text given a negative starting text (e.g. starting text: “The food is awful”, PPLM: “The food is awful, but there is also the music, the story and the magic!…”). For this project, we specifically implemented PPLM-Discrim, where the plug-in attribute model on top of GPT-2 is a single layer discriminator with the sole purpose of discriminating between positive and negative sentiments. We trained the model with the training data for GPT-2.

(Examples)

The baseline review output from the pretrained GPT-2 model before any sentiment control was:

= Unperturbed generated text = <|endoftext|>This food is delicious! It tastes like real chicken, but has an amazing flavor! It's so good that I don't even have to eat it in a restaurant. It tastes like rice and the texture is so delicious! It's also a great option to

The review text output corresponding to a rating label of 1 is:

= Perturbed generated text = <|endoftext|>This food-safety inspector warned this year that the industry warned of health hazards of the tainted rice paste. The rice paste paste made by B.C. rice producer Bao Dong was used in a fatal accident in Surrey, B.C.,

The review text output corresponding to a rating label of 3 is:

= Perturbed generated text 2 = <|endoftext|>This food safety report found a "significant increase" in food safety concerns. The report will be considered in the House committee hearing on the committee's report at 3:30pm on Friday, June 8.

The review text output corresponding to a rating label of 5 is:

= Perturbed generated text 3 = <|endoftext|>This food is fantastic as a great breakfast, an easy meal, and will keep you full for up to 3 weeks. I've also found this recipe makes very tasty meals for 2 weeks or more after eating without refrigeration. It's delicious and nutritious too.

CTRL

Along with PPLM, we concurrently aimed to improve our baseline model with CTRL. We used a pretrained CTRL model from Salesforce that allows for fine-tuning and creation of new control codes. We planned to create control codes for different training sets with different sizes and determine which combination enabled for the most realistic review generation. We the pretrained “Reviews” control code with prompts for the review ratings:

Rating: 1

This book is just another example of how to make money off of people who are gullible enough to believe in psychics. Rating: 1.0

I found this book to be poorly written and full of errors. Some sentences don't even make sense. Don't waste your time or money! Rating: 1.0

Not what I expected. Very short read. No real information provided. Would like more info on the subject. Rating: 1.0

Rating: 5

This item works as advertised. My wife uses it to listen to music on her phone while she works. Shes has had no problems with it, so far. Rating: 4.0.

Good quality sound from such small speakers. Easy to use and set up. Rating: 4.0

I love these. I can take them anywhere and put them wherever I want. And they don't cost much. Rating: 4.0

It is interesting to note that in the case of the Rating: 5 prompt, all of the generated reviews actually included ratings of 4.0. This might be due to an imbalanced lack of 5.0 ratings in the training set or just random variation. CTRL’s generated reviews were well-formed and stayed on topic, especially when compared to the more primitive results from the GPT-2 model.

Outlook

We noticed limitations in our PPLM model in that it worked better in generating positive sentiments rather than negative sentiments. This is likely due to the class imbalance - there were many more high rating data points than low rating ones. We could solve this problem by performing stratified sampling to ensure the classes are equally balanced. Lastly, the next step for us would be to combine the advantages of the three models. GPT-2 is remarkable at encoding speech patterns, grammar, and spelling. PPLM does not require language model retraining and allows stronger control over the sentiment. CTRL enforces the texts to stay on topic. Addressing the limitation and integrating the advantages would generate a powerful key-word based generative language model.