Blog Post 4 - Fine Tuning PPLM Discriminator Model to Generate Sentiment Focused Reviews

Alex Berry, Jason Chan, Hyunjoon Lee, Sayan Samanta, Christina Ye

Brown University Data Science Initiative
DATA 2040: Deep Learning
May 10th, 2020

With all the strength of the baseline GPT-2 model, the most state of the art model is far too large and inaccessible to implement and re-train to the particular specifications of a new task. In order to create reviews that work particularly well with respect to generating a particular sentiment, we decided to take the next step and build on GPT-2 using Uber’s PPLM algorithm.

As mentioned in the previous blog post, PPLM, short for Plug and Play Language Model, is a model developed to steer and control successful language models to complete more specified tasks, such as discussing a particular topic or containing a specific sentiment (positive or negative). For example, while GPT-2 is fantastic at encoding information about speech patterns, grammar, and spelling, we would be unable to force GPT-2 to generate a positive sentence given a negative starting text, such as “the food is awful”. Instead, PPLM is able to build on top of GPT-2 in order to create a positive conclusion to a negative sentence, such as: “The food is awful, but there is also the music, the story and the magic! The “Avenged Sevenfold” is a masterfully performed rock musical that will have a strong presence all over the world.”

PPLM allows us to have flexibility in choosing simple attribute models that represent what we want to control, and plug it into a large, unconditional language model. What makes PPLM so special is that there is no training or fine-tuning required of the large language model — this allows users to utilize the top of the line language models available, even if they do not have the resources to required to train them. The largest and most successful publicly available language models contain up to a billion parameters, take unfathomable amounts of money and resources to train, and often do not provide the training data publicly. On the other hand, these plug-in attribute models may be many orders of magnitude smaller in size. Uber uses the metaphor that these huge language models are like a wooly mammoth that lumbers around aimlessly, and the plug-in attribute model acts as a tiny mouse that sits on the mammoth and guides it.

In our project of Amazon text review generation, we specifically implemented PPLM-Discrim, where the plug in attribute model on top of GPT-2 is a single layer discriminator with the sole purpose of discriminating between positive and negative sentiments. In other words, this discriminator layer takes the mean of the embedded representation output from the original GPT-2 model and predicts the final output label of rating. In this particular scenario, that is encoded by the ratings associated with the reviews, ranging from most negative at 1, and most positive at 5.

Results of PPLM-Discrim

In order to stay consistent with our GPT-2 model, we trained our PPLM discriminator on the same subset of Amazon review data. Specifically we utilized a “Clothing, Shoes and Jewelry” dataset, originally containing over 9 million data points with corresponding reviews and texts. Due to the large size of the set, we chose to take a subsample of these 9 million data points, and instead train our discriminator on 10% of the data. We did this to lower the amount of training time from 60 hours to 6 hours. It is worth noting that the original data had relatively imbalanced classes, with much more positive ratings (4’s and 5’s) than negative ratings (1’s and 2’s). Our randomly sampled subsample of data had the following class imbalance, where 62% of data points had a 5 rating, 18% had a 4 rating, and the rest of the ratings combined for about 19% of the data. What we see later from the results is that despite the class imbalance, we’re still largely able to produce reviews with effective sentiments for both the positive and negative sentiment ratings.

class balances Figure 1. Boxplot of review rating class balances. Here we can see that that in our training data we have much more positive (5 rating) reviews than negative reviews. This may affect our text generation model’s ability to effectively generate negative reviews as opposed to positive reviews.

Training

Given relatively limited resources in both time and money, we decided to utilize Google Cloud Platform’s AI Hub Notebooks service, and we had to determine the best virtual machine set up to maximize our training efficiency and time. After running into multiple memory issues, we decided to settle on using 16 vCPUs, 104 GB RAM machine type, with a single NVIDIA Tesla T4 GPU. Furthermore, we used a variety of methods to ensure our code was utilizing the GPU, including running torch.cuda device commands to verify GPU installation and availability, running nvidia-smi to clarify GPU usage, and finally running nvidia-smi dmon to evaluate GPU memory usage efficiency while training. Our final successful training of the PPLM discriminator on our Amazon review data had the following training loss history, where we see a sharp initial decrease in loss, and then a more gradual decrease as all the data is run through.

training loss history Figure 2. Line plot tracing the loss over iterations during PPLM training. PPLM is only trained in one epoch, covering over nine hundred thousand data points. We see a sharp drop in loss early on, and a gradual decrease in loss as training moves forward.

PPLM-Discrim Generated Reviews

After training, we could now use the PPLM source code to run an example based on the pretrained GPT-2 model and our trained sentiment rating discriminator. Below is an example of the code that we run to generate the review text. In terms of hyperparameter tuning, there were two that we had to adjust in order to produce the desired results. Firstly, the parameter stepsize is used to specify topic control. The higher we increase stepsize, the more topic control is intensified, whereas the lower it is, the softer the topic control. Secondly, we input parameter class_label in order to specify which sentiment we want for our review text output, corresponding to the “rating” label from the data.

run_pplm_example( cond_text="This food", num_samples=3, pretrained_model='gpt2', discrim_weights='generic_classifier_head_epoch_1.pt', discrim_meta='generic_classifier_head_meta.json', discrim='generic', class_label='1', length=50, stepsize=0.90, sample=True, num_iterations=1, gamma=1, gm_scale=0.9, kl_scale=0.02, verbosity='quiet' )

Example Reviews for Different Sentiments

Here is a hand-picked example of the review data. We show this as our result as this type of qualitative, human judgment on the text data is the best way to evaluate it’s effectiveness, as ultimately we desire our reviews to be able to convince a human that another human wrote them. With the initial start of sentence text as:
= Prefix of sentence = <|endoftext|>This food

The baseline review output from the pretrained GPT-2 model before any sentiment control was:

= Unperturbed generated text = <|endoftext|>This food is delicious! It tastes like real chicken, but has an amazing flavor! It's so good that I don't even have to eat it in a restaurant. It tastes like rice and the texture is so delicious! It's also a great option to

The review text output corresponding to a rating label of 1 is:

= Perturbed generated text = <|endoftext|>This food-safety inspector warned this year that the industry warned of health hazards of the tainted rice paste. The rice paste paste made by B.C. rice producer Bao Dong was used in a fatal accident in Surrey, B.C.,

The review text output corresponding to a rating label of 3 is:

= Perturbed generated text 2 = <|endoftext|>This food safety report found a "significant increase" in food safety concerns. The report will be considered in the House committee hearing on the committee's report at 3:30pm on Friday, June 8.

The review text output corresponding to a rating label of 5 is:

= Perturbed generated text 3 = <|endoftext|>This food is fantastic as a great breakfast, an easy meal, and will keep you full for up to 3 weeks. I've also found this recipe makes very tasty meals for 2 weeks or more after eating without refrigeration. It's delicious and nutritious too.

From multiple different trials with a variety of starting sentence tokens, we were excited to find that we would consistently see clearly negative sentiment associated with class labels of 1, and see clearly positive sentiment associated with class labels of 5. We did notice that our PPLM-Discrim model did have an easier to producing accurate class 5 reviews, but sometimes would produce more positive reviews even for lower rated classes. This likely is because of the class imbalance of the data. This could be solved by taking a stratified sample of the data so that all classes are equally represented in the training data. Looking at example of class label 3 text was interesting, because it is relatively hard for humans to specify exactly what “neutral” text is.

Rating 3 Texts Gave Interesting Neutral Results

A lot of class 3 text came out to be neutral in the sense that the text discussed things like inherent properties or details of the subject, like the size specifications of a piece of clothing, or it discussed the scientific compositions and names of the chemicals related to the subject. For example:

= Perturbed generated text 1 = <|endoftext|>Potato chips are usually boiled in water in boiling water in an orange coloration apparatus (a color of food coloring oil or vegetable oil). The color of the coloring of potato chips are usually orange or brown depending upon whether their colors are yellow, purple or yellowish.

= Perturbed generated text 2 = <|endoftext|>These shoes are made of polypropylylyleyl acetate (PPAR) and polypropylene poly polyethylene polypropyl acetate (PUPA). The sole was then hand washed with polyethylene.

Controlling Sentiment

Furthermore, we can confirm that we have a higher capability of controlling the sentiment outcome of the sentence, even if we provide the model with a start of sentence with the opposite sentiment. For example, for the input:

= Prefix of sentence = <|endoftext|>This dress is tight

We get the review text output corresponding to a rating label of 5:

= Perturbed generated text = <|endoftext|>This dress is tight, but flattering and cozy, and flattering. This dress is perfect for weddings or other events that require a little attention to detail.

On the other hand, we can give the model a positive sentiment start of sentence and ask for a rating label of 1 text. For the input:

= Prefix of sentence = <|endoftext|>The shirt is nice

We get the review text output:

= Perturbed generated text = <|endoftext|>The shirt is nice, but it does not have a collar or a small picture of the logo on the back. I am not sure what is on the back. It looks too expensive shirt and the sleeves are too thin but my boyfriend and the mother who owns it looks awesome.

PPLM Outlook

Using the PPLM-Discrim algorithm to build on top of our pretrained GPT-2 model, we were able to successfully add an extra layer of complexity through the sentiment discrimantor that allowed us to have greater control over the sentiment of our generated Amazon reviews. Although the results that we displayed in this blog post were convincing, further exploration with a variety of parameters and starting sentences shows that much can be improved in our trained model. Our PPLM model still struggles once in a while with producing coherent text that could be passed off as human text, and in particular it falls relatively short in generating obviously negative reviews as compared to positive reviews.

In order to improve on what we’ve been able to do in this short time, we can throw larger resources to train our discrimator, given that we are only using a small fraction of a specific subset of Amazon reviews (only about 900,000 review text data points). There are plenty of other datasets that we could combine to cover a much larger variety of human speech patterns, and this can be achieved by simply putting more time into our training and using a larger dataset. On the other hand, the key to improving our PPLM model’s outputs of negative sentiment reviews is to ensure the class balance in the training data provides the discriminator enough to accurately represente negative sentiment texts. This can be achieved by injecting more lower rating reviews into our dataset, or performing some sort of stratified sampling in the data to ensure equal class balancing.

Finally, we have now been able to generate text reviews based on a keyword topic, and based on a particular sentiment. The next step is to try and combine these two, so that we can provide our review generation model both a general subject/keyword and a sentiment, and have it produce a convincing review that falls in both categories. We attempt to do this using Salesforce’s CTRL algorithm.

Annotated Bibliography

  1. Dathathri, Sumanth, et al. “Plug and Play Language Models: A Simple Approach to Controlled Text Generation.”, International Conference on Learning Representations 2020, 3 Mar. 2020, arxiv.org/abs/1912.02164. (This is the public conference paper by Uber engineers detailing the specific math and theory behind PPLM. This resource was used to understand how PPLM works (specifically the math) and to evaluate and compare their examples to ours.)

  2. Liu, Rosanne, et al. “Controlling Text Generation with Plug and Play Language Models.”, Uber Engineering, 11 Dec. 2019, eng.uber.com/pplm/. (This reference is the Uber Engineering blog post for PPLM, describing the general details of PPLM and how it works. This resource was key for understanding the big picture of PPLM and it’s processes.)

  3. “Uber-Research/PPLM.”, Uber Research PPLM Github Repo, 12 Apr. 2020, github.com/uber-research/PPLM. (This reference is the GitHub repo containing the source code for PPLM. All baseline code for our implementation of PPLM was taken from this source.)