Blog Post 1 - Project description and Literature Survey

Alex Berry, Jason Chan, Hyunjoon Lee, Sayan Samanta, Christina Ye

Brown University Data Science Initiative
DATA 2040: Deep Learning
April 12th, 2020

Introduction

Natural Language Processing (NLP) is the study of representation and analysis of human language in the format of computer readable vectors which encode a given language elements as a multi-dimensional vector of real numbers between 0 and 1.

Natural Language Synthesis is a sub-field of NLP where the challenge is to develop algorithms to stitch these machine represented words in a way such that when they are re-translated back to language elements, it is coherent. An element of a ‘good’ sentence is the ability to stay in context over long stretch of lines (and even paragraphs). A good model should also be able to initiate new context without being restricted to the past flow of conversations.

While generating ‘human-like’ text is a solved problem, controlling the content of the output based on a user-defined context is still an open challenge and an active research area. Our target in this project is to study the possibility of generating product reviews based on a given set of keywords.

The benefits of this project include its application as responsive chat-bots which can steer the conversation proactively in a desired direction. These models can also be used to generate texts of a particular style based on vocabulary and sentence structure.

A bit of Maths

To understand the basic mechanism behind the factors that steer the context, it is useful to review a bit of the standard language modelling and then discuss the modifications over it.

Given a sequence of tokens $X = \{x_0 , \dots , x_n\}$ , such that each $x_i$ comes from a fixed set of symbols (a sentence which a fixed set of words), the target of a language model is to estimate $p(x)$ . Considering that each $x_i$ to be a part of a sequence, we can factorize the distribution as the following product

$p(x) = \prod_{i=1}^n p(x_i|x_{\textless i})$

This factorization intuitively converts the model into a ‘next-word’ prediction model.

To train the model, we try to map the network to the set of parameters $\theta$ which minimises the negative log-likelihood over the dataset

$\mathcal{L} = -\sum_{k=1}^D\log p_\theta(x_i^k|x_{\textless i}^k)$

From this we can then sequentially generate text by sampling from $p(x) = \prod_{i=1}^n p(x_i|x_{\textless i})$ .

The task at hand is to learn the conditional language model $p(x|c)$ where c is the control code that sets the context of the sentence to be generated. The distribution can still be decomposed as before and the loss function is given by

$p(x|c) = \prod_{i=1}^n p(x_i|x_{\textless i},c)$

$\mathcal{L} = -\sum_{k=1}^D\log p_\theta(x_i^k|x_{\textless i}^k,c)$

CTRL and PPLM

For this project we intend to use two models to generate topic-driven review for various product reviews. In this post, we shall only discuss the two models and the data set, while the implementation, fine-tuning and results will be discussed in the upcoming posts.

Conditional Transformer Language Model (CTRL)

In this technique, the context keywords are simply prepended to the raw input text and are trained on a transformer model where the desired model is a sum of the learned token embedding and sinusoidal positional embedding just like the original transformer architecture.

Figure 1. The transformer model (from Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, arXiv:1706.03762).

A particular trick during generation of the next token is also to be modified by simply penalising those samples which do not follow the context.

Plug and Play Language Model (PPLM)

A drawback of CTRL is the need to retrain the entire model with a new set of keywords each time a new keyword is added to the dataset. Alternatively, it would be ideal if we could map the context based distribution from the unmodified distribution. In principle, this should be possible since

$p(x|c) \propto p(c|x)p(x)$

Thus, to control the context of the output, we shift the gradient towards maximizing the log-likelihood of both $p(c|x)$ and $p(x)$ . This also gives us a degree of control over the strength of the enforcement therefore producing better outputs.

The process can be visualised in 3 steps:

Figure 2. PPLM algorithm in short, (from the PPLM paper, cited below).

The Dataset

The data we are using is a subset of the “Amazon Review Data (2018)” dataset created by Jianmo Ni at UCSD. This larger data set is an updated version of the original Amazon review data set from 2014. It includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). The 2018 version contains a total of 233.1 million reviews.

Our Subset

Our project-specfic dataset is composed of reviews limited to the “Clothes, Shoes, and Jewelry” category, of which there are 11,285,464 reviews. The organization of the dataset is one review per line in JSON format. A sample review looks like:

{
"image": ["https://images-na.ssl-images-amazon.com/images/I/71eG75FTJJL._SY88.jpg"],
"overall": 5.0, 
"vote": "2", 
"verified": True, 
"reviewTime": "01 1, 2018", 
"reviewerID": "AUI6WTTT0QZYS", 
"asin": "5120053084", 
"style": {
	"Size:": "Large", 
	"Color:": "Charcoal"
	}, 
"reviewerName": "Abbey", 
"reviewText": "I now have 4 of the 5 available colors of this shirt... ", 
"summary": "Comfy, flattering, discreet--highly recommended!", 
"unixReviewTime": 1514764800
}

For our purposes, we are interested in the asin (the unique product ID which is used to match the product name from the metadata), reviewText, and perhaps the summary fields.

Coming up next

In our next post, we shall further explore the dataset (EDA). Stay tuned!

References

Jianmo Ni, Jiacheng Li, Julian McAuley, “Justifying recommendations using distantly-labeled reviews and fined-grained aspects”, Empirical Methods in Natural Language Processing (EMNLP), 2019, https://www.aclweb.org/anthology/D19-1018/ (Citation for the dataset used in this project.)
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, Rosanne Liu, “Plug and Play Language Models: A Simple Approach to Controlled Text Generation”, arXiv Computation and Language, 2019, https://arxiv.org/abs/1912.02164v4?fbclid=IwAR2M07hVqRQS89WE-IU3s57f5FPZ-srHn0qATa_2k5G9Foc2TSQJ9Sw7zVg (Uber’s controlled text generation model that attaches a discriminator to the classifier head without retraining the language model.)
Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Soche, “CTRL: A Conditional Transformer Language Model for Controllable Generation”, Salesforce, 2019, https://arxiv.org/abs/1909.05858 (Trains a classifier from scratch conditioned on the control condition.)