Generating Words from Embeddings

cover

In this post, I’m going to go over a recent project of mine in which I investigate how well a neural network could learn to generate words one character at a time, conditioned on an embedding vector representing its meaning.

The Model

I used a simple generative character level recurrent decoder model to generate the words character by character. This means that the output of the model is a probability over characters, and the input is the previously generated character. I condition the model on the meaning of the word by initializing the hidden state of the RNN based on the word embedding. In particular, I pass the word vector through a simple fully connected layer and use that to initialize the hidden state of the RNN.

model
The model architecture of the character level decoder

This model therefore uses the initial hidden state as a sort of input as opposed to the actual inputs to the RNN.

Training

To train the RNN, I used teacher forcing. This means that, during training, I don’t use the actual outputs of the RNN to feed the next timestep, and instead I just use the actual word as the input to the RNN.

I trained a simple GRU model on the GloVe word vectors, and a sufficiently large model (hidden size of ~1500) could overfit pretty easily. However, the loss on a small validation set would increase pretty quickly and not match the steady decrease of the training loss. This makes sense because it would be impractical to expect the network to learn the exact textual representation of a word vector it has never seen before. It would only base it upon previously seen words, so if the dataset contained words which had some meaning that could not be directly discerned from its character level representation, then the network would probably have no way of generating it. This means that looking at validation loss may not be a good metric for ‘originality’ of the generated words.

To make the network generate more ‘original’ words, I tried to add regularization schemes like dropout and $L2$ weight decay, however they seemed to make the network generate more words from the vocabulary as opposed to new words. Another attempt was to add noise to the embeddings while training, but this also didn’t lead to much improvement (validation loss still increased, but a bit slower than without regularization). It seemed like the models which overfit more actually ended up generating more ‘novel’ words. However, this is all a qualitative assessment, as I had no quantitative way of measuring the originality of generated words.

Sampling using Beam Search

To sample words from the decoder, we need to feed in the output of a previous time step as the input to the next timestep. In particular, we need to find the word with the highest probability given the input embedding. The probability of a word being generated is obtained by multiplying the probabilities of generating each of its characters. This is made clear in the diagram below:

model
The probability of generating the word cat is found by feeding in the characters of the word as input and multiplying the corresponding output probabilities.

We need to maximize this probability over all possible words which can be generated. We can see that this is a search problem on a graph generated by the RNN. At each time step, the nodes in the graph split for each character, so the number of possible generated words grows exponentially with the length of the word.

It is intractable to search through all possible words to find the one with the highest probability. Instead, we can use a greedy approximation to find a word with a high probability, but which may not be the one with the highest probability.

  • One simple way to do this is to just take the character with the highest probability in each time step and feed it as the input for the next time step.
  • We can do much better with a minimal increase in effort, by generalizing it to take the $k$ most likely characters at each step and keeping a track of only the top $k$ words with highest probabilities. These $k$ words form a so-called beam of samples, so this algorithm is called beam search.
model
The beam search algorithm with an example input and output and a beam size of $3$. The word vector for musical is passed through the model, with some noise. The words in each column are sorted in descending order of probabilities. Only the top $k$ words are used to expand the beam. The word with the highest probability at the end is melodynamic.
  • In the diagram shown, each column of words represents a timestep in sampling. The words are sorted from top to bottom in descending order of probabilities.
  • Only the top $k=3$ words are expanded at each time step.
  • Of the resulting words at the next time step, only the top $k$ words are kept.

To make the network generate new words, I added some noise to the embedding of the input word. In the example diagram, some noise is added to the embedding vector of the word musical, and then passed through the model. Performing beam search generated the word melodynamic, which seems to sound similar in meaning to the word musical, but is also an original word (it isn’t present in the vocabulary for GloVe vectors).

Passing the same embedding with a beam size of 20 yields the following words:

melodynamic mercenation melody merced merceant melionic meanical melical meldan melding mercant melidic melium melodylam melodylation magnage melodian mealing melodimentary melodynamics

It seems like the different words in the beam are quite similar in structure. Some of them are also words which exist in the GloVe vocabulary (melody for instance), and these words do have similar meanings to the input word.

Examples

Here are some cherry picked examples of generated words which look pretty cool, sampled with some noise added to the input word embedding:

Input word Generated words
musical melodynamic, melodimentary, songrishment
war demutualization, armision
intelligence technicativeness, intelimetry
intensity miltrality, amphasticity
harmony symphthism, ordenity, whistlery, hightonial
conceptual stemanological, mathedrophobic
mathematics tempologistics, mathdom
research scienting
befuddled badmanished, stummied, stumpingly
dogmatic doctivistic, ordionic, prescribitious, prefactional, pastological

These results indicate that the network has learned to link certain aspects of the embedding representation with the character level representation. However, we saw in the full beam of samples for the word musical that the network also recreates words which are present in the vocabulary, so originality of the generated words is not guaranteed.

You can find a huge list of generated words from which I cherry picked the above samples here.

Fixing initial characters

An interesting modification to the sampling process is to fix the initial few characters, and let the network complete the rest of the word. This is again done just by passing the fixed initial characters through the decoder and then using beam search to fill in the rest of the word.

For example, by passing in the word vector for appropriate, and fixing the first characters as prop, we get the following beam of outputs:

Input word: appropriate, start = prop
proper properate proport propore properatory proprient properative propriety propate propare proparation proparate proparative propriate properator propriet proportion proprie proportate proprietive

Similarly, for the input word gas, these are the outputs:

Input word: gas, start = prop
propel propellate propelly propellant prope propellation propelle propage propher propea prophetic propell propesting propous propely propeline propellat propesing propeling propoint

This further shows that the network has learned to map embedding representations, or meanings, to parts of words.

What next?

I have experimented with using FastText embeddings, and they seem to converge much faster than the GloVe vectors, as they were designed by keeping the character level structures of words in mind. However, I could not make out any qualitative differences in the generated words.

Another important thing to look into would be developing a good metric for measuring the ‘originality’ of the generated words. This would allow the different embedding and regularization schemes to be compared quantitatively.

Try it out yourself

I’ve uploaded the code for this project on github here. It has the weights of one particular GRU model trained on Glove word vectors. Go ahead and sample with different noise levels and initial characters and tweet me @rajat_vd if you find some cool looking words!


I’d like to thank my brother Anjan for our discussions about the ideas in this post. I’d also like to thank the Computer Vision and Intelligence Group of IIT Madras for generously providing me with the compute resources for this project.

Written on September 30, 2018