Generating Words from Embeddings
In this post, I’m going to go over a recent project of mine in which I investigate how well a neural network could learn to generate words one character at a time, conditioned on an embedding vector representing its meaning.
I used a simple generative character level recurrent decoder model to generate the words character by character. This means that the output of the model is a probability over characters, and the input is the previously generated character. I condition the model on the meaning of the word by initializing the hidden state of the RNN based on the word embedding. In particular, I pass the word vector through a simple fully connected layer and use that to initialize the hidden state of the RNN.
This model therefore uses the initial hidden state as a sort of input as opposed to the actual inputs to the RNN.
To train the RNN, I used teacher forcing. This means that, during training, I don’t use the actual outputs of the RNN to feed the next timestep, and instead I just use the actual word as the input to the RNN.
I trained a simple GRU model on the GloVe word vectors, and a sufficiently large model (hidden size of ~1500) could overfit pretty easily. However, the loss on a small validation set would increase pretty quickly and not match the steady decrease of the training loss. This makes sense because it would be impractical to expect the network to learn the exact textual representation of a word vector it has never seen before. It would only base it upon previously seen words, so if the dataset contained words which had some meaning that could not be directly discerned from its character level representation, then the network would probably have no way of generating it. This means that looking at validation loss may not be a good metric for ‘originality’ of the generated words.
To make the network generate more ‘original’ words, I tried to add regularization schemes like dropout and $L2$ weight decay, however they seemed to make the network generate more words from the vocabulary as opposed to new words. Another attempt was to add noise to the embeddings while training, but this also didn’t lead to much improvement (validation loss still increased, but a bit slower than without regularization). It seemed like the models which overfit more actually ended up generating more ‘novel’ words. However, this is all a qualitative assessment, as I had no quantitative way of measuring the originality of generated words.
Sampling using Beam Search
To sample words from the decoder, we need to feed in the output of a previous time step as the input to the next timestep. In particular, we need to find the word with the highest probability given the input embedding. The probability of a word being generated is obtained by multiplying the probabilities of generating each of its characters. This is made clear in the diagram below:
We need to maximize this probability over all possible words which can be generated. We can see that this is a search problem on a graph generated by the RNN. At each time step, the nodes in the graph split for each character, so the number of possible generated words grows exponentially with the length of the word.
It is intractable to search through all possible words to find the one with the highest probability. Instead, we can use a greedy approximation to find a word with a high probability, but which may not be the one with the highest probability.
- One simple way to do this is to just take the character with the highest probability in each time step and feed it as the input for the next time step.
- We can do much better with a minimal increase in effort, by generalizing it to take the $k$ most likely characters at each step and keeping a track of only the top $k$ words with highest probabilities. These $k$ words form a so-called beam of samples, so this algorithm is called beam search.
- In the diagram shown, each column of words represents a timestep in sampling. The words are sorted from top to bottom in descending order of probabilities.
- Only the top $k=3$ words are expanded at each time step.
- Of the resulting words at the next time step, only the top $k$ words are kept.
To make the network generate new words, I added some noise to the embedding of the input word. In the example diagram, some noise is added to the embedding vector of the word musical, and then passed through the model. Performing beam search generated the word melodynamic, which seems to sound similar in meaning to the word musical, but is also an original word (it isn’t present in the vocabulary for GloVe vectors).
Passing the same embedding with a beam size of 20 yields the following words:
melodynamic mercenation melody merced merceant melionic meanical melical meldan melding mercant melidic melium melodylam melodylation magnage melodian mealing melodimentary melodynamics
It seems like the different words in the beam are quite similar in structure. Some of them are also words which exist in the GloVe vocabulary (melody for instance), and these words do have similar meanings to the input word.
Here are some cherry picked examples of generated words which look pretty cool, sampled with some noise added to the input word embedding:
|Input word||Generated words|
|musical||melodynamic, melodimentary, songrishment|
|harmony||symphthism, ordenity, whistlery, hightonial|
|befuddled||badmanished, stummied, stumpingly|
|dogmatic||doctivistic, ordionic, prescribitious, prefactional, pastological|
These results indicate that the network has learned to link certain aspects of the embedding representation with the character level representation. However, we saw in the full beam of samples for the word musical that the network also recreates words which are present in the vocabulary, so originality of the generated words is not guaranteed.
You can find a huge list of generated words from which I cherry picked the above samples here.
Fixing initial characters
An interesting modification to the sampling process is to fix the initial few characters, and let the network complete the rest of the word. This is again done just by passing the fixed initial characters through the decoder and then using beam search to fill in the rest of the word.
For example, by passing in the word vector for appropriate, and fixing the first characters as prop, we get the following beam of outputs:
Input word: appropriate, start = prop
proper properate proport propore properatory proprient properative propriety propate propare proparation proparate proparative propriate properator propriet proportion proprie proportate proprietive
Similarly, for the input word gas, these are the outputs:
Input word: gas, start = prop
propel propellate propelly propellant prope propellation propelle propage propher propea prophetic propell propesting propous propely propeline propellat propesing propeling propoint
This further shows that the network has learned to map embedding representations, or meanings, to parts of words.
I have experimented with using FastText embeddings, and they seem to converge much faster than the GloVe vectors, as they were designed by keeping the character level structures of words in mind. However, I could not make out any qualitative differences in the generated words.
Another important thing to look into would be developing a good metric for measuring the ‘originality’ of the generated words. This would allow the different embedding and regularization schemes to be compared quantitatively.
Try it out yourself
I’ve uploaded the code for this project on github here. It has the weights of one particular GRU model trained on Glove word vectors. Go ahead and sample with different noise levels and initial characters and tweet me @rajat_vd if you find some cool looking words!
I’d like to thank my brother Anjan for our discussions about the ideas in this post. I’d also like to thank the Computer Vision and Intelligence Group of IIT Madras for generously providing me with the compute resources for this project.