DAT450/DIT247: Programming Assignment 2: Generating text from a language model
In this assignment, we extend the models we investigated in the previous assignment in two different ways:
- In the previous assignment, we used a model that takes a fixed number of previous words into account. Now, we will use a model capable of considering a variable number of previous words: a recurrent neural network. (Optionally, you can also investigate Transformers.)
- In this assignment, we will also use our language model to generate texts.
Pedagogical purposes of this assignment
- Investigating more capable neural network architectures for language modeling.
- Understanding text-generating algorithms.
Requirements
Please submit your solution in Canvas. Submission deadline: November 18.
Submit a notebook containing your solution to the programming tasks described below. This is a pure programming assignment and you do not have to write a technical report or explain details of your solution in the notebook: there will be a separate individual assignment where you will answer some conceptual questions about what you have been doing here.
Step 0: Preliminaries
Make sure you have access to your solution for Programming Assignment 1 since you will reuse some parts.
Copy the tokenization and integer encoding part into a new notebook.
Step 1: Adapting your code for RNNs
Adapting the preprocessing
In the previous assignment, you developed preprocessing tools that extracted fixed-length sequences from the training data. You will now adapt the preprocessing so that you can deal with inputs of variable length.
Splitting: While we will deal with longer sequences than in the previous assignment, we’ll still have to control the maximal sequence length (or we’ll run out of GPU memory). Define a hyperparameter max_sequence_length
and split your sequences into pieces that are at most of that length. (Side note: in RNN training, limiting the sequence length is called truncated backpropagation through time.)
Padding: In the previous assignment, you developed a tool that finds the most frequent words in order to build a vocabulary. In this vocabulary, you defined special symbols to cover a number of corner cases: the beginning and end of text passages, and when a word is previously unseen or too infrequent. Now, change your vocabulary builder to include a new special symbol that we will call padding: this will be used when our batches contain texts of different lengths.
After these changes, preprocess the text and build the vocabulary as in the previous assignment. Store the integer-encoded paragraphs in two lists, corresponding to the training and validation sets.
Sanity check: You should have around 147,000 training paragraphs and 18,000 validation paragraphs. However, since you split the sequences, you will in the end get a larger number of training and validation instances. (The exact numbers depend on max_sequence_length
.)
Adapting the batcher
In the previous assignment, you implemented some function to create training batches: that is, to put some number of training instances into a PyTorch tensor.
Now, change your batching function so that it can deal with sequences of variable lengths. Since the output of the batching function are rectangular tensors, you need to pad sequences so they are of the same length. So for each instance that is shorter than the longest instance in the batch, you should append the padding symbol until it has the right length.
Sanity check: Inspect a few batches. Make sure that they are 2-dimensional integer tensors with B rows, where B is the batch size you defined. The number of columns probably varies from batch to batch, but should never be longer than max_sequence_length
you defined previously. The integer-encoded padding symbol should only occur at the end of sequences.
Step 2: Designing a language model using a recurrent neural network
Setting up the neural network structure
Define a neural network that implements an RNN-based language model. It should include the following layers:
- an embedding layer that maps token integers to floating-point vectors,
- an recurrent layer implementing some RNN variant (we suggest
nn.LSTM
ornn.GRU
), - an output layer that computes (the logits of) a probability distribution over the vocabulary.
You will have to define some hyperparameters such as the embedding size (as in the previous assignment) and the size of the RNN’s hidden state.
Hint: If you are doing the batching as recommended above, you should set batch_first=True
when declaring the RNN.
batch_first=True
, then we assume that the input tensor is arranged as (B, N, E) where B is the batch size, N is the sequence length, and E the embedding dimensionality. In this case, the RNN "walks" along the second dimension: that is, over the sequence of tokens. If on the other hand you set batch_first=False
, then the RNN walks along the first dimension of the input tensor and it is assumed to be arranged as (N, B, E). Hint: How to apply RNNs in PyTorch.
Take a look at the documentation of one of the RNN types in PyTorch. For instance, here is the documentation of nn.LSTM
. In particular, look at the section called Outputs. It is important to note here that all types of RNNs return two outputs when you call them in the forward pass. In this assignment, you will need the first of these outputs, which correspond to the RNN's output for each token. (The other outputs are the layer-wise outputs.)
As we discussed in the previous assignment, PyTorch allows users to set up neural networks in different ways: the more compact approach using nn.Sequential
, and the more powerful approach by inheriting from nn.Module
.
If you implement your language model by inheriting from nn.Module
, just remember that the RNN gives two outputs in the forward pass, and that you just need the first of them.
class MyRNNBasedLanguageModel(nn.Module): def __init__(self, ... ): super().__init__() ... initialize model components here ... def forward(self, batch): embedded = ... apply the embedding layer ... rnn_out, _ = self.rnn(embedded) ... do the rest ...
If you define your model using a nn.Sequential
, we need a workaround to deal with the complication that the RNN returns two outputs. Here is one way to do it.
class RNNOutputExtractor(nn.Module): def __init__(self): super().__init__() def forward(self, rnn_out): return rnn_out[0]
The RNNOutputExtractor
can then be put after the RNN in your list of layers.
Sanity check: carry out the following steps:
- Create an integer tensor of shape 1xN where N is the length of the sequence. It doesn’t matter what the integers are except that they should be less than the vocabulary size. (Alternatively, take one instance from your training set.)
- Apply the model to this input tensor. It shouldn’t crash here.
- Make sure that the shape of the returned output tensor is 1xNxV where V is the size of the vocabulary. This output corresponds to the logits of the next-token probability distribution, but it is useless at this point because we haven’t yet trained the model.
Training the model
Adapt your training loop from the previous assignment, with the following changes
Hint: the output tensor is the input tensor, shifted one step to the right.
input_tokens = batch[:, :-1] output_tokens = batch[:, 1:]
Hint: how to apply the loss function when training a language model.
CrossEntropyLoss
) expects two input tensors: - the logits (that is: the unnormalized log probabilities) of the predictions,
- the targets, that is the true output values we want the model to predict.
targets = targets.view(-1) # 2-dimensional -> 1-dimensional logits = logits.view(-1, logits.shape[-1]) # 3-dimensional -> 2-dimensional
Hint: take padding into account when defining the loss.
CrossEntropyLoss
has a parameter ignore_index
that you can set to the integer you use to represent the padding tokens. Run the training function and compute the perplexity on the validation set as in the previous assignment.
Step 3: Generating text
Predicting the next word
As a starting point, we’ll repeat the exercise from the first assignment where we see what the model predicts as the next word of a given sequence. For instance, for the sequence he lives in san
, a well-trained model will typically predic the word francisco
. The steps will typically be something like the following:
- Apply the model to the integer-encoded input text.
- Take the model’s output at the last position.
- Use
argmax
to find the index of the highest-scoring item. - Apply the inverse vocabulary encoder (that you created in Step 2) so that you can understand what words the model thinks are the most likely in this context.
Generating texts
Implement a random sampling algorithm as described in the recording (video, pdf). The function should take the following inputs:
-
model
: the language model that we use to predict the next token. -
prompt
: the prompt that initializes the text generation. -
max_length
: the maximal number of steps before terminating. -
temperature
: controls the degree of randomness by scaling the predicted logits. -
topk
: to implement top-K sampling, i.e. the next-token distribution is truncated so that it only includes thetopk
most probable tokens.
The text generation should proceed until it an end-of-text symbol has been generated, or for at most max_length
steps.
Hint: How to sample from the next-token distribution.
The easiest option is probably to use torch.distributions.Categorical
. Categorical
is a probability distribution over a set of choices, each of which has its own probability. So this is equivalent to the case where we have a set of possible next tokens, with different probabilities.
The following code shows an example of how Categorical
can be used. In your code, you will replace example_logits
with the next-token distribution predicted by your language model.
# Logits of the probabilities of 5 different choices. example_logits = torch.tensor([0.0, 0.5, -0.2, 0.1, 0.05]) example_distr = Categorical(logits=example_logits) sampled = example_distr.sample()
Hint: The topk
function will be useful when you implement top-K sampling.
Run your generation algorithm with some different prompts and input parameters, and try to investigate the effects. In the reflection questions, you will be asked to summarize your impression of how texts are generated with different prompts and input parameters.
Sanity check: There are two ways to make this random sampling algorithm behave like greedy decoding (that is: there is no randomness, and the most likely next word is selected in each step). Run the function in these two ways and make sure you get the same output in both cases.
Optional tasks
These tasks can be done if you are curious but will not affect your score.
Dealing with repetition
As you might have observed, it is a common problem when generating from an autoregressive language model that some words or phrases are repeated over and over, in particular if you use greedy decoding (or beam search) or random sampling with a low temperature.
Implement some trick to try to reduce the amount of repetition, for instance by penalizing the generation algorithm if it wants to generate words that it has already generated.
Transformer language models
Compare the RNN-based language model to an autoregressive Transformer. See the PyTorch tutorial for an example of how to set up a Transformer-based language model using PyTorch’s Transformer implementation.