DAT450/DIT247: Programming Assignment 1: Introduction to language modeling
[Still under construction as of Oct. 29]
Language modeling is the foundation that recent advances in NLP technlogies build on. In essence, language modeling means that we learn how to imitate the language that we observe in the wild. More formally, we want to train a system that models the statistical distribution of natural language. Solving this task is exactly what the famous commercial large language models do (with some additional post-hoc tweaking to make the systems more interactive and avoid generating provocative outputs).
In the course, we will cover a variety of technical solutions to this fundamental task (in most cases, various types of Transformers). In this first assignment of the course, we are going to build a neural network-based language model that uses recurrent neural networks (RNNs) to model the interaction between words.
However, setting up the neural network itself is a small part of this assignment, and the main focus is on all the other steps we have to carry out in order to train a language model. That is: we need to process the text files, manage the vocabulary, run the training loop, and evaluate the trained models.
Pedagogical purposes of this assignment
- Introducing the task of language modeling,
- Getting experience of preprocessing text,
- Understanding the concept of word embeddings,
- Refreshing basic skills in how to set up and train a neural network,
- Introducing some parts of the HuggingFace ecosystem.
Prerequisites
We expect that you can program in Python and that you have some knowledge of basic object-oriented programming. We will use terms such as “classes”, “methods”, “attributes”, “functions” and so on.
On the theoretical side, you will need to remember fundamental concepts related to neural networks such as forward and backward passes, batches, initialization, optimization.
On the practical side, you will need to understand the basics of PyTorch such as tensors, models, optimizers, loss functions and how to write the training loop. (If you need a refresher, there are plenty of tutorials available, for instance on the PyTorch website.) In particular, the Optimizing Model Parameters tutorial contains more or less everything you need to know for this assignment about PyTorch training loops.
Submission requirements
Please submit your solution in Canvas. Submission deadline: November XX.
Submit a XXXX containing your solution to the programming tasks described below. This is a pure programming assignment and you do not have to write a technical report or explain details of your solution in the XXX: there will be a separate individual assignment where you will answer some conceptual questions about what you have been doing here.
Part 0: Preliminaries
Installing libraries
If you are working on your own machine, make sure that the following libraries are installed:
- NLTK or SpaCy for tokenization,
- PyTorch for building and training the models,
- Transformers and Datasets from HuggingFace,
- Optional: Matplotlib and scikit-learn for the embedding visualization in the last step. If you are using a Colab notebook, these libraries are already installed.
Downloading the files
TODO DESCRIBE HOW TO DOWNLOAD SKELETON
Download and extract this archive, which contains three text files. The files have been created from Wikipedia articles converted into raw text, with all Wiki markup removed. (We’ll actually just use the training and validation sets, and you can ignore the test file.)
Accessing the compute cluster
TODO DESCRIBE HOW TO ACCESS MINERVA VENV
Part 1: Tokenization
Terminological note: It can be useful to keep in mind that people in NLP use the word tokenization in a couple of different ways. Traditionally, tokenization referred to the process of splitting texts into separate words. More recently, tokenization typically tends to mean all preprocessing steps we carry out to convert text into a numerical format suitable for neural networks. To avoid confusion, in this assignment we will use the term tokenization in the modern sense, and use the term word splitting otherwise.
Using NLTK or SpaCy for word splitting
In this assignment, you will just use an existing library to split texts into words. Popular NLP libraries such as SpaCy and NLTK come with built-in functions for this purpose. We recommend NLTK in this assignment since it is somewhat faster than SpaCy and somewhat easier to use.
Hint: How to use NLTK's English word splitter.
word_tokenize from the nltk library. If you are running this on your own machine, you will first need to install NLTK with pip or conda. In Colab, NLTK is already installed. For instance, word_tokenize("Let's test!!") should give the result ["Let", "'s", "test", "!", "!"] Building the vocabulary
Each nonempty line in the text files correspond to one paragraph in Wikipedia. Apply the tokenizer to all paragraphs in the training and validation datasets. Convert all words into lowercase.
Create a function that goes through the training text and creates a vocabulary: a mapping from token strings to integers.
In addition, the vocabulary should contain 4 special symbols:
- a symbol for previously unseen or low-frequency tokens,
- a symbol we will put at the beginning of each paragraph,
- a symbol we will put at the end of each paragraph.
- a symbol we will use for padding so that we can make input tensors rectangular.
The total size of the vocabulary (including the 4 symbols) should be at most max_voc_size, which is is a user-specified hyperparameter. If the number of unique tokens in the text is greater than max_voc_size, then use the most frequent ones.
Hint: A Counter can be convenient when computing the frequencies.
Counter is like a regular Python dictionary, with some additional functionality for computing frequencies. For instance, you can go through each paragraph and call update. After building the Counter on your dataset, most_common gives the most frequent items.
Also create some utility that allows you to go back from the integer to the original word token. This will only be used in the final part of the assignment, where we look at model outputs and word embedding neighbors.
Example: you might end up with something like this:
str_to_int = { 'BEGINNING':0, 'END':1, 'UNKNOWN':2, 'PAD': 3, 'the':4, 'and':5, ... }
int_to_str = { 0:'BEGINNING', 1:'END', 2:'UNKNOWN', 3:'PAD', 4:'the', 5:'and', ... }
Sanity check: after creating the vocabulary, make sure that
- the size of your vocabulary is not greater than the max vocabulary size you specified,
- the 4 special symbols exist in the vocabulary and that they don’t coincide with any real words,
- some highly frequent example words (e.g. “the”, “and”) are included in the vocabulary but that some rare words (e.g. “cuboidal”, “epiglottis”) are not.
- if you take some test word, you can map it to an integer and then back to the original test word using the inverse mapping.
Implementing a HuggingFace-like Tokenizer
Now, we turn to the task of implementing the utility that will turn a text into a numerical format that can be provided to neural networks as an input. Our implementation will be functionally similar to the tokenizers provided by the HuggingFace library.
Write code for the missing parts in the A1Tokenizer in the skeleton Python file. You will need to implement the three methods __init__, __call__, and __len__. Most of the work will be done in __call__: __init__ is simply where you pass the information you need to set up the tokenize, and __len__ should just return the size of the vocabulary.
Hint: The weird-looking method __call__ is a special method that allows an object to be called like a function.
tokenizer(some_texts)and
tokenizer.__call__(some_texts)
Sanity check: Apply your tokenizer to an input consisting of few texts and make sure that it seems to work. In particular, verify that the tokenizer can create a tensor output in a situation where the input texts do not contain the same number of words: in these cases, the shorter texts should be “padded” on the right side. For instance
tokenizer = (... create your tokenizer...)
test_texts = [['This is a test.', 'Another test.']]
tokenizer(test_texts, return_tensors='pt', padding=True,
truncation=True)
The result should be something similar to the following example output (assuming that the integer 0 corresponds to the padding dummy token):
{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 0, 0]]),
'input_ids': tensor([[2, 35, 14, 11, 965, 6, 3],
[2, 153, 965, 6, 3, 0, 0]])}
Verify that at least the input_ids tensor corresponds to what you expect. (As mentioned in the skeleton code, the attention_mask</code>` is optional for this assignment.)
Part 2: Loading the text files and creating batches
Loading the texts. We will use the HuggingFace Datasets library to load the texts from the training and validation text files. (You may feel that we are overdoing it, since these are simple text files, but once again we want to introduce you to the standard ecosystem used in NLP.)
from datasets import load_dataset
dataset = load_dataset('text', data_files={'train': TRAIN_FILE, 'val': VAL_FILE})
The training and validation sections can now be accessed as dataset['train'] and dataset['val'] respectively. The datasets internally use the Arrow format for efficiency; in practice, they can be accessed as if they were regular Python lists. That is: you can write dataset['train'][8] to access the 8th text in the training set.
Each instance in the training and validation sets correspond to Wikipedia paragraphs. Now, remove empty lines from the data:
dataset = dataset.filter(lambda x: x['text'].strip() != ''
Sanity check: after loading the datasets and removing empty lines, you should have around 147,000 training and 18,000 validation instances.
Optionally, it can be useful in the development phase to work with smaller datasets. The following is one way of achieving that:
from torch.utils.data import Subset
for sec in ['train', 'val']:
dataset[sec] = Subset(dataset[sec], range(1000))
Iterating through the datasets. When training and running neural networks, we typically use batching: that is, to improve computational efficiency, we process several instances in parallel. We will use the DataLoader utility from PyTorch. Data loaders help users iterate through a dataset and create batches.
Hint: More information about DataLoader.
DataLoader to help us create batches. It can work on a variety of underlying data structures, but in this assignment, we'll just apply it to the datasets you prepared previously. dl = DataLoader(your_dataset, batch_size=..., shuffle=...)The arguments here are as follows:
-
batch_size: the number of instances in each batch. -
shuffle: whether or not we rearrange the instances randomly. It is common to shuffle instances while training.
DataLoader, you can iterate through the dataset batch by batch: for batch in dl: ... do something with each batch ...
Sanity check: create a DataLoader, look at the first batch, and confirm that it corresponds to your expectations.
for batch in dl:
print(batch)
break
Part 3: Defining the language model neural network
Define a neural network that implements an RNN-based language model. Use the skeleton provided in the class A1RNNModel. It should include the following layers:
- an embedding layer that maps token integers to floating-point vectors,
- an recurrent layer implementing some RNN variant (we suggest
nn.LSTMornn.GRU, and it is best to avoid the “basic”nn.RNN), - an output layer that computes (the logits of) a probability distribution over the vocabulary.
Once again, we base our implementation on the HuggingFace Transformers library, to exemplify how models are defined when we use this library. Specifically, note that
- The model hyperparameters are stored in a configuration object
A1RNNModelConfigthat inherits from HuggingFace’sPretrainedConfig; - The neural network class inherits from HuggingFace’s
PreTrainedModelrather than PyTorch’snn.Module.
When you set up your model, you should use the hyperparameters stored in the A1RNNModelConfig.
Hint: If you are doing the batching as recommended above, you should set batch_first=True when declaring the RNN.
batch_first=True, then we assume that the input tensor is arranged as (B, N, E) where B is the batch size, N is the sequence length, and E the embedding dimensionality. In this case, the RNN "walks" along the second dimension: that is, over the sequence of tokens. If on the other hand you set batch_first=False, then the RNN walks along the first dimension of the input tensor and it is assumed to be arranged as (N, B, E). Hint: How to apply RNNs in PyTorch.
Take a look at the documentation of one of the RNN types in PyTorch. For instance, here is the documentation of nn.LSTM. In particular, look at the section called Outputs. It is important to note here that all types of RNNs return two outputs when you call them in the forward pass. In this assignment, you will need the first of these outputs, which correspond to the RNN's output for each token. (The other outputs are the layer-wise outputs.)
class MyRNNBasedLanguageModel(nn.Module):
def __init__(self, ... ):
super().__init__()
... initialize model components here ...
def forward(self, batch):
embedded = ... apply the embedding layer ...
rnn_out, _ = self.rnn(embedded)
... do the rest ...
Sanity check: carry out the following steps:
- Create an integer tensor of shape 1xN where N is the length of the sequence. It doesn’t matter what the integers are except that they should be less than the vocabulary size. (Alternatively, take one instance from your training set.)
- Apply the model to this input tensor. It shouldn’t crash here.
- Make sure that the shape of the returned output tensor is 1xNxV where V is the size of the vocabulary. This output corresponds to the logits of the next-token probability distribution, but it is useless at this point because we haven’t yet trained the model.
Part 4: Training the model
We will now put all the pieces together and implement the code to train the language model.
Similarly to Part 1, we will mimic the functionality of the HuggingFace Transformers library. The Trainer is the main utility the Transformers library provides to handle model training, and it provides a variety of complex functionality including multi-GPU training and many other bells and whistles. In our case, we will just implement a basic training loop.
Starting from the skeleton Python code, your task now is to complete the missing parts in the method train in the class A1Trainer.
The missing parts you need to provide are
- Setting up the optimizer, which is the PyTorch utility that updates model parameters during the training loop. The optimizer typically implements some variant of stochastic gradient descent. We recommend
AdamW, which is used to train most LLMs. - Setting up the
DataLoaders for the training and validation sets. The datasets are provided as inputs, and you can simply create theDataLoaders as in Part 2. - The training loop itself, which is where most of your work will be done.
Hyperparameters that control the training should be stored in a TrainingArguments object. HuggingFace defines a large number of such hyperparameters but you only need to consider a few of them. The skeleton code includes a hint that lists the relevant hyperparameters.
The training loop should look more or less like a regular PyTorch training loop (see the hint in the code). There are a few non-trivial things to keep in mind when training an autoregressive language model (as opposed to training e.g. classifiers or regression models). We will discuss these points in the following three hints:
Hint: the output tensor is the input tensor, shifted one step to the right.
input_tokens = input_ids[:, :-1] output_tokens = input_ids[:, 1:]
Hint: how to apply the loss function when training a language model.
CrossEntropyLoss) expects two input tensors: - the logits (that is: the unnormalized log probabilities) of the predictions,
- the targets, that is the true output values we want the model to predict.
targets = targets.view(-1) # 2-dimensional -> 1-dimensional logits = logits.view(-1, logits.shape[-1]) # 3-dimensional -> 2-dimensional
Hint: take padding into account when defining the loss.
CrossEntropyLoss has a parameter ignore_index that you can set to the integer you use to represent the padding tokens.
While developing the code, we advise you to work with very small datasets until you know it doesn’t crash, and then use the full training set. Monitor the cross-entropy loss (and/or the perplexity) over the training: if the loss does not decrease while you are training, there is probably an error. For instance, if the learning rate is set to a value that is too large, the loss values may be unstable or increase.
Step 5: Evaluation and analysis
Predicting the next word
Take some example context window and use the model to predict the next word.
- Apply the model to the integer-encoded context window. As usual, this gives you (the logits of) a probability distribution over your vocabulary.
- Use
argmaxto find the index of the highest-scoring item, ortopkto find the indices and scores of the k highest-scoring items. - Apply the inverse vocabulary encoder (that you created in Step 2) so that you can understand what words the model thinks are the most likely in this context.
Quantitative evaluation
The most common way to evaluate language models quantitatively is the perplexity score on a test dataset. The better the model is at predicting the actually occurring words, the lower the perplexity. This quantity is formally defined as follows:
\[\text{perplexity} = 2^{-\frac{1}{m}\sum_{i=1}^m \log_2 P(w_i | c_i)}\]In this formula, m is the number of words in the dataset, P is the probability assigned by our model, wi and ci the word and context window at each position.
Compute the perplexity of your model on the validation set. The exact value will depend on various implementation choices you have made, how much of the training data you have been able to use, etc. Roughly speaking, if you get perplexity scores around 700 or more, there are probably problems. Carefully implemented and well-trained models will probably have perplexity scores in the range of 200–300.
Hint: An easy way to compute the perplexity in PyTorch.
exp to the mean of the cross-entropy loss over your batches in the validation set. If you have time for exploration, investigate the effect of the context window size N (and possibly other hyperparameters such as embedding dimensionality) on the model’s perplexity.
Inspecting the word embeddings
It is common to say that neural networks are “black boxes” and that we cannot fully understand their internal mechanics, especially as they grow larger and structurally more complex. The research area of model interpretability aims to develop methods to help us reason about the high-level functions the models implement.
In this assignment, we will briefly investigate the embeddings that your model learned while you trained it. If we have successfully trained a word embedding model, an embedding vector stores a crude representation of “word meaning”, so we can reason about the learned meaning representations by investigating the geometry of the vector space of word embeddings. The most common way to do this is to look at nearest neighbors in the vector space: intuitively, if we look at some example word, its neighbors should correspond to words that have a similar meaning.
Select some example words (e.g. "sweden") and look at their nearest neighbors in the vector space of word embeddings. Does it seem that the nearest neighbors make sense?
Hint: Example code for computing nearest neighbors.
emb is the nn.Embedding module of your language model, while voc and inv_voc are the string-to-integer and integer-to-string mappings you created in Step 2.
def nearest_neighbors(emb, voc, inv_voc, word, n_neighbors=5):
# Look up the embedding for the test word.
test_emb = emb.weight[voc[word]]
# We'll use a cosine similarity function to find the most similar words.
sim_func = nn.CosineSimilarity(dim=1)
cosine_scores = sim_func(test_emb, emb.weight)
# Find the positions of the highest cosine values.
near_nbr = cosine_scores.topk(n_neighbors+1)
topk_cos = near_nbr.values[1:]
topk_indices = near_nbr.indices[1:]
# NB: the first word in the top-k list is the query word itself!
# That's why we skip the first position in the code above.
# Finally, map word indices back to strings, and put the result in a list.
return [ (inv_voc[ix.item()], cos.item()) for ix, cos in zip(topk_indices, topk_cos) ]
Optionally, you may visualize some word embeddings in a two-dimensional plot.
Hint: Example code for PCA-based embedding scatterplot.
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt
def plot_embeddings_pca(emb, inv_voc, words):
vectors = np.vstack([emb.weight[inv_voc[w]].cpu().detach().numpy() for w in words])
vectors -= vectors.mean(axis=0)
twodim = TruncatedSVD(n_components=2).fit_transform(vectors)
plt.figure(figsize=(5,5))
plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
for word, (x,y) in zip(words, twodim):
plt.text(x+0.02, y, word)
plt.axis('off')
plot_embeddings_pca(model[0], prepr, ['sweden', 'denmark', 'europe', 'africa', 'london', 'stockholm', 'large', 'small', 'great', 'black', '3', '7', '10', 'seven', 'three', 'ten', '1984', '2005', '2010'])