DAT450/DIT247: Programming Assignment 4: Comparing fine-tuning methods

In this assignment, you will fine-tune a pre-trained Transformer model for a classification task: sentiment analysis of movie reviews.

Pedagogical purposes of this assignment

  • You will see how the accuracy and computational efficiency are affected by the different fine-tuning methods.
  • You will learn about the LoRA method for parameter-efficient fine-tuning.
  • You will get some practical experience of working with HuggingFace libraries, which provide useful utilities for preprocessing and training.

Requirements

Please submit your solution in Canvas. Submission deadline: December 6.

Submit a notebook containing your solution to the programming tasks described below. This is a pure programming assignment and you do not have to write a technical report or explain details of your solution in the notebook: there will be a separate individual assignment where you will answer some conceptual questions about what you have been doing here.

Acknowledgement

This assignment is a lightly modified version of a similar assignment by Marco Kuhlmann.

Step 0: Preliminaries

Libraries

In this assignment, we will rely on a set of libraries from the HuggingFace community:

Make sure all libraries are installed in your environment. If you use Colab, you will need to install Datasets and Evaluate, while Transformers is included in the pre-installed environment.

Getting the files

The data we use in this assignment is a subset of the Large Movie Review Dataset. The full dataset consists of 50,000 highly polar movie reviews collected from the Internet Movie Database (IMDB). We use a random sample consisting of 2,000 reviews for training and 500 reviews for evaluation.

Download this zip file, which contains the training and evaluation CSV files.

Step 1: Full fine-tuning

In this assignment, we will use a compressed version of BERT called DistilBERT. We’ll use the uncased version: that is, the tokenizer will not distinguish uppercase and lowercase.

In the HuggingFace utilities that require you to specify a model name, you should use distilbert-base-uncased.

Preprocessing

Create a Dataset by loading the training and evaluation CSV files you previously downloaded.

Hint: Creating a Dataset.
from datasets import load_dataset
imdb_dataset = load_dataset('csv', data_files = {'train': 'path/to/train.csv', 'eval': 'path/to/eval.csv'})

Load the pre-trained tokenizer using AutoTokenizer and apply it to the Dataset.

Hint: Applying a tokenizer to a Dataset.
It's easiest if we create a helper function that applies the tokenizer to the right Dataset columns and with the right parameters.
def tokenize_helper(batch):
    return tokenizer(batch['review'], padding=True, truncation=True)
tokenized_imdb_dataset = imdb_dataset.map(tokenize_helper, batched=True)
This step will create new Dataset columns `input_ids` and `attention_mask`.

Note: you may receive some warnings caused by parallelism in the tokenizer. To get rid of the warnings, you can use the following workaround.

import os
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

Creating your classification model for fine-tuning

Use the HuggingFace utility AutoModelForSequenceClassification to set up a model that you can fine-tune. Use the from_pretrained method with the model name set as above, and num_labels=2 (because we have two-class classification task). This method carries out the following steps:

  • It loads the pre-trained DistilBERT model from the HuggingFace repository (or from a cached file, if you have used the model before).
  • It sets up untrained layers to map from the DistilBERT output to the two class labels. They will be trained during the fine-tuning process below.

Sanity check: Print the model in a notebook cell. You should see a visual representation of layers the model consists of. You should see the DistilBERT model including embedding layers and Transfomer layers. At the bottom of the list of layers, you should see two layers called pre_classifier and classifier, which are the newly created classification layers.

Counting the number of trainable parameters

Define a function count_trainable_parameters that computes the number of floating-point numbers that a given model will update during training.

  • The methods .parameters() and .named_parameters() return a sequence of tensors containing the model parameters.
  • When counting the trainable parameters, you should only include those tensors where requires_grad is True. That is: we want to exclude tensors containing parameters we will not update during training.

Sanity check: The number of trainable parameters for the model above should be 66955010.

Preparing for training

The class TrainingArguments defines some parameters controlling the training process. We’ll mostly use default values here. You only need to set the following parameters:

  • output_dir: the name of some directory where the Trainer will keep its file.
  • num_train_epochs: the number of training epochs.
  • eval_strategy: set this to epoch to see evaluation scores after each epoch.

In addition, we need to define a helper function that will be used for evaluation after each epoch. We use a utility from the Evaluate library for this:

import evaluate

accuracy_scorer = evaluate.load('accuracy')

def evaluation_helper(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    return accuracy_scorer.compute(predictions=predictions, references=labels)

Training the model

Import Trainer from the transformers library. Create a Trainer using the following arguments:

  • model: the model that you are fine-tuning;
  • args: the training arguments you defined above;
  • train_dataset: the train section of your tokenized Dataset;
  • eval_dataset: the eval section of your tokenized Dataset;
  • compute_metrics: the evaluation helper function you defined above.

Run the fine-tuning process by calling train() on your Trainer. This will train for the specified number of epochs, computing loss and accuracy after each epoch.

After training, you may call save_model on the Trainer to save the model’s parameters. In this way, you can reload it later without having to retrain it.

Hint: Avoiding accidental model reuse.
It is probably a good idea to re-create the model (using AutoModelForSequenceClassification.from_pretrained) before each time you train it. Otherwise, you may accidentally train a model that has already been trained.

Step 2: Tuning the final layers only

Even with a minimal model such as DistilBERT, fine-tuning the full model is rather time-consuming. We will now consider fine-tuning approaches where we only work with a subset of the model’s parameters.

Set up the model once again. Disable gradient computation for all parameter tensors except those that are trained from scratch. That is: the two layers in the classification head will be updated during training, while the DistilBERT model will be kept fixed.

Sanity check: The number of trainable parameters for this model should be 592130.

Hint: Avoiding accidental model reuse, again!
Once again, we recommend that you re-create the model using AutoModelForSequenceClassification.from_pretrained before this step, so that you don't accidentelly work with the model that you fine-tuned in Step 1.
Hint: How to disable gradient computation for a parameter tensor.

For a parameter tensor in a model, we can set the attribute requires_grad to False, which means that during backpropagation, gradients will not be computed with respect to these parameters. So the training process will not change these parameters.

To find the parameter tensors to switch off, you can either 1) go into the distilbert component and iterate through its parameters, or 2) go through all the model's named parameters, and switch off all parameter tensors except classifier and pre_classifier.

Train this model and compare the training speed and classification accuracy to the results from Step 1.

Step 3: Fine-tuning with LoRA

Utilities for modifying models

Define a function extract_qv_layers that extracts the query and value linear layers from all Transformer blocks in a DistilBERT model. Return a dictionary that maps the component name to the corresponding linear layer.

Hint: How to access the query and value linear layers.

As we saw earlier, the DistilBERT model consists of a hierarchy of nested submodules. Each of these can be addressed by a fully-qualified string name.

You can use get_submodule() to retrieve a layer by a string name. For instance, 'distilbert.transformer.layer.0.attention.q_lin' refers to the Q part of Transformer layer 0.

It's OK to hard-code this part, so that you just enumerate the Q and V parts of all layers here.

Sanity check: If you apply this on a DistilBERT model, the result should contain 12 named linear layers.

We also need a convenience function that puts layers back into a model. The following function does the trick. The named_layers argument uses the same format as returned by extract_qv_layers.

def replace_layers(model, named_layers):
    for name, layer in named_layers.items():
        components = name.split('.')
        submodule = model
        for component in components[:-1]:
            submodule = getattr(submodule, component)
        setattr(submodule, components[-1], layer)

Implementing the LoRA layer

To implement the LoRA approach, we define a new type of layer that will be used as a drop-in replacement for a regular linear layer.

In the paper by Hu et al. (2021), the structure is presented visually in Figure 1, and equation (3) shows the same idea.

Start from the following skeleton and fill in the missing pieces:

import torch.nn as nn

class LoRALayer(nn.Module):
    def __init__(self, W, r, alpha):
        super().__init__()
        # TODO: Add your code here

    def forward(self, x):
        # TODO: Replace the next line with your own code
        raise NotImplementedError

Here, W is the linear layer we are fine-tuning, while r and alpha are hyperparameters described in section 4.1. of the paper. The r parameter controls the parameter efficiency: by setting it to a low value, we save memory but make a rougher approximation. The alpha parameter is a scaling factor.

Hint: How to initialize A and B.

To follow the description closely, we should use the parameter initialization approach recommended in the paper (see Figure 1).

You can use nn.init.normal_ and nn.init.zeros_ here.

Fine-tuning with LoRA

Set up a model where you replace the query and value linear layers with LoRA layers. Use the following steps:

  • First use extract_qv_layers to get the relevant linear layers.
  • Each of the linear layers in the returned dictionary should be wrapped inside a LoRA layer.
  • Then use replace_layers to put them back into the model.

Sanity check: Use your function count_trainable_parameters. The number of trainable parameters should be less than in Step 1 but more than in Step 2. The exact number will depend on the rank.

Train this model and compare the training speed and classification accuracy to the results from Steps 1 and 2.