DAT450/DIT247: Programming Assignment 4: Supervised Fine-Tuning (SFT) with LoRA

In this assignment, you will perform supervised fine-tuning (SFT) of a small open LLM (preferably OLMo-2 1B) on Alpaca, a dataset of 52k instructions generated by OpenAI’s text-davinci-003 engine. You will convert this dataset into instruction-response pairs, fine-tune a causal language model using LoRA (Low-Rank Adaptation), and evaluate it through prompted inference and comparison with other methods.

Pedagogical purposes of this assignment

  • You will see how the metric and computational efficiency are affected by the different fine-tuning methods.
  • Learn and apply LoRA for parameter-efficient tuning of causal LMs.
  • You will learn more about what instruction tuning is and how the model deals with it during training.
  • You will get some additional practical experience of working with HuggingFace libraries, which provide useful utilities for preprocessing and training.

Requirements

Please submit your solution in Canvas. Submission deadline: December 1.

Submit Python files containing your solution to the programming tasks described below. In addition, to save time for the people who grade your submission, please submit a text file containing the outputs printed out by your Python program; read the instructions carefully so that the right outputs are included. The most important outputs are already designed for the code.

This is a pure programming assignment and you do not have to write a technical report or explain details of your solution: there will be a separate individual assignment where you will answer some conceptual questions about what you have been doing here.

Acknowledgement

This assignment is adapted from a previous version by Marco Kuhlmann and updated to SFT with LoRA for a small LLM.

Step 0: Preliminaries

Libraries

As in the previous assignments, you can use the pre-set environment source /data/courses/2025_dat450_dit247/venvs/dat450_venv/bin/activate.

Alternatively, if you are working on your own machine or some cloud-based service, install the following libraries with a package manager such as pip or uv:

Getting the files

The dataset Alpaca is a collection of 52k instruction-response pairs designed for SFT of LLM for instruction following (JSON format). You can load using the HF datasets as:

from datasets import load_dataset

alpaca_dataset = load_dataset("/data/courses/2025_dat450_dit247/datasets/alpaca-cleaned")
Hint: Output...
DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 51760
    })
})

Since these 52k data points make fine-tuning longer, for simplicity in this course, we only consider 2K + 200 samples for training and testing. Of course, during learning, you can use a smaller amount to ensure your implementation works properly. For the final submission, go ahead with 2K + 200 samples and use the same seed as the code scaffold, SEED=101, to help TAs evaluate your outputs.

To get a clear idea of how to complete the assignment, you can start with the skeleton code available here: /data/courses/2025_dat450_dit247/assignments/a4. It looks like this:

.
├── data_utils.py
├── inference.sh
├── lora.py
├── main.py
├── predict.py
├── run.sh
└── utils.py

In short, you need to fill in the incomplete parts of data_utils.py and lora.py. The other files contain helpful functions to run the assignment. It’s highly recommended to review the documented code to understand the structure of the project. To ensure your code works correctly, you can follow these instructions and run the code using the pre-built environment:

python3 main.py

Step 1: Preprocessing

Create a Dataset by loading Alpaca training set that already downloaded for you.

from datasets import load_dataset

alpaca_dataset = load_dataset("/data/courses/2025_dat450_dit247/datasets/alpaca-cleaned")
Hint: Output...
DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 51760
    })
})

Then we need to create our training and testing sets from that dataset. To do this, we use two methods provided by HF datasets, select and train_test_split, which help us choose the subsets as training and testing data. Although we provide a cleaned version of Alpaca, we need to filter out all rows where the output is empty to ensure we always have output. To achieve an equal distribution of different instructions corresponding to the inputs, we also stratify the rows based on the presence or absence of input.

from datasets import load_dataset
from utils import create_stratification_label

alpaca_dataset = load_dataset("/data/courses/2025_dat450_dit247/datasets/alpaca-cleaned")

alpaca_dataset["train"] = alpaca_dataset["train"].filter(
    lambda x: x["output"] is not None and x["output"].strip() != ""
)

ds = alpaca_dataset["train"].map(
    lambda x: create_stratification_label(x, columns_to_check=["input"])
)
# Turn strat_label into a ClassLabel so we can stratify
ds = ds.class_encode_column("strat_label")

ds = (
    ds.shuffle(seed=SEED)
    .select(range(MAX_TRAIN_SAMPLES + MAX_TEST_SAMPLES))
    .train_test_split(
        train_size=MAX_TRAIN_SAMPLES,
        test_size=MAX_TEST_SAMPLES,
        stratify_by_column="strat_label",
        seed=SEED,
    )
)
Hint: Output...
ALPACA DATASET:
DatasetDict({
    train: Dataset({
        features: ['output', 'input', 'instruction'],
        num_rows: 51760
    })
})

ALPACA + SUBSAMPLE:
DatasetDict({
    train: Dataset({
        features: ['output', 'input', 'instruction', 'strat_label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['output', 'input', 'instruction', 'strat_label'],
        num_rows: 400
    })
})

And finally, we need to construct our LLM input as prompts and output as labels.

from data_utils import build_prompt

PROMPT_NO_INPUT = """
Below is an instruction that describes a task. 
Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
""".strip()

PROMPT_WITH_INPUT = """
Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
""".strip()

ds_sft = ds.map(lambda x: build_prompt(x, PROMPT_NO_INPUT, PROMPT_WITH_INPUT))
Hint: Output...
Sample with prompt:
{
  "output": "D) \"You're great.\"",
  "input": "What do you think of me?\n\nA) \"You're annoying.\"\nB) \"I don't know you.\"\nC) \"You're cool.\"\nD) \"You're great.\"",
  "instruction": "Select the most optimal response.",
  "strat_label": 1,
  "prompt": "Below is an instruction that describes a task, paired with an input that provides further context. \nWrite a response that appropriately completes the request.\n\n### Instruction:\nSelect the most optimal response.\n\n### Input:\nWhat do you think of me?\n\nA) \"You're annoying.\"\nB) \"I don't know you.\"\nC) \"You're cool.\"\nD) \"You're great.\"\n\n### Response:",
  "answer": "D) \"You're great.\""
}

Pre-trained LLMs are simply autoregressive models (next-token predictors); they learn patterns in text, not how to follow instructions. Therefore, SFT can enhance LLMs by teaching them how to answer tasks directly, structure their outputs, respond helpfully, and more. In real commercial systems like ChatGPT, Claude, and others, instruction tuning followed by reinforcement learning are crucial steps that make the models more practically useful. As mentioned before, Alpaca serves as a starting point to help our simple LLM (OLMo) adopt similar features. To achieve this, we define some templates based on the presence or absence of input for the Alpaca dataset.

from transformers import AutoTokenizer
from datasets import DatasetDict

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME_OR_PATH)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token or tokenizer.unk_token

tokenized_ds_sft = DatasetDict(
    {
        "train": ds_sft["train"].map(lambda x: tokenize_helper(x, tokenizer, MAX_LENGTH)),
        "test": ds_sft["test"].map(lambda x: tokenize_helper(x, tokenizer, MAX_LENGTH)),
    }
)
data_collator = create_data_collator(tokenizer)
Hint: Output...
TOKENIZED DATASET:
DatasetDict({
    train: Dataset({
        features: ['output', 'input', 'instruction', 'strat_label', 'prompt', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['output', 'input', 'instruction', 'strat_label', 'prompt', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 400
    })
})

Step 2: Baseline zero-shot and prompt format

Set MODEL_NAME_OR_PATH (default suggested: /data/courses/2025_dat450_dit247/models/OLMo-2-0425-1B). Load the tokenizer and model in causal LM form.

Then we will see how OLMo, as a simple pre-trained LLM that has no understanding of how instructions work, responds to some instruction prompt, and we will compute the ROUGE-L metric for this and subsequent models. Since the Alpaca has the output for each pair of instruction and input, we can use that output as a reference to compute the ROUGE-L metric, which will give us a measure of how well the model captures whether it produces the same key ideas and structure, even if the exact wording differs.

import time
from transformers import AutoModelForCausalLM
from transformers import TrainingArguments
from utils import make_trainer, RougeMetricComputer


compute_metrics = RougeMetricComputer(tokenizer)

pretrained_model = AutoModelForCausalLM.from_pretrained(MODEL_NAME_OR_PATH).to(DEVICE)

pretrained_eval_args = TrainingArguments(
    output_dir=os.path.join(OUTPUT_DIR, "pretrained"),
    eval_strategy="no",
    per_device_eval_batch_size=1,
    fp16=torch.cuda.is_available(),
    report_to="none",
    batch_eval_metrics=True,
    eval_accumulation_steps=1,
)

pretrained_trainer = make_trainer(
    pretrained_model,
    pretrained_eval_args,
    tokenized_ds_sft,
    compute_metrics,
    data_collator,
)
t0 = time.perf_counter()
pretrained_eval_metrics = pretrained_trainer.evaluate()
pretrained_eval_time = time.perf_counter() - t0

pretrained_eval_loss = float(pretrained_eval_metrics["eval_loss"])
pretrained_rougeL = pretrained_eval_metrics.get("eval_rougeL", None)
Hint: Output...
PRETRAINED EVAL METRICS:
{
  "eval_loss": 1.4962108135223389,
  "eval_model_preparation_time": 0.0023,
  "eval_rougeL": 0.6029353987530564,
  "eval_runtime": 36.4135,
  "eval_samples_per_second": 10.985,
  "eval_steps_per_second": 10.985
}

Counting the number of trainable parameters

Define a function num_trainable_parameters that computes the number of floating-point numbers that a given model will update during training.

def num_trainable_parameters(model):
    """Count number of trainable parameters (requires_grad=True)."""
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
  • The methods .parameters() and .named_parameters() return a sequence of tensors containing the model parameters.
  • When counting the trainable parameters, you should only include those tensors where requires_grad is True. That is: we want to exclude tensors containing parameters we will not update during training.

Sanity check: The number of trainable parameters for the model above should be 1484916736.

Preparing for training

The class TrainingArguments defines some parameters controlling the training process. We’ll mostly use default values here. You only need to set the following parameters:

  • output_dir: the name of some directory where the Trainer will keep its file.
  • num_train_epochs: the number of training epochs.
  • eval_strategy: set this to epoch to see evaluation scores after each epoch.
from transformers import TrainingArguments


pretrained_eval_args = TrainingArguments(
    output_dir=os.path.join(OUTPUT_DIR, "pretrained"),
    eval_strategy="no",
    per_device_eval_batch_size=1,
    fp16=torch.cuda.is_available(),
    report_to="none",
    batch_eval_metrics=True,
    eval_accumulation_steps=1,
)

In addition, we need to define a helper function that will be used for evaluation after each epoch. We use a utility from the Evaluate library for this:

from utils import RougeMetricComputer

compute_metrics = RougeMetricComputer(tokenizer)

Training the model

Import Trainer from the transformers library. Create a Trainer using the following arguments:

  • model: the model that you are fine-tuning;
  • args: the training arguments you defined above;
  • train_dataset: the train section of your tokenized Dataset;
  • eval_dataset: the eval section of your tokenized Dataset;
  • compute_metrics: the evaluation helper function you defined above.

Run the fine-tuning process by calling train() on your Trainer. This will train for the specified number of epochs, computing loss and accuracy after each epoch.

After training, you may call save_model on the Trainer to save the model’s parameters. In this way, you can reload it later without having to retrain it.

from transformers import Trainer


def make_trainer(
    model, training_args, tokenized_ds_sft, compute_metrics, data_collator
):
    """Create a Trainer for SFT on tokenized_ds_sft."""

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_ds_sft["train"],
        eval_dataset=tokenized_ds_sft["test"],
        compute_metrics=compute_metrics,
        data_collator=data_collator,
    )
    return trainer

Step 3: Full fine-tuning (SFT dataset)

Next, we train the pre-trained model using SFT (over all the parameters), then calculate the metrics and outputs to evaluate how well it follows instructions.

from utils import num_trainable_parameters

baseline_training_args = TrainingArguments(
    output_dir=os.path.join(OUTPUT_DIR, "trainer_sft_baseline"),
    eval_strategy="epoch",
    logging_steps=500,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    fp16=torch.cuda.is_available(),
    report_to="none",
    batch_eval_metrics=True,
    eval_accumulation_steps=1,
)

base_model = AutoModelForCausalLM.from_pretrained(MODEL_NAME_OR_PATH).to(DEVICE)
print(f"Full SFT trainable params: {num_trainable_parameters(base_model)}")

baseline_trainer = make_trainer(base_model, baseline_training_args, tokenized_ds_sft, compute_metrics, data_collator)

t0 = time.perf_counter()
baseline_trainer.train()
baseline_train_time = time.perf_counter() - t0

# Save and reload baseline model
baseline_trainer.save_model(os.path.join(OUTPUT_DIR, "trainer_sft_baseline", "finetuned_sft_baseline.model"))
base_model = AutoModelForCausalLM.from_pretrained(os.path.join(OUTPUT_DIR, "trainer_sft_baseline", "finetuned_sft_baseline.model")).to(DEVICE)

t0 = time.perf_counter()
baseline_eval_metrics = baseline_trainer.evaluate()
baseline_eval_time = time.perf_counter() - t0

baseline_eval_loss = float(baseline_eval_metrics["eval_loss"])
baseline_rougeL = baseline_eval_metrics.get("eval_rougeL", None)
Hint: Output...
================================================================================
TRAINING BASELINE MODEL (FULL SFT)
================================================================================
Full SFT trainable params: 1484916736
{
  "eval_loss": 1.8790407180786133,
  "eval_rougeL": 0.5532589329689496,
  "eval_runtime": 33.9881,
  "eval_samples_per_second": 11.769,
  "eval_steps_per_second": 11.769,
  "epoch": 2.0
}

Next, we train the pre-trained model using SFT, then calculate the metrics and outputs to evaluate how well it follows instructions. We also consider how long it will take to fine-tune all the parameters because the next step is to see how LoRA can help us achieve the same level of instruction tuning in less time.

Step 4: Fine-tuning with LoRA

Utilities for modifying models

Define a function extract_lora_targets that extracts the query and value linear layers from all Transformer blocks in a OLMo model. Return a dictionary that maps the component name to the corresponding linear layer.

Hint: How to access the query and value linear layers.

As we saw earlier, the OLMo model consists of a hierarchy of nested submodules. Each of these can be addressed by a fully-qualified string name.

You can use get_submodule() to retrieve a layer by a string name. For instance, 'model.layers.layer.0.self_attn.q_proj' refers to the Q part of Transformer layer 0.

It's OK to hard-code this part, so that you just enumerate the Q and V parts of all layers here.

Sanity check: If you apply this on a OLMo model, the result should contain 16 named linear layers.

We also need a convenience function that puts layers back into a model. The following function does the trick. The named_layers argument uses the same format as returned by extract_lora_targets.

def replace_layers(model, named_layers):
    """
    Replace submodules in `model` by name.
    """
    for name, layer in named_layers.items():
        components = name.split(".")
        submodule = model
        for comp in components[:-1]:
            submodule = getattr(submodule, comp)
        setattr(submodule, components[-1], layer)
    return model

Implementing the LoRA layer

To implement the LoRA approach, we define a new type of layer that will be used as a drop-in replacement for a regular linear layer.

In the paper by Hu et al. (2021), the structure is presented visually in Figure 1, and equation (3) shows the same idea.

Start from the following skeleton and fill in the missing pieces:

import torch.nn as nn

class LoRALayer(nn.Module):
    def __init__(self, W, r, alpha):
        super().__init__()
        # TODO: Add your code here

    def forward(self, x):
        # TODO: Replace the next line with your own code
        raise NotImplementedError

Here, W is the linear layer we are fine-tuning, while r and alpha are hyperparameters described in section 4.1. of the paper. The r parameter controls the parameter efficiency: by setting it to a low value, we save memory but make a rougher approximation. The alpha parameter is a scaling factor.

Hint: How to initialize A and B.

To follow the description closely, we should use the parameter initialization approach recommended in the paper (see Figure 1).

You can use nn.init.normal_ and nn.init.zeros_ here.

Fine-tuning with LoRA

Set up a model where you replace the four linear layers in attention blocks (query, key, value, and output) with LoRA layers. Use the following steps:

  • First use extract_lora_targets to get the relevant linear layers.
  • Each of the linear layers in the returned dictionary should be wrapped inside a LoRA layer.
  • Then use replace_layers to put them back into the model.

Sanity check: Use your function num_trainable_parameters. The number of trainable parameters should be less than in Step 3. The exact number will depend on the rank.

Train this model and compare the training speed, metrics, and outputs to the results from Step 3.

Correction (Nov. 28): We fixed a couple of typos here, in particular a mistake in the instructions about which layers you should apply LoRA to.

Side notes:

Running training on Minerva When you are ready to perform a full fine-tuning run, submit the provided training job as:

sbatch run.sh --num-epochs 2 --output-dir /path/to/runs/baseline

Running inference on Minerva To test your fine-tuned or LoRA-adapted checkpoints without re-running training you can run the inference job (omit --adapter-path to use the base model).

sbatch predict.sh --adapter-path /path/to/lora_state_dict.pt "Summarize ..." --input "..."