CodeLlama FineTuning for Classification

A Challenging Coding Task

8 min readOct 28, 2023

In the latest months, LLM is undoubtedly the hottest topic in the AI world. It is a novel concept that suggests that off-the-shelf language models are available to everyone, assisting in developing stronger-than-ever AI tools, which are allowing the usage of solutions in new domains that were not a common objective for AI engines before their appearance. Traditionally, AI tools were used, mainly for solving R&D tasks or enhancing the impact in marketing procedures: sentences such as ‘We use AI and LLM in our models” are members of honor in nearly every marketing /CS person’s pitches. The LLM revolution offers AI-based solutions for operations, HR, help -desk, and additional sections in the industry. There are plenty of wrapping tools, such as Langchain, that assist in using LLMs and allow simple integration of their capabilities into organizations’ engines. However, these shelf models have been trained upon a general language and sometimes require fine-tuning to provide proper results for a given organization. In this post, we discuss a method for performing this process.

What is Fine-tuning?

Fine-tuning is simply about taking an existing model ( we consider mainly language models in this post) as a base model and using its prior language knowledge through its already given weights as an initial model.

Fine-tuning reduces the amount of required data for training our own model from scratch. Nevertheless, we still need to provide training data that reflects our needs (namely, the particular text lexicon of our product).

This data is required since the base model has profound English knowledge but is less familiar with our specific terminology. One can think of fine-tuning as training the base model to speak our language. There are challenges in fine-tuning, such as:

The base models are often huge memory-wise and require plenty of resources to train them
Preparing relevant data can be tedious.

However, fine-tuning is often cardinal for optimizing AI usage.

In the next part of the post, we will describe some technical details of this training. Our problem won’t be text generating but using the text for classification.

LORA

Researcher published the LORA algorithm in 2021. It became the standard for performing fine-tuning to LlAMA models. Shortly after the paper’s publishment, one could find a fine-tuning code. This model uses special Python packages such as bitsandbytes, PEFT, and SFTT that support this algorithm and allow optimizing such huge models. However, this class of models (Llama, CodeLlama, and others) contains mainly generative language models. These algorithms focus on outputting text upon a given input. They follow a next word(s) mechanism. Our objective is a classification problem: given an input text, we wish to classify it rather than generate an additional one. For this purpose, we need to modify the standard fine-tuning code. The following sections focus on these changes.

The DL Differences

DL-wise the difference between text-generate to text2classification, is manifested in the upper layers those that are close to the output layer. In text-generate, We aim to predict the next word, thus we output a vector of the size of the vocabulary. In classification, we map the embedding dimension to a layer of the size of clustering. When we fine-tune a text-generating model such as Llama for classification, this layer (the one that maps embedding to clusters) is not a trained matrix but a random matrix that we optimize during the fine-tuning.

Classification Vs Text Generation

As I mentioned in the previous section there are some changes that one needs to modify for doing Llamaa for classification. The first one is the data preparation. We assume that our data is comprised of tuples : (text, label). If we have a list vv, in which every element is such a tuple, we will have the following :

 def get_data(vv):
   new_dic = { "text": [i[0] for i in vv],"labels": [i[1] for i in vv]}
   data0 = datasets.Dataset.from_dict(new_dic)
   return data0
 data_set =get_data(vv)

We have a dataset object with two columns the text and the labels. We need to bring the base model and the tokenizer. In our example, we will use CodeLlama

base_model = "codellama/CodeLlama-7b-hf"
tokenizer = get_tokenizer(base_model)

Where get_tokenizer is the following:

def get_tokenizer(model_name):
    # Load LLaMA tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
    return tokenizer

We need now to map the text using the tokenizer. This can be performed using the following code:

def prepare_obj(tokenizer,dataset):
    inputs = tokenizer([i[ for i in dataset["text"]], padding=True, truncation=True, return_tensors="pt")
    labels = torch.tensor(dataset['labels'])
    text_obj = TextClassificationDataset(input_ids =inputs['input_ids'],attention_mask= inputs['attention_mask'],labels=labels )
    return text_obj
#We call this method :
text_loader = prepare_obj(tokenizer, dataset)

We have now data that is organized in a proper way that allows Huggingface to send it to its Trainer object. We need to build the model to be fine-tuned. For uploading the base model (recall that in this post it is CodeLlama ), we first need to define the bnb_config

import torch
from transformers import    BitsAndBytesConfig


def get_bnb_config():
    compute_dtype = getattr(torch, "float16")
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type=True,
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=False,
    )

    # Check GPU compatibility with bfloat16
    if compute_dtype == torch.float16 and Trur:
        major, _ = torch.cuda.get_device_capability()
        if major >= 8:
            print("=" * 80)
            print("Your GPU supports bfloat16: accelerate training with bf16=True")
            print("=" * 80)
    return bnb_config

The values here were taken from the original fine-tuning post. We can now upload the base model:

    bnb_config = get_bnb_config()
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,num_labels=2,
        quantization_config=bnb_config,
        device_map=my_c.device_map,output_scores=True
    )

Note that we call the model from AutoModelForSequenceClassification. This means that the upper matrix is random not trained: CodedLlama is a generative model, thus it doesn't map embedding to classification. This issue exists in every LLM. Thus when one wishes to modify such a model to a classification model, he needs to pay attention to this issue. In addition, we add the number of classes for defining the model properly. We will present now the PEFT object. It is cardinal to observe the change that we use for classification:

 peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        bias="none",
        task_type=TaskType.SEQ_CLS ,
    )

The numerical values were taken from the original post, clearly for different applications we may need to change them. However, the last variable: task_type is cardinal: in the original post, the peft object is called as follows:

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

If we won't modify CASUAL_LM to TaskType.SEQ_CLS the model won't save the matrix that maps embedding dimension to classification, I will present it in the next section). The next steps are defining the Trainer object and simply training, it is similar to the original post :

raining_arguments = TrainingArguments(
        output_dir=my_c.output_dir,
        num_train_epochs=my_c.num_train_epochs,
        per_device_train_batch_size=my_c.per_device_train_batch_size,
        gradient_accumulation_steps=my_c.gradient_accumulation_steps,
        optim=my_c.optim,
        save_steps=my_c.save_steps,
        logging_steps=my_c.logging_steps,
        learning_rate=my_c.learning_rate,
        weight_decay=my_c.weight_decay,
        fp16=my_c.fp16,
        bf16=my_c.bf16,
        max_grad_norm=my_c.max_grad_norm,
        max_steps=my_c.max_steps,
        warmup_ratio=my_c.warmup_ratio,
        group_by_length=my_c.group_by_length,
        lr_scheduler_type=my_c.lr_scheduler_type,
        report_to="tensorboard"
    )

I used the same values that Huggingface recommends for text generation. However, as in every other set of hyperparameters, we need to test these values carefully for every application.

 model.config.pad_token_id = model.config.eos_token_id
    trainer = MyTrainer(
        model=model,
        args=training_arguments,
        max_seq_length=my_c.max_seq_length,
        tokenizer=tokenizer_new,
        train_dataset=text_loader,
        # eval_dataset=tdataset,
        dataset_text_field="text",
        peft_config=peft_config,
        data_collator=data_collator,


    )

In the code snippet above we need to notice for two issues:

We need to define the pad_token_id (I have no idea why, but the definition in the tokenizer is insufficient). a trainer is often an object of type SFTtrainer, for flexibility (such as providing a customized loss) we override it with our own trainer

lass MyTrainer(SFTTrainer):
    def compute_loss(self, model, inputs,  return_outputs=False):
        # print(" in mytrianer ",model.base_model.model.score.weight)
        outputs = model(**inputs)
        logits = outputs.logits
        labels = inputs["labels"]

        loss_fn = torch.nn.CrossEntropyLoss()
        loss = loss_fn(logits, labels)

        return (loss, outputs) if return_outputs else loss

Our loss function here is nn.CrossEntropyLoss(), but everyone can modify it. We can now perform the training process and save the model

    trainer.train()
    trainer.save_model(new_model)

We have a model that is saved locally. We may prefer uploading it to Huggingface and uploading it in the future as a pre-trained model:

 
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

!huggingface-cli login

model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

In the last three lines, the code is connected to Hugginface-Lab: one will need an HF token to complete this task.

The Score Matrix Bug.

As I mentioned, CodeLlama was trained as a text generation model. These models don’t train a matrix from the embedded space to classification:

It led to some issues that were probably solved. Since this topic isn't well covered on the web, I will describe what it looks like.

We start by reaching this line:

trainer.train()

We will dive into the inner structure and show that the bugs beyond have been fixed.

trainer = {MyTrainer} <__main__.MyTrainer object at 0x7fde4f68e650>
   .
   . 
  model = {PeftModelForSequenceClassification} PeftModelForSequenceClassification(\n  (base_model): LoraModel(\n    (model): LlamaForSequenceClassification(\n      (model): Lla...s_to_save): ModuleDict(\n          (default): Linear(in_features=4096, out_features=2, bias=False)\n        )\n      )\n    )\n  )\n)
      base_model = {LoraModel} LoraModel(\n  (model): LlamaForSequenceClassification(\n    (model): LlamaModel(\n      (embed_tokens): Embedding(32016, 4096)\n  ...     (modules_to_save): ModuleDict(\n        (default): Linear(in_features=4096, out_features=2, bias=False)\n      )\n    )\n  )\n)
         model = {LlamaForSequenceClassification} LlamaForSequenceClassification(\n  (model): LlamaModel(\n    (embed_tokens): Embedding(32016, 4096)\n    (layers): ModuleList(\n  ...ias=False)\n    (modules_to_save): ModuleDict(\n      (default): Linear(in_features=4096, out_features=2, bias=False)\n    )\n  )\n)
               .
               .
               .            
               score = {ModulesToSaveWrapper} ModulesToSaveWrapper(\n  (original_module): Linear(in_features=4096, out_features=2, bias=False)\n  (modules_to_save): ModuleDict(\n    (default): Linear(in_features=4096, out_features=2, bias=False)\n  )\n)

When we observe trainer.model.base_model.model, we can see the object score. This object has two inner objects:

modules_to_save = {ModuleDict: 1} ModuleDict(\n  (default): Linear(in_features=4096, out_features=2, bias=False)\n)
original_module = {Linear} Linear(in_features=4096, out_features=2, bias=False)

At the beginning of the process (namely the training ), these matrices are equal:

module_to_save.default 

H = {Tensor: (4096, 2)} tensor([[ 0.0161, -0.0099],\n        [ 0.0083, -0.0161],\n        [-0.0531, -0.0077],\n        ...,\n        [ 0.0104, -0.0052],\n        [ 0.0074,  0.0074],\n        
T = {Tensor: (4096, 2)} tensor([[ 0.0161, -0.0099],\n        [ 0.0083, -0.0161],\n        [-0.0531, -0.0077],\n        ...,\n        [ 0.0104, -0.0052],\n        [ 0.0074,  0.0074],\

original_module
H = {Tensor: (4096, 2)} tensor([[ 0.0161, -0.0099],\n        [ 0.0083, -0.0161],\n        [-0.0531, -0.0077],\n        ...,\n        [ 0.0104, -0.0052],\n        [ 0.0074, 0.0074],\
T = {Tensor: (4096, 2)} tensor([[ 0.0161, -0.0099],\n        [ 0.0083, -0.0161],\n        [-0.0531, -0.0077],\n        ...,\n        [ 0.0104, -0.0052],\n        [ 0.0074,  0.0074],\n

Now we can track these two objects :(orginal_module and module_to_save). After several iterations, we see the following picture:

Original_module
H = {Tensor: (4096, 2)} tensor([[ 0.0161, -0.0099],\n        [ 0.0083, -0.0161],\n        [-0.0531, -0.0077],\n        ...,\n        [ 0.0104, -0.0052],\n        [ 0.0074, 0.0074],\
T = {Tensor: (4096, 2)} tensor([[ 0.0161, -0.0099],\n        [ 0.0083, -0.0161],\n        [-0.0531, -0.0077],\n        ...,\n        [ 0.0104, -0.0052],\n        [ 0.0074,  0.0074],\n       

Module_to_save.default 

H = {Tensor: (4096, 2)} tensor([[ 0.0010,  0.0052],\n        [ 0.0101, -0.0042],\n        [ 0.0067, -0.0037],\n       
T = {Tensor: (4096, 2)} tensor([[ 0.0010,  0.0052],\n        [ 0.0101, -0.0042],\n

Hence, the original weights are fixed and the module_to_save’s weights are modified. I recall that to achieve this, you must set the peft_conf to TaskType.SEQ_CLS .

As a test, it is preferable to view the score’s weights of the model after it is loaded and compare them to the weights that have been achieved in the training process.

Summary

In this post, I discussed the development steps that are required to modify a fine-tuning flow of Llama from text-generate mode to classification. I focused on the differences in the special packages such as bitsandbytes and SFTT. I believe that LLM for classification will become a more common task in the future, which makes this post beneficial. One can find the source code of this project here.

Acknowledgments

I wish to acknowledge Raphael Gozlan and Avner Duchovni for their massive assistance in this project.