Enhancing Model Performance: The Impact of Fine-tuning with LoRA & QLoRA

👋 Introduction

As businesses increasingly turn to AI for value generation, the applications of neural network-based techniques and Large Language Models (LLMs) have become pivotal. Advancements in natural language processing have empowered organizations to address various text-related challenges, including classification, summarization, and controlled text creation. While third-party APIs may be convenient, fine-tuning models with proprietary data yields cost-effective and domain-specific solutions that can be securely deployed across various environments. However, choosing the right strategy for fine-tuning is critical.

This blog explores one of the most popular and effective methods for parameter-efficient fine-tuning: Low-Rank Adaptation (LoRA), with a specific focus on QLoRA, an even more efficient variant. The goal is to take an open large language model and fine-tune it to generate a proper response based on the given message. For this exercise, we have chosen the TheBloke/Mistral-7B-Instruct-v0.2-GPTQ model, which is open-source with a permissive license (Apache 2.0), and the Samhita/slack-data-long-responses dataset, both available for download from the HuggingFace Hub.

What is PEFT Finetuning?

PEFT Fine-tuning, or Parameter Efficient Fine-tuning, is a set of techniques designed to make model training more efficient. Traditional training methods often involve a large number of trainable parameters, which can be resource-intensive and time-consuming. PEFT techniques, such as Prefix Tuning, P-tuning, and LoRA, reduce the number of parameters, leading to faster and more cost-effective training. For example, in natural language processing, PEFT techniques have been used to improve model performance while reducing resource consumption.

Benefits of PEFT Fine-tuning

Parameter Efficient Fine-tuning (PEFT) offers several compelling benefits, particularly for enterprises and large businesses seeking to fine-tune Large Language Models (LLMs) like BERT and GPT. These benefits include:

Time Savings: By reducing the number of trainable parameters, PEFT accelerates both the training and testing processes. This efficiency allows for quicker exploration of different models, datasets, and techniques, ultimately saving valuable time in model development and deployment.
Cost Efficiency: PEFT's memory optimizations enable training on less powerful computational resources, resulting in reduced costs associated with training on large datasets. This cost-effectiveness makes PEFT particularly attractive for businesses looking to scale their AI capabilities without significantly increasing their infrastructure costs.
Improved Model Performance: Despite the reduction in trainable parameters, PEFT techniques often lead to improved model performance. By focusing on optimizing the most important parameters, PEFT can enhance the overall effectiveness and accuracy of LLMs, leading to better results in various natural language processing tasks.

What is LoRA & QLoRA?

LoRA, short for Low-Rank Adaptation, is a breakthrough in the field of LLM. Traditionally, fine-tuning such models involved adjusting all the weights in the pre-trained model's weight matrix. LoRA, however, takes a different approach. It fine-tunes two smaller matrices that approximate the larger weight matrix, known as the LoRA adapter. This fine-tuned adapter is then loaded into the pre-trained model for use during inference.

QLoRA takes this efficiency further by leveraging quantization techniques. It loads the pre-trained model's weights onto the GPU memory as quantized 4-bit weights, as opposed to the 8-bits used in LoRA. Despite this reduction in memory usage, QLoRA maintains a similar level of effectiveness as LoRA. By probing and comparing these methods, and optimizing QLoRA's hyperparameters, we aim to achieve optimal performance with the quickest training time.

Implemented in the Hugging Face Parameter Efficient Fine-Tuning (PEFT) library, LoRA offers ease of use. QLoRA can be utilized by combining bitsandbytes and PEFT. Additionally, We are going to use the HuggingFace Transformer Reinforcement Learning (TRL) library which provides a convenient trainer for supervised fine-tuning with seamless integration for LoRA. These libraries offer the necessary tools to fine-tune a chosen pre-trained model for generating coherent and convincing product descriptions based on specific attributes

Setting up the environment

Here, we install all required Python libraries and modules. They help with training efficiency (accelerate), allow for low-rank adaptations (PERF), facilitate quantized training (bitsandbytes), and give access to pre-trained models and tools (transformers).‍


!pip install auto-gptq optimum bitsandbytes

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, AutoConfig
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import transformers

Preparing Data for Supervised Fine-tuning

Before we can effectively utilize PEFT for fine-tuning a model for following instructions, it's crucial to prepare the data in a format suitable for supervised fine-tuning. Supervised fine-tuning involves training a pre-trained model to generate text based on a given prompt. This process is supervised because the model is fine-tuned on a dataset containing prompt-response pairs formatted consistently.

Here is an example of our chosen dataset from the Hugging Face Hub looks as follows:

As useful as this dataset is, this is not well formatted for fine-tuning of a language model for instruction following in the manner described above.

The following code snippet loads the dataset from the Hugging Face hub into the memory and transforms the necessary fields into the consistently formatted string representing the instruction, input, and response.


# load dataset
data = load_dataset("Samhita/slack-data-long-responses")
df = pd.DataFrame(data['train'])

def format_data(row):
    input_content = row['input']
    output_content = row['output']
    system_prompt = f"""
    [INST]
    Below is an instruction that describes a task, paired with a message that provides further context. Write a response that appropriately completes the request. 
    
    SlackGPT, functioning as a Project Manager in an AI SAAS-based company, communicates in clear, accessible language, escalating to technical depth upon request.
    It reacts to feedback aptly and ends responses with its signature '–SlackGPT'. 
    It should always acknowledge messages, even if it cannot provide a direct answer, and should seek further assistance or clarification only when necessary
    thus keeping the interaction natural and engaging. Please respond to the following message."""

    message = f""" 
    ###Instruction: \n
    {system_prompt} 
     \n
    ###Message: \n
    {input_content} \n
    ### Response: \n
    {output_content}
     [/INST]
    """

    return message

# Apply the format_data function to each row of the data frame
df['formatted_data'] = df.apply(format_data, axis=1)

The resulting prompts will be added to the formatted_data column in the data frame. Which are then loaded into a hugging face dataset for supervised fine-tuning. In the prompt, we have used the [INST] special token to leverage instruction fine-tuning. You can read more about the special token here.

Testing model performance before Fine Tuning

It's useful to evaluate the performance of a pre-trained model without any modifications to establish a baseline before fine-tuning.
The model can be loaded in 8-bit format and evaluated according to the format specified in the model card on Hugging Face for initial performance assessment.


model_name = "mistralai/Mistral-7B-Instruct-v0.2"
model = AutoModelForCausalLM.from_pretrained(model_name,
		device_map="auto", # automatically figures out how to best use CPU + GPU for loading model
		trust_remote_code=False, # prevents running custom model files on your machine
		revision="main") # which version of model to use in repo
                                             
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", use_fast=True, local_files_only=False, trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.eval()


prompt = f""" [INST]
Can you please help me out with two things? I'd like to update our web flow pricing page with our new plans (Go-Getter, Unstoppable, Ultimate) I'd like to move all of the active campaign zaps over to our dedicated Supplier HQ Zapier account. I just added you to the zapier account.  [/INST]"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generation_output = model.generate(
input_ids=input_ids, max_new_tokens=128
)

print(tokenizer.decode(generation_output[0]))


### OUTPUT
I'd be happy to help you update your web flow pricing page with your new plans, Go-Getter, Unstoppable, and Ultimate. Here's a suggested layout for your pricing page:
1. Header: Include your logo, navigation menu, and a brief tagline about your pricing plans.
2. Introduction: Write a brief introduction about your pricing plans and their benefits.
3. Pricing Plans: Create separate sections for each pricing plan, including a plan name, description, features, and pricing.
4. Comparison Table: Create a comparison table that shows the differences between each pricing plan.
5. Call-to-Action: Include a clear call-to-action button for users to sign up or start a free trial.
6. Footer: Include your contact information, social media links, and any other relevant information.
As for moving all of your active campaign zaps over to your dedicated Supplier HQ Zapier account, here are the steps you can follow:

As shown in the output, the response from the base model is very lengthy and It's hallucinating a lot. This emphasizes the importance of fine-tuning the model to improve its performance for specific tasks. Let’s now try to fine tune the mode using LoRA and RLoRA and see how the performance improve for the similar query.

Configure LoRA parameters

When using PEFT to train a model with LoRA or QLoRA, the hyperparameters of the low-rank adaptation process can be defined in a LoRA config. Two key hyperparameters, "r" and "target_modules," significantly affect adaptation quality and are the focus of the tests that follow.

Rank (r): This parameter determines the rank of the low-rank matrices learned during the fine-tuning process. A lower rank may lead to quicker but potentially lower-quality model training. Increasing r beyond a certain value may not yield significant quality improvements. The value of r and its effect on adaptation quality will be tested.
Target Modules: This parameter specifies which modules in the model architecture to target during LoRA adaptation. While it's common practice to target only the attention blocks of the transformer to reduce training time and compute resources, recent work suggests that targeting all linear layers may improve adaptation quality.‍


# LoRA config
config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj","k_proj"], # We are only targetting the query and key attention layer
    lora_dropout=0.01,
    bias="none",
    task_type="CAUSAL_LM"
)

# LoRA trainable version of the model
model = get_peft_model(model, config)

# trainable parameter count
model.print_trainable_parameters()
## trainable params: 3,407,872 || all params: 7,245,139,968 || trainable%: 0.04703666202518836

‍‍
Additionally, for QLoRA specifically, the pretrained models are frozen in 4-bit during the fine-tuning process, requiring a rank of 64 for the low-rank approximation. Adjusting these parameters, along with the dropout rate and lora_alpha, allows for fine-tuning the model based on performance and resource considerations, finding the optimal setup for the task at hand.

Setting up the training parameters

Define training arguments and create a Trainer instance. A note on training: To perform fine-tuning, the following steps are required:

Define the LoRA Configuration: Set up the LoRA configuration as discussed earlier, specifying parameters like rank and target modules.
Prepare the Data: Split the prepped instruction following data into train and test sets and convert them into Hugging Face Dataset objects.
Define Training Arguments: Set the number of epochs, batch size, and other relevant training hyperparameters. These parameters will remain constant throughout the training process.


# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", use_fast=True, local_files_only=False, trust_remote_code=True)

# hyperparameters
lr = 2e-4
batch_size = 10
num_epochs = 2

# define training arguments
training_args = transformers.TrainingArguments(
    output_dir= "SlackGPT-ft",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    gradient_accumulation_steps=10,
    warmup_steps=2,
    fp16=True,
    optim="paged_adamw_8bit",
)

# configure trainer
trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    args=training_args,
    data_collator=data_collator
)

# train model
model.config.use_cache = False
trainer.train()

# renable warnings
model.config.use_cache = True

‍ ‍Let's evaluate the performance of the model after fine-tuning it. We saved the refined model after training and will load it for testing. We will use the same query as before to test the response of the fine-tuned model.

Query :


Can you please help me out with two things? I'd like to update our web flow pricing page with our new plans (Go-Getter, Unstoppable, Ultimate) I'd like to move all of the active campaign zaps over to our dedicated Supplier HQ Zapier account. I just added you to the zapier account.

Response :


Of course! I'd be happy to help with both tasks.

For updating the web flow pricing page with the new plans, could you please provide me with the details for each new plan (Go-Getter, Unstoppable, Ultimate)? This will help me ensure that the information is accurate and up to date.

Regarding the active campaign zaps, I'll need access to the dedicated Supplier HQ Zapier account. I see that you've added me to the account, so I'll take a look and start moving the zaps over. If there's anything specific you'd like me to keep in mind during this process, please let me know.

–SlackGPT

‍ ‍The model has clearly been adapted for generating more consistent response. During training, only 0.7% of the total parameters were used, and the process was completed in an impressive time of two hours. This was achieved despite training only two of the attention layers. By making adjustments to the target_layer and r parameters, you can observe changes in the model's performance and resource consumption. This can help you find the optimal configuration for your specific task.

QLoRA Fine-tuning using HuggingFace

To perform QLoRA fine-tuning with Hugging Face, you'll need to install BitsandBytes library. It handles the 4-bit quantization and manages low-precision storage and high-precision compute operations. Learn more about the quantize a model over here.
To load the modal using 4-bit quantization and set up the tokenizer, follow these steps:


# Quantization Config
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False
)
# Model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

‍ ‍After setting up the model and tokenizer, you can proceed with normal training using the HF trainer.
The hyperparameters for QLoRA that were tried out are r=8, and they only target the attention blocks, namely "q_proj" and "v_proj" for adaptation. Now, we will attempt to evaluate the model again on the same query and observe the response.

Response :


Absolutely! For updating the pricing page, I recommend adding the new plans under a clearly labeled section,
highlighting their features and benefits. This will help users quickly understand the options available.
Regarding the Zapier account, I see you've added me. To move the active campaign zaps, 
I'll need access to the current account to review and migrate them. Could you please provide the necessary details or steps to proceed?
–SlackGPT

‍ ‍The model's response is now very concise and well-structured. It's logical and relevant. As a reminder, these relatively high-quality results are achieved by fine-tuning less than 1% of the model's weights. According to research, it has been observed that modifying the value of "r" does not lead to any significant improvement in the quality of adaptation beyond a certain threshold. The most substantial enhancement is observed when all linear layers are targeted during the adaptation process, rather than just the attention blocks.

Best Practices for Fine-Tuning with LoRA and QLoRA

Optimal Scaling Coefficient: According to the original LoRA paper, choosing the scaling coefficient alpha as two times the rank parameter r often results in the best outcomes. This means setting alpha = 2 * r. However, it can be beneficial to experiment with different ratios to find the optimal configuration for your specific task and model.
Enable LoRA for More Layers: While experiments often focus on enabling LoRA for select weight matrices, such as the Key and Value matrices in each transformer layer, consider enabling LoRA for additional layers, such as Query matrices, linear layers between multihead attention blocks, and the linear output layer. This can increase the number of trainable parameters and memory requirements but may improve modeling performance noticeably. Learn more about this here.
Optimize Adapter Usage: When using adapters, understand that the size of the LoRA adapter obtained through fine-tuning is typically small compared to the pre-trained base model. During inference, both the adapter and pretrained model need to be loaded, maintaining similar memory requirements. Merging the weights of the pre-trained model and the adapter can slightly increase inference latency but can be done efficiently using the PEFT library with a single line of code.
Consider the Trade-offs of Merging Adapters: While merging adapters can reduce inference latency, it eliminates the ability to efficiently use a single large pre-trainedmodel with task-specific adapters. The decision to merge weights depends on the specific use case and acceptable inference latency.

Conclusion

In conclusion, fine-tuning a model for your business can significantly improve its performance and customization. This process allows you to tailor the model to your specific needs and knowledge, providing a level of customization that standard models may not offer. Whether you're using an RAG pipeline or need a more in-depth customization, fine-tuning can help you achieve your goals effectively.

Elevate Your Business with Generative AI

The field of artificial intelligence is constantly changing and many businesses are automating their workflows and making their life easier by integrating AI solutions in their products.

Do you have an idea waiting to be realized? Book a call to explore the possibilities of generative AI for your business.

Thanks for reading 😄.

‍

Enhancing Model Performance: The Impact of Fine-tuning with LoRA & QLoRA

👋 Introduction

What is PEFT Finetuning?

Benefits of PEFT Fine-tuning

What is LoRA & QLoRA?

Setting up the environment

Preparing Data for Supervised Fine-tuning

Testing model performance before Fine Tuning

Configure LoRA parameters

Setting up the training parameters

QLoRA Fine-tuning using HuggingFace

Best Practices for Fine-Tuning with LoRA and QLoRA

Conclusion

Elevate Your Business with Generative AI

Book an AI consultation

Let us help you.

Pinak Faldu

Writer

Pranav Patel

Editor

Sign up for our AI newsletter.

Ionio LLC