Fine-Tuning LLMs for Enterprise Applications: A Practical Guide
April 12, 2025
19 min read
By ML Engineering Team
End-to-end guide to fine-tuning large language models for domain-specific tasks, including data preparation, evaluation metrics, and deployment strategies.
When and Why to Fine-Tune LLMs
Fine-tuning adapts a pre-trained LLM to your specific use case, improving accuracy and reducing costs. But it's not always necessary-prompt engineering often suffices.
- •Domain-specific terminology (legal, medical, technical)
- •Consistent output format required
- •Reduced latency needed (smaller model)
- •Cost reduction (cheaper to run fine-tuned 7B than GPT-4)
- •Data privacy (on-premise deployment)
- •Limited training data (<1,000 examples)
- •Generic use case (Q&A, summarization)
- •Rapidly changing requirements
- •Prompt engineering achieves 90%+ accuracy
- •Fine-tuning cost: $100-10,000 (one-time)
- •API cost savings: 50-90% reduction
- •Latency improvement: 2-10x faster
- •Accuracy improvement: 5-20% higher
LoRA Fine-Tuning Implementation
Python
# Fine-Tune Llama 2 with LoRA for Customer Support
# Uses 8-bit quantization to fit on single GPU
import torch
from datasets import load_dataset
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
BitsAndBytesConfig,
TrainingArguments
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
# 1. Load base model with 8-bit quantization
model_name = "meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
bnb_8bit_compute_dtype=torch.float16,
bnb_8bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# 2. Prepare model for LoRA training
model = prepare_model_for_kbit_training(model)
# 3. Configure LoRA
lora_config = LoraConfig(
r=16, # Low-rank dimension
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj"], # Which layers to adapt
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Print trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 4M || all params: 7B || trainable%: 0.06%
# Only training 0.06% of parameters = much faster!
# 4. Load and format training data
def format_instruction(example):
"""Format data as instruction-following"""
instruction = example['question']
response = example['answer']
return f"""### Instruction:
{instruction}
### Response:
{response}"""
dataset = load_dataset("your-company/customer-support-qa")
dataset = dataset.map(lambda x: {
"text": format_instruction(x)
})
# 5. Training configuration
training_args = TrainingArguments(
output_dir="./llama2-customer-support",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 16
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="epoch",
warmup_steps=100,
lr_scheduler_type="cosine",
)
# 6. Train model
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
peft_config=lora_config,
dataset_text_field="text",
max_seq_length=512,
tokenizer=tokenizer,
args=training_args,
)
trainer.train()
# 7. Save fine-tuned model
model.save_pretrained("./llama2-customer-support-final")
tokenizer.save_pretrained("./llama2-customer-support-final")
# 8. Inference
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
fine_tuned_model = PeftModel.from_pretrained(
base_model,
"./llama2-customer-support-final"
)
# Test inference
prompt = """### Instruction:
How do I reset my password?
### Response:"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = fine_tuned_model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0]))
# Performance:
# - Training time: 4-8 hours on single A100
# - Memory: 24GB VRAM
# - Inference: 50-100 tokens/sec
# - Accuracy improvement: 15-25% over base model
Evaluation and Production Deployment
Evaluation Metrics:
- 1.Automatic Metrics:
- 2.Human Evaluation:
- 3.A/B Testing:
- •Use vLLM or TGI for serving
- •Quantize to INT8 for 2x speedup
- •Monitor: latency, throughput, cost per request
- •Implement fallback to base model for errors
- •Track drift: if accuracy degrades, retrain
- •GPT-4 API: $60/day ($1,800/month)
- •Fine-tuned Llama 2 7B: $10/day ($300/month)
- •Savings: $1,500/month = $18K/year
LLMFine-tuningAIMachine LearningEnterprise
Related Articles
AI & Machine Learning
vLLM and Parallelized Inference: Scaling LLM Serving to Production
Deep dive into vLLM architecture, continuous batching, PagedAttention, tensor parallelism, and advanced techniques for serving large language models at scale with optimal throughput and latency.
AI & Machine LearningDeploying Production LLMs with AWS Bedrock: A Complete Guide
Learn how to architect, deploy, and scale large language models in production using AWS Bedrock, covering cost optimization, security, and performance best practices.