To Fine-Tune or Not to Fine-Tune: Customising Your LLM

2. The Most Expensive Misconception in AI

When business leaders hear about fine-tuning large language models, many picture something like uploading a company handbook and expecting the AI to memorise every policy, procedure, and product specification. This mental model feels intuitive. After all, that is how we humans learn new information.

 Here is the reality that can save your organisation significant time and money: fine-tuning does not teach facts. It teaches behaviour. Think of it like the difference between hiring someone with industry knowledge versus training an employee to communicate in your company’s voice. They are fundamentally different objectives requiring different approaches. 

This distinction matters enormously because it determines which tool you should reach for. Retrieval Augmented Generation, commonly called RAG, excels at injecting current, factual knowledge into your AI’s responses. Fine-tuning excels at changing how the model expresses itself, follows instructions, or maintains a consistent persona. Confusing these two approaches leads to wasted budgets, frustrated teams, and underwhelming results. 

3. RAG Versus Fine-Tuning: Choosing the Right Tool

The simplest way to understand when to use each approach is to ask yourself one question: are you trying to give the model new information, or are you trying to change how it behaves? 

Objective

Best Approach

Example

Answer questions about your product catalogue 
RAG 
Customer queries about specifications, pricing, availability 

Respond in your brand
voice consistently 

Fine-Tuning 

All outputs match your tone guidelines 
Reference recent company announcements 
RAG

Press releases, policy updates, news 

Format outputs in a specific structure 

Fine-Tuning 

Always produce JSON in your schema 

Cite internal documentation accurately 

RAG 

Technical manuals, compliance documents 

Handle domain-specific instructions 

Fine-Tuning 

Industry terminology, workflow patterns 

RAG works by retrieving relevant documents at query time and providing them as context. The model reads your information fresh with every request. Fine-tuning works by adjusting the model’s underlying parameters through training on examples, changing its default behaviours permanently.

4. The Power of Hybrid Approaches

The most sophisticated implementations combine both techniques. Consider a financial services firm that needs an AI assistant for client communications. The assistant must cite current market data and regulatory information accurately, which requires RAG. However, it must also maintain a precise, compliant communication style with specific disclosure formatting, which requires fine-tuning.

Stage One: Retrieval

The system fetches relevant documents, market data, and compliance requirements based on the query.

Stage Two: Generation with Tuned Behaviour

A fine-tuned model processes this retrieved context, automatically applying the correct tone, structure, and required disclaimers.

Stage Three: Validation

Output passes through compliance checks before delivery.

This hybrid pattern delivers both accuracy and consistency. The RAG component ensures information remains current without retraining. The fine-tuned model ensures every response meets your standards without manual editing. 

5. Data Requirements: Quality Over Quantity

One of the most common questions about fine-tuning concerns data volume. How many examples do you actually need? The answer depends entirely on quality.

The Volume Myth

Some teams assume they need tens of thousands of examples. They scrape logs, export chat histories, and compile massive datasets of mediocre quality. These efforts typically produce mediocre results.

The Quality Reality

Research consistently shows that 500 to 1,000 carefully crafted, high-quality example pairs often outperform 100,000 raw training lines. Each example should demonstrate exactly the behaviour you want the model to learn.

What Makes a High-Quality Example

Clear input that represents real use cases. Output that perfectly demonstrates your desired response. Consistent formatting across all examples. Deliberate inclusion of edge cases and variations.

The Investment Calculation

Creating 500 excellent examples might require 40 to 60 hours of expert time. This investment typically yields better results than months spent cleaning and processing massive low-quality datasets.

6. Evaluating Success: Beyond the Vibe Check

“If you cannot measure it, you cannot improve it.” This principle applies directly to fine-tuning, yet many teams rely solely on subjective assessment.

"If you cannot measure it, you cannot improve it." This principle applies directly to fine-tuning, yet many teams rely solely on subjective assessment.

Rigorous evaluation requires multiple approaches. Automated metrics like BLEU and ROUGE scores measure surface-level similarity between generated outputs and reference texts. These work well for translation or summarisation tasks but struggle with open-ended generation where multiple correct answers exist. 

Human evaluation remains essential for nuanced quality assessment. Create rubrics that score specific attributes: accuracy, tone consistency, instruction following, and format compliance. Use multiple evaluators to reduce bias. Track scores across iterations to measure genuine improvement. 

LLM-as-a-Judge represents an emerging middle ground. You prompt a capable model, often a different one than you are fine-tuning, to evaluate outputs against defined criteria. This scales better than human evaluation while capturing nuance that automated metrics miss. However, it requires careful prompt engineering and validation against human judgement. 

Interaction Cycles

7. Iteration Cycles: Timeline and Cost Realities

The first iteration rarely produces production-quality results. You will identify gaps in your training data, discover edge cases you missed, and recognise behaviours you want to adjust. Budget for this iteration cycle from the start. 

Cost structures vary significantly between approaches. Fine-tuning through services like OpenAI’s API involves per-token training costs plus ongoing inference costs for your custom model. Fine-tuning open models like Llama requires compute infrastructure, whether cloud GPU instances or on-premises hardware, plus engineering time for deployment and maintenance. 

A realistic first project budget should include data preparation, which often consumes 40 to 50 percent of total effort. Training compute costs, typically ranging from hundreds to thousands of pounds per iteration depending on model size. Evaluation time, both human and automated. And infrastructure costs if deploying your own model. 

8. Open Versus Closed: Strategic Considerations

The choice between fine-tuning an open model like Llama and fine-tuning through a service like OpenAI’s GPT platform involves tradeoffs beyond pure capability. 

Service-based fine-tuning offers simplicity. You upload training data, configure parameters through an interface, and receive a hosted endpoint. No infrastructure management, no deployment complexity. The tradeoffs include ongoing per-request costs, limited customisation options, and dependency on the provider’s roadmap and policies. 

Open model fine-tuning provides control. You own the weights, choose your infrastructure, and can modify anything. You can deploy on-premises for data sovereignty, provided you have selected the right Hardware for On-Premise Inference to handle the training and inference load. The tradeoffs include significant engineering overhead, responsibility for security and scaling, and the need to stay current with rapidly evolving tooling. 

For most organisations without dedicated machine learning infrastructure, service-based fine-tuning provides the faster path to value. For those with specific compliance requirements, cost optimisation at scale, or strategic reasons to own their AI capabilities, open models warrant serious consideration. 

9. Making the Decision

Is your goal factual knowledge or behavioural change? If knowledge, start with RAG. If behaviour, fine-tuning may be appropriate. 

Do you have 500+ high-quality examples of your desired behaviour? If not, invest in data creation before considering fine-tuning. 

Have you maximised what prompt engineering can achieve? Often, well-crafted system prompts and few-shot examples deliver 80 percent of the desired improvement at 10 percent of the cost. Before committing to training runs, ensure you have exhausted the capabilities of modern Agentic IDEs and systematic workflow design, which can often achieve similar results at a fraction of the cost.

Can you define measurable success criteria? Without clear metrics, you cannot know if fine-tuning actually improved anything. 

Do you have budget for iteration? Plan for three to five cycles, not one. 

Ready to Implement Multi-Agent AI?

Book a consultation to explore how the Council of Experts framework can transform your AI capabilities.

Book a Consultation

Discover more AI Insights and Blogs

Find out more about us