To Fine-Tune or Not to Fine-Tune: Customising Your LLM
2. The Most Expensive Misconception in AI
When business leaders hear about fine-tuning large language models, many picture something like uploading a company handbook and expecting the AI to memorise every policy, procedure, and product specification. This mental model feels intuitive. After all, that is how we humans learn new information.
Here is the reality that can save your organisation significant time and money: fine-tuning does not teach facts. It teaches behaviour. Think of it like the difference between hiring someone with industry knowledge versus training an employee to communicate in your company’s voice. They are fundamentally different objectives requiring different approaches.
This distinction matters enormously because it determines which tool you should reach for. Retrieval Augmented Generation, commonly called RAG, excels at injecting current, factual knowledge into your AI’s responses. Fine-tuning excels at changing how the model expresses itself, follows instructions, or maintains a consistent persona. Confusing these two approaches leads to wasted budgets, frustrated teams, and underwhelming results.
3. RAG Versus Fine-Tuning: Choosing the Right Tool
The simplest way to understand when to use each approach is to ask yourself one question: are you trying to give the model new information, or are you trying to change how it behaves?
Objective
Best Approach
Example
Respond in your brand
voice consistently
Fine-Tuning
Press releases, policy updates, news
Fine-Tuning
Always produce JSON in your schema
RAG
Handle domain-specific instructions
Fine-Tuning
Industry terminology, workflow patterns
RAG works by retrieving relevant documents at query time and providing them as context. The model reads your information fresh with every request. Fine-tuning works by adjusting the model’s underlying parameters through training on examples, changing its default behaviours permanently.
4. The Power of Hybrid Approaches
Stage One: Retrieval
The system fetches relevant documents, market data, and compliance requirements based on the query.
Stage Two: Generation with Tuned Behaviour
A fine-tuned model processes this retrieved context, automatically applying the correct tone, structure, and required disclaimers.
Stage Three: Validation
Output passes through compliance checks before delivery.
This hybrid pattern delivers both accuracy and consistency. The RAG component ensures information remains current without retraining. The fine-tuned model ensures every response meets your standards without manual editing.
5. Data Requirements: Quality Over Quantity
The Volume Myth
Some teams assume they need tens of thousands of examples. They scrape logs, export chat histories, and compile massive datasets of mediocre quality. These efforts typically produce mediocre results.
The Quality Reality
Research consistently shows that 500 to 1,000 carefully crafted, high-quality example pairs often outperform 100,000 raw training lines. Each example should demonstrate exactly the behaviour you want the model to learn.
What Makes a High-Quality Example
Clear input that represents real use cases. Output that perfectly demonstrates your desired response. Consistent formatting across all examples. Deliberate inclusion of edge cases and variations.
The Investment Calculation
Creating 500 excellent examples might require 40 to 60 hours of expert time. This investment typically yields better results than months spent cleaning and processing massive low-quality datasets.
6. Evaluating Success: Beyond the Vibe Check
“If you cannot measure it, you cannot improve it.” This principle applies directly to fine-tuning, yet many teams rely solely on subjective assessment.
"If you cannot measure it, you cannot improve it." This principle applies directly to fine-tuning, yet many teams rely solely on subjective assessment.
Rigorous evaluation requires multiple approaches. Automated metrics like BLEU and ROUGE scores measure surface-level similarity between generated outputs and reference texts. These work well for translation or summarisation tasks but struggle with open-ended generation where multiple correct answers exist.
Human evaluation remains essential for nuanced quality assessment. Create rubrics that score specific attributes: accuracy, tone consistency, instruction following, and format compliance. Use multiple evaluators to reduce bias. Track scores across iterations to measure genuine improvement.
LLM-as-a-Judge represents an emerging middle ground. You prompt a capable model, often a different one than you are fine-tuning, to evaluate outputs against defined criteria. This scales better than human evaluation while capturing nuance that automated metrics miss. However, it requires careful prompt engineering and validation against human judgement.
7. Iteration Cycles: Timeline and Cost Realities
The first iteration rarely produces production-quality results. You will identify gaps in your training data, discover edge cases you missed, and recognise behaviours you want to adjust. Budget for this iteration cycle from the start.
Cost structures vary significantly between approaches. Fine-tuning through services like OpenAI’s API involves per-token training costs plus ongoing inference costs for your custom model. Fine-tuning open models like Llama requires compute infrastructure, whether cloud GPU instances or on-premises hardware, plus engineering time for deployment and maintenance.
A realistic first project budget should include data preparation, which often consumes 40 to 50 percent of total effort. Training compute costs, typically ranging from hundreds to thousands of pounds per iteration depending on model size. Evaluation time, both human and automated. And infrastructure costs if deploying your own model.
8. Open Versus Closed: Strategic Considerations
The choice between fine-tuning an open model like Llama and fine-tuning through a service like OpenAI’s GPT platform involves tradeoffs beyond pure capability.
Service-based fine-tuning offers simplicity. You upload training data, configure parameters through an interface, and receive a hosted endpoint. No infrastructure management, no deployment complexity. The tradeoffs include ongoing per-request costs, limited customisation options, and dependency on the provider’s roadmap and policies.
Open model fine-tuning provides control. You own the weights, choose your infrastructure, and can modify anything. You can deploy on-premises for data sovereignty, provided you have selected the right Hardware for On-Premise Inference to handle the training and inference load. The tradeoffs include significant engineering overhead, responsibility for security and scaling, and the need to stay current with rapidly evolving tooling.
For most organisations without dedicated machine learning infrastructure, service-based fine-tuning provides the faster path to value. For those with specific compliance requirements, cost optimisation at scale, or strategic reasons to own their AI capabilities, open models warrant serious consideration.
9. Making the Decision
Is your goal factual knowledge or behavioural change? If knowledge, start with RAG. If behaviour, fine-tuning may be appropriate.
Do you have 500+ high-quality examples of your desired behaviour? If not, invest in data creation before considering fine-tuning.
Have you maximised what prompt engineering can achieve? Often, well-crafted system prompts and few-shot examples deliver 80 percent of the desired improvement at 10 percent of the cost. Before committing to training runs, ensure you have exhausted the capabilities of modern Agentic IDEs and systematic workflow design, which can often achieve similar results at a fraction of the cost.
Do you have budget for iteration? Plan for three to five cycles, not one.
Ready to Implement Multi-Agent AI?
Book a consultation to explore how the Council of Experts framework can transform your AI capabilities.
Discover more AI Insights and Blogs
By 2027, your biggest buyer might be an AI. How to restructure your Ecommerce APIs and product data so "Buyer Agents" can negotiate and purchase from your store automatically
Dashboards only show you what happened. We build Agentic Supply Chains that autonomously reorder stock based on predictive local trends, weather patterns, and social sentiment
Stop building static pages. Learn how we configure WordPress as a "Headless" receiver for AI agents that dynamically rewrite content and restructure layouts for every unique visitor
One agent writes, one edits, one SEO-optimizes, and one publishes. How we build autonomous content teams inside WordPress that scale your marketing without scaling your headcount
One model doesn't fit all. We break down our strategy for routing tasks between heavy reasoners (like GPT-4) and fast, local SLMs to cut business IT costs by 60%
Don't rewrite your old code. How we use Multi-Modal agents to "watch" and operate your legacy desktop apps, creating modern automations without touching the source code
You wouldn't give an intern root access to your database. Why are you giving it to ChatGPT? Our framework for "Role-Based Access Control" in Agentic Systems