The Hybrid Strategy: On-Premise GPUs vs. Cloud APIs
The Case for Starting in the Cloud
The argument for cloud-first AI infrastructure is straightforward: you pay only for what you use, you avoid capital expenditure, and you can scale instantly when demand spikes. For organisations experimenting with large language models or building their first AI-powered features, cloud APIs from providers like OpenAI, Anthropic, and Google represent the rational choice. There is no hardware to procure, no data centre space to negotiate, and no MLOps team to hire before you can send your first inference request.
This flexibility comes at a price, however. Cloud inference costs scale linearly with usage. A summarisation endpoint that costs £200 per month during development can balloon to £15,000 monthly once deployed to production traffic. The economics that made cloud attractive during experimentation become punitive at scale.
The hybrid strategy acknowledges both realities. Cloud remains unbeatable for getting started, for handling unpredictable traffic bursts, and for accessing the most advanced proprietary models. On-premise infrastructure wins when workloads stabilise, volumes grow large, and the same model runs thousands of times daily on predictable inputs. Most mature AI operations will inevitably arrive at a hybrid architecture. The question is not whether to adopt this approach, but when and how.
The Case for Starting in the Cloud
A100
The A100 remains the workhorse of enterprise AI infrastructure. With 80GB of HBM2e memory and mature software support, it handles most inference tasks efficiently. Prices have dropped significantly since the H100's release, making refurbished and secondary market A100s attractive for cost-conscious deployments. For models up to 70 billion parameters running standard inference, the A100 delivers excellent value. 🖼️
H100
The H100 represents the current performance leader. Its Transformer Engine and fourth-generation Tensor Cores deliver roughly two to three times the inference throughput of an A100 for transformer-based models. The premium price, typically three to four times that of an A100, makes sense only when inference volume justifies the investment or when latency requirements demand maximum performance. Organisations running real-time inference at scale, particularly those serving customer-facing applications with strict response time SLAs, find the H100's performance premium worthwhile.
L40S
The L40S offers an intriguing middle path. Built on the Ada Lovelace architecture, it provides strong inference performance at a significantly lower price point than the H100. Its 48GB of GDDR6 memory suits models up to approximately 40 billion parameters. For inference-heavy workloads that do not require the absolute fastest response times, the L40S delivers compelling economics.
Specification
A100 (80GB)
H100 (80GB)
L40S (48GB)
Memory
80GB HBM2e
80GB HBM3
48GB GDDR6
Memory Bandwidth
2.0 TB/s
3.35 TB/s
864 GB/s
Inference Throughput
Baseline
2-3x A100
1.2-1.5x A100
Relative Price
£
£££
££
Best For
General inference, cost optimisation
High-volume, low-latency production
Mid-tier inference, budget constraints
The True Cost of On-Premise: Beyond Hardware
Hardware acquisition represents only a fraction of total on-premise AI infrastructure costs. A realistic total cost of ownership calculation must account for power consumption, cooling requirements, physical space, and the human expertise required to operate the system reliably.
Power Consumption forms the largest ongoing operational cost.
A single H100 draws approximately 700 watts under load. A modest four-GPU server running continuously consumes roughly 4kW, translating to approximately £4,500 annually at UK commercial electricity rates. This figure excludes the power required for cooling, networking equipment, and supporting infrastructure. Organisations must also consider power delivery infrastructure: high-density GPU deployments often require electrical upgrades to provide adequate amperage.
Cooling Requirements scale with power consumption.
Every watt of electricity consumed by computing equipment eventually becomes heat that must be removed. Traditional air cooling struggles with GPU-dense deployments, pushing many organisations toward liquid cooling solutions. Retrofit liquid cooling can add £5,000 to £15,000 per rack, while purpose-built liquid-cooled enclosures cost significantly more. Even with efficient cooling, organisations in warmer climates or those lacking modern data centre facilities face elevated costs.
Physical Space carries both direct and opportunity costs.
A single 42U rack can accommodate two to four high-density GPU servers depending on configuration. Co-location fees in UK data centres range from £500 to £2,000 monthly per rack, varying by location, power density, and connectivity options.
Organisations with existing data centre space must weigh the opportunity cost of dedicating that space to AI infrastructure versus other uses.
MLOps Staffing often represents the largest hidden cost.
On-premise AI infrastructure requires expertise in hardware maintenance, driver management, model deployment, monitoring, and optimisation. A competent MLOps engineer in the UK commands £70,000 to £120,000 annually. Most organisations require at least partial dedicated headcount, even when leveraging automation and managed Kubernetes platforms. Underestimating this cost leads to reliability problems, suboptimal utilisation, and eventual migration back to managed cloud services.
Case Study: 60% Cost Reduction Through Local Inference
A mid-sized legal technology company approached their AI infrastructure decision in 2023 with a clear methodology. Their primary workload involved document summarisation: extracting key points from contracts, briefs, and correspondence. At peak, their system processed 50,000 documents daily, each requiring between one and three API calls to their cloud LLM provider.
Monthly cloud API costs had reached £45,000. The workload exhibited two characteristics that made it an ideal candidate for on-premise migration: high volume and predictable patterns. Document processing peaked during business hours and dropped to near zero overnight. The summarisation task itself was well-defined, with consistent input lengths and output requirements.
We tested Llama 3 70B against our existing cloud provider on 10,000 representative documents. Quality scores were within 3% on our evaluation framework. The decision became purely economic.
The migration involved procuring two servers, each equipped with four A100 GPUs, deployed in existing co-located rack space. Total capital expenditure, including installation and initial configuration, reached approximately £180,000. Monthly operational costs, covering power, cooling, co-location fees, and allocated MLOps time, settled at approximately £8,000.
The arithmetic proved compelling. Previous monthly spend of £45,000 dropped to £8,000 in operational costs plus amortised hardware costs of approximately £7,500 monthly over a two-year depreciation period. Total monthly cost fell to roughly £15,500, representing a 65% reduction. The payback period on hardware investment came in under eight months.
Critically, the company retained cloud API access for overflow capacity and for tasks requiring the most advanced models. Approximately 5% of their inference volume, primarily complex legal reasoning tasks, continued routing to cloud endpoints. This hybrid approach provided both cost optimisation and capability assurance.
Workload Segmentation: The Decision Framework
Effective hybrid architecture requires systematic workload analysis. Not every inference task suits on-premise deployment, and forcing unsuitable workloads onto local infrastructure creates operational complexity without corresponding savings.
Route to Cloud
when traffic patterns are unpredictable or highly variable. Marketing campaign launches, viral content responses, and seasonal business spikes create demand patterns that would require massive over-provisioning to handle on-premise. Cloud APIs absorb these bursts gracefully, charging only for actual usage. Similarly, workloads requiring the most advanced proprietary models, particularly those involving complex reasoning, code generation, or multimodal understanding, often perform best on cloud endpoints where providers deploy their latest architectures.
Route to On-Premise
when workloads are repetitive, high-volume, and stable. Document processing, content moderation, embedding generation, and standardised customer service responses typically exhibit these characteristics. If you can predict tomorrow’s inference volume within 20% based on today’s numbers, and that volume justifies dedicated hardware, on-premise deployment likely offers superior economics. Tasks that process sensitive data also benefit from on-premise deployment, helping you meet strict Data Sovereignty requirements in regulated environments.
The segmentation decision ultimately reduces to a simple question: does this workload’s volume and predictability justify dedicated infrastructure, or does the flexibility of per-request pricing better match its characteristics?
Building Your Hybrid Architecture
Standardise Your Inference Interface.
Abstract your AI capabilities behind a unified API layer that routes requests based on workload type, current capacity, and cost optimisation rules. Tools like LiteLLM, OpenRouter, or custom gateway implementations enable applications to remain agnostic to whether inference runs locally or in the cloud. This abstraction simplifies future migrations and enables real-time routing decisions.To effectively manage this abstraction layer, we recommend adopting the Model Router Architecture to dynamically switch between local and cloud endpoints based on traffic.
Implement Robust Monitoring.
Track latency, throughput, error rates, and costs across both environments. Establish alerting thresholds that trigger automatic failover when on-premise capacity saturates or experiences degradation. Cost monitoring should provide daily visibility into spend by workload category, enabling rapid identification of optimisation opportunities.
Plan for Failure.
On-premise infrastructure fails. Hardware degrades, software updates introduce regressions, and cooling systems malfunction. Design your routing layer to automatically redirect traffic to cloud endpoints when local inference becomes unavailable. This resilience transforms on-premise deployment from a single point of failure into a cost optimisation layer with graceful degradation characteristics.
Start Small and Expand.
Begin hybrid deployment with a single, well-understood workload. Validate operational procedures, confirm cost projections, and build team expertise before expanding scope. The legal technology case study succeeded partly because the team focused exclusively on summarisation before considering other workloads.
Ready to Implement Multi-Agent AI?
Book a consultation to explore how the Council of Experts framework can transform your AI capabilities.
Discover more AI Insights and Blogs
By 2027, your biggest buyer might be an AI. How to restructure your Ecommerce APIs and product data so "Buyer Agents" can negotiate and purchase from your store automatically
Dashboards only show you what happened. We build Agentic Supply Chains that autonomously reorder stock based on predictive local trends, weather patterns, and social sentiment
Stop building static pages. Learn how we configure WordPress as a "Headless" receiver for AI agents that dynamically rewrite content and restructure layouts for every unique visitor
One agent writes, one edits, one SEO-optimizes, and one publishes. How we build autonomous content teams inside WordPress that scale your marketing without scaling your headcount
One model doesn't fit all. We break down our strategy for routing tasks between heavy reasoners (like GPT-4) and fast, local SLMs to cut business IT costs by 60%
Don't rewrite your old code. How we use Multi-Modal agents to "watch" and operate your legacy desktop apps, creating modern automations without touching the source code
You wouldn't give an intern root access to your database. Why are you giving it to ChatGPT? Our framework for "Role-Based Access Control" in Agentic Systems