LLMs in Production: Lessons from the First Wave
After deploying large language models for enterprise clients, here's what we've learned about prompt engineering, retrieval-augmented generation, and the true cost of running these systems.
The hype cycle for large language models has been extraordinary. Every company wants an AI assistant, a document processor, a content generator. The technology is genuinely impressive. But the gap between "impressive demo" and "reliable production system" remains substantial.
Over the past 18 months, we've deployed LLM-based systems for document analysis, customer support, code generation, and knowledge management. Here are the lessons that have cost real money to learn.
Lesson 1: Prompts Are Code, Treat Them That Way
In the early days, prompts lived in application code as string literals. When they needed to change, developers would edit the code, test locally, and deploy. This works for simple applications. It falls apart at scale.
We now treat prompts as versioned artifacts separate from application code:
- Prompts live in a prompt registry with version history
- Changes go through review, just like code
- A/B testing infrastructure allows comparing prompt variants
- Rollback is instant—no code deployment required
- Prompt performance metrics are tracked over time
The overhead is real, but so are the benefits. When an LLM provider silently updates their model and your carefully-tuned prompts stop working, you need to iterate fast. That's much easier when prompts are decoupled from application releases.
Lesson 2: RAG Is Harder Than It Looks
Retrieval-augmented generation—grounding LLM responses in your own documents—sounds simple. Embed your documents, find relevant chunks, stuff them into context, generate. The demo works beautifully. Production is another story.
The problems we've encountered:
- Chunking matters enormously. Chunk too small and you lose context. Chunk too large and you exceed context windows or retrieve irrelevant content. The optimal chunking strategy varies by document type.
- Embedding models have blind spots.A query about "revenue" might not retrieve documents that say "sales" or "income." Synonymy, acronyms, and domain-specific terminology all cause retrieval failures.
- Relevance scoring is unreliable.Cosine similarity doesn't correspond to human intuition about relevance. We've moved to hybrid retrieval combining embeddings with keyword search and re-ranking.
- Freshness is complex. When documents are updated, embeddings need to be recomputed. But incremental updates are tricky when chunks overlap.
We now budget 2-3x the expected time for RAG implementations. The "naive" version gets you 70% of the way there. The last 30% is where the actual work lives.
Lesson 3: The True Cost of LLMs
Token pricing is deceptively simple. OpenAI charges X per 1K tokens, Anthropic charges Y, etc. But the true cost of LLM operations includes much more:
- Context engineering: Larger contexts improve quality but increase cost linearly. A RAG system retrieving 5 chunks vs. 10 chunks doubles your input costs.
- Retry handling: Rate limits, timeouts, and transient failures require retries. Budget for 10-15% overhead from retries alone.
- Model selection: Using GPT-4 when GPT-3.5 suffices is pure waste. But many teams default to the most capable model for every use case.
- Development and testing: Prompt iteration consumes tokens. Evaluation pipelines consume tokens. Regression testing consumes tokens.
For one client, we reduced LLM costs by 60% simply by implementing a routing layer that sent simple queries to cheaper models and reserved expensive models for complex tasks. No quality degradation—just smarter resource allocation.
Lesson 4: Evaluation Is the Hard Part
How do you know if your LLM system is working? Traditional ML has clear metrics: accuracy, precision, recall, AUC. LLM outputs are free-form text. "Good" is subjective and context-dependent.
We've settled on a multi-layer evaluation approach:
- Automated checks: Format validation, safety filters, factual consistency where verifiable. Catches obvious failures.
- LLM-as-judge: Using a separate LLM to evaluate outputs against criteria. Surprisingly effective when criteria are well-defined.
- Human evaluation: Random sampling with expert review. Expensive but necessary for high-stakes applications.
- Implicit feedback: User behavior signals. Do they accept the response? Edit it? Regenerate? Complaint rate?
No single approach is sufficient. We build evaluation pipelines that combine all four, with dashboards that surface problems before users complain.
Lesson 5: The Model Is the Least Interesting Part
Teams obsess over which model to use. GPT-4 vs. Claude vs. Gemini. The debates are endless. And they largely miss the point.
In our experience, the choice of foundation model explains maybe 20% of system quality. The other 80% comes from:
- Quality and organization of source documents
- Retrieval pipeline effectiveness
- Prompt design and context construction
- Post-processing and validation
- Error handling and fallback behavior
A well-engineered system using GPT-3.5 will outperform a naive implementation using GPT-4. Spend your optimization budget on the system, not the model.
The Path Forward
LLMs are genuinely transformative. They enable applications that were impossible two years ago. But they also require new engineering practices, new operational capabilities, and new ways of thinking about system reliability.
Organizations that will succeed are those that:
- Treat LLM applications as ML systems requiring MLOps infrastructure
- Invest in evaluation before scaling
- Build for observability and rapid iteration
- Accept that the technology is immature and design for change
The first wave of LLM deployments has taught us what doesn't work. The next wave will be built on those lessons.
Building LLM applications?
We help organizations deploy LLMs reliably and cost-effectively. From RAG pipelines to evaluation infrastructure, we've seen what works.
Schedule a Conversation