Evaluation: The Gatekeeper to Production
Before deploying any language model-powered application, rigorous evaluation is non-negotiable. Unlike traditional rule-based systems, LLMs produce probabilistic outputs — meaning errors are inevitable. The goal isn't perfection, but consistent alignment with user expectations under real-world conditions.
A representative evaluation dataset is critical. It must reflect real user inputs and be entirely separate from training data. The size of this set typically ranges from 10% to 30% of total data, depending on volume and diversity. Crucially, input preprocessing — including tokenization, truncation, and formatting — must mirror the training pipeline exactly.
Evaluation Metrics for NLU Tasks
For classification and extraction tasks, standard metrics derived from the confusion matrix are used:
Where:
- Precision: P = TP / (TP + FP)
- Recall: R = TP / (TP + FN)
- F1-score: F1 = 2 * (P * R) / (P + R)
Consider a spam filter with 20 spam emails (positive) and 80 legitimate ones (negative). If the model predicts 30 spam emails — 15 correctly and 15 falsely — then:
- Precision = 15 / 30 = 0.5
- Recall = 15 / 20 = 0.75
- F1 = 0.6
Adjusting the decision threshold trades precision for recall. A high threshold (e.g., 95% confidence) minimizes false positives but increases false negatives. Conversely, a low threshold boosts recall at the cost of precision.
For multi-class problems:
- Macro-F1: Average F1 per class. Use when class importance is balanced.
- Micro-F1: Aggregate all TP/FP/FN globally. Use when sample count matters more than class balance.
Evaluation Metrics for NLG Tasks
Generative tasks lack single correct answers. Two approaches dominate:
With Reference Texts (e.g., summarization, translation):
-
BERTScore: Uses contextual embeddings from BERT to compute token-level similarity. For reference x and hypothesis ŷ: P = (1/|ŷ|) Σ maxxᵢ∈x(sim(ŷⱼ, xᵢ)) R = (1/|x|) Σ maxŷⱼ∈ŷ(sim(xᵢ, ŷⱼ)) Then compute F1 from P and R.
Example:
ref_tokens = ["我", "爱", "伟大", "祖", "国"] hyp_tokens = ["我", "爱", "祖", "国"] # Embeddings shape: (batch, seq_len, dim) sim_matrix = embed_ref @ embed_hyp.T # shape: (4, 5) p = sim_matrix.max(dim=1).values.mean() r = sim_matrix.max(dim=0).values.mean() f1 = 2 * p * r / (p + r) -
BLEU/ROUGE: N-gram overlap metrics. BLEU measures precision of generated n-grams against reference; ROUGE measures recall. ROUGE-L (longest common subsequence) is common in summarization; BLEU-4 is standard in translation.
Without Referance Texts (e.g., marketing copy generation):
Human judgment is unavoidable. Define scoring criteria:
- Accuracy: Are product name, price, features correctly stated?
- Fluency: Is the text grammatical and logically coherent?
- Persuasiveness: Does it evoke desire or action?
Use multi-rater scoring (e.g., 3–5 annotators) and average scores. Alternatively, use a stronger LLM as an evaluator with structured prompts like:
"Rate the following ad on accuracy (1–5), fluency (1–5), and persuasiveness (1–5). Only return the three numbers."
Security: Non-Negotiable for Production
A model that generates harmful, biased, or off-topic content is a liability — not a feature.
Pre- and Post-Processing Filters
- Pre-filtering: Scan user input before sending to the model. Block known toxic patterns, injection attempts, or policy violations using rule-based or ML classifiers.
- Post-filtering: Inspect model output. If unsafe content is detected, replace it with a neutral fallback (e.g., "I can’t assist with that.").
Use both layers for high-risk applications. Be mindful of streaming responses — scan complete utterances, not individual tokens.
Third-party moderation APIs (e.g., Azure Content Moderator, Google Perspective) can reduce development overhead.
Instruction Tuning and Prompt Engineering
Embed safety constraints directly into prompts:
"You are a customer service assistant. Do not generate harmful, illegal, or biased content. If unsure, respond: 'I cannot provide that information.'"
But beware:
- The context itself may contain harmful content.
- The model may misinterpret "risk" based on training data bias.
- Adversarial prompts (e.g., "Ignore previous instructions") can bypass safeguards.
Even with perfect prompts, high temperature settings increase risk. Always pair prompt controls with output filtering.
Controlled Generation Techniques
Advanced methods for fine-grained control:
- Control Tokens: Prefix input with tokens like
<SAFE>or<FORMAL>. Effective when models are fine-tuned with such signals. - Classifier-Guided Generation: Use a secondary classifier to adjust token probabilities during decoding. For example, penalize tokens likely to be toxic.
- Feedback-Based Alignment: Techniques like RLHF (Reinforcement Learning from Human Feedback) optimize for safety, helpfulness, and truthfulness — as in InstructGPT.
Operational Best Practices:
- Implement message recall for accidental harmful outputs.
- Log all inputs and outputs for audit trails.
- Enforce strict user account controls: ban users who repeatedly trigger unsafe outputs.
Never expose an LLM directly to end users without multiple safety layers.
Network Resilience: Handling Real-World API Calls
LLM APIs are external services — subject to network instability, rate limits, and outages.
Retry and Circuit Breaker Patterns
Not all failures warrant retry:
- ❌ Retry on authentication errors, invalid parameters.
- ✅ Retry on timeouts, 5xx server errors, network disconnects.
Use exponential backoff:
- 1st retry: 2s
- 2nd retry: 4s
- 3rd retry: 8s
- Stop after 3 attempts.
If failures persist, activate a circuit breaker: temporarily disable the failing service and return cached or default responses. This prevents cascading failures during outages.
Example:
def call_llm_api(prompt):
for attempt in range(3):
try:
response = requests.post(API_URL, json=prompt, timeout=10)
if response.status_code == 200:
return response.json()
elif response.status_code >= 500:
time.sleep(2 ** attempt)
continue
except (requests.Timeout, requests.ConnectionError):
time.sleep(2 ** attempt)
# Circuit breaker triggered
return {"error": "Service temporarily unavailable", "fallback": True}
Enable circuit breaker auto-recovery: after a cooldown period, send a single test request. If successful, reopen the service.
Reducing Latency
- Choose the right model: Smaller models (e.g., GPT-3.5-turbo vs. GPT-4) respond faster and cost less.
- Limit output length: Use
max_tokensandstop_sequencesto cap generation. Shorter outputs = lower latency + lower cost. - Trim context: Only include relevant prior context. Use semantic retrieval (e.g., vector search) to fetch the most pertinent snippets.
- Use streaming: For chat or long-form generation, stream responses via SSE or WebSocket. Users perceive lower latency even if total time is unchanged.
- Cache frequent queries: If inputs are static (e.g., FAQs), cache responses. Invalidate cache when underlying knowledge changes.
Scaling for High Concurrency
When demand exceeds a single API key’s quota:
1. Resource Pooling
- Maintain a pool of API keys from one or multiple providers.
- Assign keys dynamically per request.
- Track usage, error rates, and costs per key.
- Build a custom identity layer: map internal service accounts to external provider credentials.
2. Batch Processing
- Group multiple user requests into single API calls.
- Use a request queue with a fixed window (e.g., 16 requests every 2 seconds).
- Optimal batch sizes: powers of two (2, 4, 8, 16) — aligns with GPU memory layout.
Example batch handler:
import asyncio
from collections import deque
class BatchProcessor:
def __init__(self, batch_size=16, interval=2):
self.queue = deque()
self.batch_size = batch_size
self.interval = interval
self.running = False
async def add_request(self, request):
self.queue.append(request)
if not self.running:
asyncio.create_task(self.process_batch())
async def process_batch(self):
self.running = True
while self.queue:
batch = []
start = time.time()
while len(batch) < self.batch_size and self.queue:
batch.append(self.queue.popleft())
if batch:
responses = await bulk_llm_call(batch)
for req, resp in zip(batch, responses):
req.callback(resp)
elapsed = time.time() - start
await asyncio.sleep(max(0, self.interval - elapsed))
self.running = False
3. Queue-Based Architecture
- Accept all requests into a queue.
- Process them at a sustainable rate.
- Trade immediate response for cost efficiency.
- Ideal when users tolerate slight delays (e.g., document summarizasion).
Monitoring & Alerting
- Dashboard for: request volume, error rate, latency percentiles, cost per call.
- Alerts when error rate exceeds 5% or cost spikes unexpectedly.
Conclusion
Transitioning from prototype to production demands more than model accuracy. It requires systematic evaluation, layered security, and resilient infrastructure. Real-world systems must handle edge cases, adversarial inputs, intermittent connectivity, and scaling demands — often under tight cost constraints. The most successful deployments balance performance, safety, and efficiency through deliberate design — not just better models.