Practical Strategies for Optimizing Gemini API Calls
Generating tokens is expensive and slow. Here are some techniques to reduce latency and costs when working with Gemini.
1. Explicit Context Caching
Cache your system prompts and static context with the provider.
Benefits:
- Dramatically lower TTFT (Time-to-first-token). I’ve seen 3s drop to 600ms.
- 90% cost reduction on cached tokens
- Enables longer, richer prompts without the cost and latency penalty
Gemini supports both implicit and explicit caching, but implicit caching can be unreliable. Use explicit caching for consistent results.
When to use: Any prompt where substantial static content (system instructions, few-shot examples, tool definitions) would improve accuracy. Hint: that’s always the case.
2. Batching
If real-time isn’t required, use Gemini’s batch inference. It’s 50% cheaper, has implicit context caching enabled by default, and supports up to 200,000 requests per job. Note: the 90% cache hit discount takes precedence over the batch discount, they don’t stack.
Use it for pre-computation. If you can predict what users will need, batch-process it ahead of time and cache the results. Saves both cost and real-time latency.
Timing: In my experience, 20k samples takes 15-40 minutes depending on load, not the 24-hour worst case.
3. Skip Structured Output (When You Can)
Structured output (JSON mode) is convenient but expensive. The model must generate valid JSON tokens, which often means more generated tokens. That’s higher cost and slower.
This doesn’t apply to tool calls. Tool calls always return JSON. When using function calling, return thought signatures from the model response along with your function results so the model can resume its reasoning. Gemini 3 enforces this.
Note: Only Gemini 3 models support both structured output and tool calls together.
When you’re not using tools: Design a token-efficient output format and parse it yourself. Every token costs money and time.
// Instead of:
{ "sentiment": "positive", "confidence": 0.92, "aspects": ["price", "quality"] }
// Use CSV, semicolons, or key-value pairs:
positive;0.92;price,quality
// Or:
sentiment: positive
confidence: 0.92
aspects: price, quality
With context caching, you can provide a thorough list of examples in your prompt for almost nothing. The model learns your format reliably.
Make your parser extremely robust. Handle variations in whitespace, casing, punctuation, and ordering. Parsing is cheap, so do it all in code. Use AI to generate test cases with weird edge cases and variations, then make sure your parser handles them all.
4. Dynamic Model Selection Based on Complexity
Not every query needs your most expensive model. Instead of using a separate LLM to classify complexity (expensive, slow), calculate it deterministically from input features.
Build a complexity calculator: Score inputs based on domain-specific signals. Use regular NLP algorithms (tokenization, POS tagging, sentiment analysis, entity extraction) and simple heuristics. These are fast and free compared to an LLM call.
// Example simple message complexity scoring
function calculateComplexity(input: { message: string; historySize: number }) {
const { message, historySize } = input
const wordCount = message.trim().split(/\s+/).length
const questionCount = (message.match(/\?/g) || []).length
const hasCorrection = /\b(actually|wait|i mean|meant to say)\b/i.test(message)
const hasNegation = /\b(not|no|don't|didn't|won't|can't)\b/i.test(message)
const hasUncertainty = /\b(maybe|might|probably|not sure)\b/i.test(message)
return (
wordCount * 1.0 +
questionCount * 15 +
historySize * 2 +
(hasCorrection ? 20 : 0) +
(hasNegation ? 10 : 0) +
(hasUncertainty ? 10 : 0)
)
}
// Route to appropriate model
const score = calculateComplexity(input)
const model =
score < 30 ? 'gemini-2.5-flash-lite' : score < 80 ? 'gemini-2.5-flash' : 'gemini-2.5-pro'
Domain-specific signals matter. For meal logging like Welling: commas indicate multiple items, measurements indicate precision needed. For chat: corrections, negations, and uncertainty indicate nuance requiring a smarter model.
// Meal complexity: different domain, different signals
const commaCount = (description.match(/,/g) || []).length
const measurementCount = (description.match(/\d+\s*(g|oz|cup|tbsp)/gi) || []).length
const hasCompoundDish = /bowl with|sandwich with|served with/i.test(description)
Why this beats LLM-based routing: Zero latency, zero cost, deterministic, debuggable. You can tune weights based on observed accuracy.
Caveat: Getting the signals and thresholds right takes trial and error. Start simple, log everything, and iterate based on real failures.
5. Dynamic Thinking Budget Selection
Gemini 2.5 models charge for “thinking tokens”. Don’t let the model decide how much to think.
Scale thinking budget with complexity: Use the same complexity score to set thinking budget. Simple acknowledgments (“ok”, “thanks”) get zero thinking tokens (or the minimum of 128 for Gemini 2.5 pro). Complex queries get a larger budget for your domain.
Output token estimation: Complex inputs produce complex outputs. Estimate maxOutputTokens based on input features to avoid paying for unused capacity.
Caveat: Same as model selection. Expect trial and error. Log your inputs, outputs, and budgets. Look for cases where you under- or over-allocated, then adjust.
6. Parallel Calls and Speculative Execution
Don’t wait for sequential LLM calls when you can fire them in parallel.
Speculative execution for classifiers: If you have a classifier that routes to different prompts, run all branches simultaneously and discard the unused results. You pay more in tokens but win big on latency.
When to use: When latency matters more than cost, or when combined with context caching (which makes the extra calls cheap).
7. Streaming for Perceived Latency
Streaming doesn’t reduce actual latency, but it dramatically improves perceived latency.
Why it works: Users see progress immediately instead of staring at a spinner. The first token arrives fast; the full response streams in while they read.
8. Fine-tuning (Maybe)
Fine-tuning sounds appealing but rarely makes sense for small organizations.
The hidden costs:
- Requires high-quality training data
- Training infrastructure and iteration cycles
- Locked into a specific model version with no easy upgrades
- Loses flexibility to adjust behavior via prompts
When it might make sense: Large organizations with well-defined problem spaces and automation to continuously fine-tune as models and requirements evolve.
Better alternative for most: Use context caching with detailed few-shot examples. You get similar quality gains while retaining flexibility to iterate on prompts and upgrade models instantly.
Combining Techniques
These techniques compound. A well-optimized pipeline might:
- Calculate complexity score from input features (zero-cost routing)
- Select model tier and thinking budget based on score
- Speculatively execute multiple branches with cached prompts
- Stream the response
- Use batching for non-real-time workloads
The result: 10x faster, 5x cheaper, same quality.
Pick the techniques that match your constraints. Latency-sensitive? Focus on caching and parallelism. Cost-sensitive? Focus on caching, complexity-based routing, and batching.