When your p95 response time is killing user experience and your GPU bill is spiraling, you don't have the luxury of experimenting. Here's exactly what I did, what failed first, and what finally worked.
When I joined MetLife's AI team, the LLM inference stack was already in production. It was serving real users, real queries, real time. And it was slow. Not unusably slow — but the kind of slow that quietly kills adoption. Our p95 latency was sitting at roughly 2.8 seconds. For a conversational AI interface, that's an eternity.
The first thing I did was not write code. I sat with the logs for three days. I watched where time was being spent. What I found surprised me — the model itself wasn't the bottleneck. The problem was how we were serving it.
"The model isn't slow. The way you're running it is slow." — what I told myself after day three of log analysis.
We were running a 3B-parameter model on NVIDIA A100s using vanilla PyTorch inference. No batching strategy. No KV-cache optimization. Every request was treated as a standalone inference job. On paper, the model was fast. In practice, we were leaving 70% of our GPU compute on the floor.
The specific issues I identified were: no continuous batching (requests waited in queue instead of being grouped), memory fragmentation from inefficient KV-cache allocation, and no tensor parallelism across our 8-GPU nodes. Any one of these alone would hurt. All three together meant we were running a sports car with the handbrake on.
My first move was migrating to vLLM. If you haven't used it — it implements PagedAttention, which handles KV-cache like virtual memory in an OS. Instead of allocating a fixed block of memory per sequence, it allocates in pages, dramatically reducing fragmentation and wasted memory. The switch alone dropped our p95 by around 18%.
But I wasn't done. The next layer was TensorRT-LLM from NVIDIA. This required converting our fine-tuned model weights into an optimized TRT engine. The compilation step takes time — about 40 minutes the first run — but the runtime gains are significant. TensorRT fuses operations, eliminates redundant memory transfers, and generates CUDA kernels specifically tuned for our A100 hardware.
This is where most people get nervous, and honestly I was too. Quantizing a model from FP16 to INT8 means trading some numerical precision for speed and memory. In a regulated finance environment, "trading precision" sounds like a compliance nightmare.
We used GPTQ quantization with careful calibration data. The key insight: we didn't quantize blindly. We ran an extensive evaluation suite — the same one used to validate the model for production — and compared outputs across 10,000 query samples. The degradation was statistically negligible. Output consistency held across all 3B-parameter model variants. We got sign-off from the model risk team.
The result: 35% reduction in GPU memory footprint, which meant we could run 2x the model concurrency on the same hardware. That deferred an estimated $500K in GPU procurement for the year.
The final piece was deploying behind NVIDIA's Triton Inference Server with dynamic batching enabled. This groups incoming requests within a configurable time window and processes them together. Before, GPU utilization hovered around 58% under peak load. After enabling dynamic batching with a 50ms window: 83% utilization. Effectively extracting the compute value of 3 additional A100s from existing hardware.
Start with profiling, not optimization. I spent too long on intuition-based fixes in the first week before I brought in Nsight Systems to actually see where compute time was going. Once I had real data, every decision became obvious. The 42% latency win didn't come from one clever trick — it came from systematically eliminating every inefficiency the profiler revealed.
If you're working on LLM inference at scale, the stack matters as much as the model. PagedAttention + TRT-LLM + dynamic batching is a combination I'd reach for again on day one.