Engineering
Scaling LLMs with Golang: How we serve millions of LLM requests

Scaling LLMs with Golang: How we serve millions of LLM requests

John Wang
Co-Founder and CTO
JUMP TO SECTION

While the LLM ecosystem is overwhelmingly Python-first, we've found Go to be exceptionally well-suited for production deployments. Our Go-based infrastructure handles millions of monthly LLM requests with minimal performance tuning. Beyond Go's well-documented advantages (see Rob Pike’s excellent distillation of Go's benefits), three capabilities have proven particularly valuable for LLM workloads: static type checking for handling model outputs, goroutines for managing concurrent API calls, and interfaces for building composable response pipelines. Here's how we've implemented each of these in our production stack.

Type safety and structured outputs

One of the main challenges with LLMs is handling their unstructured outputs. OpenAI's structured output support has been a significant advancement for us, and Go's type system makes it particularly elegant to implement. Rather than writing separate schema definitions, we can leverage Go's struct tags and reflection to generate well defined schemas. Here’s an example where we automatically convert a SupportResponse into OpenAI's JSON schema format using the go-openai library:

The above code will provide us with Answer and RelatedDocs populated directly from an LLM call. Now, the SupportResponse can be easily passed to our frontend or saved in our database.

Notice that because Golang has a type system built in, you don’t have to spend any extra time defining the object structure (like you would in Python) — it’s already available via reflection and you can spend more of your time on prompting, inputs, and outputs of the LLM.

Parallel processing and latency

LLM applications often require concurrent API calls and complex orchestration. Go's goroutines and channels make this remarkably straightforward.

For instance, suppose we're running a Retrieval Augmented Generation (RAG) pipeline and want to perform hybrid search across three different search backends (see our article on Better RAG Results with Reciprocal Rank Fusion and Hybrid Search). Running these searches serially would add up their individual latencies, resulting in slower responses. With Go we can relatively easily parallelize searches across multiple backends:

This pattern reduces our total latency to that of the slowest backend, with a configurable timeout to prevent any single slow backend from bottlenecking the entire system. The results are collected via a Go channel and combined after all the Goroutines have completed or timed out.

Response processing pipeline

LLM outputs often need multiple transformations before they're ready for end users. For example, if you're using an LLM provider with great reasoning ability but doesn't yet have structured outputs (e.g. Claude 3.5 Sonnet), you'll likely want to structure the output in your prompt and parse the output before passing it to an end user.

We've built a composable pipeline that makes these transformations both maintainable and testable:

Each cleaner is a discrete unit that handles one specific transformation. This separation of concerns makes testing straightforward and allows us to modify individual transformations without touching the rest of the pipeline. Here's how we handle source citations:

Using the above cleaner, when an LLM responds with:

According to [Source: docs/onboarding.pdf] and [Source: kb/troubleshooting.md], the API rate limit is 100 requests per minute for [Source: pricing.pdf] premium accounts.

The cleaner will parse the sources and pass them to the frontend as response details. It will also transform the raw LLM output into:

According to [1] and [2], the API rate limit is 100 requests per minute for [3] premium accounts.

Complementing with Python

While Go powers our production infrastructure, Python remains essential for ML experimentation and rapid prototyping. The Python ecosystem excels at tasks like:

These tasks would be significantly more complex in Go, where ML libraries are either non-existent or far less mature.

To bridge the Go / Python gap, we maintain a lightweight Python service that our Go infrastructure calls. This service handles computationally intensive ML tasks (like generating embeddings or clustering) while keeping our core infrastructure in Go. In practice, we often prototype features entirely in Python, then gradually port performance-critical components to Go once they're proven. This approach lets us ship improvements incrementally without waiting for a complete Go implementation.

Conclusion

Go's strengths in type safety, concurrency, and building interfaces have made it an excellent choice for our LLM infrastructure. While Python remains our go-to language for ML development, Go provides the performance and reliability we need in production. The combination of both languages lets us move fast while maintaining a robust, scalable system.