We believe empowering engineers drives innovation.

Components of a RAG System in Production

By David Maxson
March 26, 2024

Large language models (LLMs) have introduced a whole new world of opportunities. Suddenly, the interface between machines and human communication has broken down, and ideas that were recently untenable have become almost trivially simple. It’s an exciting time, for sure. Real success, however, depends on wrapping this new technology with software that makes it truly useful. One pattern that has arisen to make that happen is Retrieval Augmented Generation (RAG).

This post will examine the RAG architecture, why this leads to an efficient system, and what production-ready capabilities exist today to deploy a RAG system within an enterprise.

To guide our discussion, we’ll work our way backward through the design of a RAG system. We’ll start with our end goal, and show how the various elements fall into place as we attempt to achieve that goal.

Diagram of a RAG system. A data preparation pipeline feeds a vector index. An orchestrator combines information from the vector index with a foundation LLM. Finally, a wrapping application uses that orchestrator to add AI to the user experience.

The End-User Application

RAGs and LLMs are about more than just chatbots; they bridge the gap between human communication and automation, allowing us to write systems that interact naturally with all the ways humans already use to communicate with each other.

Let’s say we’re building a support ticketing system and want to auto-generate suggested responses. There are a lot of components to such an application. It should integrate tightly with our website, as well as with the internal management tools our support specialists use. It should adapt as the conversation with the customer continues. We should consider how the system’s suggestions are presented on the user interface, and we should be measuring how they impact our specialists’ response times and our customers’ overall experience. Much of this is fairly typical software development. The special sauce, of course, is generating useful suggestions that improve the experience for both our customers and our staff. That’s where RAG comes in.

Our overall application consists of many pieces. There may be databases, caches, frontend components, backend services, APIs, logging, dashboarding, etc. Somewhere, buried deep in our application, is a very small interface that accepts the ticket’s conversation thus far and returns some possible responses along with, perhaps, some links to useful references.

The logic behind this interface is very AI-heavy. It’s aware of LLMs, vector databases, and how they fit together. It’s probably best implemented in Python, and it probably uses common tools in the AI community to abstract away complexity. This code is heavily focused on AI, provides a simple interface to that AI, and can be safely ignorant of much of the rest of the application. In short, this block is probably best implemented as a microservice.

LLM Orchestration

If we make this mental leap quickly enough and commit to the AI-heavy logic being somewhat isolated, we gain many benefits.

For starters, we may not need to write almost any code at all. This need to encapsulate AI orchestration logic is becoming so prevalent that some pre-packaged offerings may suffice–for example, LangServe. The reasons for this are pretty simple: this part of an application often has a simple interface for the outside world, and there are often several pesky requirements that promote isolating this code, including:

Whether we execute this logic in a microservice or not, it is likely to be such a distinct and isolatable portion of our software that we can think of it as an independent component.

In a RAG context specifically, this service has a couple of primary functions. There may be many other steps involved for the sake of safety, privacy, efficacy, or functionality, but there are at least two main steps somewhere in that pipeline: context retrieval, and augmented response generation.

Common umbrella solutions include the following:

Context Retrieval

Context retrieval solves a few limitations with today’s AI models.

The RAG solution to these problems is to leverage our model’s limited context window to provide our model with concise task-specific knowledge. We encapsulate compact nuggets of potentially relevant knowledge in a database, and use a search engine to fetch the most relevant nuggets at query time. This knowledge is then added to the prompt provided to the LLM for its use in generating a response.

Data Index

Typically, in a RAG context, we use AI-powered vector embeddings and similarity search to identify documents that may be useful to our LLM. There are many such embedding algorithms, but the more important decision here is what search engine to use. Ultimately, the point of this system is to yield relevant knowledge to your RAG system so your LLM can generate the best responses.

When choosing a system for your search index, you need to consider its speed, scaling, update patterns, ease of deployment, and many other options. As a result, there are many options available targeting the most common places you might be storing your data.

Common solutions include the following:

Data Preparation

A critical part of the RAG pattern is getting data ingested into your search system so it’s available for the LLM later. There are a host of challenges here, some of which may or may not be addressed by your choice of search engine.

In a simple vector database, you may need to think about things like:

More managed search engines may offer any number of the above features for you.

Your specific needs will vary depending on where you store your data and what your RAG’s needs are. There is a large variety of tools that may help with building and maintaining your critical data pipelines to ensure your RAG has access to updated, relevant knowledge.

Augmented Response Generation

LLM’s are, by their nature, stateless. They have knowledge trained into them, but otherwise they are simply functions that accept inputs and generate outputs without memory or side-effects. (LLM Agents or Assistants are a kind of LLM-like application that can use tools that may have side effects, but these are stateful services wrapped around stateless LLMs.)

To generate our response, we must now pass our instructions, context, and prompt to the LLM. This combined prompt provides the entirety of the task we wish the LLM to accomplish for us. Any state, such as previous messages in a thread, the topic of conversation, the tone of the response, and more must be combined into this single prompt.

Executing the LLM is a very complex process in its own right, and often requires specialized hardware and complex software to be efficient. As such, LLMs are often served as dedicated services apart from the rest of the application.

The LLM being used may be generic (such as OpenAI’s GPT-4 or Anthropic’s Claude) or it may be a customized or fine-tuned model. Custom models offer a variety of tradeoffs that are worth understanding and which may offer indispensable benefits in some contexts.

LLM Service

Serving an LLM is hard, but fortunately there are a lot of excellent options available. Given all the challenges around hosting and scaling LLM’s, in an enterprise context it’s virtually always best to use a pre-packaged LLM serving solution, whether self-hosted or external.

LLM Preparation

LLMs are trained, like any other machine learning model, and generalize that training data as “knowledge” useful for solving future tasks.

LLMs are large and rely on some of the largest datasets in existence to achieve their incredible results. Training state-of-the-art models from scratch can cost many millions of dollars, but fine-tuning pre-trained models can cost pennies and yield meaningful improvement. However, doing so also introduces technical debt, maintenance overhead, and security considerations, so fine-tuning’s benefits should be weighed against its holistic costs.

In general, the service you use to host your LLM will probably also offer a service for fine-tuning that same LLM. Some common options include:

Other Considerations

Prompt Sanitization

We may want to introduce checks to ensure users don’t submit abusive queries, such as ones designed to get our LLM to produce inappropriate outputs. This kind of filter can be readily added to our LLM Orchestration layer as a pre-processing step.

Response Sanitization

We may also want to check that our model doesn’t generate inappropriate outputs. This can be checked by the LLM Orchestration layer as a post-processing step.

Data Access Controls

Especially in enterprise contexts, the user making the query almost always has limits on what knowledge they should have access to. These limits are relevant both to what context we should be allowed to fetch for the LLM, and may also limit which LLMs we should be able to use to generate our response (e.g. if some LLMs are trained with protected data). The LLM Orchestration layer should be built with these limitations in mind, providing appropriate pre-filters to the vector database and deciding on the most appropriate LLM model to rely on for each query.

Diagnostics

LLMs generally give non-deterministic responses (i.e. they may respond differently even if given the same prompt). For the sake of diagnosing mistakes and improving system performance over time, it’s useful to record the model’s inputs and outputs over time. There may be several steps to this, such as consolidating context before final answering or changes to prompt templating over time.

Budgeting and Throughput Management

While scaling of web services is a solved problem for many cases, the extremely specialized requirements of LLMs mean, in the short term, that throughput and rate limits may be a serious issue for large-scale usage. In an enterprise context, when developing a RAG system, or any other LLM-powered system for that matter, it’s important to carefully plan and monitor the system as you scale up. Make sure you’re tracking your cost budget, available infrastructure, API rate limits, and token limits to ensure your system will continue functioning at production scale.

Conclusion

In this article, we’ve reviewed the core design elements of a RAG system and noted what solutions, regardless of cloud, are available to address these needs. We’ve explored the role that each system plays and how each element may encapsulate different project requirements.

Rearc provides services to satisfy bespoke LLM, AI, and MLOps requirements in complicated enterprise contexts like financial services and healthcare. We bring a strong Cloud and DevOps background, so you can trust that your solutions are scalable and maintainable. If you have any enterprise AI requirements you need help with, just reach out to us at ai@rearc.io for consultation.