Blog

We believe empowering engineers drives innovation.

LLM RAG CHATBOT AI DATABRICKS AWS OPENAI

Components of a RAG System in Production

By David Maxson

March 26, 2024

Large language models (LLMs) have introduced a whole new world of opportunities. Suddenly, the interface between machines and human communication has broken down, and ideas that were recently untenable have become almost trivially simple. It’s an exciting time, for sure. Real success, however, depends on wrapping this new technology with software that makes it truly useful. One pattern that has arisen to make that happen is Retrieval Augmented Generation (RAG).

This post will examine the RAG architecture, why this leads to an efficient system, and what production-ready capabilities exist today to deploy a RAG system within an enterprise.

To guide our discussion, we’ll work our way backward through the design of a RAG system. We’ll start with our end goal, and show how the various elements fall into place as we attempt to achieve that goal.

Diagram of a RAG system. A data preparation pipeline feeds a vector index. An orchestrator combines information from the vector index with a foundation LLM. Finally, a wrapping application uses that orchestrator to add AI to the user experience.

The End-User Application

RAGs and LLMs are about more than just chatbots; they bridge the gap between human communication and automation, allowing us to write systems that interact naturally with all the ways humans already use to communicate with each other.

Let’s say we’re building a support ticketing system and want to auto-generate suggested responses. There are a lot of components to such an application. It should integrate tightly with our website, as well as with the internal management tools our support specialists use. It should adapt as the conversation with the customer continues. We should consider how the system’s suggestions are presented on the user interface, and we should be measuring how they impact our specialists’ response times and our customers’ overall experience. Much of this is fairly typical software development. The special sauce, of course, is generating useful suggestions that improve the experience for both our customers and our staff. That’s where RAG comes in.

Our overall application consists of many pieces. There may be databases, caches, frontend components, backend services, APIs, logging, dashboarding, etc. Somewhere, buried deep in our application, is a very small interface that accepts the ticket’s conversation thus far and returns some possible responses along with, perhaps, some links to useful references.

The logic behind this interface is very AI-heavy. It’s aware of LLMs, vector databases, and how they fit together. It’s probably best implemented in Python, and it probably uses common tools in the AI community to abstract away complexity. This code is heavily focused on AI, provides a simple interface to that AI, and can be safely ignorant of much of the rest of the application. In short, this block is probably best implemented as a microservice.

LLM Orchestration

If we make this mental leap quickly enough and commit to the AI-heavy logic being somewhat isolated, we gain many benefits.

For starters, we may not need to write almost any code at all. This need to encapsulate AI orchestration logic is becoming so prevalent that some pre-packaged offerings may suffice–for example, LangServe. The reasons for this are pretty simple: this part of an application often has a simple interface for the outside world, and there are often several pesky requirements that promote isolating this code, including:

Often much easier to write in Python, due to community support for AI applications
Depends on libraries (such as LangChain) which are irrelevant to the rest of the application and may collide with other dependencies
May rely on special hardware (e.g. GPU’s, TPU’s)
May require access to special services (vector database, LLM API’s)
May require additional scrutiny or specialized security measures (e.g. data privacy, prompt sanitization, LLM response logging)

Whether we execute this logic in a microservice or not, it is likely to be such a distinct and isolatable portion of our software that we can think of it as an independent component.

In a RAG context specifically, this service has a couple of primary functions. There may be many other steps involved for the sake of safety, privacy, efficacy, or functionality, but there are at least two main steps somewhere in that pipeline: context retrieval, and augmented response generation.

Common umbrella solutions include the following:

Open source or self-hosted:
- LangServe
AWS:
- Kendra
GCP:
- Vertex AI Search
Azure:
- AI Search

Context Retrieval

Context retrieval solves a few limitations with today’s AI models.

Understanding what knowledge is relevant. Modern LLMs are trained on mind-boggling amounts of information, but most of it is generic and irrelevant to any particular question. Whether the LLM has been trained with domain-specific knowledge or not, it’s still effectively drawing on its long-term memory to respond to questions. By retrieving relevant context, we can provide the LLM with reference material that’s likely relevant to answering the particular question. This material can reinforce or even entirely add knowledge to the LLM to help generate a better response. By analogy, giving a human reference materials to answer a question makes it easier for them to produce an accurate, consistent answer, whether or not they might’ve been able to come up with a good answer on their own.
Limited scope of what can be referenced during response generation. Modern LLMs have a limit to how much they can respond to at once; there’s little difference between this limit and a computer’s RAM. While newer LLMs keep pushing this limit further, it will always come at a cost. It may seem obvious, but it’s worth stating clearly: a simpler program and task will always be cheaper, faster, and yield better results than the same solution operating on a more complicated task. An intuitive analogy is that people can give better answers faster when they have clearer questions and more concise and precise contextual knowledge.
Complexity of training and serving LLM models. These models are enormous; storing and transmitting them is a chore, not even to mention the cost and complexity of training a model. The more we fragment our models or generate derivatives, the harder it becomes to build the system, serve it efficiently, explain and reproduce results, and much more. By instead providing domain-specific context to a shared model, we can get much the same results with orders of magnitude less engineering effort. By analogy, it is far easier to interact with fewer people with more knowledge than a lot of people with less knowledge.

The RAG solution to these problems is to leverage our model’s limited context window to provide our model with concise task-specific knowledge. We encapsulate compact nuggets of potentially relevant knowledge in a database, and use a search engine to fetch the most relevant nuggets at query time. This knowledge is then added to the prompt provided to the LLM for its use in generating a response.

Data Index

Typically, in a RAG context, we use AI-powered vector embeddings and similarity search to identify documents that may be useful to our LLM. There are many such embedding algorithms, but the more important decision here is what search engine to use. Ultimately, the point of this system is to yield relevant knowledge to your RAG system so your LLM can generate the best responses.

When choosing a system for your search index, you need to consider its speed, scaling, update patterns, ease of deployment, and many other options. As a result, there are many options available targeting the most common places you might be storing your data.

Common solutions include the following:

Open source or self-hosted:
- ChromaDB
- PostgreSQL pgvector plugin
- Pinecone
- FAISS
- ElasticSearch or Solr (traditional search)
- And many more…
AWS:
- OpenSearch (traditional search)
- Aurora or RDS PostgreSQL with the pgvector plugin
- MemoryDB (Redis)
- DocumentDB (MongoDB)
GCP:
- AlloyDB or CloudSQL for PostgreSQL with pgvector plugin
- Vertex AI Vector Search
Azure:
- CosmosDB for MongoDB, PostgreSQL, or NoSQL (with Azure AI Search)
Databricks:
- Vector Search

Data Preparation

A critical part of the RAG pattern is getting data ingested into your search system so it’s available for the LLM later. There are a host of challenges here, some of which may or may not be addressed by your choice of search engine.

In a simple vector database, you may need to think about things like:

Converting your data to a format supported by your database (generally text, but some databases support indexing images or other formats)
Chunking up your documents to enable retrieving only the relevant portions of longer documents
Data cleanup (removing formatting, summarization of longer chunks, etc.)
Associating metadata with each indexable element (recency or accuracy of the data, links to source documents, etc.)
ETL pipeline orchestration to update your database as new documents become available

More managed search engines may offer any number of the above features for you.

Your specific needs will vary depending on where you store your data and what your RAG’s needs are. There is a large variety of tools that may help with building and maintaining your critical data pipelines to ensure your RAG has access to updated, relevant knowledge.

Open source or self-hosted:
- Airflow
- Prefect
- Databricks
- Snowflake
AWS:
- Step Functions
- Glue
- EMR
GCP:
- Dataflow
Azure:
- Data Factory
Databricks:
- Spark Streaming
- Delta Live Tables

Augmented Response Generation

LLM’s are, by their nature, stateless. They have knowledge trained into them, but otherwise they are simply functions that accept inputs and generate outputs without memory or side-effects. (LLM Agents or Assistants are a kind of LLM-like application that can use tools that may have side effects, but these are stateful services wrapped around stateless LLMs.)

To generate our response, we must now pass our instructions, context, and prompt to the LLM. This combined prompt provides the entirety of the task we wish the LLM to accomplish for us. Any state, such as previous messages in a thread, the topic of conversation, the tone of the response, and more must be combined into this single prompt.

Executing the LLM is a very complex process in its own right, and often requires specialized hardware and complex software to be efficient. As such, LLMs are often served as dedicated services apart from the rest of the application.

The LLM being used may be generic (such as OpenAI’s GPT-4 or Anthropic’s Claude) or it may be a customized or fine-tuned model. Custom models offer a variety of tradeoffs that are worth understanding and which may offer indispensable benefits in some contexts.

LLM Service

Serving an LLM is hard, but fortunately there are a lot of excellent options available. Given all the challenges around hosting and scaling LLM’s, in an enterprise context it’s virtually always best to use a pre-packaged LLM serving solution, whether self-hosted or external.

Open source or self-hosted:
- Ollama
- LocalAI
- A variety of others…
Dedicated services:
- OpenAI
- Anthropic
- And many others…
AWS:
- Bedrock
GCP:
- Generative AI on Vertex AI
Databricks:
- Model Serving
- Foundation Models

LLM Preparation

LLMs are trained, like any other machine learning model, and generalize that training data as “knowledge” useful for solving future tasks.

LLMs are large and rely on some of the largest datasets in existence to achieve their incredible results. Training state-of-the-art models from scratch can cost many millions of dollars, but fine-tuning pre-trained models can cost pennies and yield meaningful improvement. However, doing so also introduces technical debt, maintenance overhead, and security considerations, so fine-tuning’s benefits should be weighed against its holistic costs.

In general, the service you use to host your LLM will probably also offer a service for fine-tuning that same LLM. Some common options include:

Dedicated services:
- OpenAI
- Anthropic
AWS:
- Bedrock
GCP:
- Generative AI
Databricks

Other Considerations

Prompt Sanitization

We may want to introduce checks to ensure users don’t submit abusive queries, such as ones designed to get our LLM to produce inappropriate outputs. This kind of filter can be readily added to our LLM Orchestration layer as a pre-processing step.

Response Sanitization

We may also want to check that our model doesn’t generate inappropriate outputs. This can be checked by the LLM Orchestration layer as a post-processing step.

Data Access Controls

Especially in enterprise contexts, the user making the query almost always has limits on what knowledge they should have access to. These limits are relevant both to what context we should be allowed to fetch for the LLM, and may also limit which LLMs we should be able to use to generate our response (e.g. if some LLMs are trained with protected data). The LLM Orchestration layer should be built with these limitations in mind, providing appropriate pre-filters to the vector database and deciding on the most appropriate LLM model to rely on for each query.

Diagnostics

LLMs generally give non-deterministic responses (i.e. they may respond differently even if given the same prompt). For the sake of diagnosing mistakes and improving system performance over time, it’s useful to record the model’s inputs and outputs over time. There may be several steps to this, such as consolidating context before final answering or changes to prompt templating over time.

Budgeting and Throughput Management

While scaling of web services is a solved problem for many cases, the extremely specialized requirements of LLMs mean, in the short term, that throughput and rate limits may be a serious issue for large-scale usage. In an enterprise context, when developing a RAG system, or any other LLM-powered system for that matter, it’s important to carefully plan and monitor the system as you scale up. Make sure you’re tracking your cost budget, available infrastructure, API rate limits, and token limits to ensure your system will continue functioning at production scale.

Conclusion

In this article, we’ve reviewed the core design elements of a RAG system and noted what solutions, regardless of cloud, are available to address these needs. We’ve explored the role that each system plays and how each element may encapsulate different project requirements.

Rearc provides services to satisfy bespoke LLM, AI, and MLOps requirements in complicated enterprise contexts like financial services and healthcare. We bring a strong Cloud and DevOps background, so you can trust that your solutions are scalable and maintainable. If you have any enterprise AI requirements you need help with, just reach out to us at ai@rearc.io for consultation.