All-in-One RAGs with Databricks

By David Maxson

May 31, 2024

Previously, we saw the various components required to build a RAG LLM system. In that article, we discussed both how a RAG system comes together and the most common services available to fill each component of that system.

In this article, we’ll focus solely on using Databricks to build out RAG. Why Databricks?

We’re fans of the platform. We find it one of the most complete, powerful, and useful data platforms available in the cloud. We recommend Databricks frequently to our consulting clients because we believe it solves their needs with performance and price that will make their lives better.
Databricks has been expanding its excellent foundational data capabilities into the AI domain. While they’ve made some recent acquisitions and strategic shifts into the GenAI domain in recent years, they’ve been integrating more traditional ML Ops tooling into their platform for years prior, investing in broadly useful tools like experiment management and model lifecycle management.
Databricks has been pushing hard in the last couple of years to be a one-stop shop for LLMs. That includes model hosting, embeddings, vector search, model monitoring, and more. They do all this while maintaining their ability to run on all three of the major clouds (AWS, GCP, and Azure).

Some of Databricks’ features are newer (e.g. their recently-released DBRX LLM, and their Vector Search public preview) and have yet to stand the test of time. But in all fairness, most other offerings in these domains are also new and iterating rapidly, and there have yet to arise any obvious winners (as noted in the previous post). While I’ll call out what’s new and what’s established in the Databricks ecosystem, these components are mature enough to show how to put all these pieces together.

As an aside, while I may write similar articles about each cloud provider (AWS, etc.) in the future, Databricks stands out because of its cross-platform emphasis and open-source roots. While not everything they’ve done is open source, most of it is. As such, many of the tools I’ll be talking about could be run on-premise. This makes it much easier for me to recommend this ecosystem without fear of vendor lock-in.

Architecture Review

As a review, RAG is a system of services built around modern LLMs to provide several advantages over using those LLMs directly.

Adds domain-specific knowledge
Knowledge and query distillation
Lightweight wrapper service which can add security, monitoring, and other features

Diagram of a RAG system. A data preparation pipeline feeds a vector index. An orchestrator combines information from the vector index with a foundation LLM. Finally, a wrapping application uses that orchestrator to add AI to the user experience.

In short, the core logic of RAG is as follows:

Receive natural language input
Retrieve contextual knowledge
Augment our input with that knowledge
Generate our response

Knowledge Base

Knowledge Preparation

Preparing your knowledge base is probably where Databricks shines most since this is exactly the kind of work the platform has been built for from day one. Between Spark’s vast feature set and all the additions Databricks brings, you have a slew of options available for converting your domain-specific knowledge from raw data into useful chunks.

Spark, of course, is the star of the show here. Whether you use it via Java, Scala, Python, SQL, or any of a host of other languages, Spark is an industry standard for processing big data with support for structured and semi-structured data, batch or streaming flows, and much more. Making Spark easy to use is Databricks’ bread and butter; there are few platforms out there that make it easier to process data of any size from whatever structure it starts with into whatever structure you want.

Beyond bare spark, Databricks also offers a concept of “live tables”, which are effectively smart materialized views where Databricks handles the infrastructure and Spark handles the compute. Especially when you note that this allows for both SQL and Python code as part of the live pipeline, and is built on top of Spark Streaming, this is an incredibly potent solution to maintaining downstream derivatives of upstream data. This is powered in part by Databricks’ open-source Delta table format, adding extra features like rewinding table versions, change logging, cloud-agnostic storage, efficient merging of new data, and much more.

For observability, management, and security, we have Databricks’ Unity Catalog. This, and its associated security model, bridges raw data and compute with the security and oversight realities of managing data and its derivatives in enterprise contexts. The combination of Unity Catalog with Delta tables also makes cross-platform no-copy data sharing a practical reality.

Databricks also offers a variety of other features to encourage you to use Spark in their environment.

Photon runtime for higher performance.
AI functions for easy interactions with common AI tasks.
And, just generally, the fact that the Databricks platform brings all of the above tools together with little hassle.

There are innumerable other discussions about data processing with Spark and Databricks which I don’t need to duplicate; suffice it to say, databricks is one of the easiest and most potent places to manage your data.

But there’s much more we’ll need for RAG.

Knowledge Indexing and Search

To integrate our knowledge data with our RAG system, we’ll need a way to index our data and retrieve it efficiently. As discussed in the previous post, what we need is a search engine. In a RAG context, however, it’s common practice to piece this together ourselves as a simple semantic search engine, which requires two components: a vector database, and an embedding function.

The vector database must support adding new records (indexing) and searching for similar records (retrieval). Databricks offers this functionality with its Vector Search offering. The docs note that the indexing and retrieval steps are two independent components. The index is a derivative of a Delta table and is managed in Unity Catalog like any other table, with an automatic update mechanism similar to Delta Live Tables. We then attach this index to a search endpoint, which provides the compute and API needed to retrieve documents from this index.

For generating the vectors, you can either bring your own solution (e.g. by relying on a third-party API) or, preferably, use Databricks to host the embedding model which is easy to automate with the rest of this system. This embedding service is responsible for loading and executing the embedding model which converts text into vectors. Importantly, this service must be available both at indexing time (to generate vectors for your searchable documents) and at search time (to generate a vector for your query text).

Alright, with all these systems in place, we have everything we need for a near-real-time searchable index of our knowledge. The next component we’ll need to put our RAG together will be LLMs for generating outputs.

LLM

In our previous post, we discussed the wide variety of LLM’s available. That list is likely to need updating in the next few months as the various well-funded LLM companies strive to outperform each other. What doesn’t change as fast is the interface to those LLMs.

LLMs have historically been unimodal: they accepted text and generated text. With the expansion of Generative AI, however, this is changing rapidly. Models are becoming multi-modal, integrating the ability to both ingest and generate images, audio, and video within a single model. This domain is still developing rapidly, and new standards will arise with time. In the meantime, the patterns around text are sufficiently stable that most LLMs can be swapped for each other with minimal difficulty.

What this means for our situation is that, for integrating LLMs into real-world applications, we care more about choosing an interface that will remain stable and grow easily over time than about which models we have access to at this instant. Better models will come and go, but we don’t want to rewrite our application every time a better LLM comes out somewhere. Databricks, therefore, has focused on providing a platform and interface to LLMs and then partnering with various external companies to ensure their models are available behind that interface.

Let’s dive in and see what Databricks makes readily available from within their ecosystem.

Accessing Common Models

For many applications–RAG especially–we can do just fine with general-purpose LLMs. Building an initial LLM takes tens or even hundreds of millions of dollars of compute resources, not even including the expertise required to design and manage the models and their training. The resulting models generalize extraordinarily well, and many of their remaining shortcomings can be well resolved by wrapping them in well-designed software, such as the RAG architecture. Thus, let’s first look at how to access general-purpose LLMs within Databricks.

These core models are generally called “foundation models”. In Databricks, we have two ways to access foundation models:

Bill-by-usage: truly serverless hosting, where Databricks ensures model availability for popular models and all Databricks customers can share those same model servers. You are billed by the number of tokens you send and receive, which scales to 0.
Bill-by-throughput: managed hosting, where Databricks provides you dedicated compute power for a particular model as effectively a private model server. You are billed by the amount of compute you reserve (size x duration) whether you use it or not.

The bill-by-usage model is excellent for getting started or for smaller or bursty applications, but has some limitations:

There must be sufficient demand for a model to warrant Databricks providing this kind of hosting, so this is only available for the most popular models.
The model being hosted must be cross-customer, and thus this billing method isn’t available for private or customized LLMs.
For large or sustained token throughputs, this probably won’t give you the best price-per-token or latency.

Using foundation models on Databricks is a flexible, self-contained solution. However, there are many other models that are not open-source and cannot be hosted by Databricks, such as OpenAI or Anthropic’s models. For these, Databricks provides an adapter layer that offers a consistent interface across any of a variety of external providers.

Building Custom Models

In some cases, you may decide you need a custom model. This model might be an LLM fine-tuned for your particular application, or it may be a non-LLM capability. Either way, Databricks has a managed offering for hosting your own private ML models too.

Databricks has a variety of offerings to help with custom ML:

AutoML to automate the boring parts of building high-performance ML solutions (not yet applicable to training LLM’s specifically, but likely to be in the future).
MLflow to monitor and manage the lifecycle of your ML models.
Model Serving to wrap your ML models as scalable APIs.
LLM Fine-Tuning and Pre-Training to simplify the process of modifying LLM models, which are especially difficult to train relative to other ML models.

LLM Orchestration

Alright, so we have our knowledge base and we have access to our LLM of choice. The final step in putting together our RAG architecture is to tie everything together. For the sake of argument, let’s assume we want to do so as an API (public or private), so we can use our complete RAG from any other application we may be building.

Note that what I’m discussing here is the hosting of the RAG driver, not the LLM itself. This RAG system incorporates prompt engineering, retrieval, moderation, logging, and at least one call to an LLM (hosted elsewhere) for generating the output. This code is usually lightweight glue to binds other systems together (fetching data from an external database and using that to augment what gets generated by an external LLM), and often does not require specialized hardware (GPU’s) or direct access to model files (model weights).

The main solution Databricks offers here is Model Serving, which makes it easy to register Python functions, R crates, LangChain chains, or a whole host of other options as hosted models. LangChain, in particular, is a popular library for piecing together LLM-powered operations and offers easy recipes for building a RAG application specifically. In the Databricks context, this looks like the following process:

Build the model, function, or chain you want to host in a Databricks notebook
Log the model with MLflow
Register that logged model with Unity Catalog
Host that model with a Databricks model serving endpoint

This done, you now have the model tagged and tracked, access to it secured by UC, and an API for it provisioned by your model serving endpoint. The result is scalable, securable, and trackable.

Conclusion

In this post, we’ve showed how all the necessary elements for building a RAG-like system, or virtually any other compound AI system, are all ready-at-hand in the Databricks ecosystem. Tight integration with MLFlow, in-house API’s for interacting with LLM’s, and best-in-class data pipeline systems that can be directly embedded and served as vector stores make Databricks a one-stop-shop for building out production-ready AI systems.

Rearc provides services to satisfy bespoke LLM, AI, and MLOps requirements in complicated enterprise contexts like financial services and healthcare. We bring a strong Cloud and DevOps background, so you can trust that your solutions are scalable and maintainable. If you have any enterprise AI requirements you need help with, just reach out to us at ai@rearc.io for consultation.