# Deploying Open Source LLMs for RAG with SageMaker

> Learn how to build Retrieval Augmented Generation (RAG) pipelines with open source LLMs like Flan T5 and Llama 2. We deploy LLMs using AWS SageMaker and implement RAG with sentence transformers and the Pinecone vector database.

James Briggs, Vedant Jain · 2023-08-23

We are developing the future with **L**arge **L**anguage **M**odels (LLMs). Companies worldwide are moving fast to integrate LLMs into existing products and even creating entirely new products using LLMs.

Despite the seemingly unstoppable adoption of LLMs across industries, they're just one component of a broader technology ecosystem that is powering the new AI wave. An LLM alone is not all that useful.

Would you use Google if it was 10x slower and only allowed you to search through data from September 2021 or earlier? A slow, outdated Google would not be as compelling as the Google we use today.

LLMs alone are like our slow, outdated Google. With the correct hardware, some of the latest LLMs, like Meta's Llama 2, will either _not run_ or take minutes to generate a paragraph of text. With a knowledge base, GPT 4 (v0316) can tell us about the world pre-September 2021.

These limitations are significant. Yet, we can overcome these issues by augmenting our LLM with the right components. In this article, we'll explore how to deploy open-access LLMs using AWS Sagemaker and keep our LLMs up to date with relevant information using the Pinecone vector database.

---

Many conversational AI use cases require LLMs like Llama 2, Flan T5, and Bloom to respond to user queries. These models rely on _parametric knowledge_ to answer questions.

[Video](https://www.youtube.com/watch?v=0xyXYHMrAP0)


Parametric knowledge is powerful but limited. The model learns this knowledge during training and encodes it into the model parameters. We must retrain the LLM to update this knowledge — which takes a lot of time and money.

Fortunately, we can also use _source knowledge_ to inform our LLMs. Source knowledge refers to information fed into the LLM via an input prompt.

[Retrieval Augmented Generation (RAG)](https://www.pinecone.io/learn/retrieval-augmented-generation/) is a popular approach to providing our LLMs with relevant source knowledge. Using RAG, we retrieve relevant information from an external data source and feed that information into the LLM.

![RAG allows us to augment our prompts with relevant context, improving LLM performance.](https://cdn.sanity.io/images/vr8gru94/production/12a67c2d233c2097273eb7c7a5db65c066594c3c-2282x1145.png)


## Deploying LLMs in SageMaker

Pinecone will handle the retrieval component of RAG for us, but we still need two more critical components: somewhere to run our LLM inference and somewhere to run our embedding model.

SageMaker provides inference hardware, easily deployable images for LLMs like Llama 2, and integrations with popular model providers like Hugging Face.

SageMaker provides the ideal environment for developing RAG-enabled LLM pipelines. First, create a SageMaker domain and open a Jupyter Studio notebook. We first install prerequisite libraries:

```text
!pip install -qU \
  sagemaker \
  pinecone-client==2.2.1 \
  ipywidgets==7.0.0
```

### Deploying an LLM

There are two approaches to deploying an LLM that we will discuss here. The first is via the `HuggingFaceModel` object. We use this when deploying LLMs (and embedding models) directly from the Hugging Face model hub.

For example, we could create a deployable config for the [google/flan-t5-xl](https://huggingface.co/google/flan-t5-xl) or meta-llama/Llama-2-7b models like so:

```
import sagemaker
from sagemaker.huggingface import (
    HuggingFaceModel,
    get_huggingface_llm_image_uri
)

role = sagemaker.get_execution_role()

hub_config = {
    'HF_MODEL_ID':'meta-llama/Llama-2-7b', # model_id from hf.co/models
    'HF_TASK':'text-generation' # NLP task you want to use for predictions
}

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
    "huggingface",
    version="0.8.2"
)

my_model = HuggingFaceModel(
    env=hub_config,
    role=role, # iam role with permissions to create an Endpoint
    image_uri=llm_image
)
```

When deploying models directly from Hugging Face like this, we must initialize the configuration `my_model` with:

- An `env` config tells us which model we want to use and for what task.
- Our SageMaker execution `role` gives us permissions to deploy our model.
- An `image_uri` is an image config specifically for deploying LLMs from Hugging Face.

Alternatively, SageMaker has a set of models directly compatible with a simpler `JumpStartModel` object. Many popular LLMs like Llama 2 are supported by this, which we initialize like so:

```
from sagemaker.jumpstart.model import JumpStartModel

my_model = JumpStartModel(
    model_id="meta-textgeneration-llama-2-7b-f"
)
```

For both versions of `my_model`, we can go ahead and deploy them like so:

```
llm = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.4xlarge",
    endpoint_name="llama-2-demo"
)
```

### Querying the LLM

With our initialized LLM endpoint, we can begin querying it. The format of our queries may vary (particularly between conversational and non-conversational LLMs), but the process is generally the same. For the Hugging Face model, we do the following:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/6d9b4acbc99906fd5f4d47fa26c457b8ff930a61.json)


​_(See [this notebook](https://github.com/pinecone-io/examples/blob/master/learn/generation/aws/sagemaker/sagemaker-llama-2-rag.ipynb) to see an example with the Llama 2 conversational LLM)_

The generated answer we're receiving here doesn't make much sense — it is a [hallucination](https://www.pinecone.io/learn/ai-hallucinations/).

#### Adding Context

Llama 2 attempts to answer our question based solely on internal parametric knowledge. Clearly, the model parameters do not store knowledge of which instances we can with managed spot training in SageMaker.

To answer this question correctly, we must use _source knowledge_. That is, we give additional information to the LLM via the prompt. Let's add that information directly as additional context for the model.

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/a8530497b2720639fed0d33c0f37b8a11ae28a5a.json)


We now see the correct answer to our question; that was easy! However, a user is unlikely to insert contexts into their prompts — in that case, they would already know the answer to their question.

Rather than manually inserting a single context, we need to automatically identify relevant information from a more extensive database of information. For that, we need **RAG**.

## Retrieval Augmented Generation

With RAG, we will encode our database of information into a vector space where the proximity between vectors represents their relevance / semantic similarity to one another. Using this vector space as a "knowledge base", we can take a new user query, encode it into the same vector space, and retrieve the most relevant records previously indexed.

After retrieving these relevant records, we take a few of them and insert them into our LLM prompt as additional context — giving our LLM highly relevant source knowledge.

We can break these components down into two steps:

1. **Indexing** is where we populate our vector index with information from our dataset.
2. **Retrieval** happens at query time and is where we retrieve relevant information from the vector index.

Both steps require an embedding model to translate our human-readable plain text into semantic vector space. We will use the highly efficient MiniLM sentence transformer from Hugging Face. This model is _not_ an LLM and therefore is not initialized in the same way as our Llama 2 model.

```
hub_config = {
    'HF_MODEL_ID': 'sentence-transformers/all-MiniLM-L6-v2', # model_id from hf.co/models
    'HF_TASK': 'feature-extraction'
}

huggingface_model = HuggingFaceModel(
    env=hub_config,
    role=role,
    transformers_version="4.6", # transformers version used
    pytorch_version="1.7", # pytorch version used
    py_version="py36", # python version of the DLC
)
```

In the `hub_config`, we specify the model ID as before, but for the task, we use `'feature-extraction'` because we are generating vector embeddings — not text like our LLM. Following this, we initialize the model config with `HuggingFaceModel` as before, but this time _without_ the LLM image and _with_ some version parameters.

```
encoder = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.t2.large",
    endpoint_name="minilm-demo"
)
```

We deploy the model again with `deploy`, using the smaller (CPU only) instance of `ml.t2.large`. The reason for this is that the MiniLM model is tiny, so it does not require a lot of memory. MiniLM also doesn't need a GPU because it can quickly create embeddings even on a CPU. If preferred, we could run the model faster on GPU.

To create our embeddings, we use the `predict` method and pass a list of contexts to encode via the `'inputs'` key like so:


```
out = encoder.predict({
    "inputs": ["some text here", "some more text goes here too"]
})
```

We passed two input contexts here, returning two context vector embeddings:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/f68f493eb92311506b99186e8ef73572bc1897ab.ipynb)


The embedding dimensionality of the MiniLM model is `384`. That means each vector embedding MiniLM outputs should have a dimensionality of `384`. However, if we take a look at the length of our embeddings, we will see something strange:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/7efc0e3018154aa4fc9fdae0b9e37e7eed5da4de.ipynb)


We see two lists containing _eight_ items each; what is happening?

MiniLM first processes text in a _tokenization_ step. This tokenization transforms our human-readable plain text into a list of model-readable token IDs. These token IDs each represent a word or sub-word from our input text, like this:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/8319930597ae021ef6e9410ee872d083f58e4655.ipynb)


In the output features of the model, we get the _token-level_ embeddings. If we look at one of these embeddings, we'll find the expected dimensionality of `384`:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/0536f7afb5e5036456abde69826ee22272997932.ipynb)


Perfect! Now we need to transform these token-level embeddings into _document-level_ embeddings. To do this, we take the mean values across each vector dimension.

![Mean pooling operation to get a single 384-dimensional vector.](https://cdn.sanity.io/images/vr8gru94/production/d4bf831f3df861c3910792e3bf1387ae815f6cbf-1220x910.png)


[Colab File](https://cdn.sanity.io/files/vr8gru94/production/78df85de2a65c1483bfd0e29e5fc7446963a7938.ipynb)


Now we have two 384-dimensional vector embeddings, one for each input text. To make our lives easier, we will wrap the encoding process into a single function:

```
from typing import List

def embed_docs(docs: List[str]) -> List[List[float]]:
    out = encoder.predict({'inputs': docs})
    embeddings = np.mean(np.array(out), axis=1)
    return embeddings.tolist()
```

### Downloading the Dataset

We download the [Amazon SageMaker FAQs](https://aws.amazon.com/sagemaker/faqs/) as our knowledge base. The data contains both question and answer columns.

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/85d95242a5c5814fae32a82ace087774080d84e6.ipynb)


When performing our search, we're looking for answers, so we can drop the `Question` column.

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/1f2cfbb511467180e2a45d3630b1fb8e5b76b8c2.ipynb)


Our dataset and the embedding pipeline are ready. Now all we need is somewhere to store those embeddings.

### Indexing

We'll be using the Pinecone vector database to store our vector embeddings and search through them efficiently at scale. To create a database, we need [a free API key](https://app.pinecone.io/).

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/20174f0c8698800a9f3ccaf10f3495b552558dc7.ipynb)


With that, we have connected to our Pinecone vector database. We must create a single vector index (similar to a table in traditional DBs). We will name the index `retrieval-augmentation-aws`. We must align the index `dimension` and `metric` parameters with those required by our embedding model (MiniLM in this case).

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/8eebf5ec44b9954abf1c234e296f0c6bedb230ae.ipynb)


We are ready to begin inserting our data — to do this, we run the following:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/0f5da38b22172bb8b0840c17df06e088bd8102e9.ipynb)


With that, we populated our index and can begin querying it. Let's try again with our earlier question.

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/af9c1b5c660c9eee8a9afd93a2e298be9fd28e5c.ipynb)


Looks great; we're returning relevant contexts to help us answer our question.

### Augmenting the Prompt

We can use the retrieved contexts to augment our prompt. To do this, we should decide on a maximum amount of context to feed into our LLM. We will use `1000` characters. With this limit set, we can iteratively add each returned context to our prompt until we exceed the content length.

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/5dc4f47f16dda68c045b1a1767c5723c93ea4908.ipynb)


Now to feed the `context_str` into our LLM prompt:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/b8ec0aa80ca19a60dfc0f94c2f332a76974cd404.ipynb)


This answer looks excellent! The logic works, so let's wrap it up into a single function to keep things clean.

```
def rag_query(question: str) -> str:
    # create query vec
    query_vec = embed_docs(question)[0]
    # query pinecone
    res = index.query(query_vec, top_k=5, include_metadata=True)
    # get contexts
    contexts = [match.metadata['text'] for match in res.matches]
    # build the multiple contexts string
    context_str = construct_context(contexts=contexts)
    # create our retrieval augmented prompt
    text_input = prompt_template.replace(
        "{context}", context_str
    ).replace(
        "{question}", question
    )
    # make prediction
    out = llm.predict({"inputs": text_input})
    return out[0]["generated_text"]
```

We can now ask questions like so:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/95206b4d4a04fb8364beac06bfa8518acc3f8520.ipynb)


Let's ask questions about things that are out of context and not contained within the dataset. We will find that thanks to our prompt specifying to use the context provided, the model will _not_ hallucinate and instead honestly tell us that it does _not_ know the answer.

---

That's it for our introduction to RAG with open-access LLMs on SageMaker. We've seen how to deploy SageMaker's Jumpstart models with Llama 2, Hugging Face LLMs, and even embedding models with MiniLM.

We implemented a complete end-to-end RAG pipeline using our open-access models and a Pinecone vector index. Using this, we minimize hallucinations, keep our LLM knowledge up to date, and ultimately enhance the user experience and trust in our systems.

---

## Resources

[Hugging Face RAG Example](https://github.com/pinecone-io/examples/blob/master/learn/generation/aws/sagemaker/sagemaker-huggingface-rag.ipynb), Pinecone Examples Repo

[JumpStart Llama 2 RAG Example](https://github.com/pinecone-io/examples/blob/master/learn/generation/aws/sagemaker/sagemaker-llama-2-rag.ipynb), Pinecone Examples Repo