# Metrics-Driven Agent Development

> By using evaluation frameworks like RAGAS we can use a metrics-driven approach to developing AI applications. Here we see how to integrate RAGAS with LangChain agents and Pinecone.

**R**etrieval **A**ugmented **G**eneration **As**sessment (RAGAS) is an evaluation framework for quantifying the performances of our agent and RAG pipelines. By adding evaluation to our workflow, we can iterate on both agent and RAG performance more reliably. In this chapter, we will see how to use RAGAS to quantify the performance of a RAG-enabled conversational agent in LangChain — enabling metrics-driven development of our agent.

Because we need an agent and RAG pipeline to evaluate RAGAS, the first part of this article will cover the creation of a RAG-enabled XML Agent. You can skip the XML agent section and jump ahead to **Integrating RAGAS** if preferred.

[Video](https://www.youtube.com/watch?v=-_52DIIOsCE)


To begin, let's install the prerequisites:

```text
!pip install -qU \
    langchain==0.1.1 \
    langchain-community==0.0.13 \
    langchainhub==0.1.14 \
    anthropic==0.14.0 \
    cohere==4.45 \
    pinecone-client==3.0.2 \
    datasets==2.16.1 \
    ragas==0.1.0
```

## Finding Knowledge

The first thing we need for an agent using RAG is somewhere we want to pull knowledge from. We will use v2 of the AI ArXiv dataset, available on Hugging Face Datasets at [`jamescalam/ai-arxiv2-chunks`](https://huggingface.co/datasets/jamescalam/ai-arxiv2-chunks).  


_Note: we're using the prechunked dataset. For the raw version see [jamescalam/ai-arxiv2](https://huggingface.co/datasets/jamescalam/ai-arxiv2)._

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/06f0dd0eecbc695928e7bf9be495094b4adee856.ipynb)


### Building the Knowledge Base

To build our knowledge base, we need _two things_:

1. **Embeddings**: we will use `CohereEmbeddings` using Cohere's embedding models, which do need an [API key](https://dashboard.cohere.com/api-keys).
2. **Vector database**: used to store and query our embeddings. We use Pinecone, which requires a [free API key](https://app.pinecone.io/).

First, we initialize our connection to Cohere and define an `embed` helper function:

```
import os
from getpass import getpass
from langchain_community.embeddings import CohereEmbeddings

embed = CohereEmbeddings(model="embed-english-v3.0")
```

Before creating an index, we need the dimensionality of our Cohere embedding model, which we can find easily by creating an embedding and checking the length:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/b15e7fd0d04e59837475ff3e0d7dd631a5f893c8.ipynb)


Our embedding model outputs 1024-dimensional vector embeddings — we will use this number when initializing our vector index. We first need to initialize our Pinecone client (using the Pinecone API key) to set up our index.

```
from pinecone import Pinecone

# configure client
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
```

Now, we set up our index specification, which allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

```
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-west-2"
)
```

Now we create the index using our embedding dimensionality and a metric also compatible with the model (this can be either `cosine` or `dotproduct`). We also pass our spec to index initialization.

```
import time

index_name = "xml-agent"

# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=len(vec[0]),  # dimensionality of cohere v3
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
```

### Populating our Index

Now our knowledge base is ready to be populated with our data. We will use the `embed` helper function to embed our documents and add them to our index. To simplify things, we can include our chunk's text in the metadata field of each record.

```
from tqdm.auto import tqdm

# easier to work with dataset as pandas dataframe
data = dataset.to_pandas()

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    # get batch of data
    batch = data.iloc[i:i_end]
    # generate unique ids for each chunk
    ids = [x["id"] for i, x in batch.iterrows()]
    # get text to embed
    texts = [x['chunk'] for _, x in batch.iterrows()]
    # embed text
    embeds = embed.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'],
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))
```

## Defining an XML Agent

Anthropic trained their LLMs to use XML tags like `<input>{some input}</input>` or when using a tool they use:

```text
<tool>{tool name}</tool>
<tool_input>{tool input}</tool_input>
```

This format is very different from the format produced by typical ReAct agents. Because of this, ReAct agents tend to perform worse than XML agents when using Anthropic LLMs.

To create an XML agent, we need a `prompt`, `llm`, and a list of `tools`. We can download a prebuilt prompt for conversational XML agents from the LangChain hub.

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/61b7d07228e70103dcc0c8ac701e6f198cad4286.ipynb)


We can see the XML format being used throughout the prompt when explaining to the LLM how it should use tools. Next, we initialize the `llm`:

```
from langchain_community.chat_models import ChatAnthropic

# chat completion llm
llm = ChatAnthropic(
    anthropic_api_key=os.environ["ANTHROPIC_API_KEY"],
    model_name='claude-2.1',
    temperature=0.0
)
```

Now, we need to initialize our `tools` list. The agent will use just _one_ tool — the ArXiv search tool that will search through our vector index and return relevant contexts. We define the tool using the `@tool` decorator on a function that will consume a query string (i.e., the search query) and return a string containing all retrieved contexts.

```
from langchain.agents import tool

@tool
def arxiv_search(query: str) -> str:
    """Use this tool when answering questions about AI, machine learning, data
    science, or other technical questions that may be answered using arXiv
    papers.
    """
    # create query vector
    xq = embed.embed_query(query)
    # perform search
    out = index.query(vector=xq, top_k=5, include_metadata=True)
    # reformat results into string
    results_str = "\n---\n".join(
        [x["metadata"]["text"] for x in out["matches"]]
    )
    return results_str

tools = [arxiv_search]
```

When our agent uses this tool, it executes (and returns output) as shown below:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/c8c932c39d07453fe537f3e1da22e658235a8731.ipynb)


### Creating the Agent Executor

An agent executor is a linear "execution path" we run whenever the agent receives input. Within the agent executor, we combine data preparation steps, our agent prompt, the LLM, tools, and output parsing.

When we execute the agent, we provide it with a single `input` — the input text from a user. However, within the agent logic, an `agent_scratchpad` object will be passed, too, including tool information. To feed this information into our LLM, we must transform it into the XML format we described earlier; we define a `convert_intermediate_steps` function to handle that.

```
def convert_intermediate_steps(intermediate_steps):
    log = ""
    for action, observation in intermediate_steps:
        log += (
            f"<tool>{action.tool}</tool><tool_input>{action.tool_input}"
            f"</tool_input><observation>{observation}</observation>"
        )
    return log
```

We must also parse the tools into a string containing `tool_name: tool_description` — we handle that with the `convert_tools` function.

```
def convert_tools(tools):
    return "\n".join([f"{tool.name}: {tool.description}" for tool in tools])
```

With everything ready, we can go ahead and initialize our agent object using [LangChain Expression Language (LCEL)](https://www.pinecone.io/learn/series/langchain/langchain-expression-language/). We add instructions for when the LLM should _stop_ generating with `llm.bind(stop=[...])`, and finally, we parse the output from the agent using an `XMLAgentOutputParser` object.

```
from langchain.agents.output_parsers import XMLAgentOutputParser

agent = (
    {
        "input": lambda x: x["input"],
        # without "chat_history", tool usage has no context of prev interactions
        "chat_history": lambda x: x["chat_history"],
        "agent_scratchpad": lambda x: convert_intermediate_steps(
            x["intermediate_steps"]
        ),
    }
    | prompt.partial(tools=convert_tools(tools))
    | llm.bind(stop=["</tool_input>", "</final_answer>"])
    | XMLAgentOutputParser()
)
```

With our `agent` object initialized, we pass it to an `AgentExecutor` object alongside our original `tools` list:

```
from langchain.agents import AgentExecutor

agent_executor = AgentExecutor(
    agent=agent, tools=tools, return_intermediate_steps=True
)
```

Now, we can use the agent via the `invoke` method. Note that we have no `"chat_history"` so we will pass an empty string to that argument. We can create a helper function called `chat` to help us handle the chat — if you need conversational history in your use-case, you can add it here.

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/be003ebbbd885b5d9cbc82959eb0972cbf555ef4.ipynb)


The answer looks good. Let's try another:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/b6740b4ff69ee12fc322c4e5bba96e0902ddf18e.ipynb)


We get more good answers. Now that we have our agent defined, we can move on to evaluating it with RAGAS.

## Integrating RAGAS

We need a few things to integrate RAGAS evaluation: the retrieved contexts and the generated output from our pipeline.

We already have the generated output, which we're printing above. The retrieved contexts are also being logged, but we have yet to see how to extract them programmatically. Let's take a look at what we return in `out`:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/84d067c24e72c0e739ce23c05440f1a607b41482.ipynb)


When initializing our `AgentExecutor` object, we included `return_intermediate_steps=True` — this (unsurprisingly) returns the intermediate steps that the agent tool to generate the final answer. Those steps include the response from our `arxiv_search` tool — which we can use to evaluate the retrieval portion of our pipeline with RAGAS.

We extract the contexts themselves like so:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/c78480f5f6a04edb45b35e6cd4ea3453ceff8a04.ipynb)


### Evaluation

To evaluate our agent using RAGAS, we need a dataset containing questions, ideal contexts, and the _ground truth_ answers to those questions. RAGAS does provide utilities for automatically generating these, but these are out of the scope of this article. Nonetheless, we will be using a prebuilt evaluation dataset created using RAGAS.

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/10da160e30e0ccb7f15bb3dfe502ebb612e60ddb.ipynb)


This dataset includes 51 questions, their most relevant contexts (according to GPT-4), and their truthful answers (again, according to GPT-4).

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/d443deab5758a69fc5941e76b0233251a8d8f7bf.ipynb)


We first iterate through the questions in this evaluation dataset and ask our agent these questions.

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/8a624e38c1c507dc461b9c92b4d6555494ed1396.ipynb)


We transform this into a `Dataset` object like so:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/5fec01073a4a5a638aa39d9b5d9937f15aa36ecc.ipynb)


Now we can run evaluation across a suite of metrics provided by RAGAS using `evaluate`.

```
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

result = evaluate(
    dataset=eval_data,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
    ],
)
result = result.to_pandas()
```

With that, we have generated our evaluation, but we must understand what these metrics mean before looking at the results.

### Retrieval Metrics

Retrieval is the first step in every RAG pipeline, so we will focus on metrics that assess retrieval first. We primarily want to focus on `context_recall` and `context_precision`, but we must understand what they measure before diving into these metrics.

#### Actual vs. Predicted

When evaluating the performance of retrieval systems, we tend to compare the _actual_ (ground truth) to _predicted_ results. We define these as:

- **Actual condition** is the true label of every context in the dataset. These are _positive_ ($p$) if the context is relevant to our query or _negative_ ($n$) if the context is _ir_relevant to our query.
- **Predicted condition** is the _predicted_ label determined by our retrieval system. Every context our pipeline returns is a predicted _positive_, i.e., $\hat{p}$. If our pipeline does not return a context, it is a predicted _negative_, i.e., $\hat{n}$.

Given these conditions, we can say the following:

- $p\hat{p}$ is a **true positive**, meaning a relevant result has been returned.
- $n\hat{n}$ is a **true negative**, meaning an irrelevant result was not returned.
- $n\hat{p}$ is a **false positive**, meaning an irrelevant result has been returned.
- $p\hat{n}$ is a **false negative**, meaning a relevant result has _not_ been returned.

Let's see how these apply to our metrics in RAGAS.

#### Context Recall

Context recall (or just _recall_) measures how many relevant records in a dataset have been retrieved by the pipeline. We calculate it as follows:

$$
Recall@K = \frac{p\hat{p}}{p\hat{p}+ n\hat{n}} = \frac{Relevant \: contexts \: retrieved}{Total \: number \: of \: relevant \: contexts}
$$

RAGAS calculates context recall using _Recall@K_, where the _@K_ represents the number of contexts returned. If we increase the @K value, the recall scores will improve (as the capture size of the retrieval step increases). At its extreme, we could set @K equal to the dataset size to guarantee perfect recall — although this negates the point of RAG in the first place.

By default, RAGAS uses a _@K_ value of `5`.

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/a43567dcc5f4d677680322f7e3966925d87b15ec.ipynb)


We can see all but the second set of results returned all relevant contexts. The score in this second set of results is `0.6` meaning that the pipeline returned 3/5 (60%) of relevant contexts.

All other results returned `1.0` (100%), meaning our pipeline retrieved all relevant contexts.

The recall is a useful metric but easily fooled by simply returning more records, i.e., increasing the _@K_ value. Because of that, it is typically paired with _precision_.

#### Context Precision

Context precision (or just _precision_) is another popular retrieval metric. We typically see both recall and precision paired together when evaluating retrieval systems.

As with recall, the actual metric here is called _Precision@K_, where @K represents the number of contexts returned. However, unlike recall, precision focuses on the number of relevant results returned compared to the total results returned, whether relevant or not — this is equal to our chosen _@K_ value.

$$
Precision@K = \frac{p\hat{p}}{p\hat{p}+ p\hat{n}} = \frac{Relevant \: contexts \: retrieved}{Number \: of \: contexts \: retrieved}
$$

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/ead901c6f4047178921c7124cf9b6f040fbdbc7f.ipynb)


Our precision@K scores are equal to our recall scores (this can happen when there are _5_ relevant contexts for each query at we set _@K = 5_). This result means every query produced 100% precision except our 60% precision result, where only 3/5 returned contexts were relevant.  


### Generation Metrics

#### Faithfulness

The _faithfulness_ metric measures (from _0_ to _1_) the factual consistency of an answer when compared to the retrieved context. A score of _1_ means we can find all answer claims in the context. A score of _0_ would indicate that we find _no_ answer claims in the context.

We calculate the faithfulness like so:

$$
Faithfulness = \frac{Number \: of \: claims \: in \: answer \: also \: found \: in \: context}{Number \: of \: claims \: in \: answer}
$$

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/d52c25ca79483ff0eb8d651e567703822a7faeae.ipynb)


When calculating faithfulness, RAGAS uses OpenAI LLMs to decide which claims are in the answer and whether they exist in the context. Because of this approach's "generative" nature, we won't always get accurate scores.

We can see that we get perfect scores for all but our fourth result, which scores `0.0`. However, we can see some related claims. Nonetheless, the fourth answer is less grounded in the truth of our context than other responses, indicating justification behind this low score.

#### Answer Relevancy

Answer relevancy is our final metric. It focuses on the generation component and is similar to our "context precision" metric as it measures how much of the returned information is relevant to our original question.

We return a low answer relevancy score when:

- Answers are incomplete.
- Answers contain redundant information.

A high answer relevancy score indicates that an answer is concise and does not contain "fluff" (i.e., irrelevant information).

The score is calculated by asking an LLM to generate multiple questions for a generated answer and then calculating the cosine similarity between the original and generated questions. Naturally, if we have a concise answer that answers a specific question, we should find that the generated question will have a high cosine similarity to the original question.

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/b502044cc0d042f7e3a7e3bfaacdd43dfa97e0a3.ipynb)


Again, we can see poorer performance from our fourth answer, but the remainder (particularly the answer with similarity greater than `0.9`) perform well.

---

That's all for our introduction to using RAGAS with real LangChain agents. We've seen how to set up an XML agent with RAG, initialize a RAGAS instance, and use RAGAS to evaluate our XML agent performance.

The ability to measure retrieval and generation performance with a framework like RAGAS allows us to modify and improve the performance of our agents reliably. We should integrate it as part of a metrics-driven optimization process for any AI agent use case.