AnnouncementPinecone serverless on AWS is now generally availableLearn more

Metrics-Driven Agent Development

Retrieval Augmented Generation Assessment (RAGAS) is an evaluation framework for quantifying the performances of our agent and RAG pipelines. By adding evaluation to our workflow, we can iterate on both agent and RAG performance more reliably. In this chapter, we will see how to use RAGAS to quantify the performance of a RAG-enabled conversational agent in LangChain — enabling metrics-driven development of our agent.

Because we need an agent and RAG pipeline to evaluate RAGAS, the first part of this article will cover the creation of a RAG-enabled XML Agent. You can skip the XML agent section and jump ahead to Integrating RAGAS if preferred.

Video walkthrough for RAGAS and Metrics-Driven Agent Development.

To begin, let's install the prerequisites:

!pip install -qU \
    langchain==0.1.1 \
    langchain-community==0.0.13 \
    langchainhub==0.1.14 \
    anthropic==0.14.0 \
    cohere==4.45 \
    pinecone-client==3.0.2 \
    datasets==2.16.1 \

Finding Knowledge

The first thing we need for an agent using RAG is somewhere we want to pull knowledge from. We will use v2 of the AI ArXiv dataset, available on Hugging Face Datasets at [`jamescalam/ai-arxiv2-chunks`](

Note: we're using the prechunked dataset. For the raw version see jamescalam/ai-arxiv2.

Building the Knowledge Base

To build our knowledge base, we need two things:

  1. Embeddings: we will use CohereEmbeddings using Cohere's embedding models, which do need an API key.
  2. Vector database: used to store and query our embeddings. We use Pinecone, which requires a free API key.

First, we initialize our connection to Cohere and define an embed helper function:

import os
from getpass import getpass
from langchain_community.embeddings import CohereEmbeddings

embed = CohereEmbeddings(model="embed-english-v3.0")

Before creating an index, we need the dimensionality of our Cohere embedding model, which we can find easily by creating an embedding and checking the length:

Our embedding model outputs 1024-dimensional vector embeddings — we will use this number when initializing our vector index. We first need to initialize our Pinecone client (using the Pinecone API key) to set up our index.

from pinecone import Pinecone

# configure client
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

Now, we set up our index specification, which allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all available providers and regions here.

from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-west-2"

Now we create the index using our embedding dimensionality and a metric also compatible with the model (this can be either cosine or dotproduct). We also pass our spec to index initialization.

import time

index_name = "xml-agent"

# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
        dimension=len(vec[0]),  # dimensionality of cohere v3
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:

# connect to index
index = pc.Index(index_name)

Populating our Index

Now our knowledge base is ready to be populated with our data. We will use the embed helper function to embed our documents and add them to our index. To simplify things, we can include our chunk's text in the metadata field of each record.

from import tqdm

# easier to work with dataset as pandas dataframe
data = dataset.to_pandas()

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    # get batch of data
    batch = data.iloc[i:i_end]
    # generate unique ids for each chunk
    ids = [x["id"] for i, x in batch.iterrows()]
    # get text to embed
    texts = [x['chunk'] for _, x in batch.iterrows()]
    # embed text
    embeds = embed.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'],
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

Defining an XML Agent

Anthropic trained their LLMs to use XML tags like <input>{some input}</input> or when using a tool they use:

<tool>{tool name}</tool>
<tool_input>{tool input}</tool_input>

This format is very different from the format produced by typical ReAct agents. Because of this, ReAct agents tend to perform worse than XML agents when using Anthropic LLMs.

To create an XML agent, we need a prompt, llm, and a list of tools. We can download a prebuilt prompt for conversational XML agents from the LangChain hub.

We can see the XML format being used throughout the prompt when explaining to the LLM how it should use tools. Next, we initialize the `llm`:

from langchain_community.chat_models import ChatAnthropic

# chat completion llm
llm = ChatAnthropic(

Now, we need to initialize our tools list. The agent will use just one tool — the ArXiv search tool that will search through our vector index and return relevant contexts. We define the tool using the @tool decorator on a function that will consume a query string (i.e., the search query) and return a string containing all retrieved contexts.

from langchain.agents import tool

def arxiv_search(query: str) -> str:
    """Use this tool when answering questions about AI, machine learning, data
    science, or other technical questions that may be answered using arXiv
    # create query vector
    xq = embed.embed_query(query)
    # perform search
    out = index.query(vector=xq, top_k=5, include_metadata=True)
    # reformat results into string
    results_str = "\n---\n".join(
        [x["metadata"]["text"] for x in out["matches"]]
    return results_str

tools = [arxiv_search]

When our agent uses this tool, it executes (and returns output) as shown below:

Creating the Agent Executor

An agent executor is a linear "execution path" we run whenever the agent receives input. Within the agent executor, we combine data preparation steps, our agent prompt, the LLM, tools, and output parsing.

When we execute the agent, we provide it with a single input — the input text from a user. However, within the agent logic, an agent_scratchpad object will be passed, too, including tool information. To feed this information into our LLM, we must transform it into the XML format we described earlier; we define a convert_intermediate_steps function to handle that.

def convert_intermediate_steps(intermediate_steps):
    log = ""
    for action, observation in intermediate_steps:
        log += (
    return log

We must also parse the tools into a string containing tool_name: tool_description — we handle that with the convert_tools function.

def convert_tools(tools):
    return "\n".join([f"{}: {tool.description}" for tool in tools])

With everything ready, we can go ahead and initialize our agent object using LangChain Expression Language (LCEL). We add instructions for when the LLM should stop generating with llm.bind(stop=[...]), and finally, we parse the output from the agent using an XMLAgentOutputParser object.

from langchain.agents.output_parsers import XMLAgentOutputParser

agent = (
        "input": lambda x: x["input"],
        # without "chat_history", tool usage has no context of prev interactions
        "chat_history": lambda x: x["chat_history"],
        "agent_scratchpad": lambda x: convert_intermediate_steps(
    | prompt.partial(tools=convert_tools(tools))
    | llm.bind(stop=["</tool_input>", "</final_answer>"])
    | XMLAgentOutputParser()

With our agent object initialized, we pass it to an AgentExecutor object alongside our original tools list:

from langchain.agents import AgentExecutor

agent_executor = AgentExecutor(
    agent=agent, tools=tools, return_intermediate_steps=True

Now, we can use the agent via the invoke method. Note that we have no "chat_history" so we will pass an empty string to that argument. We can create a helper function called chat to help us handle the chat — if you need conversational history in your use-case, you can add it here.

The answer looks good. Let's try another:

We get more good answers. Now that we have our agent defined, we can move on to evaluating it with RAGAS.

Integrating RAGAS

We need a few things to integrate RAGAS evaluation: the retrieved contexts and the generated output from our pipeline.

We already have the generated output, which we're printing above. The retrieved contexts are also being logged, but we have yet to see how to extract them programmatically. Let's take a look at what we return in out:

When initializing our AgentExecutor object, we included return_intermediate_steps=True — this (unsurprisingly) returns the intermediate steps that the agent tool to generate the final answer. Those steps include the response from our arxiv_search tool — which we can use to evaluate the retrieval portion of our pipeline with RAGAS.

We extract the contexts themselves like so:


To evaluate our agent using RAGAS, we need a dataset containing questions, ideal contexts, and the ground truth answers to those questions. RAGAS does provide utilities for automatically generating these, but these are out of the scope of this article. Nonetheless, we will be using a prebuilt evaluation dataset created using RAGAS.

This dataset includes 51 questions, their most relevant contexts (according to GPT-4), and their truthful answers (again, according to GPT-4).

We first iterate through the questions in this evaluation dataset and ask our agent these questions.

We transform this into a Dataset object like so:

Now we can run evaluation across a suite of metrics provided by RAGAS using evaluate.

from ragas import evaluate
from ragas.metrics import (

result = evaluate(
result = result.to_pandas()

With that, we have generated our evaluation, but we must understand what these metrics mean before looking at the results.

Retrieval Metrics

Retrieval is the first step in every RAG pipeline, so we will focus on metrics that assess retrieval first. We primarily want to focus on context_recall and context_precision, but we must understand what they measure before diving into these metrics.

Actual vs. Predicted

When evaluating the performance of retrieval systems, we tend to compare the actual (ground truth) to predicted results. We define these as:

  • Actual condition is the true label of every context in the dataset. These are positive () if the context is relevant to our query or negative () if the context is irrelevant to our query.
  • Predicted condition is the predicted label determined by our retrieval system. Every context our pipeline returns is a predicted positive, i.e., . If our pipeline does not return a context, it is a predicted negative, i.e., .

Given these conditions, we can say the following:

  • is a true positive, meaning a relevant result has been returned.
  • is a true negative, meaning an irrelevant result was not returned.
  • is a false positive, meaning an irrelevant result has been returned.
  • is a false negative, meaning a relevant result has not been returned.

Let's see how these apply to our metrics in RAGAS.

Context Recall

Context recall (or just recall) measures how many relevant records in a dataset have been retrieved by the pipeline. We calculate it as follows:

RAGAS calculates context recall using Recall@K, where the @K represents the number of contexts returned. If we increase the @K value, the recall scores will improve (as the capture size of the retrieval step increases). At its extreme, we could set @K equal to the dataset size to guarantee perfect recall — although this negates the point of RAG in the first place.

By default, RAGAS uses a @K value of 5.

We can see all but the second set of results returned all relevant contexts. The score in this second set of results is `0.6` meaning that the pipeline returned 3/5 (60%) of relevant contexts.

All other results returned 1.0 (100%), meaning our pipeline retrieved all relevant contexts.

The recall is a useful metric but easily fooled by simply returning more records, i.e., increasing the @K value. Because of that, it is typically paired with precision.

Context Precision

Context precision (or just precision) is another popular retrieval metric. We typically see both recall and precision paired together when evaluating retrieval systems.

As with recall, the actual metric here is called Precision@K, where @K represents the number of contexts returned. However, unlike recall, precision focuses on the number of relevant results returned compared to the total results returned, whether relevant or not — this is equal to our chosen @K value.

Our precision@K scores are equal to our recall scores (this can happen when there are 5 relevant contexts for each query at we set @K = 5). This result means every query produced 100% precision except our 60% precision result, where only 3/5 returned contexts were relevant.

Generation Metrics


The faithfulness metric measures (from 0 to 1) the factual consistency of an answer when compared to the retrieved context. A score of 1 means we can find all answer claims in the context. A score of 0 would indicate that we find no answer claims in the context.

We calculate the faithfulness like so:

When calculating faithfulness, RAGAS uses OpenAI LLMs to decide which claims are in the answer and whether they exist in the context. Because of this approach's "generative" nature, we won't always get accurate scores.

We can see that we get perfect scores for all but our fourth result, which scores 0.0. However, we can see some related claims. Nonetheless, the fourth answer is less grounded in the truth of our context than other responses, indicating justification behind this low score.

Answer Relevancy

Answer relevancy is our final metric. It focuses on the generation component and is similar to our "context precision" metric as it measures how much of the returned information is relevant to our original question.

We return a low answer relevancy score when:

  • Answers are incomplete.
  • Answers contain redundant information.

A high answer relevancy score indicates that an answer is concise and does not contain "fluff" (i.e., irrelevant information).

The score is calculated by asking an LLM to generate multiple questions for a generated answer and then calculating the cosine similarity between the original and generated questions. Naturally, if we have a concise answer that answers a specific question, we should find that the generated question will have a high cosine similarity to the original question.

Again, we can see poorer performance from our fourth answer, but the remainder (particularly the answer with similarity greater than 0.9) perform well.

That's all for our introduction to using RAGAS with real LangChain agents. We've seen how to set up an XML agent with RAG, initialize a RAGAS instance, and use RAGAS to evaluate our XML agent performance.

The ability to measure retrieval and generation performance with a framework like RAGAS allows us to modify and improve the performance of our agents reliably. We should integrate it as part of a metrics-driven optimization process for any AI agent use case.