# Evaluation Measures in Information Retrieval

Evaluation of information retrieval (IR) systems is critical to making well-informed design decisions. From search to recommendations, evaluation measures are paramount to understanding what does and does not work in retrieval.

Many big tech companies contribute much of their success to well-built IR systems. One of Amazon’s earliest iterations of the technology was reportedly driving more than 35% of their sales[1]. Google attributes 70% of YouTube views to their IR recommender systems[2][3].

IR systems power some of the greatest companies in the world, and behind every successful IR system is a set of evaluation measures.

## Metrics in Information Retrieval

Evaluation measures for IR systems can be split into two categories: online or offline metrics.

Online metrics are captured during actual usage of the IR system when it is online. These consider user interactions like whether a user clicked on a recommended show from Netflix or if a particular link was clicked from an email advertisement (the click-through rate or CTR). There are many online metrics, but they all relate to some form of user interaction.

Offline metrics are measured in an isolated environment before deploying a new IR system. These look at whether a particular set of relevant results are returned when retrieving items with the system.

Evaluation measures can be categorized as either offline or online metrics. Offline metrics can be further divided into order-unaware or order-aware, which we will explain soon.

Organizations often use both offline and online metrics to measure the performance of their IR systems. It begins, however, with offline metrics to predict the system’s performance before deployment.

We will focus on the most useful and popular offline metrics:

• Recall@K
• Mean Reciprocal Rank (MRR)
• Mean Average Precision@K (MAP@K)
• Normalized Discounted Cumulative Gain (NDCG@K)

These metrics are deceptively simple yet provide invaluable insight into the performance of IR systems.

We can use one or more of these metrics in different evaluation stages. During the development of Spotify’s podcast search; Recall@K (using $K=1$) was used during training on “evaluation batches”, and after training, both Recall@K and MRR (using $K=30$) were used with a much larger evaluation set.

The last paragraph will make sense by the end of this article. For now, understand that Spotify was able to predict system performance before deploying anything to customers. This allowed them to deploy successful A/B tests and significantly increase podcast engagement[4].

We have two more subdivisions for these metrics; order-aware and order-unaware. This refers to whether the order of results impacts the metric score. If so, the metric is order-aware. Otherwise, it is order-unaware.

## Cats in Boxes

Throughout the article, we will be using a very small dataset of eight images. In reality, this number is likely to be millions or more.

Example query and ranking of the eight possible results.

If we were to search for “cat in a box”, we may return something like the above. The numbers represent the relevance rank of each image as predicted by the IR system. Other queries would yield a different order of results.

Example query and ranking with actual relevant results highlighted.

We can see that results 2, 4, 5, and 7 are actual relevant results. The other results are not relevant as they show cats without boxes, boxes without cats, or a dog.

### Actual vs. Predicted

When evaluating the performance of the IR system, we will be comparing actual vs. predicted conditions, where:

• Actual condition refers to the true label of every item in the dataset. These are positive ($p$) if an item is relevant to a query or negative ($n$) if an item is irrelevant to a query.
• Predicted condition is the predicted label returned by the IR system. If an item is returned, it is predicted as being positive ($\hat{p}$) and, if it is not returned, is predicted as a negative ($\hat{n}$).

From these actual and predicted conditions, we create a set of outputs from which we calculate all of our offline metrics. Those are the true/false positives and true/false negatives.

The positive results focus on what the IR system returns. Given our dataset, we ask the IR system to return two items using the “cat in a box” query. If it returns an actual relevant result this is a true positive ($p\hat{p}$); if it returns an irrelevant result, we have a false positive ($n\hat{p}$).

On returning the top two results, we get $n\hat{p}$ and $p\hat{p}$ results for the positives. For the unreturned negatives we have a mix of $n\hat{n}$ and $p\hat{n}$.

For negative results, we must look at what the IR system does not return. Let’s query for two results. Anything that is relevant but is not returned is a false negative ($p\hat{n}$). Irrelevant items that were not returned are true negatives ($n\hat{n}$).

With all of this in mind, we can begin with the first metric.

## Recall@K

Recall@K is one of the most interpretable and popular offline evaluation metrics. It measures how many relevant items were returned ($p\hat{p}$) against how many relevant items exist in the entire dataset ($p\hat{p}$$+$$p\hat{n}$). $$Recall@K = \frac{true Positives}{true Positives + false Negatives} = \frac{p\hat{p}}{p\hat{p}+p\hat{n}}$$ The K in this and all other offline metrics refers to the number of items returned by the IR system. In our example, we have a total number of N = 8 items in the entire dataset, so K can be any value between $[1, …, N]$.

With recall@2 we return the predicted top K = 2 most relevant results.

When K = 2, our recall@2 score is calculated as the number of returned relevant results over the total number of relevant results in the entire dataset. That is: $$Recall@2 = \frac{p\hat{p}}{p\hat{p}+p\hat{n}} = \frac{1}{1+3} = 0.25$$ With recall@K, the score improves as K increases and the scope of returned items increases.

With recall@K we will see the score increase as K increases and more positives (whether true or false) are returned.

We can calculate the same recall@K score easily in Python. For this, we will define a function named recall that takes lists of actual conditions and predicted conditions, a K value, and returns a recall@K score.

Using this, we will replicate our eight-image dataset with actual relevant results in rank positions 2, 4, 5, and 7.

### Pros and Cons

Recall@K is undoubtedly one of the most easily interpretable evaluation metrics. We know that a perfect score indicates that all relevant items are being returned. We also know that a smaller k value makes it harder for the IR system to score well with recall@K.

Still, there are disadvantages to using recall@K. By increasing K to N or near N, we can return a perfect score every time, so relying solely on recall@K can be deceptive.

Another problem is that it is an order-unaware metric. That means if we used recall@4 and returned one relevant result at rank one, we would score the same as if we returned the same result at rank four. Clearly, it is better to return the actual relevant result at a higher rank, but recall@K cannot account for this.

## Mean Reciprocal Rank (MRR)

The Mean Reciprocal Rank (MRR) is an order-aware metric, which means that, unlike recall@K, returning an actual relevant result at rank one scores better than at rank four.

Another differentiator for MRR is that it is calculated based on multiple queries. It is calculated as: $$MRR = \frac{1}{Q} \sum_{q=1}^{Q} \frac{1}{rank_q}$$ $Q$ is the number of queries, $q$ a specific query, and $rank_q$ the rank of the first *actual relevant* result for query $q$. We will explain the formula step-by-step.

Using our last example where a user searches for “cat in a box”. We add two more queries, giving us $Q = 3$.

We perform three queries while calculating the MRR score.

We calculate the rank reciprocal $\frac{1}{rank_q}$ for each query $q$. For the first query, the first actual relevant image is returned at position two, so the rank reciprocal is $\frac{1}{2}$. Let’s calculate the reciprocal rank for all queries:

$$query \space 1: \frac{1}{rank_1} = \frac{1}{2} = 0.5$$

$$query \space 2: \frac{1}{rank_2} = \frac{1}{1} = 1.0$$

$$query \space 3: \frac{1}{rank_3} = \frac{1}{5} = 0.2$$

Next, we sum all of these reciprocal ranks for queries $q=[1, …, Q]$ (e.g., all three of our queries):

$$\sum_{q=1}^{Q} \frac{1}{rank_q} = 0.5 + 1.0 + 0.2 = 1.7$$

As we are calculating the mean reciprocal rank (MRR), we must take the average value by dividing our total reciprocal ranks by the number of queries $Q$:

$$MRR = \frac{1}{Q} \sum_{q=1}^{Q} \frac{1}{rank_q} = \frac{1}{3}1.7 ≅ 0.57$$

Now let’s translate this into Python. We will replicate the same scenario where $Q = 3$ using the same actual relevant results.

And as expected, we calculate the same MRR score of 0.57.

### Pros and Cons

MRR has its own unique set of advantages and disadvantages. It is order-aware, a massive advantage for use cases where the rank of the first relevant result is important, like chatbots or question-answering.

On the other hand, we consider the rank of the first relevant item, but no others. That means for use cases where we’d like to return multiple items like recommendation or search engines, MRR is not a good metric. For example, if we’d like to recommend ~10 products to a user, we ask the IR system to retrieve 10 items. We could return just one actual relevant item in rank one and no other relevant items. Nine of ten irrelevant items is a terrible result, but MRR would score a perfect 1.0.

Another minor disadvantage is that MRR is less readily interpretable compared to a simpler metric like recall@K. However, it is still more interpretable than many other evaluation metrics.

## Mean Average Precision (MAP)

Mean Average Precision@K (MAP@K) is another popular order-aware metric. At first, it seems to have an odd name, a mean of an average? It makes sense; we promise.

There are a few steps to calculating MAP@K. We start with another metric called precision@K: $$Precision@K = \frac{true Positives}{true Positives + false Positives} = \frac{p\hat{p}}{p\hat{p}+n\hat{p}}$$ You may be think this looks very similar to recall@K, and it is! The only difference is that we’ve swapped $p\hat{n}$ in recall for $n\hat{p}$ here. That means we now consider both actual relevant and non-relevant results only from the returned items. We do not consider non-returned items with precision@K.

The difference between Recall@K and Precision@K calculation where K = 2.

Let’s return to the “cat in a box” example. We will return $K = 2$ items and calculate precision@2.

Precision@K calculation where K = 2.

$$Precision@2 = \frac{p\hat{p}}{p\hat{p}+n\hat{p}} = \frac{p\hat{p}}{K} = \frac{1}{2} = 0.5$$

Note that the denominator in precision@K always equals $K$. Now that we have the precision@K value, we move on to the next step of calculating the Average Precision@K (AP@K):

$$AP@K = \frac{\sum_{k = 1}^{K} (Precision@k * rel_k)}{ number \space of \space relevant \space results}$$

Note that for AP@K we are taking the average precision@k score for all values of $k = [1, …, K]$. Meaning that for AP@5 we calculate precision@k where $k=[1, 2, 3, 4, 5]$.

We have not seen the $rel_k$ parameter before. It is a relevance parameter that (for AP@K) is equal to 1 when the $k^{th}$ item is relevant or 0 when it is not.

We calculate $precision@k$ and $rel_k$ iteratively using $k = [1, ..., K]$.

As we multiply precision@k and $rel_k$, we only need to consider precision@k where the $k^{th}$ item is relevant.

$Precision@k$ and $rel_k$ values for all relevant items across three queries $q = [1, ..., Q]$.

Given these values, for each query $q$, we can calculate the $AP@K_q$ score where $K=8$ as:

$$AP@8_1 = \frac{0.5*1 + 0.5*1 + 0.6*1 + 0.57*1}{4} = 0.54$$

$$AP@8_2 = \frac{1*1 + 0.5*1 + 0.4*1 + 0.43*1}{4} = 0.67$$

$$AP@8_3 = \frac{0.2*1 + 0.25*1}{2} = 0.22$$

Each of these individual $AP@K_q$ calculations produces a single Average Precision@K score for each query $q$. To get the Mean Average Precision@K (MAP@K) score for all queries, we simply divide by the number of queries $Q$:

$$MAP@K = \frac{1}{Q}\sum_{q=1}^{Q}AP@K_q = \frac{1}{3}*(0.54+0.67+0.22) = 0.48$$

That leaves us with a final MAP@K score of 0.48. To calculate all of this with Python, we write:

Returning the same MAP@K score of $0.48$.

### Pros and Cons

MAP@K is a simple offline metric that allows us to consider the order of returned items. Making this ideal for use cases where we expect to return multiple relevant items.

The primary disadvantage of MAP@K is the $rel_K$ relevance parameter is binary. We must either view items as relevant or irrelevant. It does not allow for items to be slightly more/less relevant than others.

## Normalized Discounted Cumulative Gain (NDCG@K)

Normalized Discounted Cumulative Gain @K ($NDCG@K$) is another order-aware metric that we can derive from a few simpler metrics. Starting with Cumulative Gain ($CG@K$) calculated like so: $$CG@K = \sum_{k=1}^{K}rel_k$$ The $rel_k$ variable is different this time. It is a range of relevance ranks where *0* is the least relevant, and some higher value is the most relevant. The number of ranks does not matter; in our example, we use a range of $0 \rightarrow 4$.

Using $rel_k$ we rank every item based on its relevance to a particular query $q$.

Let’s try applying this to another example. We will use a similar eight-image dataset as before. The circled numbers represent the IR system’s predicted ranking, and the diamond shapes represent the $rel_k$ actual ranking.

A small dataset with predicted ranks (circles) and actual ranks (diamonds).

To calculate the cumulative gain at position K (CG@K), we sum the relevance scores up to the predicted rank K. When $K = 2$:

$$CG@2 = \sum_{k=1}^{2}rel_k = rel_1 + rel_2 = 0 + 4 = 4$$

It’s important to note that CG@K is not order-aware. If we swap images 1 and 2, we will return the same score when $K \geq 2$ despite having the more relevant item placed first.

Images 1 and 2 have been swapped.

$$CG@2 = \sum_{k=1}^{2}rel_k = rel_1 + rel_2 = 4 + 0 = 4$$

To handle this lack of order awareness, we modify the metric to create DCG@K, adding a penalty in the form of $log_{2}(1+k)$ to the formula:

$$DCG@2 = \sum_{k=1}^{K}\frac{rel_k}{log_2(1+k)}$$

Now when we calculate DCG@2 and swap the position of the first two images, we return different scores:

$$original: \space DCG@2 = \frac{0}{\log_2(1+1)}+\frac{4}{\log_2(1+2)} = 0 + 2.52 = 2.52$$

$$swapped: \space DCG@2 = \frac{4}{\log_2(1+1)}+\frac{0}{\log_2(1+2)} = 4 + 0 = 4$$

$DCG@K$ score as $K$ increases using the query #1 order of results.

Using the order-aware $DCG@K$ metric means the preferred swapped results returns a better score.

Unfortunately, DCG@K scores are very hard to interpret as their range depends on the variable $rel_k$ range we chose for our data. We use the Normalized DCG@K (NDCG@K) metric to fix this.

$NDCG@K$ is a special modification of standard NDCG that cuts off any results whose rank is greater than $K$. This modification is prevalent in use-cases measuring search performance[5].

NDCG@K normalizes DCG@K using the Ideal DCG@K (IDCG@K) rankings. For IDCG@K, we assume that the most relevant items are ranked highest and in order of relevance.

Calculating IDCG@K takes nothing more than reordering the assigned ranks and performing the same DCG@K calculation:

$$IDCG@2 = \frac{4}{\log_2(1+1)}+\frac{4}{\log_2(1+2)} = 4 + 2.52 = 6.52$$

IDCG@K score as $K$ increases compared against the DCG@K score calculated with using the query #1 order of results.

Now all we need to calculate NDCG@K is to normalize our DCG@K score using the IDCG@K score:

$$NDCG@K = \frac{DCG@K}{IDCG@K} = \frac{2.52}{6.52} = 0.39$$

NDCG@K score as $K$ increases calculated by normalizing DCG@K using IDCG@K.

Using NDCG@K, we get a more interpretable result of 0.41, where we know that 1.0 is the best score we can get with all items ranked perfectly (e.g., the IDCG@K).

### Pros and Cons

NDCG@K is one of the most popular offline metrics for evaluating IR systems, in particular web search engines. That is because NDCG@K optimizes for highly relevant documents, is order-aware, and is easily interpretable.

However, there is a significant disadvantage to NDCG@K. Not only do we need to know which items are relevant for a particular query, but we need to know whether each item is more/less relevant than other items; the data requirements are more complex.

Example of data for the other metrics (left) and the more complex data required for NDCG@K (right).

These are some of the most popular offline metrics for evaluating information retrieval systems. A single metric can be a good indicator of system performance. For even more confidence in retrieval performance you can use several metrics, just as Spotify did with recall@1, recall@30, and MRR@30.

These metrics are still best supported with online metrics during A/B testing, which act as the next step before deploying your retrieval system to the world. However, these offline metrics are the foundation behind any retrieval project.

Whether you’re prototyping your very first product, or evaluating the latest iteration of Google search, evaluating your retrieval system with these metrics will help you deploy the best retrieval system possible.

## References

Code Walkthrough

[1] G. Linden et al., Item-to-Item Collaborative Filtering (2003), IEEE

[2] J. Solsman, YouTube’s AI is the puppet master over most of what you watch (2018)

[3] P. Covington et al., Deep Neural Networks for YouTube Recommendations (2016), RecSys

[4] A. Tamborrino, Introducing Natural Language Search for Podcast Episodes (2022), Engineering at Spotify Blog

[5] Y. Wang et al., A Theoretical Analysis of NDCG Ranking Measures (2013), JMLR

## Nomenclature

$K: number \space of \space retrieved \space results$

$k: position \space k \space of \space retrieved \space items$

$Q: number \space of \space queries$

$q: query$

$rank_q: rank \space of \space first \space relevant \space result \space for \space query \space q$

$rel_k: relevance \space of \space item \space at \space position \space k$

$p: actual \space relevant \space result$

$\hat{p}: predicted \space relevant \space result$

$n: actual \space irrelevant \space result$

$\hat{n}: predicted \space irrelevant \space result$

$p\hat{p}: true \space positive$

$n\hat{p}: false \space positive$

$n\hat{n}: true \space negative$

$p\hat{n}: false \space negative$