# Evaluation Measures in Information Retrieval

> Evaluation of information retrieval (IR) systems is critical to making well-informed design decisions. From search to recommendations, evaluation measures are paramount to understanding what does and does not work in retrieval.

Laura Carnevali · 2023-06-30

Evaluation of information retrieval (IR) systems is critical to making well-informed design decisions. From search to recommendations, evaluation measures are paramount to understanding what does and _does not_ work in retrieval.

Many big tech companies contribute much of their success to well-built IR systems. One of Amazon’s earliest iterations of the technology was reportedly driving more than 35% of their sales[1]. Google attributes 70% of YouTube views to their IR recommender systems[2][3].

IR systems power some of the greatest companies in the world, and behind every successful IR system is a set of evaluation measures.

---

## Metrics in Information Retrieval

[Video](https://www.youtube.com/watch?v=BD9TkvEsKwM)


Evaluation measures for IR systems can be split into _two_ categories: _online_ or _offline_ metrics.

**Online metrics** are captured during actual usage of the IR system when it is _online_. These consider user interactions like whether a user clicked on a recommended show from Netflix or if a particular link was clicked from an email advertisement (the click-through rate or CTR). There are many online metrics, but they all relate to some form of user interaction.

**Offline metrics** are measured in an isolated environment before deploying a new IR system. These look at whether a particular set of _relevant_ results are returned when retrieving items with the system.



![Evaluation measures can be categorized as either offline or online metrics. Offline metrics can be further divided into order-unaware or order-aware, which we will explain soon.](https://cdn.sanity.io/images/vr8gru94/production/74baa4032f93d8444e0b52e3aacbb1e5278c1f90-921x561.png)


Organizations often use _both_ offline and online metrics to measure the performance of their IR systems. It begins, however, with offline metrics to predict the system’s performance _before deployment_.

We will focus on the most useful and popular offline metrics:

- Recall@K
- **M**ean **R**eciprocal **R**ank (MRR)
- **M**ean **A**verage **P**recision@K (MAP@K)
- **N**ormalized **D**iscounted **C**umulative **G**ain (NDCG@K)

These metrics are deceptively simple yet provide invaluable insight into the performance of IR systems.

We can use one or more of these metrics in different evaluation stages. During the development of Spotify’s podcast search; _Recall@K_ (using $K=1$) was used during training on “evaluation batches”, and after training, [both](https://www.pinecone.io/learn/spotify-podcast-search/#:~:text=Spotify%20details%20their%20full%2Dretrieval%20setting%20metrics%20as%20using%20Recall%4030%20and%20MRR%4030%2C%20performed%20both%20on%20queries%20from%20the%20eval%20set%20and%20on%20their%20curated%20dataset.) [Recall@K](https://www.pinecone.io/learn/spotify-podcast-search/#:~:text=Spotify%20details%20their%20full%2Dretrieval%20setting%20metrics%20as%20using%20Recall%4030%20and%20MRR%4030%2C%20performed%20both%20on%20queries%20from%20the%20eval%20set%20and%20on%20their%20curated%20dataset.) [and](https://www.pinecone.io/learn/spotify-podcast-search/#:~:text=Spotify%20details%20their%20full%2Dretrieval%20setting%20metrics%20as%20using%20Recall%4030%20and%20MRR%4030%2C%20performed%20both%20on%20queries%20from%20the%20eval%20set%20and%20on%20their%20curated%20dataset.) [MRR](https://www.pinecone.io/learn/spotify-podcast-search/#:~:text=Spotify%20details%20their%20full%2Dretrieval%20setting%20metrics%20as%20using%20Recall%4030%20and%20MRR%4030%2C%20performed%20both%20on%20queries%20from%20the%20eval%20set%20and%20on%20their%20curated%20dataset.) (using $K=30$) were used with a much larger evaluation set.

The last paragraph will make sense by the end of this article. For now, understand that Spotify was able to predict system performance _before_ deploying anything to customers. This allowed them to deploy successful A/B tests and significantly increase podcast engagement[4].

We have two more subdivisions for these metrics; _order-aware_ and _order-unaware_. This refers to whether the order of results impacts the metric score. If so, the metric is _order-aware_. Otherwise, it is _order-unaware_.

## Cats in Boxes

Throughout the article, we will be using a _very_ small dataset of eight images. In reality, this number is likely to be millions or more.

![Example query and ranking of the eight possible results.](https://cdn.sanity.io/images/vr8gru94/production/50c0ec47628de98ba72b939cb5ef8663695161d5-3108x2287.png)


If we were to search for _“cat in a box”_, we may return something like the above. The numbers represent the relevance _rank_ of each image as predicted by the IR system. Other queries would yield a different order of results.

![Example query and ranking with actual relevant results highlighted.](https://cdn.sanity.io/images/vr8gru94/production/7415ad2dc6ee9948f1f9c222af52a8f4c116a26a-3112x2287.png)


We can see that results _2_, _4_, _5_, and _7_ are _actual relevant_ results. The other results are _not_ relevant as they show cats _without_ boxes, boxes _without_ cats, or a dog.

### Actual vs. Predicted

When evaluating the performance of the IR system, we will be comparing _actual_ vs. _predicted_ conditions, where:

- **Actual condition** refers to the true label of every item in the dataset. These are _positive_ ($p$) if an item is relevant to a query or _negative_ ($n$) if an item is _ir_relevant to a query.
- **Predicted condition** is the _predicted_ label returned by the IR system. If an item is returned, it is predicted as being _positive_ ($\hat{p}$) and, if it is not returned, is predicted as a _negative_ ($\hat{n}$).

From these actual and predicted conditions, we create a set of outputs from which we calculate all of our offline metrics. Those are the true/false positives and true/false negatives.

The _positive_ results focus on what the IR system returns. Given our dataset, we ask the IR system to return _two_ items using the _“cat in a box”_ query. If it returns an _actual relevant_ result this is a _true positive_ ($p\hat{p}$); if it returns an irrelevant result, we have a _false positive_ ($n\hat{p}$).

![](https://cdn.sanity.io/images/vr8gru94/production/583a91ad93ffeba3ad59cd8f2a5e049137d2f0e1-3196x2416.png)


For _negative_ results, we must look at what the IR system _does not_ return. Let’s query for two results. Anything that _is relevant_ but is _not_ returned is a _false negative_ ($p\hat{n}$). Irrelevant items that were _not_ returned are _true negatives_ ($n\hat{n}$).

With all of this in mind, we can begin with the first metric.

## Recall@K

_Recall@K_ is one of the most interpretable and popular offline evaluation metrics. It measures how many relevant items were returned ($p\hat{p}$) against how many relevant items exist in the entire dataset ($p\hat{p} + p\hat{n}$). 

$$
Recall@K = \frac{true Positives}{true Positives + false Negatives} = \frac{p\hat{p}}{p\hat{p}+p\hat{n}}
$$

The _K_ in this and all other offline metrics refers to the number of items returned by the IR system. In our example, we have a total number of _N = 8_ items in the entire dataset, so _K_ can be any value between $[1, ..., N]$.

![With recall@2 we return the predicted top K = 2 most relevant results.](https://cdn.sanity.io/images/vr8gru94/production/f04df4e20cd5948ed3bc2e88e2475573b8ca1d2a-3314x2617.png)


When _K = 2_, our _recall@2_ score is calculated as the number of _returned_ relevant results over the total number of relevant results in the _entire dataset_. That is:

$$
Recall@2 = \frac{p\hat{p}}{p\hat{p}+p\hat{n}} = \frac{1}{1+3} = 0.25
$$

With recall@K, the score improves as _K_ increases and the scope of returned items increases.

![With recall@K we will see the score increase as K increases and more positives (whether true or false) are returned.](https://cdn.sanity.io/images/vr8gru94/production/96815bfcc118cfb256df8f5d4a64ffd6b408429f-1300x721.png)


We can calculate the same recall@K score easily in Python. For this, we will define a function named `recall` that takes lists of _actual conditions_ and _predicted conditions_, a _K_ value, and returns a recall@K score.

```python
# recall@k function
def recall(actual, predicted, k):
    act_set = set(actual)
    pred_set = set(predicted[:k])
    result = round(len(act_set & pred_set) / float(len(act_set)), 2)
    return result
```

Using this, we will replicate our eight-image dataset with _actual relevant_ results in rank positions _2_, _4_, _5_, and _7_.

```json
{
  "_key": "51066ad20f50",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Recall@1 = 0.0\\n\",\n      \"Recall@2 = 0.25\\n\",\n      \"Recall@3 = 0.25\\n\",\n      \"Recall@4 = 0.5\\n\",\n      \"Recall@5 = 0.75\\n\",\n      \"Recall@6 = 0.75\\n\",\n      \"Recall@7 = 1.0\\n\",\n      \"Recall@8 = 1.0\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"actual = [\\\"2\\\", \\\"4\\\", \\\"5\\\", \\\"7\\\"]\\n\",\n    \"predicted = [\\\"1\\\", \\\"2\\\", \\\"3\\\", \\\"4\\\", \\\"5\\\", \\\"6\\\", \\\"7\\\", \\\"8\\\"]\\n\",\n    \"for k in range(1, 9):\\n\",\n    \"    print(f\\\"Recall@{k} = {recall(actual, predicted, k)}\\\")\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.9.12 ('ml')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.12\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

### Pros and Cons

Recall@K is undoubtedly one of the most easily interpretable evaluation metrics. We know that a perfect score indicates that all relevant items are being returned. We also know that a smaller _k_ value makes it harder for the IR system to score well with recall@K.

Still, there are disadvantages to using _recall@K_. By increasing _K_ to _N_ or near _N_, we can return a perfect score every time, so relying solely on recall@K can be deceptive.

Another problem is that it is an _order-unaware metric_. That means if we used recall@4 and returned one relevant result at rank _one_, we would score the same as if we returned the same result at rank _four_. Clearly, it is better to return the actual relevant result at a higher rank, but recall@K _cannot_ account for this.

## Mean Reciprocal Rank (MRR)

The **M**ean **R**eciprocal **R**ank (MRR) is an _order-aware metric_, which means that, unlike recall@K, returning an actual relevant result at rank _one_ scores better than at rank _four_.

Another differentiator for MRR is that it is calculated based on multiple queries. It is calculated as:

$$
RR = \frac{1}{Q} \sum_{q=1}^{Q} \frac{1}{rank_q}
$$

$Q$ is the number of queries, $q$ a specific query, and $rank_q$ the rank of the first *actual relevant* result for query $q$. We will explain the formula step-by-step.

Using our last example where a user searches for _“cat in a box”_. We add two more queries, giving us $Q=3$.

![We perform three queries while calculating the MRR score.](https://cdn.sanity.io/images/vr8gru94/production/3581ebed5a4faf7e840fe9a60e6100595c614c14-5354x1648.png)


We calculate the rank reciprocal $\frac{1}{rank_q}$ for each query $q$. For the first query, the first actual relevant image is returned at position _two_, so the rank reciprocal is $\frac{1}{2}$. Let’s calculate the reciprocal rank for all queries:

$$
query \space 1: \frac{1}{rank_1} = \frac{1}{2} = 0.5
$$

$$
query \space 2: \frac{1}{rank_2} = \frac{1}{1} = 1.0
$$

$$
query \space 3: \frac{1}{rank_3} = \frac{1}{5} = 0.2
$$

Next, we sum all of these reciprocal ranks for queries $q=[1, ..., Q]$(e.g., all three of our queries):

$$
\sum_{q=1}^{Q} \frac{1}{rank_q} = 0.5 + 1.0 + 0.2 = 1.7
$$

As we are calculating the **mean** reciprocal rank (**M**RR), we must take the average value by dividing our total reciprocal ranks by the number of queries $Q$:

$$
MRR = \frac{1}{Q} \sum_{q=1}^{Q} \frac{1}{rank_q} = \frac{1}{3}1.7 ≅ 0.57
$$

Now let’s translate this into Python. We will replicate the same scenario where $Q=3$ using the same _actual relevant_ results. 

```json
{
  "_key": "7e0b259e4ccf",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# relevant results for query #1, #2, and #3\\n\",\n    \"actual_relevant = [\\n\",\n    \"    [2, 4, 5, 7],\\n\",\n    \"    [1, 4, 5, 7],\\n\",\n    \"    [5, 8]\\n\",\n    \"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"query #1 = 1/2 = 0.5\\n\",\n      \"query #2 = 1/1 = 1.0\\n\",\n      \"query #3 = 1/5 = 0.2\\n\",\n      \"MRR = 0.57\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# number of queries\\n\",\n    \"Q = len(actual_relevant)\\n\",\n    \"\\n\",\n    \"# calculate the reciprocal of the first actual relevant rank\\n\",\n    \"cumulative_reciprocal = 0\\n\",\n    \"for i in range(Q):\\n\",\n    \"    first_result = actual_relevant[i][0]\\n\",\n    \"    reciprocal = 1 / first_result\\n\",\n    \"    cumulative_reciprocal += reciprocal\\n\",\n    \"    print(f\\\"query #{i+1} = 1/{first_result} = {reciprocal}\\\")\\n\",\n    \"\\n\",\n    \"# calculate mrr\\n\",\n    \"mrr = 1/Q * cumulative_reciprocal\\n\",\n    \"\\n\",\n    \"# generate results\\n\",\n    \"print(\\\"MRR =\\\", round(mrr,2))\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.9.12 ('ml')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.12\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

And as expected, we calculate the same MRR score of _0.57_.

### Pros and Cons

MRR has its own unique set of advantages and disadvantages. It is _order-aware_, a massive advantage for use cases where the rank of the first relevant result is important, like chatbots or [question-answering](https://www.pinecone.io/learn/series/nlp/question-answering/).

On the other hand, we consider the rank of the _first_ relevant item, but no others. That means for use cases where we’d like to return multiple items like recommendation or search engines, MRR is not a good metric. For example, if we’d like to recommend ~10 products to a user, we ask the IR system to retrieve 10 items. We could return just one _actual relevant_ item in rank one and no other relevant items. Nine of ten irrelevant items is a terrible result, but MRR would score a perfect _1.0_.

Another _minor_ disadvantage is that MRR is less readily interpretable compared to a simpler metric like recall@K. However, it is still more interpretable than many other evaluation metrics.

## Mean Average Precision (MAP)

**M**ean **A**verage **P**recision@K (_MAP@K_) is another popular _order-aware_ metric. At first, it seems to have an odd name, a _mean_ of an _average_? It makes sense; we promise.

There are a few steps to calculating _MAP@K_. We start with another metric called _precision@K_:

$$
Precision@K = \frac{true Positives}{true Positives + false Positives} = \frac{p\hat{p}}{p\hat{p}+n\hat{p}}
$$

You may be think this looks very similar to _recall@K_, and it is! The only difference is that we’ve swapped $p\hat{n}$ in recall for $n\hat{p}$ here. That means we now consider both actual relevant and non-relevant results _only_ from the returned items. We _do not_ consider non-returned items with _precision@K_.

![The difference between Recall@K and Precision@K calculation where K = 2.](https://cdn.sanity.io/images/vr8gru94/production/73c9adf7611d9ebaaa3256cb70a80c28477b22f5-1910x670.png)


Let’s return to the _“cat in a box”_ example. We will return $K=2$ items and calculate _precision@2_.

![Precision@K calculation where K = 2.](https://cdn.sanity.io/images/vr8gru94/production/bc51c02815d7fab8381ed3c9f8db613ddb4bf1fb-3314x2619.png)


$$
Precision@2 = \frac{p\hat{p}}{p\hat{p}+n\hat{p}} = \frac{p\hat{p}}{K} = \frac{1}{2} = 0.5
$$

Note that the denominator in _precision@K_ always equals $K$. Now that we have the _precision@K_ value, we move on to the next step of calculating the **A**verage **P**recision@K (_AP@K_):

$$
AP@K = \frac{\sum_{k = 1}^{K} (Precision@k * rel_k)}{ number \space of \space relevant \space results}
$$

Note that for _AP@K_ we are taking the average _precision@k_ score for all values of $k = [1, ..., K]$. Meaning that for AP@5 we calculate _precision@k_ where $k=[1, 2, 3, 4, 5]$.

We have not seen the $rel_k$ parameter before. It is a _relevance_ parameter that (for _AP@K_) is equal to _1_ when the $k^{th}$ item is relevant or _0_ when it is not.

![](https://cdn.sanity.io/images/vr8gru94/production/c009584211e87b2bad916931b348085d13f34bc5-1206x990.png)


We calculate precision@k and $rel_k$ iteratively using $k=[1,...,K]$.

As we multiply _precision@k_ and  $rel_k$, we only need to consider _precision@k_ where the  $k^{th}$ item is relevant.

![Precision@k and rel_k values for all relevant items across three queries q = [1, ..., Q]](https://cdn.sanity.io/images/vr8gru94/production/5bbb920bf69343baedc4b93d2f7a48c576a18d77-737x1005.png)


Given these values, for each query $q$, we can calculate the $AP@K_q$ score where $K=8$ as:

$$
AP@8_1 = \frac{0.5*1 + 0.5*1 + 0.6*1 + 0.57*1}{4} = 0.54
$$

$$
AP@8_2 = \frac{1*1 + 0.5*1 + 0.4*1 + 0.43*1}{4} = 0.67
$$

$$
AP@8_3 = \frac{0.2*1 + 0.25*1}{2} = 0.22
$$

Each of these individual $AP@K_q$ calculations produces a single **A**verage **P**recision@K score for each query q�. To get the **M**ean **A**verage **P**recision@K (_MAP@K_) score for all queries, we simply divide by the number of queries $Q$:

$$
MAP@K = \frac{1}{Q}\sum_{q=1}^{Q}AP@K_q = \frac{1}{3}*(0.54+0.67+0.22) = 0.48
$$

That leaves us with a final _MAP@K_ score of _0.48_. To calculate all of this with Python, we write:

```json
{
  "_key": "00f4032ad1f1",
  "_type": "colabBlock",
  "jsonContent": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"AP@8_1 = 0.54\\n\",\n      \"AP@8_2 = 0.67\\n\",\n      \"AP@8_3 = 0.23\\n\",\n      \"mAP@8 = 0.48\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# initialize variables\\n\",\n    \"actual = [\\n\",\n    \"    [2, 4, 5, 7],\\n\",\n    \"    [1, 4, 5, 7],\\n\",\n    \"    [5, 8]\\n\",\n    \"]\\n\",\n    \"Q = len(actual)\\n\",\n    \"predicted = [1, 2, 3, 4, 5, 6, 7, 8]\\n\",\n    \"k = 8\\n\",\n    \"ap = []\\n\",\n    \"\\n\",\n    \"# loop through and calculate AP for each query q\\n\",\n    \"for q in range(Q):\\n\",\n    \"    ap_num = 0\\n\",\n    \"    # loop through k values\\n\",\n    \"    for x in range(k):\\n\",\n    \"        # calculate precision@k\\n\",\n    \"        act_set = set(actual[q])                                                                                                                                   \\n\",\n    \"        pred_set = set(predicted[:x+1])\\n\",\n    \"        precision_at_k = len(act_set & pred_set) / (x+1)\\n\",\n    \"        # calculate rel_k values\\n\",\n    \"        if predicted[x] in actual[q]:\\n\",\n    \"            rel_k = 1\\n\",\n    \"        else:\\n\",\n    \"            rel_k = 0\\n\",\n    \"        # calculate numerator value for ap\\n\",\n    \"        ap_num += precision_at_k * rel_k\\n\",\n    \"    # now we calculate the AP value as the average of AP\\n\",\n    \"    # numerator values\\n\",\n    \"    ap_q = ap_num / len(actual[q])\\n\",\n    \"    print(f\\\"AP@{k}_{q+1} = {round(ap_q,2)}\\\")\\n\",\n    \"    ap.append(ap_q)\\n\",\n    \"\\n\",\n    \"# now we take the mean of all ap values to get mAP\\n\",\n    \"map_at_k = sum(ap) / Q\\n\",\n    \"\\n\",\n    \"# generate results\\n\",\n    \"print(f\\\"mAP@{k} = {round(map_at_k, 2)}\\\")\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"interpreter\": {\n   \"hash\": \"b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.9.12 ('ml')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.12\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}"
}
```

Returning the same _MAP@K_ score of 0.480.48.

### Pros and Cons

MAP@K is a simple offline metric that allows us to consider the _order_ of returned items. Making this ideal for use cases where we expect to return multiple relevant items.

The primary disadvantage of MAP@K is the $rel_K$ relevance parameter is binary. We must either view items as _relevant_ or _irrelevant_. It does not allow for items to be slightly more/less relevant than others.

## Normalized Discounted Cumulative Gain (NDCG@K)

_**N**ormalized_ _**D**iscounted_ _**C**umulative_ _**G**ain_ **_@K_** ($NDCG@K$) is another _order-aware metric_ that we can derive from a few simpler metrics. Starting with **C**umulative **G**ain ($CG@K$) calculated like so:

$$
CG@K = \sum_{k=1}^{K}rel_k
$$

The $rel_K$ variable is different this time. It is a range of relevance ranks where *0* is the least relevant, and some higher value is the most relevant. The number of ranks does not matter; in our example, we use a range of $0 \rightarrow 4$.

![Using rel_k we rank every item based on its relevance to a particular query q.](https://cdn.sanity.io/images/vr8gru94/production/52fa890651d63fb86eada79cd6dac2265f10e59a-1284x528.png)


Let’s try applying this to another example. We will use a similar eight-image dataset as before. The circled numbers represent the IR system’s _predicted_ ranking, and the diamond shapes represent the $rel_k$_ actual ranking_.

![A small dataset with predicted ranks (circles) and actual ranks (diamonds).](https://cdn.sanity.io/images/vr8gru94/production/b5a33fa20bfd0555b99520cd743b61895439f698-3032x2623.png)


To calculate the cumulative gain at position _K_ (_CG@K_), we sum the relevance scores up to the _predicted_ rank _K_. When $K=2$:

$$
CG@2 = \sum_{k=1}^{2}rel_k = rel_1 + rel_2 = 0 + 4 = 4
$$

It’s important to note that _CG@K_ is _not_ order-aware. If we swap images _1_ and _2_, we will return the same score when $K \geq 2$ despite having the more relevant item placed first.

![Images 1 and 2 have been swapped.](https://cdn.sanity.io/images/vr8gru94/production/05a0faaa720b1d94f2a333600bd444bb6e56c369-3032x2631.png)


$$
CG@2 = \sum_{k=1}^{2}rel_k = rel_1 + rel_2 = 4 + 0 = 4
$$

To handle this lack of order awareness, we modify the metric to create _DCG@K_, adding a penalty in the form of $log_{2}(1+k)$ to the formula:

$$
DCG@2 = \sum_{k=1}^{K}\frac{rel_k}{log_2(1+k)}
$$

Now when we calculate _DCG@2_ and swap the position of the first two images, we return different scores:

$$
original: \space DCG@2 = \frac{0}{\log_2(1+1)}+\frac{4}{\log_2(1+2)} = 0 + 2.52 = 2.52
$$

$$
swapped: \space DCG@2 = \frac{4}{\log_2(1+1)}+\frac{0}{\log_2(1+2)} = 4 + 0 = 4
$$

```python
from math import log2

# initialize variables
relevance = [0, 7, 2, 4, 6, 1, 4, 3]
K = 8

dcg = 0
# loop through each item and calculate DCG
for k in range(1, K+1):
    rel_k = relevance[k-1]
    # calculate DCG
    dcg += rel_k / log2(1 + k)
```

![](https://cdn.sanity.io/images/vr8gru94/production/baf630c47bdd3406a51facc9b5b2e8e1ed929b1a-1268x723.png)


Using the _order-aware_ $DCG@K$ metric means the preferred swapped results returns a better score.

Unfortunately, _DCG@K_ scores are very hard to interpret as their range depends on the variable $rel_k$ range we chose for our data. We use the **N**ormalized **DCG@K** (_NDCG@K_) metric to fix this.

_NDCG@K_ _is a special modification of standard NDCG that cuts off any results whose rank is greater than_ _K. This modification is prevalent in use-cases measuring search performance[5]._

_NDCG@K_ normalizes _DCG@K_ using the **I**deal **DCG@K** (_IDCG@K_) rankings. For _IDCG@K_, we assume that the most relevant items are ranked highest and in order of relevance.

Calculating _IDCG@K_ takes nothing more than reordering the assigned ranks and performing the same _DCG@K_ calculation:

$$
IDCG@2 = \frac{4}{\log_2(1+1)}+\frac{4}{\log_2(1+2)} = 4 + 2.52 = 6.52
$$

```python
# sort items in 'relevance' from most relevant to less relevant
ideal_relevance = sorted(relevance, reverse=True)

print(ideal_relevance)

idcg = 0
# as before, loop through each item and calculate *Ideal* DCG
for k in range(1, K+1):
    rel_k = ideal_relevance[k-1]
    # calculate DCG
    idcg += rel_k / log2(1 + k)
```

![IDCG@K score as K increases compared against the DCG@K score calculated with using the query #1 order of results.](https://cdn.sanity.io/images/vr8gru94/production/4d77a9067b2cba0139a4273187ac86517779d86b-1278x784.png)


Now all we need to calculate _NDCG@K_ is to normalize our _DCG@K_ score using the _IDCG@K_ score:

$$
NDCG@K = \frac{DCG@K}{IDCG@K} = \frac{2.52}{6.52} = 0.39
$$

```python
dcg = 0
idcg = 0

for k in range(1, K+1):
    # calculate rel_k values
    rel_k = relevance[k-1]
    ideal_rel_k = ideal_relevance[k-1]
    # calculate dcg and idcg
    dcg += rel_k / log2(1 + k)
    idcg += ideal_rel_k / log2(1 + k)
    # calcualte ndcg
    ndcg = dcg / idcg
```

![NDCG@K score as K increases calculated by normalizing DCG@K using IDCG@K.](https://cdn.sanity.io/images/vr8gru94/production/dd62a488e49de85ac7778e5a66d44df0f7556ad2-1295x721.png)


Using _NDCG@K_, we get a more interpretable result of _0.41_, where we know that _1.0_ is the _best_ score we can get with all items ranked perfectly (e.g., the _IDCG@K_).

### Pros and Cons

_NDCG@K_ is one of the most popular offline metrics for evaluating IR systems, in particular web search engines. That is because _NDCG@K_ optimizes for highly relevant documents, is _order-aware_, and is easily interpretable.

However, there is a significant disadvantage to _NDCG@K_. Not only do we need to know which items are relevant for a particular query, but we need to know whether each item is more/less relevant than other items; the data requirements are more complex.

![Example of data for the other metrics (left) and the more complex data required for NDCG@K (right).](https://cdn.sanity.io/images/vr8gru94/production/52261251dda5d26d15c6772086611227e7764b04-1340x383.png)


These are some of the most popular offline metrics for evaluating information retrieval systems. A single metric can be a good indicator of system performance. For even more confidence in retrieval performance you can use several metrics, just as Spotify did with recall@1, recall@30, and MRR@30.

These metrics are still best supported with online metrics during A/B testing, which act as _the next step_ before deploying your retrieval system to the world. However, these offline metrics are the foundation behind any retrieval project.

Whether you’re prototyping your very first product, or evaluating the latest iteration of Google search, evaluating your retrieval system with these metrics will help you deploy the best retrieval system possible.

## References

[Code Walkthrough](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/algos-and-libraries/offline-evaluation/offline-evaluation.ipynb)

[1] G. Linden et al., [Item-to-Item Collaborative Filtering](https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf) (2003), IEEE

[2] J. Solsman, [YouTube’s AI is the puppet master over most of what you watch](https://www.cnet.com/tech/services-and-software/youtube-ces-2018-neal-mohan/) (2018)

[3] P. Covington et al., [Deep Neural Networks for YouTube Recommendations](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf) (2016), RecSys

[4] A. Tamborrino, [Introducing Natural Language Search for Podcast Episodes](https://engineering.atspotify.com/2022/03/introducing-natural-language-search-for-podcast-episodes/) (2022), Engineering at Spotify Blog

[5] Y. Wang et al., [A Theoretical Analysis of NDCG Ranking Measures](http://proceedings.mlr.press/v30/Wang13.pdf) (2013), JMLR

## Nomenclature

$$
K: number \space of \space retrieved \space results
$$

$$
k: position \space k \space of \space retrieved \space items
$$

$$
Q: number \space of \space queries
$$

$$
q: query
$$

$$
rank_q: rank \space of \space first \space relevant \space result \space for \space query \space q
$$

$$
rel_k: relevance \space of \space item \space at \space position \space k
$$

$$
p: actual \space relevant \space result
$$

$$
\hat{p}: predicted \space relevant \space result
$$

$$
n: actual \space irrelevant \space result 
$$

$$
\hat{n}: predicted \space irrelevant \space result
$$

$$
p\hat{p}: true \space positive
$$

$$
n\hat{p}: false \space positive
$$

$$
n\hat{n}: true \space negative 
$$

$$
p\hat{n}: false \space negative
$$