Unsupervised Training of Retrievers Using GenQ
Fine-tuning effective dense retrieval models is challenging. Bi-encoders (sentence transformers) are the current best models for dense retrieval in semantic search. Unfortunately, they’re also notoriously data-hungry models that typically require a particular type of labeled training data.
Hard problems like this attract attention. As expected, there is plenty of attention on building ever better techniques for training retrievers.
One of the most impressive is GenQ. This approach to building bi-encoder retrievers uses the latest text generation techniques to synthetically generate training data. In short, all we need are passages of text. The generation model then augments these passages with synthetic queries, giving us the exact format we need to train an effective bi-encoder model.
Let’s work through the details of this training method. At a high level, there are two key steps.
- Generate queries for pre-existing but unlabeled passages: Creating (query, passage) pairs.
- Fine-tune a bi-encoder model using these (query, passage) pairs and Multiple Negatives Ranking (MNR) loss.
High-level view of the GenQ training process.
Don’t worry if any (or even all) of the above doesn’t make sense. We’ll detail everything from start to finish.
We can describe data as either being in-domain or belonging to another domain. The domain here refers to the target data and use-case where we apply the eventual fine-tuned bi-encoder model.
For example, we may want to build a retriever model that encodes sentences (passages) for financial documents in German. In that case, any text from German financial documents is in-domain, and everything else is out-of-domain.
For our target domain of German financial documents, anything that fits the topic and we would expect our model to encounter is in-domain. Anything else is out-of-domain.
To achieve good performance with a language model (LM), we need to train (fine-tune) it on in-domain data. We would typically need a lot of labeled in-domain data to fine-tune a bi-encoder.
For most domains, we can either have a lot of unlabeled data or a little labeled data. It’s hard to get both, and most bi-encoder training needs both.
GenQ aims to break the reliance on requiring labeled data by synthetically generating queries for otherwise unlabeled passages of text. Producing (query, passage) pairs from an unlabeled dataset. That means that given a large, in-domain, but unlabeled dataset, we can train with GenQ.
The task that GenQ is designed for is referred to as asymmetric semantic search . That means the query is much shorter than the passage we would aim to retrieve. A typical query may consist of (for example) six words “How do I tie my shoelaces?", and the relevant passage can be much longer:
“To tie your shoelaces, take both laces and place one over the other, pulling them tightly together…"
Asymmetric semantic search is where the length of queries are typically much smaller than that of the passages/contexts being searched.
It is this task, with asymmetry between queries and passages, where GenQ can be applied.
Generation of Queries
We need passages and a query generation model to generate the (query, passage) pairs. The model used by GenQ is the Text-to-Text Transfer Transformer (T5).
The T5 model philosophy is that all NLP tasks can be defined as a text-to-text problem, so they are pretrained on many different tasks with vast amounts of data.
T5 views every task as a text-to-text problem. Here are a few examples adapted from the paper that introduced T5 .
One of these tasks is query generation. In this case, the input text, or passage, is fed into a special query generation T5 model that generates questions that the passage may answer .
Given a large corpus of passages, such as paragraphs scraped from documentation, web pages, etc. We use T5 to generate several queries for each passage.
Using a T5 model fine-tuned for query generation (like BeIR/query-gen-msmarco-t5-large-v1) we can generate sets of queries using passages of text.
It’s important to note that query generation is not perfect. We’re using a general-purpose T5 model. The queries it generates can be noisy with plenty of randomness and nonsensical queries. Because of that, GenQ is prone to poor performance where the synthetic data is too noisy .
We have what should be a very large dataset of (query, passage) pairs. With this data, we can move on to fine-tuning the bi-encoder model.
Fine-Tuning the Bi-Encoder
To fine-tune the bi-encoder (sentence transformer) we use Multiple Negatives Ranking (MNR) loss. MNR loss is ideal for training where our dataset consists of pairs of related sentences.
For example, when training a QA retriever model, we can train with MNR loss if we have sets of (question, answer) pairs. If we have a Natural Language Inference (NLI) dataset, we can use MNR loss to train on (anchor, positive) pairs. In this case, we fine-tune on (query, passage) pairs.
MNR loss works by placing all of these pairs into batches. For each batch, the model is optimized so that pair (Qi, Pj=i) has the highest similarity. Meaning that within a batch of 32, the similarity score between Qi=3 and Pj=3 must be higher than the similarity between Qi=3 and any other passage Pj≠3.
Similarity scores using five (query, passage) pairs. MNR loss optimizes so that (Qi, Pi) scores higher than any other pair (Qi, Pj≠i)
At the end of this training process, we have a new bi-encoder fine-tuned to a specific domain. The model’s performance can vary depending on the models being used, source and target domains, and many other variables. However, GenQ can sometimes achieve performances approaching models trained with supervised methods .
Let’s move on to the implementation of GenQ.
First, we need a dataset to train on. We will take the context paragraphs from the Stanford Question and Answering Dataset (SQuAD) dataset, which we will download from HuggingFace Datasets.
In this dataset, we already have query
'question' and passage
'context' pairs. However, we want to emulate the scenario in which we do not have queries. We will remove all but the
'context' data to do that.
Now that we have our
passages, we can begin generating queries. For this, we need a query generation model. We will use a T5 model fine-tuned for query generation as part of the BeIR project, named
Some layers in the model behave differently during training and inference. To ensure the model is running in “inference mode”, we call
With this, the model will generate three queries for each passage. In this case, we generate 56,673 pairs from 18,891 passages and save them as TSV files.
We can see that the queries are generally much smaller than the passages; this is where the asymmetric in asymmetric similarity search comes from.
Example of a few generated queries given a paragraph about Python.
The next step is to fine-tune a model using MNR loss. We do this easily with the
We start by loading the pairs dataset we created into a list of
Next, we load the pairs into a
NoDuplicatesDataLoader. We use the no duplicates data loader to avoid placing duplicate passages in the same batch, as this will confuse the ranking mechanism of MNR loss.
Now we initialize the bi-encoder that we will be fine-tuning. We create the transformer-to-pooler architecture using modules.
Here we are initializing from a pretrained MPNet model, which by default outputs 512 embeddings. The second module is a mean pooling layer that takes the average activations across all of these embeddings to create a single sentence embedding.
With this, our bi-encoder is initialized. We now need to fine-tune the model, which we do using MNR loss.
Everything is now in place, and we fine-tune the model by calling the
We now have a fine-tuned bi-encoder that we can use for asymmetric semantic search. Let’s move on to setting up a search index and testing a few searches to see what we return.
For evaluation, we will work through a simple qualitative test. We take a few example questions from the SQuAD validation set, and we will (hopefully) see that we are returning relevant contexts.
We can use Pinecone as an ultra-fast way to store our vectors. All we need is an API key and to install the Pinecone client with
pip install pinecone-client. To initialize our connection to Pinecone and create an index to store the vectors we write:
The vector database will store all encoded contexts from the SQuAD validation set, so let’s download, encode, and upsert our contexts.
To download, we use HuggingFace Datasets as before.
We can now encode using our newly trained
And finally upsert to Pinecone.
We’re now ready to begin querying; we can take a few example queries from SQuAD.
We immediately return the best possible answer as the highest rated passage. Let’s try with some more SQuAD queries.
Another great result; let’s try one final query.
All of these great results show that our model fine-tuned with GenQ has fit well to the SQuAD domain.
That’s it for this chapter covering the GenQ training method, a clearly powerful approach to fine-tuning models where we have limited datasets.
Using this approach, we can take passages of text, generate (query, passage) pairs, and use these pairs to train effective bi-encoder models ideal for asymmetric semantic search.
GenQ is an excellent, low-effort technique enabling projects that focus or rely on retrieving passages of text from natural language queries. Using GenQ you can begin fine-tuning models with limited data, unlocking previously inaccessible domains.
 J. Ma, et al., Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation (2021), ACL
 N. Reimers, GenQ Page, SBERT.net
 N. Reimers, et. al., Semantic Search Page, SBERT.net
 C. Raffel, et. al., Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (2020), JMLR
Domain Adaptation with Generative Pseudo-Labeling (GPL)