Long Form Question Answering in Haystack

Question-Answering (QA) has exploded as a subdomain of Natural Language Processing (NLP) in the last few years. QA is a widely applicable use case in NLP yet was out of reach until the introduction of transformer models in 2017.

Without transformer models, the level of language comprehension required to make something as complex as QA work simply was not possible.

Although QA is a complex topic, it comes from a simple idea. The automatic retrieval of information via a more human-like interaction. The task of information retrieval (IR) is performed by almost every organization in the world. Without other options, organizations rely on person-to-person IR and rigid keyword search tools. This haphazard approach to IR generates a lot of friction, particularly for larger organizations.

Consider that many large organizations contain thousands of employees, each producing pages upon pages of unstructured text data. That data quickly gets lost in the void of unused directories and email archives.

QA offers a solution to this problem. Rather than these documents being lost in an abyss, they can be stored within a space where an intelligent QA agent can access them. Unlike humans, our QA agent can scan millions of documents in seconds and return answers from these documents almost instantly.

To interact with a QA agent, we don’t need to know any fancy search logic or code. Instead, we just ask a question as we would ask another human being. Suppose we want to understand why process X exists. In that case, we can ask, “why do we follow process X?” and the relevant information will be returned within milliseconds.

QA capability is not a “nice to have”. It is a key that can unlock ~90% of your organization’s data. Without it, unstructured data is lost almost as soon as it is made, akin to searching in the dark.

With QA tools, employees can stop wasting time searching for snippets of information and focus on their real, value-adding tasks.

A small investment in QA is, for most organizations, a no-brainer.

Long-Form Question-Answering

There are two overarching approaches to QA, abstractive (/generative) and extractive. The difference between the two lies in how the answer is constructed.

For abstractive QA, an answer is generated by an NLP generator model, usually based on external documents. Extractive QA differs from this approach. Rather than generating answers, it uses a reader model to extract them directly from external documents, similar to cutting out snippets from a newspaper.

We are using external documents to inform our generator/reader models in both cases. We call this open-book QA as it emulates open-book exams where students (the models) can refer to a book (external documents) during the exam. In our case, the void of unstructured data becomes our open-book.

An open-book abstractive QA pipeline looks like this:


Open-book abstractive QA pipeline. We start by indexing documents into a document store, and then perform searches with queries encoded by a retriever model. The retrieved contexts and the initial query are passed to a generator to produce an answer.

One form of open-book abstractive QA is Long-Form Question-Answering (LFQA). LFQA focuses on the generation of multi-sentence answers to open-ended questions.

Let’s work through an implementation of LFQA using the Haystack library.

LFQA in Haystack

Haystack is a popular Python library for building QA pipelines, allowing us to create an LFQA pipeline with just a few lines of code. You can find the full LFQA code here.

We start by installing the necessary libraries.

!pip install -U 'farm-haystack[pinecone]'>=1.3.0 pinecone-client datasets

Data Preparation

Our first task is to find a dataset to emulate our “void” of unstructured data. For that, we will use the Wikipedia Snippets dataset. The full dataset contains over 17M passages from Wikipedia, but for this demo, we will restrict the dataset to ~50K passages. Feel free to use the whole dataset, but it will take some time to process.

We can find the dataset in Hugging Face’s Datasets library. The streaming=True parameter allows us to stream the dataset rather than download it. The full dataset is over 9GB, and we don’t need it all; streaming allows us to iteratively download records one at a time.

The dataset contains eight features, of which we are most interested in the passage_text and section_title.

As we are limiting our dataset to ~50K passages, We will tighten the scope of topics and only extract records where the section_title feature is History.

Our dataset is now prepared. We can move on to initializing the various components in our LFQA pipeline.

Document Store

The document store is (not surprisingly) where we store our documents. Haystack allows us to use various document stores, each with its pros and cons. A key consideration is whether we want to support a sparse keyword search or enable a full semantic search. Naturally, a human-like QA system requires full semantic search capability.

With that in mind, we must use a document store that supports dense vectors. If we’d like to scale to larger datasets, we must also use a document store that supports Approximate Nearest Neighbors (ANN) search.

To satisfy these requirements, we use the PineconeDocumentStore, which also supports:

  • Single-stage metadata filtering, if using different section_title documents, we could use metadata filtering to tighten the search scope.
  • Instant index updates, meaning we can add millions of new documents and immediately see these new documents reflected in new queries.
  • Scalability to billions of documents.
  • Free hosting for up to 1M documents.

We first need to sign up for a free API key. After signing up, API keys can be found by clicking on a project and navigating to API Keys. Next, we initialize the document store using:

Here we specify the name of the index where we will store our documents, the similarity metric, and the embedding dimension embedding_dim. The similarity metric and embedding dimension can change depending on the retriever model used. However, most retrievers use "cosine" and 768.

We can check the current document and embedding count of our document store like so:

If there is an existing index called haystack-lfqa, the above will connect to the existing index rather than initialize a new one. Existing indexes can be found and managed by visiting the active project in the Pinecone dashboard.

We can start adding documents to our document store. To do this, we first create Haystack Document objects, where we will store the text content alongside some metadata for each document. The indexing process will be done in batches of 10_000.

Now, if we check the document count, we will see that ~50K documents have been added:

When looking at the embedding count, we still see zero; this reflects the embedding count in the Pinecone dashboard. This is because we have not created any embeddings of our documents. That is another step that requires the retriever component.


The QA pipeline relies on retrieving relevant information from our document store. In reality, this document store is what is called a vector database. Vector databases store vectors (surprise!), and each vector represents a single document’s text content (the context).


Queries and contexts with similar meaning are embedding into a similar vector space.

Using the retriever model, we can take text content and encode it into a vector embedding that numerically represents the text’s original “human” meaning.

The vector database is where we store these vectors. If we introduce a new query vector to an already populated vector database, we could use similarity metrics to measure its proximity to existing vectors. From there, we return the top k most similar vectors (e.g., the most semantically similar contexts).


We return the top k (in this case, k=5) most similar context vectors to our query vector. This visual demonstrates the top k vectors using Euclidean (or L2) distance, other common metrics include cosine similarity and dot product.

We will use Haystack’s EmbeddingRetriever component, which allows us to use any retriever model from the Sentence Transformers library hosted via the Hugging Face Model Hub.


The model we use on Hugging Face Model Hub.

First, we check that we are using the GPU, as this will make the retriever embedding process much faster.

The embedding step will still run if you do not have access to a CUDA-enabled GPU, but it may be slow. We initialize the EmbeddingRetriever component with the all_datasets_v3_mpnet-base model shown above.

We call the document_store.update_embeddings method and pass in our new retriever to begin the embedding process.

The batch_size parameter can be increased to reduce the embedding time. However, it is limited by GPU/CPU hardware and cannot be increased beyond those limits.

We can confirm that our document store now contains the embedded documents by calling document_store.get_embedding_count() or checking the embedding count in the Pinecone dashboard.


Screenshot from the Pinecone dashboard showing 49,995 vectors have been stored. Calling document_store.get_embedding_count() will return the same number.

Before moving on to the next step, we can test our document retrieval:

It looks like we’re returning good results. We now have the first two components of our LFQA pipeline: the document store and retriever. Let’s move on to the final component.


The generator is the component that builds our answer. Generators are sequence-to-sequence (Seq2Seq) models that take the query and retrieved contexts as input and use them to generate an output, the answer.


The input to our generator model is a concatenation of the question and any retrieved contexts. In this example each context is preceded by "

" to mark that the following is a new chunk of information for the generator to consider.

We initialize the generator using Haystack’s Seq2SeqGenerator with a model trained specifically for LFQA, for example, vblagoje/bart_lfqa or yjernite/bart_eli5 [1].

Now we can initialize the entire abstractive QA pipeline using Haystack’s GenerativeQAPipeline object. This pipeline combines all three components, with the document store included as part of the retriever.

With that, our abstractive QA pipeline is ready, and we can move on to making some queries.


When querying, we can specify the number of contexts for our retriever to return and the number of answers for our generator to generate using the top_k parameters.

There’s a lot here, but we can see that we have 1 final answer, followed by the 3 retrieved contexts. It’s hard to understand what is happening here, so we can use Haystack’s print_answer util to clean up the output.

Our output is now much more readable. The answer looks good, but there is not much detail. When we find an answer is either not good or lacks detail, there can be two reasons for this:

  • The generator model has not been trained on data that includes information about the “war on currents”, so it has not memorized this information within its model weights.

  • We have not returned any contexts that contain the answer, so the generator has no reliable external sources of information.

If neither condition is satisfied, the generator cannot produce a factually correct answer. However, in our case, we are returning some good external context. We can try and return more detail by increasing the number of contexts retrieved.

Now we’re seeing much more information. The latter half does descend into nonscensical gibberish, most likely because the higher top_k value retrieved several irrelevant contexts. However, given a larger dataset we could likely avoid this. We can also compare these results to an answer generated without any context by querying the generator directly.

Clearly, the retrieved contexts are important. Although this isn’t always the case, for example, if we ask about a more well-known fact:

For general knowledge queries, the generator model can often pull the answer directly from its own “memory” (the model weights optimized during training), where it may have seen training data containing the answer. Larger models have a larger memory and, in turn, are better at direct answer generation.

When we ask more specific questions, like our question about the war on currents, both smaller and larger generators rarely return good answers without an external data source.

We can ask a few more questions:

To confirm that this answer is correct, we can check the contexts used to generate the answer.

In this case, the answer looks correct. If we ask a question and no relevant contexts are retrieved, the generator will typically return nonsensical or false answers, like with this question about COVID-19:

The issue with nonsensical or false answers is one drawback of the LFQA approach. However, it can be mitigated somewhat by implementing thresholds to filter our low confidence answers and referring to the sources behind generated answers.

Let’s finish with a final few questions.

That’s it for this walkthrough of Long-Form Question-Answering with Haystack. As mentioned, there are many approaches to building a pipeline like this. Many different retriever and generator models can be tested by simply switching the model names for other retriever/generator models.

With the wide variety of off-the-shelf models, many use cases can be built with little more than what we have worked through here. All that is left is to find potential use cases and try implementing LFQA (or other QA pipelines) and reap the benefits of enhanced data visibility and workplace efficiency that come with it.


[1] A. Fan, et al., ELI5: Long Form Question Answering (2019)

Haystack Example Notebooks

PineconeDocumentStore Integration Docs

Haystack Github Repo


What will you build?

Upgrade your search or recommendation systems with just a few lines of code, or contact us for help.