Haystack Integration

In this guide we will see how to integrate Pinecone and the popular Haystack library for Question-Answering.

Installing Haystack

We start by installing the latest version of Haystack with all dependencies required for the PineconeDocumentStore.

Copy
Copied
!pip install -U 'farm-haystack[pinecone]'

Initializing the PineconeDocumentStore

We initialize a PineconeDocumentStore by providing an API key and environment name. (Create an account to get your API key.)

Copy
Copied
from haystack.document_stores import PineconeDocumentStore

document_store = PineconeDocumentStore(
    api_key='<YOUR_API_KEY>',
    environment='us-west1-gcp'
)
warning

If you see a ModuleNotFoundError or ImportError, try installing the Pinecone client manually using pip install -U pinecone-client.

Data Preparation

Before adding data to the document store, we must download and convert data into the Document format that Haystack uses.

We download pages from the Game of Thrones wiki.

Copy
Copied
from haystack.utils import clean_wiki_text, convert_files_to_dicts, fetch_archive_from_http, print_answers

doc_dir = "data/article_txt_got"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

Then convert these files into the Document format.

Copy
Copied
dicts = convert_files_to_dicts(
    dir_path=doc_dir,
    clean_func=clean_wiki_text,
    split_paragraphs=True
)

This Document format contains two fields; 'content' for the text content or paragraphs, and 'meta' where we can place any additional information that can later be used to apply metadata filtering in our search. Here is an example of the Document format:

Copy
Copied
{'content': "'''David Benioff''' (; né '''Friedman''' ; September 25, 1970) is "
            'an American screenwriter and television producer, writer, and '
            'director. Along with his collaborator D. B. Weiss, he is best '
            "known as co-creator, showrunner, and writer of ''Game of "
            "Thrones'' (2011–2019), the HBO adaptation of George R. R. "
            "Martin's series of books ''A Song of Ice and Fire''. He is also "
            "known for writing ''Troy'' (2004) and co-writing ''X-Men Origins: "
            "Wolverine'' (2009).",
 'meta': {'name': '33_David_Benioff.txt'}}

Indexing Documents

To index the documents we use the PineconeDocumentStore.write_documents method.

Copy
Copied
document_store.write_documents(dicts)

Creating and Upserting Embeddings

To create embeddings for our documents we must initialize a DensePassageRetriever model.

Copy
Copied
from haystack.nodes import DensePassageRetriever
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    max_seq_len_query=64,
    max_seq_len_passage=256,
    batch_size=2,
    use_gpu=True,
    embed_title=True,
    use_fast_tokenizers=True
)

Then we run the PineconeDocumentStore.update_embeddings method with the retriever provided as an argument. GPU acceleration can greatly reduce the time required for this step.

Copy
Copied
document_store.update_embeddings(
    retriever,
    batch_size=16
)

Inspect Documents and Embeddings

We can get documents by their ID with the PineconeDocumentStore.get_documents_by_id method.

Copy
Copied
d = document_store.get_documents_by_id(ids=['49091c797d2236e73fab510b1e9c7f6b'], return_embedding=True)[0]

From here we return can view document content with d.content and the document embedding with d.embedding.

Initializing an Extractive QA Pipeline

An ExtractiveQAPipeline contains three key components by default:

  • a document store (PineconeDocumentStore)
  • a retriever model
  • a reader model

We can initialize a reader model from the HuggingFace Model Hub named deepset/roberta-base-squad2.

Copy
Copied
from haystack.nodes import FARMReader

reader = FARMReader(
    model_name_or_path="deepset/roberta-base-squad2", 
    use_gpu=True
)

We are now ready to initialize the ExtractiveQAPipeline.

Copy
Copied
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

Asking Questions

Using our QA pipeline we can begin querying with pipe.run.

Copy
Copied
prediction = pipe.run(
    query="Who created the Dothraki vocabulary?",
    params={
        "Retriever": {"top_k": 10},
        "Reader": {"top_k": 5}
    }
)

We are also passing two top_k values, the retriever top_k defines how many records to retrieve from Pinecone. These records are then passed to the reader model which identifies a specific answer from each content paragraph and reranks the returned records. The reader top_k defines how many of these reranked records to return.

To view the answers we use haystack.utils.print_answers.

Copy
Copied
from haystack.utils import print_answers
print_answers(prediction, details="minimum")

This will return:

Copy
Copied
Query: Who created the Dothraki vocabulary?
Answers:
[   {   'answer': 'David J. Peterson',
        'context': 'orld. The language was developed for the TV series by the '
                   'linguist David J. Peterson, working off the Dothraki words '
                   "and phrases in Martin's novels.\n"
                   ','},
    {   'answer': 'David J. Peterson',
        'context': '\n'
                   '===Valyrian===\n'
                   'David J. Peterson, who created the Dothraki language for '
                   'the first season of the show, was entrusted by the '
                   'producers to design a new '},
    {   'answer': 'David J. Peterson',
        'context': "age for ''Game of Thrones''\n"
                   'The Dothraki vocabulary was created by David J. Peterson '
                   'well in advance of the adaptation. HBO hired the Language '
                   'Creatio'},
    {   'answer': 'D. B. Weiss and David Benioff',
        'context': '\n'
                   '===Conception and development===\n'
                   'Showrunners D. B. Weiss and David Benioff created the '
                   'series, wrote most of its episodes and directed several.\n'
                   'In Ja'},
    {   'answer': 'books',
        'context': 'ints.  First, the language had to match the uses already '
                   'put down in the books. Secondly, it had to be easily '
                   'pronounceable or learnable by the actors'}]

We can view more details including the score of each answer by specifying details="all".

Copy
Copied
print_answers(prediction, details="all")

Which returns:

Copy
Copied
Query: Who created the Dothraki vocabulary?
Answers:
[   <Answer {'answer': 'David J. Peterson', 'type': 'extractive', 'score': 0.9532108306884766, 'context': "orld. The language was developed for the TV series by the linguist David J. Peterson, working off the Dothraki words and phrases in Martin's novels.\n,", 'offsets_in_document': [{'start': 329, 'end': 346}], 'offsets_in_context': [{'start': 67, 'end': 84}], 'document_id': '308dca876f94e5e839187f1463693015', 'meta': {'name': '214_Dothraki_language.txt'}}>,
    <Answer {'answer': 'David J. Peterson', 'type': 'extractive', 'score': 0.8807850480079651, 'context': '\n===Valyrian===\nDavid J. Peterson, who created the Dothraki language for the first season of the show, was entrusted by the producers to design a new ', 'offsets_in_document': [{'start': 16, 'end': 33}], 'offsets_in_context': [{'start': 16, 'end': 33}], 'document_id': 'b368200c210d555625bd409b0dc27be1', 'meta': {'name': '87_Valar_Dohaeris.txt'}}>,
    <Answer {'answer': 'David J. Peterson', 'type': 'extractive', 'score': 0.8687494099140167, 'context': "age for ''Game of Thrones''\nThe Dothraki vocabulary was created by David J. Peterson well in advance of the adaptation. HBO hired the Language Creatio", 'offsets_in_document': [{'start': 139, 'end': 156}], 'offsets_in_context': [{'start': 67, 'end': 84}], 'document_id': '27baa56e5aab6b04d38f19e97e078bc6', 'meta': {'name': '214_Dothraki_language.txt'}}>,
    <Answer {'answer': 'D. B. Weiss and David Benioff', 'type': 'extractive', 'score': 0.10197015851736069, 'context': '\n===Conception and development===\nShowrunners D. B. Weiss and David Benioff created the series, wrote most of its episodes and directed several.\nIn Ja', 'offsets_in_document': [{'start': 46, 'end': 75}], 'offsets_in_context': [{'start': 46, 'end': 75}], 'document_id': 'd8b7f165cc64c549532b74249cc692dd', 'meta': {'name': '229_Game_of_Thrones.txt'}}>,
    <Answer {'answer': 'books', 'type': 'extractive', 'score': 0.0460672490298748, 'context': 'ints.  First, the language had to match the uses already put down in the books. Secondly, it had to be easily pronounceable or learnable by the actors', 'offsets_in_document': [{'start': 166, 'end': 171}], 'offsets_in_context': [{'start': 73, 'end': 78}], 'document_id': '8767e85c7a9bcec61f95e13bb61f3e98', 'meta': {'name': '214_Dothraki_language.txt'}}>]

Metadata Filtering

The PineconeDocumentStore gives us access to Pinecone's powerful metadata filtering functionality. When performing filtering with Haystack we use a slightly different filter syntax to that used by Pinecone.

Using the Game of Thrones dataset we can filter by filename.

Copy
Copied
prediction = pipe.run(
    query="Who created the Dothraki vocabulary?",
    params={"Retriever": {
        "top_k": 10,
        "filters": {
            "name": {"$eq": "368_Jaime_Lannister.txt"}
        }
    }, "Reader": {"top_k": 5}}
)

Haystack Filter Examples

Here are a few more examples of Haystack filtering syntax.

Copy
Copied
filters = {
    "$and": {
        "type": {"$eq": "article"},
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": {"$in": ["economy", "politics"]},
            "publisher": {"$eq": "nytimes"}
        }
    }
}
# or simpler using default operators
filters = {
    "type": "article",
    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
    "rating": {"$gte": 3},
    "$or": {
        "genre": ["economy", "politics"],
        "publisher": "nytimes"
    }
}