Time Series Similarity Search

Time series, a sequence of values ordered by time, is one of the fundamental data forms. Consequently, there are plentiful time-series analysis methods and tools, ranging from forecasting to anomaly detection.

Here we demonstrate how to perform time-series "pattern" matching using a similarity search service. Wherein we want to retrieve all historical time series that match a particular pattern. Such matching capability serves as a core ingredient for time series applications such as clustering, labeling, and recommendation. For example, consider a time series describing web page visitors and a need to retrieve all historical peak surges, drops, or trends.

We will walk you through a simple approach that utilizes the time series raw data as-is. In other words, it does not require any modeling heavy lifting. Such an approach is very appealing because it does not require any domain-specific technical knowledge nor extra model generation resources. Sounds too good to be true?

Our demo indicates that this simple approach provides satisfying results. We will show you how to index and search a set of stock market daily prices time series. Then we will compare the simple approach with an alternative that utilizes a comprehensive time-series library recently published by Facebook AI.

What we'll cover:

  • Prerequisites
  • Simple Time-Series Embeddings

    • Prepare data
    • Index
    • Search
  • Facebook's Kats Time-Series Embeddings

    • Index
    • Search
  • Conclusion

View Source

Prerequisites

Install and import relevant python packages

!pip install -qU convertdate kats kaggle matplotlib==3.1.3
!pip install -qU pinecone-client
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pprint
from sklearn.preprocessing import MinMaxScaler
from kats.consts import TimeSeriesData
from kats.tsfeatures.tsfeatures import TsFeatures

import warnings
warnings.simplefilter(action='ignore')

Simple Time-Series Embeddings

Upload Time Series to Pinecone's Similarity Search Service

In the steps below how to set up Pinecone's similarity search service and upload the time series into the service's Index data structure. Pinecone stores and searches vector embeddings. These embeddings or feature vectors are a numerical representation of raw data semantics.

Recall that we want to create two indexes:

  • An index that contains vectors representing the raw data of historical prices of different stocks. In other words, vector embedding is simply the time-series sequence of numbers.
  • An index that stores feature embeddings calculated using Facebook's Kats toolkit. Kats is a powerful time-series analysis tool that includes a time-series embedding functionality.

Configure Pinecone

Let's start by configuring the Pinecone service.

Pinecone Setup

import pinecone

# Load Pinecone API key
api_key = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"
pinecone.init(api_key=api_key)
pinecone.list_indexes()

Get your API key and try this example yourself!

Create a New Service

Let's start with the simple approach and create an index.

# Pick a name for the new index
simple_index_name = 'stocks-trends'
# Check whether the index with the same name already exists
if simple_index_name in pinecone.list_indexes():
    pinecone.delete_index(simple_index_name)

# Create a new index
pinecone.create_index(name=simple_index_name, metric='cosine', shards=1)

# Establish a connection
simple_index = pinecone.Index(name = simple_index_name, response_timeout=300)

Prepare data

Starting with the simple embedding approach described earlier. Wherein we represent a time series as a vector of the time series sequence of numbers.

Throughout the demo, we use a Stock Market Dataset. This dataset contains historical daily prices for all tickers trading on NASDAQ, up to April 2020. The dataset is defined in Kaggle and requires either a manual download or a Kaggle API to download it.

The data processing (i.e., ETL part) here is heavy lifting and includes:

  • Downloading the data from Kaggle. (Recall, you will need a Kaggle API key.)
  • Define the time series raw data.
  • Extract the time series from the relevant files.
  • Transform the raw data into vectors and upload the vectors into Pinecone's service.

Download Kaggle Stock Market Dataset

%%writefile kaggle.json
{"username":"KAGGLE_USERNAME","key":"KAGGLE_KEY"}

#Check Kaggle username and key
! cat ./kaggle.json

!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d jacksoncrow/stock-market-dataset
!unzip -q stock-market-dataset.zip -d data

Set Up Time Series Hyperparameters

We set two hyperparameters defining how we extract the time series:

  • sliding window, which controls the length of the time series in consecutive day periods.
  • step size, which defines a gap in the start dates of two consecutive vectors.

Feel free to set the window or step size to a different value.

# Define sliding window and step size
SLIDING_WINDOW_SIZE = 64
STEP = 10

Define Extract-Transform-Load Functions

Before we do all of the other steps, we will define utility functions to help us extract, transform, and upload the time series data.

Note that we will:

  • Work only with the stock prices and disregard the ETF data folder.
  • Load data for a set of symbols from the stock folder for the simplicity of the example.
  • When creating vectors, we will include daily Open and Close prices. This way, our vectors will be double the size of a sliding window. Feel free to try different prices when creating vectors.
def windows(data, window_size, step):
    r = np.arange(len(data))
    s = r[::step]
    z = list(zip(s, s + window_size))
    f = '{0[0]}:{0[1]}'.format
    g = lambda t: data.iloc[t[0]:t[1]]
    return pd.concat(map(g, z), keys=map(f, z))

def get_feature_embedding_for_window(df, stock):
    ts_name = f"{stock.strip('.csv')}_{str(df.Date.min())}_{str(df.Date.max())}"
    scaler=MinMaxScaler()
    df[['Open', 'Close']] = scaler.fit_transform(df[['Open', 'Close']])
    prices = df[['Open', 'Close']].values.tolist()
    flat_values = [item for sublist in prices for item in sublist]
    df = df.rename(columns={"Date":"time"}) 
    ts_df = pd.DataFrame({'time':df.time.repeat(2), 
                          'price':flat_values})
    ts_df.drop_duplicates(keep='first', inplace=True)  

    # Use Kats to extract features for the time window
    try:
      if not (len(np.unique(ts_df.price.tolist())) == 1 \
         or len(np.unique(ts_df.price.tolist())) == 0):
          timeseries = TimeSeriesData(ts_df)
          features = TsFeatures().transform(timeseries)
          feature_list = [v if not pd.isnull(v) else 0 for _, v in features.items()]
          return (ts_name, np.array(feature_list))
    except np.linalg.LinAlgError as e:
        print(f"Can't process {ts_name}:{e}")
    return None

def get_simple_pair_for_window(df, stock):
    ts_name = f"{stock.strip('.csv')}_{str(df.Date.min())}_{str(df.Date.max())}"
    prices = df[['Open', 'Close']].values.tolist()
    flat_values = [item for sublist in prices for item in sublist]
    return (ts_name, np.array(flat_values))
def upload_data_to_index(index, create_pair_func, verbose=False):
    # Define path to the folder
    stocks = sorted(os.listdir('./data/stocks'))
    
    # Iterate over files, create vectors and upload data
    for stock in stocks[::50]:
        print(stock.strip('.csv'))
        data = pd.read_csv(os.path.join('./data/stocks', stock))
        data = data.sort_index(axis=0, ascending=True)
        data["Date"] = pd.to_datetime(data["Date"]).dt.date

        # Interpolate data for missing dates
        data.set_index('Date', inplace=True)
        data = data.reindex(pd.date_range(start=data.index.min(),
                                          end=data.index.max(),
                                          freq='1D'))
        data = data.interpolate(method='linear')
        data = data.reset_index().rename(columns={'index': 'Date'})
        data["Date"] = pd.to_datetime(data["Date"]).dt.date
        
        # Create sliding windows dataset
        wdf = windows(data, SLIDING_WINDOW_SIZE, STEP)
        
        # Prepare sequences for upload 
        items_to_upload = []
        for window, new_df in wdf.groupby(level=0):
            if new_df.shape[0] == SLIDING_WINDOW_SIZE:
                pair = create_pair_func(new_df, stock)
                if pair:
                    items_to_upload.append(pair)
               
        # Upload data for the symbol
        acks = index.upsert(items=items_to_upload, 
                            batch_size=2000)
        if verbose: print(acks[-2:])

Index

Let's upsert data into the simple index.

upload_data_to_index(simple_index, get_simple_pair_for_window)

# Check the index size
simple_index.info()
InfoResult(index_size=61212)

Now that we have uploaded the items into the vector index, it is time to check the similarities between vectors.

In this section, we will:

  • Define stocks and their windows for the query.
  • Fetch these query items from the index to retrieve their vectors.
  • Query the index using these vectors. Pinecone will return top K most similar vectors for each query item.
  • Show the results.

Below we define utility functions for data preparation and display.

def prepare_items_for_graph(ids, query_item, query_vec):    
    scaler = MinMaxScaler()
    result_list = []

    if not query_item in ids:
        ids.append(query_item)

    fetch_res = simple_index.fetch(ids=ids, disable_progress_bar=True)
    for res in fetch_res:
        if len(res.vector) > 0:
            vec = res.vector
            scaled_vec = scaler.fit_transform(res.vector.reshape(-1,1))
            result_list.append((res.id, (vec, scaled_vec)))
    return result_list
def show_query_results(query_item, query_vec, data):
    data_prepared = prepare_items_for_graph(data.id.tolist(), query_item, query_vec)
    graph_index = pd.Float64Index(np.arange(start=0, stop=SLIDING_WINDOW_SIZE, step=0.5))

    print('\n The most similar items from the vector index:')
    data.reset_index(inplace=True, drop=True)
    display(data)
      
    fig = plt.figure(figsize=(20,7))
    for item in data_prepared:
        _id, vectors = item
        ax1 = plt.subplot(1, 2, 1)
        graph = plt.plot(graph_index, vectors[0], label = _id, marker='o' if _id == query_item else None)
        ax2 = plt.subplot(1, 2, 2)
        graph = plt.plot(graph_index, vectors[1], label = _id, marker='o' if _id == query_item else None)    
    ax1.set_xlabel("Days in time window")
    ax2.set_xlabel("Days in time window")
    ax1.set_ylabel("Stock values")
    ax2.set_ylabel("Normalized Stock Values")
    ax1.title.set_text(f'Similar stock patterns and their market values')
    ax2.title.set_text(f'Similar stock patterns and their normalized market values')
    plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
    plt.show()

Note that we will filter the retrieved results to make sure we present a diverse set of stocks. Otherwise, we might get consecutive time windows for the same stock.

def filter_results(query_item, data, historical_only=False):
    already_present = []
    
    # Remove symbol that is already included
    for i, row in data.iterrows():
        check_name = row.id.split('_')[0]
        if check_name not in already_present:
            already_present.append(check_name)
        else:
            data.drop(i,axis=0,inplace=True)
            
    # Include only data prior to query interval
    if historical_only:
        _, start_dt, end_dt = query_item.split('_')
        start_dt = pd.to_datetime(start_dt).date()
        data['final_date'] = data.id.apply(lambda x: x.split('_')[2])
        data['final_date'] =  data.final_date.apply(lambda x: pd.to_datetime(x).date())
        data = data[data.final_date <= start_dt]
        del data['final_date']
       
    return data

Numerical Examples

Let's examine a few interesting price patterns and their corresponding best matches.

Here we define the query items, fetch them and prepare vectors for the query.

# Define query examples
items_to_query = ['BORR_2019-10-18_2019-12-20', 'HCCO_2020-01-28_2020-03-31', 'PUMP_2019-11-22_2020-01-24']

# Fetch vectors from the index
fetch_res = simple_index.fetch(ids=items_to_query)

# Create a list of vectors for the fetched items
query_vectors = [res.vector for res in fetch_res if len(res.vector) > 0]

The next step is to perform the query for the query vectors.

# Query the pinecone index
query_result = simple_index.query(queries=query_vectors, 
                                  top_k=100,
                                  disable_progress_bar=True)

Finally, iterate over the results, get all vectors needed for the graphs and display them.

Note that graphs on the left show the absolute price values for the query items selected, while graphs on the right show each vector on a 0-1 scale. The price normalization ignores the magnitude of stock prices and thus focuses on the time series pattern only.

It is more likely that similar trends appear in the same time interval. There is a flag historical_only that lets you choose whether you want to look only at the time intervals prior to query time interval or any time interval that exists.

# Iterate and show query results for each query item
for query_item, query_vector, res in zip(items_to_query, query_vectors, query_result):
    print(f'\nQueried: {query_item}')
    res_df = pd.DataFrame({'id':res.ids, 
                           'score':res.scores,
                           })
    res_df = filter_results(query_item, res_df, historical_only=False)
    show_query_results(query_item, query_vector, res_df.head(6))
Queried: BORR_2019-10-18_2019-12-20

 The most similar items from the vector index:
id score
0 BORR_2019-10-18_2019-12-20 1.000000
1 OSK_1990-11-25_1991-01-27 0.999467
2 BH_2009-10-21_2009-12-23 0.999385
3 MLAB_2007-03-30_2007-06-01 0.999307
4 SORL_2008-03-19_2008-05-21 0.999293
5 PKBK_2012-12-02_2013-02-03 0.999265

Example of vector similarity search for time series data


Queried: HCCO2020-01-282020-03-31

 The most similar items from the vector index:
id score
0 HCCO_2020-01-28_2020-03-31 1.000000
1 TEI_2016-09-22_2016-11-24 0.999958
2 BSD_2002-01-31_2002-04-04 0.999945
3 AMCI_2016-10-04_2016-12-06 0.999942
4 HPI_2012-01-29_2012-04-01 0.999939
5 GENC_2012-08-16_2012-10-18 0.999930

Example of vector similarity search for time series data

Queried: PUMP_2019-11-22_2020-01-24

 The most similar items from the vector index:
id score
0 PUMP_2019-11-22_2020-01-24 1.000000
1 AEL_2003-12-14_2004-02-15 0.999547
2 THS_2006-10-21_2006-12-23 0.999510
3 RUBY_2008-07-05_2008-09-06 0.999508
4 BH_2017-11-18_2018-01-20 0.999466
5 LMNR_2010-11-19_2011-01-21 0.999459

Example of vector similarity search for time series data

Notice that we found patterns that look alike and are related to different stocks and different time windows.

Facebook's Kats Time-Series Embeddings

It is time to test another approach. This time we create feature embeddings and upload them using the same stocks and windows as in our previous index. Here we utilize Facebook's Kats toolkit. Kats is a powerful time-series analysis tool that includes a time-series embedding functionality.

Let's create a new index first.

Create a New Pinecone Service

# Pick a name for the new index
kats_index_name = 'stocks-trends-with-features'

# Check whether the index with the same name already exists
if kats_index_name in pinecone.list_indexes():
    pinecone.delete_index(kats_index_name)

# Create a new index
pinecone.create_index(name=kats_index_name, metric='cosine', shards=1)

# Establish a connection
kats_index = pinecone.Index(name = kats_index_name, response_timeout=300)

Index

We will use Kats and its time-series feature extraction module to create feature embeddings for each stock and corresponding time window. These feature embeddings include the following types: seasonality, autocorrelation, modeling parameter, changepoints, moving statistics, and raw statistics of time series array as the ad-hoc features. We used the default set of features for our example and created 40-dimensional feature embeddings.

upload_data_to_index(kats_index, get_feature_embedding_for_window)

kats_index.info()
InfoResult(index_size=61114)

Note that the Kats-based index has fewer vectors compared to the simple embeddings index. It happens because Kats fails to calculate some features for some patterns. E.g., it happens if the time series has a constant value each day in a time window.

Search

We will use the same query item that we used to query the simple index.

# Fetch vectors from the index
fetch_res = kats_index.fetch(ids=items_to_query)

# Create a list of vectors for the fetched items
query_vectors = [res.vector for res in fetch_res if len(res.vector) > 0]

# Query the pinecone index
query_result = kats_index.query(queries=query_vectors, 
                                top_k=100,
                                disable_progress_bar=True)

# Use simple index to retrieve historical prices for query items
simple_index = pinecone.Index(name = simple_index_name, response_timeout=300)

# Fetch vectors from the index
fetch_res = simple_index.fetch(ids=items_to_query)

# Create a list of vectors for the fetched items
time_series_vectors = [res.vector for res in fetch_res if len(res.vector) > 0]

# Iterate and show query results for each query item
for query_item, query_vector, res in zip(items_to_query, time_series_vectors, query_result):
    print(f'\nQueried: {query_item}')
    res_df = pd.DataFrame({'id':res.ids, 
                           'score':res.scores,
                           })
    res_df = filter_results(query_item, res_df, historical_only=False)
    show_query_results(query_item, query_vector, res_df.head(6))
Queried: BORR_2019-10-18_2019-12-20

 The most similar items from the vector index:
id score
0 BORR_2019-10-18_2019-12-20 1.000000
1 NEON_2004-08-08_2004-10-10 0.999648
2 AGTC_2019-10-27_2019-12-29 0.999632
3 DUO_2014-07-14_2014-09-15 0.999611
4 MLAB_2016-04-11_2016-06-13 0.999604
5 GENC_2004-01-11_2004-03-14 0.999585

Example of vector similarity search for time series data

Queried: HCCO_2020-01-28_2020-03-31

 The most similar items from the vector index:
id score
0 HCCO_2020-01-28_2020-03-31 1.000000
1 TEI_2002-04-29_2002-07-01 0.999724
2 BY_2018-06-25_2018-08-27 0.999667
3 OSK_2016-12-18_2017-02-19 0.999616
4 PKBK_2019-10-27_2019-12-29 0.999610
5 RGR_2012-02-08_2012-04-11 0.999577

Example of vector similarity search for time series data

Queried: PUMP_2019-11-22_2020-01-24

 The most similar items from the vector index:
id score
0 PUMP_2019-11-22_2020-01-24 1.000000
1 RGR_2014-07-07_2014-09-08 0.999932
2 TVIX_2015-09-15_2015-11-17 0.999832
3 SCVL_2004-03-18_2004-05-20 0.999734
4 STNE_2018-12-14_2019-02-15 0.999730
5 NIB_2011-02-20_2011-04-24 0.999683

Example of vector similarity search for time series data

Conclusion

Pattern matching of time series data is an important task affecting time series clustering, labeling, classification, and recommendation.

We used the similarity search to find the most similar patterns in stocks data. We tried two different approaches to create vector representations of time series. First, we used the raw data of historical prices, and then we represented time-series as a set of statistical features. In both cases, we retrieved the top 100 best matches from the Pinecone similarity search service. Then we further filtered and showed only the top 5 most similar stock trends. (We did that to make sure we retrieve a diverse set of stocks. Otherwise, we might get consecutive time windows for the same stock.)

The simple approach turned out to give good results. When using Kats' time series features, we got somehow mixed results. We noticed that the most similar feature embeddings sometimes retrieve reverse patterns.

Yet, we note that the literature has plentiful advanced time series representation techniques. Starting from sequential deep neural network approaches such as LSTMs and RNNs, combining frequency-domain representations with convolutional neural networks, and even applying deep neural graphs embeddings. We encourage you to explore this fascinating domain. Feel free to try it along with the Pinecone service, and share your findings with us!

Delete the indexes

Delete the indexes once you are sure that you do not want to use it anymore. Once the index is deleted, you cannot use it again.

pinecone.delete_index(simple_index_name)
pinecone.delete_index(kats_index_name)