Note, Pinecone operated under the name HyperCube.ai until the end of 2020. References to the HyperCube system in the PDF version of the paper refer to Pinecone.
From the lab to production: A case study of session-based recommendations in the home-improvement domain
- Pigi Kouki firstname.lastname@example.org RelationalAI
- Ilias Fountalis email@example.com RelationalAI
- Nikolaos Vasiloglou firstname.lastname@example.org RelationalAI
- Xiquan Cui email@example.com The Home Depot
- Edo Liberty firstname.lastname@example.org Pinecone Systems Inc.
- Khalifeh Al Jadda email@example.com The Home Depot
E-commerce applications rely heavily on session-based recommendation algorithms to improve the shopping experience of their customers. Recent progress in session-based recommendation algorithms shows great promise. However, translating that promise to real-world outcomes is a challenging task for several reasons, but mostly due to the large number and varying characteristics of the available models.
In this paper, we discuss the approach and lessons learned from the process of identifying and deploying a successful session-based recommendation algorithm for a leading e-commerce application in the home-improvement domain. To this end, we initially evaluate fourteen session-based recommendation algorithms in an offline setting using eight different popular evaluation metrics on three datasets.
The results indicate that offline evaluation does not provide enough insight to make an informed decision since there is no clear winning method on all metrics. Additionally, we observe that standard offline evaluation metrics fall short for this application. Specifically, they reward an algorithm only when it predicts the exact same item that the user clicked next or eventually purchased. In a practical scenario, however, there are near-identical products which, although they are assigned different identifiers, they should be considered as equally-good recommendations. To overcome these limitations, we perform an additional round of evaluation, where human experts provide both objective and subjective feedback for the recommendations of five algorithms that performed the best in the offline evaluation.
We find that the experts’ opinion is oftentimes different from the offline evaluation results. Analysis of the feedback confirms that the performance of all models is significantly higher when we evaluate near-identical product recommendations as relevant.
Finally, we run an A/B test with one of the models that performed the best in the human evaluation phase. The treatment model increased conversion rate by 15.6% and revenue per visit by 18.5% when compared with a leading third-party solution.
Since the speed of serving recommendations for millions of items and for millions of customers in real-time is critical for the success of a model, we leveraged Pinecone, a commercially available real-time ranking platform that can efficiently handle both the deep learning model transformations and vector search.
Pinecone allows the lookup of product embeddings and computation of the session embedding on the fly, as well as a speedy tri-linear product computation against millions of products in the catalog. STAMP was decomposed into a query transformer and an item transformer both of which produced vectors whose dot product was the recommendation score. Shopping items were indexed by the platform by applying the item transformer offline and storing the resulting vectors in replicated nearest-neighbor retrieval servers. Query transformers were applied in real time to shopping carts as well as other session information. The service then used the query vector to retrieve highest scoring items. 99% of the customers experienced less than the 60ms latency when running the STAMP model inference (i.e., p99 < 60ms).