MESA - Music Embeddings Search Architecture

<!doctype html>MESA - Music Embeddings Search Architecture

MESA - Music Embeddings Search Architecture

Music Embeddings Search Architecture (MESA see A folk tale of naming in AmazonMusic if you're curious about names), is a platform for hosting music embeddings for low latency, high recall retrieval. In other words, MESA is a back end software system that allows customers to search and discover music quickly. Using the MESA platform, Amazon Music engineers can build powerful music exploration experiences, across any slice of the music corpus, with deep understanding, personalized for any active customer, by integrating with the MESA  API. In the future, MESA platform will allow Amazon Music AI/ML Data Scientists and Engineers to quickly experiment and launch new models. 

This document explains what embeddings are, how they work, and then goes into how we use them in practice. The Q&A session following document reading, is recorded here: https://broadcast.amazon.com/videos/595855

Amazon Music Search, is a team of ~150 software engineers, product managers, managers, data scientists, and more, based mainly in the San Francisco Bay Area and Bangalore.  Nelson Burton (@neburton) is the technical leader of this project, and author of this document. The step by step of building this system is captured in the following: P13n Embeddings Infrastructure- Table of Contents - Start Here . This document links to a number of product requirements, performance tests, proof of concept validation, design docs, and more. 

If you already know what embeddings, and embedding nearest Neighbors are, you can skip down to Amazon Music Embeddings which has information specific to what we’ve built in Music Search. 

Background on Embeddings

What are embeddings?

"Embeddings" are a way to use a number to represent something.  For words, images, songs, etc., we can make up a number (or several, using a vector) to represent each item. The Machine Learning approach to create embeddings, is called Representation Learning, and literally means how do we learn how to represent an item with a number. Representing something with a number, is more formally called embedding an item in vector space. Abstractly, the distances between numbers within vector space, give rise to contextual meaning based on how the vector space is defined. More concretely, different numbers mean different things, based on what the numbers are describing. 

1-Dimensional Example: Fruit preference

For example, let's say we represented fruit on a 1-10 scale of how much I (Nelson) likes them, Apples are 4.0, Nectarines are 8.1, Bananas are 6.5, Mangosteens are 9.8.  Those would be my fruit embeddings, living in 1-dimensional vector space (1-dimensional here, refers to only using one number per fruit; vector space means all the numbers that I could rate fruit, in this case 1-10). Each fruit embedding is composed of an identifier (the name of the fruit), and a vector (in this case, a vector of length 1 with a value of my rating). For my fruit preferences, Apples and Mangosteens are the most different, as the distance between the numbers (9.8 - 4.0) = 5.8 is the largest.

2-Dimenional Example: Animal size

To take another example, let's say we have dogs, and cats. We measure all the dogs and cats we encounter, noting their length and weight. We create embeddings for these animals, by embedding them in 2 dimensional vector space, using 2 numbers to represent each of them. The 2 dimensions of our animal vector are length, and weight. Our animal embeddings are below:

 AnimalLength (inches)Weight (lbs)
1French Bull Dog1516
2Husky3538
3Labrador3630
4Siamese Cat1611
5Tabby Cat139
Screen Shot 2022-08-23 at 2.35.18 PM.png




Here is the same Animal data from the table above, graphed, in 2 dimensional vector space. Vector here, means a value, e.g. a Husky has a vector of (35, 38).

Note, for embedding vector spaces, the embedding is not equal to the entity being embedded, clearly the numbers 15 and 16 are not equivalent to an actual dog. Additionally, in this example, we've chosen real physical attributes, but just like the first example, these numbers only mean something in the vector space in which they're created. In the animal case, it's a physical 2-Dimensional vector space. In the fruit case above, it's a 1-Dimensional Nelson preference vector space.

300-Dimensional Example: word2vec

A famous embeddings model created by engineers at Google, for words, learned from analyzing a huge amount of written text, is called word2vec. word2vec, a 300-Dimensional, generates embeddings, representing words as vectors (300 numbers per vector). E.g. Paris = [0.5, 1.2, -0.2, .... 297 more numbers] . word2vec was based on the idea that the meaning of a word can be learned by the company it keeps, thereby if 2 words are frequently beside each other in writing, they may be nearby in vector space. 

A neat property of word2vec, is that although an individual vector for a word might not mean anything, in aggregate they mean something. For example, taking the vectors for the following places, and placing them in the formula, researchers found  France - Paris ~= Germany - Berlin , meaning there is some notion of a capital city learned in the model. And Queen - King ~= Woman - Man .

Even larger dimensions

From my own observation, of ML papers I’ve read, I’ve seen most embedding dimensions in the range of 100-5000 dimensions. Too many dimensions and you risk over-fitting, meaning essentially memorizing your input data, and too few dimensions, and your embeddings will not capture enough information to be useful. 

Embeddings Nearest Neighbors

A singular embedding can be used in relation to other embeddings in vector space, by determining which are the closest (shortest distance) embeddings, called “nearest neighbors”, to a starting embedding. For the 2-Dimensional Example: Animal size, let's say we measured an unknown animal, and we found it's length to be 40inches, and weight to be 40lbs. So we have UnknownAnimal: (40,40). We refer to this starting point as our source/basis embedding. From the basis, we could measure the distance between this vector (40,40), and all the other animal points, and then guess that this (40, 40) animal is more likely to be similar to, or is actually a Husky and Labrador than it is to be the other animals, because Husky and Labrador are the Nearest Neighbors to our new point. 

Terminology wise, KNN (K-Nearest Neighbors), is an abbreviation to refer to this calculation, where K is the number of results you want. For example, K = 50, do KNN, means find the 50 nearest neighbors to a starting embedding.  The term “affinity” to express a customer’s preference for particular music. For example, I have affinity to 115 BPM repetitive House music. For an embedding model that contains music and listeners, this might be expressed as those tracks that I have affinity to, meaning those nearest neighbors to the vector that represents me. 

To understand how embeddings are used in Amazon Music Search, it's important to establish a foundation of what it means to find Nearest Neighbors. If it is not quite clicking, you can reference  Visual Explanation of Nearest Neighbors Search

Nearest Neighbor Distance Functions

There are multiple ways to find the distance between embeddings, thereby calculating the nearest neighbors to an embedding. The most obvious is regular “euclidean distance", which means treating vectors as points in space, and finding the distance between them, which we learned in geometry, √[ (x2 – x1)2 + (y2 – y1)2] . We can also calculate angles between vectors (cosine), project vectors onto one another (dot product), and can also use a Manhattan distances (e.g. number of squares to move on a chessboard). 

For a particular embedding model, using different distance functions can surface different nearest neighbors. For example, with Cuttlefish, a model trained using customer listening history, more popular tracks have been listened to more, hence have more training data, which leads the ML model to learn a stronger signal, and thereby the most popular tracks end up with the largest vector magnitudes. In this case, choosing to use dot products for vector distances in Cuttlefish, as compared to cosine, leads to finding Nearest Neighbors that are more popular (because in cosine, we divide out the vector magnitude, and thereby normalize out the effect of popularity). Here’s the formal mathematical definition of cosine similarity.
image.png


Playing around with different distance functions can surface different candidates for the same model. Engineers in MESA should choose the best distance functions for a particular ML model and usecase, which clients will use. 

Nearest Neighbor Algorithms

Once you’ve settled on a distance function between any 2 vectors, the brute force approach to calculating Nearest Neighbors, is to calculate a distance between your basis vector, and every other vector, O(N). For example, if you have a customer vector, and 10 million songs, you do 10 million calculations, sort by magnitude, and find which tracks are nearest. From our performance testing, we found that calculating a dot product between two 50-Dimensionsal vectors, with the data already in memory, on a i3.4xlarge 2.3 GHz processor, takes on average 100 nanoseconds. So if you wanted to find the nearest neighbors for a customer across 10 Milliion tracks, on a single processor, the math would take 1 second of CPU time. This Brute Force approach, O(N), we refer to as “Exact KNN”, is one of the nearest neighbor algorithms used by MESA.

However, this Brute Force approach can be slow, especially with high dimensionality, and lots of entities, so researchers have come up with faster algorithms than brute force. These sets of algorithms make a quality for speed trade-off, where the set of nearest neighbors calculated (candidates returned) is not always exactly equivalent to the “Exact KNN” approach, and hence, these algorithms are refered to as “Approximate Nearest Neighbors” (ANN). At the time of building, we used the state-of-the-art ANN approach, called Heirarchical Navigable Small Worlds (HNSW), which pre-arranges the vectors into nearby space, thereby decreasing the number of calculations needed. For a deeper dive on Nearest Neighbor algorithms, including HNSW, see Appendix 3: Approaches to finding nearest neighbors

ANN vs exact KNN

There is a tradeoff between Recall and Speed, when choosing between ANN and exact KNN. 

Recall

Recall is a term used in search/information retrieval, that refers to the ratio of documents found/actually in the response, out of the total number of documents that would match. For example, if there was a corpus of 50 documents, and 5 were actually relevant, but your information retrieval system returned only 3 out of those 5, your recall would be 60%. One thing to note, is in search, we also give preference to relative ranking (getting the 1st place result correct is more important than the 5th place result correct), and thereby score recall using a measurement called NDCG (Normalized Discounted Cumulative Gain), which we calculate as the following:

NDCG score, by descending weight over the top 10 hits. 
Screen Shot 2020-08-17 at 12.52.44 PM.png
A perfect recall score would be 100%.  


When it comes to embeddings, and measuring the recall of our Nearest Neighbor algorithm, we can compare the recall quality of an “exact KNN” approach which would have 100% recall, against an ANN approach, which would have imperfect recall. To note, this measurement was done using default HNSW settings, chosen by OpenSearch. With ANN algorithms, in particular HNSW, there are often hyperparameters that can be tuned, e.g. changing number of graph links in HNSW, that will also affect the recall. Using ANN for scoring mechanism for MESA, risks a quality of recall problem, which manifests in two ways:
  • Approximation inherent in chosen ANN scoring algorithm (e.g. HNSW NMLIB), that allows for the faster retrieval.
    • This difference led to ~86% recall.
  • Applying filtering when calculating Nearest Neighbors using Elasticsearch, exact KNN’s pre-filter vs ANN’s post-filter. Pre- and post- filtering here, refers to when filtering is applied to your results, before scoring or after scoring. With ANN, you score all results, then filter afterwards.With KNN you filter the results, than score afterwards. The risk with ANN, post-filtering, is if your filter filters too much, you might not end up with anything. 
    • With one set of genre filters, for our Cuttlefish embeddings, we found this difference led to ~25% recall. 
    • The key difference is ANN scores top K results, then post-filters; whereas exact KNN pre-filters results, then scores those matches. 
    • For example, for a customer whose cuttlefish embedding preference leans to hip hop, if a filter is applied for classical, the initial top 10,000 hits calculated from cuttlefish embedding distance are more likely to be hip hop, so if we post-filter the results to only classical, it is less likely those results have the highest scoring hit. 10,000 is a hard limit from ANN.
More information about Recall, and how we found these numbers, is here: Recall quality analysis of Amazon Elasticsearch OpenDistro ANN vs Custom Scoring with KNN plugin

Speed/Latency

We measure the speed of the nearest neighbor algorithm by timing how long it takes to get results back. As we power search for Amazon Music app, as well as Alexa customers asking to play music, we have a large volume of customers who need their responses quickly (low latency). In search, customers are very sensitive to latency. Amazon famously found every 100ms (0.1 seconds) of latency cost them 1% in sales. Google found an extra 500ms of latency in search page generation time dropped traffic by 20%.

To measure how well MESA does, we ran a number of performance experiments, and measured the results, to compare ANN vs exact KNN. 

Note: these are all based on evidence from experiments in  Search Embedding Infrastructure Performance Evaluation, please see document to see the data that supports these claims. 
  • ANN will almost always outperform KNN in both latency and throughput for match-all queries (no filtering)
  • Both ANN and KNN are CPU bound, and for best performance we should use instances optimized for compute (c5,c6,etc.)
  • ANN is very sensitive to Elasticsearch segment count, I found that latency ~= 6 ms * num_segments  .  This means live indexing will worsen ANN latency, as new indexing creates new segments. 
  • KNN latency is very sensitive to selectivity, it’s latency per shard is approximately O(n), where n is number of documents needed to score, latency ~= 0.3µs/document-scored * number_of_documents_that_match_filter. 
    • Selectivity here, refers to how many documents a query matches. So a highly selective query means the query matches a small number of documents, an unselective/low selectivity query means the query matches a large percentage of the corpus.
  • Calculating a dot product between two 50-Dimensionsal vectors, with the data already in memory, on a i3.4xlarge 2.3 GHz processor, takes on average 100 nanoseconds.
  • Sharding KNN leads to a trade off between tail (p90,p99), and median (p50) latency; the more shards, the lower tail latency, but the cost of fan out increases average latency. 
  • On both KNN and ANN, if average CPU utilization crosses above ~35%, we will start seeing slight latency degradation. 

MESA exact KNN vs ANN Choice

For CuttelfishV4, we chose to use an exact KNN approach, so that we could have better recall when applying filtering. To deal with latency, we split the data in multiple shards,  that are searched in parallel.  10MM track documents, split into 72 shards, with 300 nanoseconds/document = 41ms Match All latency. 

For our GEMS demo cluster, we chose to use ANN, as it is faster/cheaper, and our first pass does not need filtering, so we won’t have a recall problem. 

Amazon Music Embeddings

In Amazon Music, we have many embedding models both already developed, and currently under development. MESA currently only hosts Cuttlefish, and a test cluster for GEMS. 

Amazon Music ML Models
Amazon Music Machine Learning (MusicML) team, created Cuttlefish, Cuttlefish is a 50 dimensional embedding model trained using Amazon Music's customers' listening history. If a customer listens to only Kanye West, her Cuttlefish embedding vector's nearest neighbors will likely include the artist Kanye West's vector. If my friend and I listen to all the same music, our embedding vectors will be nearby one another, called nearest neighbors. More information on how Cuttlefish actually creates these embeddings, is here Appendix 4: How does Cuttlefish work?

Another embedding model created by MusicML is called Deep Cuts, is an embedding model trained on sonic similarity, that means if 2 songs have lots of the same sounds, for example if a song was played on acoustic guitar, with a baritone singer, other acoustic guitar music with a baritone singer would be nearby in vector space. 

Amazon Music Search Models
In the past, we used trained a BERT model with transfer learning to prefer music with a high click through rate. Currently, Music Search Relevance is working on GEMS, a series of embedding models built using graphs on catalog metadata, as well as search query to entity relationships created from text search customer feedback. This embedding models will support better semantic understanding of querying, meaning we can find documents that have the same meaning, if they if they don’t share the same words as the customer query. For example, British Comedian Podcast could return interviews with Ricky Gervais in them, even though neither the podcast title nor description include the exact tokens, "Ricky Gervais." More about the plan can be read here: Intuitive Search CX- Demo & PRFAQ - V2 DRAFT .  

MESA Components

We have microservices on both search side, that customer queries flow through, and on our ingestion side, where we keep our music data up to date. 

The architecture diagram below depicts which pieces of Tenzing search and ingestion that MESA uses. As there is significant overlap between MESA and Tenzing components, as MESA uses a lot of the same microservices. 
TenzingEmbeddingsNew.png

Elasticsearch + KNN Plugin

The Elasticsearch clusters are where the entity (track, album, station, playlist, artist) embeddings are stored, and where the nearest neighbor calculation happens. For a customer query, we find our Nearest Neighbors by calculating exact-KNN by using a custom plugin on Elasticsearch that overrides scoring of all query matches with dot-product/cosine etc. 

You can think of Elasticsearch + KNN happening in a 2 phase approach. First, Elasticsearch calculates matches for a customer query, by using the underlying inverted indices. For example, let’s say a customer types “rock”, and we show 50 results on the page. Out of 10 million songs, 100,000 songs are tagged/have “rock” in their metadata, and Elasticsearch then scores each of those 100,000 songs. Rather than using a traditional information retrieval score like TF/IDF, or BM25, we override the Elasticsearch default scoring, with a call to calculate the score. 

Redis - Embeddings Key/Value Store

We place both customer embeddings, as well as entity (track, album, station, playlist, artist) embeddings in Redis. A client sends in a customer identifier/device identifier, Alexa voice/speaker identifier, or an entity identifer, and we return the vector in a few milliseconds.

The Redis Embeddings KV Store is called by 3 clients, TenzingBackfillScheduler to write the embeddings, TenzingIndexingAggregator to read entity embeddings, and TenzingCoreService to retrieve embeddings to insert the vector into a search query that goes to Elasticsearch. We pioneered the use of Redis in MusicSearch, as it is a highly available, fast, Key/Value store. 

Ingestion Services - TenzingBackfillScheduler, TenzingIndexingAggregator, TenzingWorkflow

After MusicML generates the embeddings, our series of ingestion services put embeddings into Redis and Elasticsearch. TenzingBackfillScheduler downloads raw S3 embedding files from MusicML S3, writes them to Redis. TenzingIndexingAggregator fetches the embeddings, and rest of the music metadata from respective authority sources. Then the assembled data is used by TenzingWorkflow to construct an Elasticsearch document that fits our Elasticsearch data model. 

The most complicated part of ingesting embeddings, is the “Index and Swap", which performs a daily update. We generate a new minor version of the Cuttlefish embedding model each day, so personalization incorporates the most recent listening history. Due to the nature of the Machine Learning methods used to create embeddings model, in particular Cuttlefish, embeddings change their location in vector space with each daily retraining of the model. This changing data means we can not incrementally update an index in place, as the nearest neighbors would be thrown off, but instead must fully recreate a new index and perform a live swap to it, so no data is lost in the process. 

TenzingBackfillScheduler performs the steps, including writing all embeddings to Redis, creating new empty Elasticsearch indexes for each entity type, triggering writes to Elasticsearch to fill the indexes with data by scheduling a backfill, verifies the Elasticsearch indexes has all the data, and then sets a flag (alias) to signal that the newly created Elasticsearch indexes are up-to-date. The Search Service TenzingCoreService checks this flag, to know which Elasticsearch index to send queries to.  This long running job, typically takes around 4 hours to complete, and uses an Amazon internal tool called Herd as a distributed state machine/orchestration/coordination engine, which controls the aforementioned series of steps, handling failures, retries, etc.  The “Index and Swap” process is described in more detail here https://w.amazon.com/bin/view/Amazon_Music/Music_Search_Platform/Design/Tenzing/TenzingBackfillScheduler/Runbook

Search Services - TenzingBrowse, TenzingCore, TenzingTextSearch, ...

Search services turn customer queries into Elasticsearch queries, to compute nearest neighbors, in order to find songs for customers. We choose to hide some of the complexity of constructing nearest neighbor computations, by specifying them differently in our API. For example, with our Cuttlefish model, rather than forcing clients to specify “cosine” vs “dot product”, we ran some experiments, found  that “cosine” works better for entity-to-entity, and dot-product, and encoded that into a configuration, so clients can use the best distance function for most usecases, without needing specification. We also provide a way to override this functionality if needed. 

What customer facing usecases can MESA power?

External (non-Search) customers can integrate with MESA, via sending browse requests to the TenzingBrowseService API. Internal (Search) customers can integrate with MESA, via sending browse requests to the TenzingCoreService API. They can filter on any catalog metadata; return tracks, playlists, albums, artists, or stations; and rank hits by customer affinity, or entity affinity (similarity to other catalog documents), or combinations there-of. 

For example, a client, in a single request, could implement “New Releases for You”, a personalized experience of which new releases we think customers will enjoy, by sending a request to TenzingBrowseService API, with a date filter, combined with an AffinityRank. Here is an example request.  https://w.amazon.com/bin/view/Amazon_Music/Music_Search_Platform/TenzingBrowseService/Runbook/CoralDiver#HExampleNewreleasesforyouQuery

Another usecase MESA can power is entity similarity search, for example, find me stations that are similar to The Rolling Stones, or even, artists that are similar to The Rolling Stones, The Who, and The Beatles. Here is an example https://w.amazon.com/bin/view/Amazon_Music/Music_Search_Platform/TenzingCoreService/Runbook/CoralDiver/#HExampleQuery2CSimilarplayliststoTheRollingStones2CTheBeatles2CandTheWho:

Tech debt

Like any engineering project, MESA is imperfect. One place that would be improved, is our “Index and Swap” mechanism. The challenge, is that we need to switch to a new minor version of the embedding model each day, while serving live customer traffic, and the flip for both the Embedding KV Redis lookup and the Embedding Elasticsearch indexes routed to, needs to happen at the exact same time. Our current solution works, and solves this problem, but it is not as elegant as we would like. As we add more embedding models, this area of the code base could add extra engineering effort. 

Looking ahead, future work

Combination with Lexical

Once we complete the work to add multi-language text analyzers to our title fields, outlined in Personalized Text Search: , we will be able to serve all current text search requests via MESA.  Although it will require tuning, by combining BM25 (traditional lexical string comparison ranking), with the embedding scoring. I am hopeful this will allow for better candidate retrieval, for example personalizing spear fishing for overlapping titles, customer types “Justice”, we get the best French Electronic duo, or the Justin Bieber album, correct at retrieval time. 

1-click Ingestion for new ML Models

To allow data scientists to experiment with generating new models directly, we should work backwards from a vision of a “1 click ingestion” experience, where with limited heavy lifting, a scientist can run an experiment to see if her model  improves the customer experience. As Search Data Management has made AWS Opensearch cluster formation quick (<1 hour), the next step will be developing automatic ingestion, which may take the form of batch indexing using something like Spark, incremental indexing using the existing ingestion flow, or perhaps something else. This will require more design/thought. 

Hybrid exact KNN and ANN

One fancy thing we could do, to minimize the effect of the performance vs. recall trade off between ANN and KNN, is to use both. For example, if we create a query selectivity estimator, that before querying, could quickly guess how many results we could return, we could choose an optimal exact KNN vs ANN. We could have an index with both an ANN field, and a KNN field. Based on selectivity (how many hits), we could choose to score using the ANN field (lots of hits/unselective), or choose to score with KNN field (selective, few hits). 

Appendix

Appendix 1 - A corny folk tale of naming in AmazonMusic

Mesa is a tabletop mountain, figuratively representing a platform on which we can build.
image.png


Amazon Music used to have a product called Cloud Player, where customers could upload their music to the "cloud", and then download/stream it. The backend software services that powered cloud player, were thus named after clouds: Cirrus  (music metadata library), Stratus (customer accounts), and Nimbus (fulfillment from purchases). 

Cirrus, was formerly built on software called A-Drive, a key/value storage system, but in order to scale to the growing customer usage, the team decided to be an early adopter of DynamoDB, and led a large refactor with lofty aspirations, but on a firmer base that would still touch the clouds, hence the name "Everest". Everest held each customer's music. 

Customers also needed a way to explore their music library now. In mountaineering, the famous first ascent of Mount Everest was done by Edmund Hillary, and Tenzing Norgay, a sherpa. Tenzing became the search system that searched Everest. Later on, Tenzing's code base became a mess, hence a return to Basecamp (a large refactoring project), to refactor the search code. MESA is another mountain platform. Later on, GEMS (Graph Embeddings something something), mined from our mountains of data, will be proudly displayed on the MESA platform (hosted and available for search/recall).

Appendix 2 - Visual Explanation of Nearest Neighbors Search

The Blue Dots below are albums that Amazon Music customers listen to. Based on customers’ listening history, album embeddings are created, and in this made-up example, are located in 2-Dimensional vector space. Hip Hop fans, by listening to Hip Hop, have swayed vectors for their favorite music to be nearby one another (the upper left). To the right, we see Classical music grouped together, formed by customers listening. Note, the axes here don’t actually mean anything, other than providing a visual frame of reference to locate these items in vector space, there is no meaning of the axes, other than more or less of our 2 vectors. 
Slide1.png

Now, we add in a customer, Nelson, the Orange Dot. Based on his listening history, he has listened to Frank Ocean, and Anjunadeep, and has a vector somewhere inbetween. 
Slide2.png
Grandpa Joe, the other Orange Dot shows up on the far right. Based on his preference for Classical music, he has a vector nearby.
Slide3.png
Now, we fetch Nelson’s 3 nearest neighbors, in order to recommend him something to listen to. To calculate these, we find the distances between Nelson and all Blue Dots (albums), and then find the 3 with the smallest distance. These albums, due to vector distance, are most similar to his listening history. 
Slide4.png
Similarily, here we fetch Grandpa Joe’s 2 nearest neighbors. 
Slide5.png
We can also retrieve results by applying filters, sometimes referred to as search with constraints or facets. In the below example, let’s say we apply a filter to only match songs that have been tagged with Hip Hop as a genre. In the diagram below, the now grey dots will not surface, and we have filtered down to just Hip Hop songs.
Slide6.png
If we then calculate nearest neighbors for only those results that match a query (the dark Blue Dots), we can now power a customer facing usecase, of “Genre for You”. In the below example, it would be “Hip Hop for Nelson”. 
Slide7.png


Appendix 3: Approaches to finding nearest neighbors

Note: This document section goes into details about KNN and ANN, as well as touches on some of the algorithmic implementations, because choosing an ANN service means trading latency for recall, and its worth exploring why that is. It is not critical that the reader understands the algorithms, just that at a high level, we have two options, exact and approximate. 

KNN - K Nearest Neighbors - O(n)

Finding the exact nearest neighbors for a target vector will mean computing the distance to every other point in space. This  means to find the 10 closest tracks to customerId/speakerId, we need to compare all 10 million tracks to the customerId/speakerId’s embedding. This is linear time (the more possible candidates, the longer it takes to score them all). 

image
Computing distance to each vector to see who nearest neighbors (less efficient, best recall)


ANN - Approximate Nearest Neighbors - O(log n)

As opposed to an exact KNN, where accuracy is guaranteed, an alternative approach is ANN. Approximate nearest neighbors finds probably the closest, but not guaranteed. There is a performance/recall trade off, where the closer to recalling perfect nearest neighbors (the exact KNN result), the slower the search (you have to do more comparisons).

There are different algorithms to do ANN, each with its own performance/recall trade off curve. 

Algorithm 1 - annoy

One algorithm to reduce the number of comparisons you have to do, is to repackage the vectors into hierarchical trees, e.g. R-trees, you can compare your vector to the split criteria, rather than all points. This means you can find nearest neighbors in logarithmic time instead. 

To explain what that means visually, one ANN algorithm, annoy can be explained with pictures. Annoy (Approximate Nearest Neighbors Oh Yeah) was developed at Spotify, and powers their recommendations. (source: https://www.slideshare.net/erikbern/approximate-nearest-neighbor-methods-and-vector-models-nyc-ml-meetup ):
image
Split the vector space with hyperplanes to reduce the number of comparisons you have to do. Rather than computing distance to a particular point, you compute which side of a hyper plane you are on, and look for nearest neighbors in that space, for each subspace.


  •  Am I above/below the middle bold line.
  • Okay, I’m above, that puts me in the green or orange territory...next, am I left or right of the next split (am I in orange or green area)
  • splits continue, etc.. 
  • Find bottom, nearest neighbors are ones in the same space

Screen Shot 2020-05-07 at 4.35.57 PM.png

This is the data structure that gets built, where to decide to go down left of right branch, you are comparing to the pivot criteria (am I above/below a hyper plane). 

Once you have descended all the way down to the leaf nodes, you have found your approximate nearest neighbors. 


Algorithm 2 - Hierarchical Navigable Small World (HNSW) 

There is an implementation of HNSW called Non-Metric Space Library (NMSLIB). It is more efficient than the “annoy” approach described above.  Rather than splitting into trees, it hops around in vector space, from sub-graph to sub-graph, from vector to vector, greedily choosing the next nearest point until it descends close to the target. It is more efficient than the approach of annoy. 

The Amazon Elasticsearch team wrote an Open Distro plugin to make HNSW-NMSLIB run on Elasticsearch. This is what MusicKNNService uses.  Written about here. 



nsw_graph.png
The algorithm starts by choosing a point at random, computing its distance to the target (query).

Then proceed to explore the graph, by continuing to choose the next closest point (based on Euclidean distance), until you find no closer points. 



The graphs are built as a heirarchy of smaller graphs, so you start at a coarse layer (small subset of points), finding the nearest sub graphs, then you navigate those sub graphs. 



Benchmarking Different ANN Algorithms’ Performance

ANN-Benchmarks is a benchmarking environment for approximate nearest neighbor algorithms search.
http://ann-benchmarks.com/ 


Spotify uses annoy for recommendations.  in graph below, dark blue line
  • Algorithm used: Annoy uses the approach described above
Amazon Elasticsearch KNN Service uses hnsw (nmslib). in graph below, light blue line
  • Algorithm used: The HNSW (NMSLIB)
  • For 80% recall, the benchmark below suggests we can reach ~3000 TPS, for ~95% recall, 500 TPS

glove-100-angular.png
                                                                        ( source : http://ann-benchmarks.com )


Appendix 4: How are Cuttlefish embeddings created?

A full write up is here: https://w.amazon.com/index.php/Amazon%20Music/MusicML/Cuttlefish

Cuttlefish uses Matrix Factorization to create user-to-item embeddings. 

One method, collaborative filtering with latent factor analysis, popularized by its efficacy in making recommendations for the Netflix Prize, https://en.wikipedia.org/wiki/Netflix_Prize, is to use matrix multiplication to solve the problem.

Let’s say you have all users (rows) x all songs (columns), in a massive sparsely populated matrix, where the value is how many times the user listened to the song (at least 30 seconds). 
 Song1Song2Song3Song4Song5
Nelson4  1 
Jess2 68 
Neil  5  
Sidd   52
Joaquin15  3
Dhara   2 

To make recommendations, the goal is to “guess“ how many times a customer would listen to a song they haven’t heard, and if your guess is they would listen to it 5 times, and then you recommend it, and the user listens to it 5 times, it’s a good recommendation. 

The “latent factor” idea is breaking the problem up, so in order to compute the rating matrix that is the final size N customers by M songs, you split it into: 

(N customers x M songs)  = (N customers x D latent factor) (cross product) (D latent factor x M books )

In 2-Dimensions this would be:
 LatentFactor1LatentFactor2
Nelson??
Jess??
Neil??
Sidd??
Joaquin??
Dhara??
X (Cross Product)
 Song1Song2Song3Song4Song5
LatentFactor1?????
LatentFactor2?????

 Song1Song2Song3Song4Song5
Nelson4  1 
Jess2 68 
Neil  5  
Sidd   52
Joaquin15  3
Dhara   2 
It then becomes a machine learning problem, where you different techniques to solve for those “?”. Those become your embeddings. 

For example, you may initialize all the “?” at random, then multiply them, see if you end up with the spark matrix, and keep guessing and checking till you arrive at the right solution. You could use loss function plus gradient descent, to solve for that that D latent factor. 

Customers with a similar latent factor, will have similar taste. Once you have the latent factors, you find the nearest neighbors (the closest other latent factor vectors measured by dot product or cosine similarity), to compute the nearest books. The vectors that multiply together to give the highest rating, will be the best recommendations for the user.