Search is everywhere - be it looking for information on Google or Bing, finding a product on Amazon, hunting for music on Spotify or rummaging for videos on Netflix. Search is one of the fundamental techniques for information retrieval (IR) and has revolutionized access to information over the internet.
It is also a fundamental part of information retrieval in databases. Binary search is a technique widely used in computer science to locate precise data in a large database. B-trees are a variation of binary search to locate precise data in disk and are used extensively in Relations databases (RDBMS).
In an RDBMS the lowest unit in which data is stored is a row in a table. A row consists of multiple columns of similar or dissimilar data that represents a real world object.
Here the data has a defined structure that is well known in advance to the user who is querying.
When you have to find that object in the database you need to provide an unique identifier e.g. name of a person when you are searching in a table containing customer data.
This works remarkably well and has a great degree of accuracy. To speed up the process databases use indexes on such columns, to make them searchable quickly.
However when you need to search for similar or same ‘text’ data or ‘image’ data or ‘song’ or ‘video’ you can’t use the row in a relational database since the query object will never be an exact match and you can’t even provide an exact identifier most of the time.
Here the data is unstructured and there is no well defined schema that is known in advance to the user who is querying.
Say you want to search for articles in Google, you really do not know in advance all the thousands of words or sentences or paragraphs that are contained in documents so that you can exactly specific the sentence or a paragraph to find documents that contain them.
Search queries for text are usually a pointer or reference to locate information in a fuzzy way by approximately matching the query with documents that contain similar information.
The technique that was invented to search for text, image, video, songs or any other form of unstructured data is called vector search.
Let’s start by understanding what a vector search is and how data is represented in a vector search.
A vector search is a data structure that can store the features of an object, such as a paragraph or an image, which has been generated by a machine learning algorithm. Vectors are also known as embeddings.
A vector has a size which is referred to as the dimension of the vector and you can pack many features of an object into that.
Text such as words, sentences, paragraphs etc. can be represented as word embeddings, sentence embeddings, paragraph embedding and stored in a multi dimensional vector array. Prominent machine learning algorithms like word2vec, Glove, BERT are used to generate representations of text as vectors in a neural network.
These algorithms capture the lexical and semantic features of text data.
Images are represented as the output of a CNN algorithm in a neural network.
CNN takes patches of pixel data and converts them into features of the object in an image and packs them into an image vector.
Most of the neural networks used for generating such vectors for text and image use deep learning models so that you can uncover prominent features of the object which define the identity features. These identity features can be then used to search for the same object or similar objects in a vector space that contains representative vectors or embeddings of the searchable objects.
Searching for Vectors is challenging in addition to being a compute and memory intensive operation. Unlike searching for records in a table in a RDBMS where we are trying to match one or more columns in a record with an incoming set of values, in vector search we have to match with every dimension of a vector.
In RDBMS we can scan the entire table to fetch all records and then do a match with every record. This will work only if the table has less number of records. However in a big table having a million records or more this technique will not perform well. Hence we use an index like a B-Tree index to enhance the search speed. These concepts can be extended to vector search.
First we have to determine how we can measure if two vectors are the same or are similar. Typically the metrics used in vector search are ''Euclidean distance'' or ''Cosine similarity'' . The difference between both is that the magnitude is not critical in cosine similarity but the direction is.
If we scan all vectors in a vector graphics space and calculate the Euclidean distance or Cosine similarity between each of them and a query vector then this is similar to doing a table scan in a RDBMS table. Typically the top K matches are returned since in vector search it will be difficult to get an exact match unless the query vector is a duplicate of the vectors in the vector space. This is called k-NN or k nearest neighbor algorithm.
In order to reduce the computational overhead associated with k-NN, which does an exhaustive search in a vector space, an approximate search can be performed. This search is similar to using indexes in RDBMS to cut down the search time. It is called Approximate Nearest Neighbor or ANN search. Most ANN algorithms use techniques such as hashing, indexing, clustering and graphs to cut down the search time.
One of the earliest ANN algorithms came from Spotify, called Annoy. Annoy is based on a tree based structure. An index is created with multiple tree structures (called a forest). Each tree is built by picking two points randomly in the vector space and splitting the vector space into two by their hyperplane. The subspaces are further split and so on until points associated with a node are small. For searching a query vector, the trees in the forest are traversed in order to obtain a set of points which are closest to the query vector.
Another technique to speed up vector search is by using a hashing algorithm. Locality Sensitive Hashing or LSH is one such technique. Vectors are hashed into hash buckets such that similar vectors are most likely to be located in the same bucket.
An incoming query vector is hashed to determine the bucket it will fall into and only the vectors in that bucket are matched. This reduces the number of matches and hence increases the speed.
One of the recent algorithms that achieves high performance in vector search is called Hierarchical Navigable Small World Algorithm (HNSW).
HNSW is a graph based algorithm to approximate the kNN match. The search index in HNSW is multi-layered such that each layer is a proximity graph with two vertices linked to each other based on proximate distance..
Each node in a layer in the graph corresponds to one or more vectors.
A nearest neighbor search starts at an entry point node in the top most layer, is recursive and a greedy graph traversal is undertaken in each layer until a local minimum is reached in the bottomost layer.
Navigable Small World means logarithmic or polylogarithmic scaling of the greedy graph routing.
The detailed explanation of how this algorithm works can be found in the original HNSW paper.
A benchmark was completed for current state of the art vector search algorithms and can be found here: http://ann-benchmarks.com/
As we collect and store huge quantities of unstructured data, having a fast vector search technique will make the data more accessible to users.