We have heard about SQL and no-SQL databases and know how they are used. There are many popular databases like MySQL, MongoDB which are extensively used to build software applications.
But what are vector databases and why are they required?
Let's talk about vector databases
AI and ML applications, especially deep learning NLP and Computer Vision applications deal with lots of vectors, have to operate on them and persist them in large numbers.
Semantic search, image search, recommendation systems and question answering are some of the AI based applications that use vector arithmetic heavily.
So what are vectors? Essentially vectors in ML applications are large sized matrices.
They are used to store collections of representations of words, sentences, paragraphs, documents or images, called embeddings. With the focus on deep learning models in all AI based applications the importance of vectors has grown exponentially and we need systems to store and retrieve such large sized data in huge quantities in real time.
Unfortunately the popular databases of yesteryears have not kept up pace with the advances in AI and currently no existing database offers a good enough solution to operate on vectors.
There is a need to generate millions of such vectors as part of training of AI models, store them in an effective and efficient manner in a persistence store, read blocks of such vectors quickly when needed at inference time and perform mathematical operations like multiplication and division on large sized matrices.
For AI applications there is also a need to perform ‘search’ operations on large collections of vectors in real time and quickly retrieve the most relevant matching vectors.
This has led to development of indexes for vectors, creation of partitions or clusters of such vectors and implementation of various types of nearest neighbor algorithms to quickly search and retrieve the most relevant matches.
The most rudimentary type of storing and operating on such vectors is using Numpy arrays in files and loading them into memory when needed. Although Numpy arrays are a very efficient way to store vectors and operate on them they are technically not databases.
Most databases offer query engines, compute engines, memory management and storage engines out of the box so that users can just focus on app development and use an easy interface to store and retrieve data from the database.
Since many ML applications run in the cloud there is also a requirement to provide cloud infrastructure for scale, high availability and fault tolerance for such data.
Another important thing to consider is that most of the matrix operations on such vectors happen in memory and have to be part of the application memory so that ML code can get access to the result of such matrix operations.
This means we need a highly optimized memory management system when we deal with such a huge collection of vectors so that we do not run into out-of-memory conditions.
Having an easy query interface to the database goes a long way to improve usage of databases. No wonder SQL happens to be the most popular language for databases of today and have been so for more than forty years and still going strong.
It will be really nice if we have a SQL like interface for vector databases too so that they can be easily used inside AI applications and the query processing and data access part can be abstracted.
Now that we know what is the need for vector databases and what features they have to provide, are there any such popular vector databases available?
The most popular vector databases
There are a few upcoming vector libraries and databases that are mostly available as open source products but they do not provide all the features of a standard database.
One of the earliest open source vector based search engines is Annoy from Spotify. This is a C++ library with Python interface and provides a host of Approximate Nearest Neighbour (ANN) algorithms. However it is not really a vector database.
Similar to Annoy is NMSLIB which is also a popular ANN based C++ library with C++ and Python interfaces.
One of the most popular and heavily used open source vector based indexing and search engines is from Facebook AI Research group called Faiss. It is also a C++ based library that has Python and C++ interfaces. It provides a lot of options for indexing with and without compression. It is somewhere in between the journey to a true vector database.
Elasticsearch also provides a way to store and retrieve vectors and perform similarity search on such vectors.
Milvus is an upcoming open source vector database which provides many options for indexing and similarity search and is proposing a truly cloud native architecture for vector databases.
Underlying it uses Faiss, NMSLIB and Annoy libraries but offers them as a true vector database.
The new kid on the block is Pinecone. It offers a cloud based vector database with a claim to index billions of vectors and perform millisecond search operations on them. It offers various options to manage the vector indexes and perform similarity search operations on them using Python.
Final thoughts
As more and more AI applications are getting deployed into production systems to provide business critical applications, the need for a vector database which is as good as the mission critical SQL databases that operate in the cloud is the need of the day.
More research and investment in this area will enable the development of a high performing and reliable vector database.
At Engati, we use vectors a lot and look forward to development of large scale cloud based vector databases.