Vector embedding database? #995

hrstoyanov · 2023-04-03T04:57:43Z

hrstoyanov
Apr 3, 2023

ArcadeDb is a general purpose database, combining document, graph and time-series. However, there is a new kind of databases on the horizon that emerged over the last 1-2 years: "vector" for efficient storing of "vector embeddings". These are used in the latest crop of generative AI of 2022, and are believed to be next frontier in ML/AI search and inference. Maybe ArcadeDB can add such capabilities as well?

Here is a list of several such project, most open source, written in Go - check out the associated blogs and architecture posts to get a better understanding why are they needed and what problems they address:

Although most of the above are billed as "open-source", there is, typically, a commercial entity that provides "paid-for" cloud service. The interest seems to be huge, as Pinecone, for example, did a $700m investment round, not sure about the rest (I am not advocating VC-funded open source databases!)

This thread from hacker news is also super useful.

lvca · 2023-04-04T02:57:25Z

lvca
Apr 4, 2023
Maintainer

This is very interesting. In theory, if you can translate an image/text/video/whatever into something that can be indexed by ArcadeDB (using our fast LSM Tree) it should be straightforward.

Unfortunately, it's not easy to find Java libraries that do this job.

Does anybody have some idea about how to achieve that?

1 reply

gramian Apr 4, 2023
Collaborator

Please correct me if I am wrong, but AFAIK vector (embedding) databases store vectors (in the mathematical sense) and associations to things, and provide some means for similarity searches.
In principle, I think, ArcadeDB can already store fixed-length arrays of floats (the vectors) using list as they are internally Java ArrayLists. Here is how I would do this in the current ArcadeDB for an association of strings with 1000 dimensional vectors:

CREATE DOCUMENT TYPE vecemb;
CREATE PROPERTY vecemb.name STRING (mandatory true, notnull true);
CREATE PROPERTY vecemb.vec LIST (mandatory true, notnull true, min 1000, max 1000);

How efficient this is for vectors with thousands of elements in practice I cannot say though. In terms of similarity searches, Jaccard similarity is a very basic way to do this for sets, see #625 but there are other measures for vectors like cosine similarity, L2 distance, etc. (many others).

The association process, ie assigning the floats in the vector with values would be then done by an external ML process.

PS: A MAP type would only work for string-vector associations, ie not image-vectors, unless the string is used to store an identifier to the associated thing.

hrstoyanov · 2023-04-04T17:39:08Z

hrstoyanov
Apr 4, 2023
Author

@lvca @gramian

[I updated my initial post, please re-read!]

Apart from encoding to vectors of unstructured data, there are a few special algorithms to index such databases (classic tree based indexes apparently not used) and do fast approximate/similarity search, that make this class of databases a thing of their own:

Facebook AI Similarity Search (FAISS): This framework enables efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It supports indexing capabilities like inverted multi-index and product quantization
Spotify’s Annoy (Approximate Nearest Neighbors Oh Yeah): This framework uses random projections and builds up a tree to enable ANNS at scale for dense vectors
Google’s ScaNN (Scalable Nearest Neighbors): This framework performs efficient vector similarity search at scale. Consists of implementations, which includes search space pruning and quantization for Maximum Inner Product Search (MIPS)
DiskANN/Vamana. DiskANN is a suite of scalable, accurate and cost-effective approximate nearest neighbor search algorithms for large-scale vector search that support real-time changes and simple filters.
NGT. Used by Vald, claimed to be "the fastest"

There are also difference ways to measure distance between vectors (like cosine similarity and Euclidean distance (L2 distance).) and ways to produce encoding vectors (BERT)

Where I see opportunity for ArcadeDB is:

Add this novel class of storage for Java developers, on top of everything else ArcadeDB already has. ArcadeDb can be embedded into Java servers, which is unique advantage here, compared to Go-/C-/Rust- based client/server solutions
Implement a pluggable API, where different indexing and algorithms can be plugged in, separate from storage. Start minimal, with the most popular/best one (FAISS, or see what the pgVector folks did). Re-implementing Milvus in Java is an overkill. IMO.
I have not checked the code base recently, but with the engine may have to be re-implemted the core to use:
- Start looking at the next LTS Java 21 as a minimum.
- Use JEP 434: Foreign Function & Memory API for mmap and off-the heap memory management of huge arrays (rather than the Java 8 native buffers of 2G max) and calling native GPU vector apis (rather then suffering with JNI). I am particularly interested in Apple's vector native library which is super fast due to the Mx chips!
- JEP 438: Vector API for SIMD instructions on the cpu (such as AVX on x86 or ARM NEON on arm chips), which should greatly improve vector operations (when you do not have GPU)

This is a curated list of such libraries, but I have not checked if any is in Java. I also want to mention that there is a (Tensor (e.g multi-dimensional matrix) Java library - ND4J) [https://github.com/deeplearning4j/deeplearning4j], but it is pretty old school, does not use the above JEPs.

Other than that, I found these blog posts to be a nice introduction on how such databases work:

There are also some insights (quite interesting!) not related to the AI/ML "secret sauce", but more like general modern distributed database design, which may (or not) overlap with what is in ArcadeDB:

1 reply

lvca Apr 5, 2023
Maintainer

Thanks, @hrstoyanov for the detailed response.

I'd love to support such use cases, I believe it's going to be more relevant by the day.

But I'd love also to keep ArcadeDB out of a particular technology to generate such arrays. That should be on top of it. Somebody could create plugins for ArcadeDB or just embed ArcadeDB into a product that does everything, but ArcadeDB should be able to just store and retrieve float[] (or encoded base64 if that works?) + indexing for an efficient lookup of similar items, where "similar" must be a pluggable concept to ArcadeDB. We could provide 2-3 of the most common similarities algorithms, but a user can implement (or contribute) to the latest finding from research.

ArcadeDB has a very efficient way to store and retrieve binary data, so we could absolutely have a native type for float[size] that is efficiently stored and indexed. If anybody could help on this, I'd love to jump into the implementation detail to achieve a POC with ArcadeDB!

Also, we do pretty much everything with our LSM-Tree index, so we should find a way to index a float[].

lvca · 2023-04-19T21:31:15Z

lvca
Apr 19, 2023
Maintainer

@hrstoyanov I've been exploring the vector database world and I definitely would like to provide a new Vector model to ArcadeDB! I've created a new branch "vector-database" where to do some experiments. Now I'm loading a Word2Vec dataset with 3M words and 300 vectors each. Playing with @gramian to make this more efficient, I was wondering if there is some resources on quantization to reduce the amount of data and speed up the cosine function that now takes 0.7 secs (real-time lookup for the 20 top words for the word X).

0 replies

hrstoyanov · 2023-04-19T23:46:38Z

hrstoyanov
Apr 19, 2023
Author

@lvca sorry, I was absent the last few days, traveling back to Europe, etc.

For multi-dimensional arrays (e.g. tensors) you may want to check out ND4J. Although it is part of deeplearning4j, I think it is separate, with no other dependencies. (Also check the datavec module, while at that)

As explained in ND4J, storing multi-dimensional tensors as a flat (1-dimensional) Java array can easily exceed the array size limit, (which in java is MAX INT rather than MAX LONG, as it should have been!). Therefore, ND4J resorts to off-heap arrays (can't recall if it is via JNI or native byte buffers). I would personally, take it one step further and try to improve (or re-write ND4J) with the latest JEP 442: Foreign Function & Memory API, included in Java 20+, but this might be too much for ArcadeDB ...

... As far as quantization /vector compression (at the expense of precision, which is a very acceptable trade-off in ML), I would look at another over-hyped project - LAMA.CPP . It is C++ re-implementation of Facebook's LAMA model, but internally uses a small tensor C-library (just like ND4J). This tensor library is very cool, because it leverages modern Apple Mx silicon, or SIMD instructions as a fallback, or GPUs. Again, if you are adventurous, you can do the same in java with JEP 442: Foreign Function & Memory API, or use JEP 426: Vector API for SIMD vector ops (works for both x64 AVX and ARM NEON simd instructions)

...Anyway, so take a look at LAMMA.CPP, specifically at GGML here and here. One of the features there is quantization/de-quantization that allows LAMA.CPP to compress the original 13-billion model of NN params to fit into the RAM of an average desktop/laptop computer! Not only is the model compressed well, but it is also very fast to do inference, because of the vector acceleration features I described earlier.

.. So yes, vector databases are in high demand right now. Reading this to figure out the next wave of AutpGPT, BabyAGI and autonomous agents, it will all depend on vector databases, which at the moment is provided by Pinecone (now worth $700m), Weaviate ($200m), and Chroma ($75m).

I have not looked at more efficient functions that cosine, will follow up on that separately.

0 replies

lvca · 2023-06-03T05:18:48Z

lvca
Jun 3, 2023
Maintainer

Good news about the new vector-model in ArcadeDB. I'm working on the branch "vector-model" and this is the

Latest update

new native support for arrays of float, double, short, integer, and long. The array is saved in a compressed format
adapted the nice work https://github.com/jelmerk/hnswlib/tree/master/hnswlib-core by Jelmer Kuperus to use ArcadeDB native graph format
created the first version of an importer able to quickly load a Word2Vec file with 2M of words and 300 dimensions (float32)
the HSNW index is pretty configurable with a lot of settings and the following distance functions: Cosine, InnerProduct, Euclidean, Canberra, BrayCurtis, Correlation, Manhattan, Chebyshev, and SparseVectorInnerProduct

First Performance Results

The first time, loading 2M of words (300 dimensions) took 17 minutes on a MacBook Intel 2019 to load the whole file in RAM and create the database + index on disk. During this phase, the Java process consumed less than 4GB of RAM
Once the database is created, the index doesn't need to be loaded in RAM, but it's rather lazy-loaded upon request only for the needed pieces of data for the HSNW index. At run-time, you can execute vector similarity queries with a minimum amount of RAM. I mean, it should be possible to execute queries on a Raspberry Pi
Looking up the 10 most similar words takes about from 3ms to 60ms based on how much the cache is cold. The first time is around 60ms, then the average time is 3ms

Short term todo list (in a week or less)

integrate the HSNW importer into ArcadeDB general purpose Importer tool so you can use it from SQL
support the creation of the HSNW index with SQL
save the index configuration into the database directory as a JSON file
write some test cases on a micro dataset

Open Questions

It would be nice to configure a hook to automatically index ArcadeDB vertices with vectors and keep the HSNW index updated. We'd need a way for ArcadeDB to create such embeddings. The idea would be to have something in Java, so we can incorporate it into the server
What's the performance of larger datasets? Is there any other dataset we should test?

5 replies

hrstoyanov Jun 3, 2023
Author

@lvca thanks for the contribution, will try it soon(very interested in re-implementing the indexing/distance calculation with native SIMD)

A big part of building client apps that interact with LLM servers through APIs plays LangChain - it started as open source, it is a startup now and provides Python and JS API. (Similarly, Microsoft ooen-source d "Semantic Kernel" for C# and Python)

We need one for Java!, and when I just googled for "LangChain Java", I got back at least 3 open source projects on GitHub already!

... So as part of the "memory mechanism" LangChain&Friends use is ... A vector database! So, you may want to reach out to the project owners and tell them that in Java there is already an alternative, they can start using/integrating with? This should be very attractive to them as ArcadeDB is embeddable, and will give you feedback for additional features you may want to add.

lvca Jun 3, 2023
Maintainer

What's the easiest way to integrate it with LangChain? I guess the minimum is a Python driver, but we don't have this skill among the contributors.

hrstoyanov Jun 5, 2023
Author

@lvca not sure Python developers would be interested as the original LangChain (Python) already has access to the vector databases (also Python) we listed above. But, new Java clones would benefit, I think:

https://github.com/ai-for-java/ai4j

https://github.com/HamaWhiteGG/langchain-java

lvca Jun 5, 2023
Maintainer

Thanks, I didn't know about such projects. Amazing, I'm writing to the maintainer to join efforts :-)

hrstoyanov Jun 5, 2023
Author

It's all very new stuff ... any Java developer who wants to write an app that uses AI-behind-API will need something like LangChain, so here is an opportunity! There were 3000+ such apps/startups launched for May 2023 alone ...

lvca · 2023-06-23T16:46:28Z

lvca
Jun 23, 2023
Maintainer

First PR si out! #1148

2 replies

lvca Jun 23, 2023
Maintainer

@hrstoyanov I'd love your feedback on this ;-)

hrstoyanov Jun 28, 2023
Author

Thanks @lvca will add comments in the pull request!

lvca · 2023-06-29T14:18:31Z

lvca
Jun 29, 2023
Maintainer

Inspired by the Qdrant Benchmark page (https://qdrant.tech/benchmarks/#filtered-search-benchmark) I downloaded the GloVe-100-Angular dataset (100 dimensions, 1.2M entries) and loaded it into ArcadeDB. Below are some numbers running on my Mac Book Pro 2019 (Intel CPU, 32GB RAM)

The dataset was loaded and indexed in ArcadeDB in 9 minutes using only 8GB RAM.
The RPS (request per second) is 435.58 by using M=16 and EF=128

I can't see what's the accuracy of the responses, I guess it should be measured against a distance calculated with all the vector embeddings.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector embedding database? #995

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Vector embedding database? #995

hrstoyanov Apr 3, 2023

Replies: 7 comments · 9 replies

lvca Apr 4, 2023 Maintainer

gramian Apr 4, 2023 Collaborator

hrstoyanov Apr 4, 2023 Author

lvca Apr 5, 2023 Maintainer

lvca Apr 19, 2023 Maintainer

hrstoyanov Apr 19, 2023 Author

lvca Jun 3, 2023 Maintainer

Latest update

First Performance Results

Short term todo list (in a week or less)

Open Questions

hrstoyanov Jun 3, 2023 Author

lvca Jun 3, 2023 Maintainer

hrstoyanov Jun 5, 2023 Author

lvca Jun 5, 2023 Maintainer

hrstoyanov Jun 5, 2023 Author

lvca Jun 23, 2023 Maintainer

lvca Jun 23, 2023 Maintainer

hrstoyanov Jun 28, 2023 Author

lvca Jun 29, 2023 Maintainer

hrstoyanov
Apr 3, 2023

Replies: 7 comments 9 replies

lvca
Apr 4, 2023
Maintainer

gramian Apr 4, 2023
Collaborator

hrstoyanov
Apr 4, 2023
Author

lvca Apr 5, 2023
Maintainer

lvca
Apr 19, 2023
Maintainer

hrstoyanov
Apr 19, 2023
Author

lvca
Jun 3, 2023
Maintainer

hrstoyanov Jun 3, 2023
Author

lvca Jun 3, 2023
Maintainer

hrstoyanov Jun 5, 2023
Author

lvca Jun 5, 2023
Maintainer

hrstoyanov Jun 5, 2023
Author

lvca
Jun 23, 2023
Maintainer

lvca Jun 23, 2023
Maintainer

hrstoyanov Jun 28, 2023
Author

lvca
Jun 29, 2023
Maintainer