Vector embedding database? #995
Replies: 7 comments 9 replies
-
This is very interesting. In theory, if you can translate an image/text/video/whatever into something that can be indexed by ArcadeDB (using our fast LSM Tree) it should be straightforward. Unfortunately, it's not easy to find Java libraries that do this job. Does anybody have some idea about how to achieve that? |
Beta Was this translation helpful? Give feedback.
-
[I updated my initial post, please re-read!] Apart from encoding to vectors of unstructured data, there are a few special algorithms to index such databases (classic tree based indexes apparently not used) and do fast approximate/similarity search, that make this class of databases a thing of their own:
There are also difference ways to measure distance between vectors (like cosine similarity and Euclidean distance (L2 distance).) and ways to produce encoding vectors (BERT) Where I see opportunity for ArcadeDB is:
This is a curated list of such libraries, but I have not checked if any is in Java. I also want to mention that there is a (Tensor (e.g multi-dimensional matrix) Java library - ND4J) [https://github.com/deeplearning4j/deeplearning4j], but it is pretty old school, does not use the above JEPs. Other than that, I found these blog posts to be a nice introduction on how such databases work:
There are also some insights (quite interesting!) not related to the AI/ML "secret sauce", but more like general modern distributed database design, which may (or not) overlap with what is in ArcadeDB:
|
Beta Was this translation helpful? Give feedback.
-
@hrstoyanov I've been exploring the vector database world and I definitely would like to provide a new Vector model to ArcadeDB! I've created a new branch "vector-database" where to do some experiments. Now I'm loading a Word2Vec dataset with 3M words and 300 vectors each. Playing with @gramian to make this more efficient, I was wondering if there is some resources on quantization to reduce the amount of data and speed up the cosine function that now takes 0.7 secs (real-time lookup for the 20 top words for the word X). |
Beta Was this translation helpful? Give feedback.
-
@lvca sorry, I was absent the last few days, traveling back to Europe, etc. For multi-dimensional arrays (e.g. tensors) you may want to check out ND4J. Although it is part of deeplearning4j, I think it is separate, with no other dependencies. (Also check the datavec module, while at that) As explained in ND4J, storing multi-dimensional tensors as a flat (1-dimensional) Java array can easily exceed the array size limit, (which in java is MAX INT rather than MAX LONG, as it should have been!). Therefore, ND4J resorts to off-heap arrays (can't recall if it is via JNI or native byte buffers). I would personally, take it one step further and try to improve (or re-write ND4J) with the latest JEP 442: Foreign Function & Memory API, included in Java 20+, but this might be too much for ArcadeDB ... ... As far as quantization /vector compression (at the expense of precision, which is a very acceptable trade-off in ML), I would look at another over-hyped project - LAMA.CPP . It is C++ re-implementation of Facebook's LAMA model, but internally uses a small tensor C-library (just like ND4J). This tensor library is very cool, because it leverages modern Apple Mx silicon, or SIMD instructions as a fallback, or GPUs. Again, if you are adventurous, you can do the same in java with JEP 442: Foreign Function & Memory API, or use JEP 426: Vector API for SIMD vector ops (works for both x64 AVX and ARM NEON simd instructions) ...Anyway, so take a look at LAMMA.CPP, specifically at GGML here and here. One of the features there is quantization/de-quantization that allows LAMA.CPP to compress the original 13-billion model of NN params to fit into the RAM of an average desktop/laptop computer! Not only is the model compressed well, but it is also very fast to do inference, because of the vector acceleration features I described earlier. .. So yes, vector databases are in high demand right now. Reading this to figure out the next wave of AutpGPT, BabyAGI and autonomous agents, it will all depend on vector databases, which at the moment is provided by Pinecone (now worth $700m), Weaviate ($200m), and Chroma ($75m). I have not looked at more efficient functions that cosine, will follow up on that separately. |
Beta Was this translation helpful? Give feedback.
-
Good news about the new vector-model in ArcadeDB. I'm working on the branch "vector-model" and this is the Latest update
First Performance Results
Short term todo list (in a week or less)
Open Questions
|
Beta Was this translation helpful? Give feedback.
-
First PR si out! #1148 |
Beta Was this translation helpful? Give feedback.
-
Inspired by the Qdrant Benchmark page (https://qdrant.tech/benchmarks/#filtered-search-benchmark) I downloaded the GloVe-100-Angular dataset (100 dimensions, 1.2M entries) and loaded it into ArcadeDB. Below are some numbers running on my Mac Book Pro 2019 (Intel CPU, 32GB RAM)
I can't see what's the accuracy of the responses, I guess it should be measured against a distance calculated with all the vector embeddings. |
Beta Was this translation helpful? Give feedback.
-
ArcadeDb is a general purpose database, combining document, graph and time-series. However, there is a new kind of databases on the horizon that emerged over the last 1-2 years: "vector" for efficient storing of "vector embeddings". These are used in the latest crop of generative AI of 2022, and are believed to be next frontier in ML/AI search and inference. Maybe ArcadeDB can add such capabilities as well?
Here is a list of several such project, most open source, written in Go - check out the associated blogs and architecture posts to get a better understanding why are they needed and what problems they address:
Although most of the above are billed as "open-source", there is, typically, a commercial entity that provides "paid-for" cloud service. The interest seems to be huge, as Pinecone, for example, did a $700m investment round, not sure about the rest (I am not advocating VC-funded open source databases!)
This thread from hacker news is also super useful.
Beta Was this translation helpful? Give feedback.
All reactions