Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak parquet BEIR yaml configs #2609

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion bin/run.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
#!/bin/sh

java -cp `ls target/*-fatjar.jar` -Xms512M -Xmx64G --add-modules jdk.incubator.vector $@
java -cp `ls target/*-fatjar.jar` -Xms512M -Xmx128G --add-modules jdk.incubator.vector $@
Original file line number Diff line number Diff line change
Expand Up @@ -49,24 +49,19 @@ public class ParquetDenseVectorDocumentGenerator<T extends SourceDocument> imple
public Document createDocument(T src) throws InvalidDocumentException {

try {
LOG.info("Processing document ID: " + src.id() + " with thread: " + Thread.currentThread().getName());

// Parse vector data from document contents
float[] contents = parseVectorFromString(src.contents());
if (contents == null || contents.length == 0) {
LOG.error("Vector data is null or empty for document ID: " + src.id());
throw new InvalidDocumentException();
}

LOG.info("Vector length: " + contents.length + " for document ID: " + src.id());

// Create and populate the Lucene document
final Document document = new Document();
document.add(new StringField(Constants.ID, src.id(), Field.Store.YES));
document.add(new BinaryDocValuesField(Constants.ID, new BytesRef(src.id())));
document.add(new KnnFloatVectorField(Constants.VECTOR, contents, VectorSimilarityFunction.DOT_PRODUCT));

LOG.info("Document created for ID: " + src.id());
return document;

} catch (Exception e) {
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-arguana.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/arguana.parquet

index_path: indexes/parquet/arguana
index_path: indexes/lucene-flat.beir-v1.0.0-arguana.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-bioasq.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/bioasq.parquet

index_path: indexes/parquet/bioasq
index_path: indexes/lucene-flat.beir-v1.0.0-bioasq.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-climate-fever.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/climate-fever.parquet

index_path: indexes/parquet/climate-fever
index_path: indexes/lucene-flat.beir-v1.0.0-climate-fever.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-cqadupstack-android.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/cqadupstack-android.parquet

index_path: indexes/parquet/cqadupstack-android
index_path: indexes/lucene-flat.beir-v1.0.0-cqadupstack-android.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-cqadupstack-english.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/cqadupstack-english.parquet

index_path: indexes/parquet/cqadupstack-english
index_path: indexes/lucene-flat.beir-v1.0.0-cqadupstack-english.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-cqadupstack-gaming.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/cqadupstack-gaming.parquet

index_path: indexes/parquet/cqadupstack-gaming
index_path: indexes/lucene-flat.beir-v1.0.0-cqadupstack-gaming.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-cqadupstack-gis.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/cqadupstack-gis.parquet

index_path: indexes/parquet/cqadupstack-gis
index_path: indexes/lucene-flat.beir-v1.0.0-cqadupstack-gis.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-cqadupstack-mathematica.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/cqadupstack-mathematica.parquet

index_path: indexes/parquet/cqadupstack-mathematica
index_path: indexes/lucene-flat.beir-v1.0.0-cqadupstack-mathematica.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-cqadupstack-physics.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/cqadupstack-physics.parquet

index_path: indexes/parquet/cqadupstack-physics
index_path: indexes/lucene-flat.beir-v1.0.0-cqadupstack-physics.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-cqadupstack-programmers.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/cqadupstack-programmers.parquet

index_path: indexes/parquet/cqadupstack-programmers
index_path: indexes/lucene-flat.beir-v1.0.0-cqadupstack-programmers.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-cqadupstack-stats.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/cqadupstack-stats.parquet

index_path: indexes/parquet/cqadupstack-stats
index_path: indexes/lucene-flat.beir-v1.0.0-cqadupstack-stats.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-cqadupstack-tex.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/cqadupstack-tex.parquet

index_path: indexes/parquet/cqadupstack-tex
index_path: indexes/lucene-flat.beir-v1.0.0-cqadupstack-tex.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-cqadupstack-unix.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/cqadupstack-unix.parquet

index_path: indexes/parquet/cqadupstack-unix
index_path: indexes/lucene-flat.beir-v1.0.0-cqadupstack-unix.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-cqadupstack-webmasters.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/cqadupstack-webmasters.parquet

index_path: indexes/parquet/cqadupstack-webmasters
index_path: indexes/lucene-flat.beir-v1.0.0-cqadupstack-webmasters.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-cqadupstack-wordpress.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/cqadupstack-wordpress.parquet

index_path: indexes/parquet/cqadupstack-wordpress
index_path: indexes/lucene-flat.beir-v1.0.0-cqadupstack-wordpress.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-dbpedia-entity.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/dbpedia-entity.parquet

index_path: indexes/parquet/dbpedia-entity
index_path: indexes/lucene-flat.beir-v1.0.0-dbpedia-entity.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-fever.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/fever.parquet

index_path: indexes/parquet/fever
index_path: indexes/lucene-flat.beir-v1.0.0-fever.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-fiqa.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/fiqa.parquet

index_path: indexes/parquet/fiqa
index_path: indexes/lucene-flat.beir-v1.0.0-fiqa.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-hotpotqa.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/hotpotqa.parquet

index_path: indexes/parquet/hotpotqa
index_path: indexes/lucene-flat.beir-v1.0.0-hotpotqa.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-nfcorpus.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus.parquet

index_path: indexes/parquet/nfcorpus
index_path: indexes/lucene-flat.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-nq.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/nq.parquet

index_path: indexes/parquet/nq
index_path: indexes/lucene-flat.beir-v1.0.0-nq.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-quora.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/quora.parquet

index_path: indexes/parquet/quora
index_path: indexes/lucene-flat.beir-v1.0.0-quora.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-robust04.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/robust04.parquet

index_path: indexes/parquet/robust04
index_path: indexes/lucene-flat.beir-v1.0.0-robust04.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-scidocs.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/scidocs.parquet

index_path: indexes/parquet/scidocs
index_path: indexes/lucene-flat.beir-v1.0.0-scidocs.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-scifact.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/scifact.parquet

index_path: indexes/parquet/scifact
index_path: indexes/lucene-flat.beir-v1.0.0-scifact.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-signal1m.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/signal1m.parquet

index_path: indexes/parquet/signal1m
index_path: indexes/lucene-flat.beir-v1.0.0-signal1m.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-trec-covid.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/trec-covid.parquet

index_path: indexes/parquet/trec-covid
index_path: indexes/lucene-flat.beir-v1.0.0-trec-covid.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-trec-news.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/trec-news.parquet

index_path: indexes/parquet/trec-news
index_path: indexes/lucene-flat.beir-v1.0.0-trec-news.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
corpus: beir-v1.0.0-webis-touche2020.bge-base-en-v1.5
corpus_path: collections/beir-v1.0.0/bge-base-en-v1.5/webis-touche2020.parquet

index_path: indexes/parquet/webis-touche2020
index_path: indexes/lucene-flat.beir-v1.0.0-webis-touche2020.bge-base-en-v1.5/
index_type: flat
collection_class: ParquetDenseVectorCollection
generator_class: ParquetDenseVectorDocumentGenerator
Expand Down
Loading