File Formats

General Rules

To simplify storing results, a common format is maintained:

A single line in a file corresponds to information about a single query.
When the information consists of a single value, it's simply written down, e.g., a query: yerba mate.
When the information is a list of simple values, it is single-space-delimited, e.g., a list of document IDs: 0 10 15 100 145 1000.
When the information consists of multiple values of different semantic type (or two corresponding lists), we divide them into separate files, e.g., tuples (query, queryLength) or pairs of lists ([documentId], [score]).
When information consists of nested lists, it is divided into separate files (saving indices in the file name) until is a single value or a list of simple values is written down to a file (see above). To illustrate this, let us assume that we want to store scores of consecutive documents returned by each of K shards. Then, we would have K files: <basename>#k.score where k is shard number. Each line in k-th file would consist of scores of consecutive documents for given queries.

This section describes available file formats.

*.queries

Lines containing queries.

yerba mate
labradoodle
the meatball shop

*.results

*.score

IDs and scores of consecutive documents returned by the search engine for the query.

Note It is assumed that the IDs are sorted in ascending order.

*.results

1 2 3 4 5 6 7 8 9 10
10 1001
67 190 2004

*.score

1.2 1.1 0.9 0.85 0.8 0.6 0.55 0.5 0.5 0.3
2.0 1.0
1.0 1.0 0.99

*#k.results

*#k.score

where k is the shard number

Contents