Skip to content

File Formats

Michał Siedlaczek edited this page Jan 27, 2017 · 1 revision

General Rules

To simplify storing results, a common format is maintained:

  • A single line in a file corresponds to information about a single query.
  • When the information consists of a single value, it's simply written down, e.g., a query: yerba mate.
  • When the information is a list of simple values, it is single-space-delimited, e.g., a list of document IDs: 0 10 15 100 145 1000.
  • When the information consists of multiple values of different semantic type (or two corresponding lists), we divide them into separate files, e.g., tuples (query, queryLength) or pairs of lists ([documentId], [score]).
  • When information consists of nested lists, it is divided into separate files (saving indices in the file name) until is a single value or a list of simple values is written down to a file (see above). To illustrate this, let us assume that we want to store scores of consecutive documents returned by each of K shards. Then, we would have K files: <basename>#k.score where k is shard number. Each line in k-th file would consist of scores of consecutive documents for given queries.

Files descriptions

This section describes available file formats.

Queries

File Name Pattern

*.queries

Description

Lines containing queries.

Example

yerba mate
labradoodle
the meatball shop

Results

File Name Pattern

*.results

*.score

Description

IDs and scores of consecutive documents returned by the search engine for the query.

Note It is assumed that the IDs are sorted in ascending order.

Example

*.results

1 2 3 4 5 6 7 8 9 10
10 1001
67 190 2004

*.score

1.2 1.1 0.9 0.85 0.8 0.6 0.55 0.5 0.5 0.3
2.0 1.0
1.0 1.0 0.99

Shard Results

File Name Pattern

*#k.results

*#k.score

where k is the shard number