-
Notifications
You must be signed in to change notification settings - Fork 0
File Formats
Michał Siedlaczek edited this page Jan 27, 2017
·
1 revision
To simplify storing results, a common format is maintained:
- A single line in a file corresponds to information about a single query.
- When the information consists of a single value, it's simply written down, e.g., a query:
yerba mate
. - When the information is a list of simple values, it is single-space-delimited, e.g., a list of document IDs:
0 10 15 100 145 1000
. - When the information consists of multiple values of different semantic type (or two corresponding lists), we divide them into separate files, e.g., tuples
(query, queryLength)
or pairs of lists([documentId], [score])
. - When information consists of nested lists, it is divided into separate files (saving indices in the file name) until is a single value or a list of simple values is written down to a file (see above). To illustrate this, let us assume that we want to store scores of consecutive documents returned by each of
K
shards. Then, we would haveK
files:<basename>#k.score
wherek
is shard number. Each line ink
-th file would consist of scores of consecutive documents for given queries.
This section describes available file formats.
*.queries
Lines containing queries.
yerba mate
labradoodle
the meatball shop
*.results
*.score
IDs and scores of consecutive documents returned by the search engine for the query.
Note It is assumed that the IDs are sorted in ascending order.
*.results
1 2 3 4 5 6 7 8 9 10
10 1001
67 190 2004
*.score
1.2 1.1 0.9 0.85 0.8 0.6 0.55 0.5 0.5 0.3
2.0 1.0
1.0 1.0 0.99
*#k.results
*#k.score
where k
is the shard number