TREC Washington Post Corpus

Synopsis

The TREC Washington Post Corpus contains 608,180 news articles and blog posts from January 2012 through August 2017. It was originally used for the Common Core Track at TREC 2018 (http://trec-core.github.io/2018/ ). The initial document collection contained duplicate docids. These duplicates are removed from the filed dataset. The resulting collection contains 595,037 documents. The documents are stored in one single JSON Lines file (http://jsonlines.org/ ).

Files and Folders

archives this directory contains the original files
data contains the JSON Lines file
scripts the python scripts for duplicate removal can be found here
license-agreement contains the license-agreement
topics-and-qrels contain txt-files with 50 topics and corresponding qrels

Research and Usecases

This dataset will also be used within the course of CENTRE@CLEF2019 (http://www.centre-eval.org/clef2019/ ). This track focuses on the replicability, reproducibility and generalizability of retrieval systems. We are planning to participate in the CENTRE-track.

License Information

@pschaer signed a licence agreement which can be found under license-agreement

Data Source

The original data can be retrieved from NIST:
https://ir.nist.gov/wapo/

Topic- and relevance-files can be retrieved from:
https://trec.nist.gov/act_part/tracks2018.html

Publications

Alexander Bondarenko, Michael Völske, Alexander Panchenko, Chris Biemann, Benno Stein, and Matthias Hagen. Webis at TREC 2018: Common Core Track. In Ellen M. Voorhees and Angela Ellis, editors, 27th International Text Retrieval Conference (TREC 2018), NIST Special Publication, November 2018. National Institute of Standards and Technology (NIST). PDF

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TREC Washington Post Corpus

Synopsis

Files and Folders

Research and Usecases

License Information

Data Source

Publications

Files

README.md

Latest commit

History

README.md

File metadata and controls

TREC Washington Post Corpus

Synopsis

Files and Folders

Research and Usecases

License Information

Data Source

Publications