Skip to content

Latest commit

 

History

History
50 lines (34 loc) · 2.66 KB

README.md

File metadata and controls

50 lines (34 loc) · 2.66 KB

Tipster Corpus

Synopsis

The complete Tipster Corpus is comprised of a test collection built at NIST for the TIPSTER project and the related TREC project. The test collection consists of three CD-ROMs of SGML encoded documents distributed by LDC plus queries and answers (relevant documents) distributed by NIST.

The documents in the test collection are varied in style, size and subject domain. The format of all the documents is relatively clean and easy to use, with SGML-like tags separating documents and document fields. There is no part-of-speech tagging or breakdown into individual sentences or paragraphs as the purpose of this collection is to test retrieval against real-world data.

Files and Folders

Tipster Vol. 1

  • AP includes copyrighted stories from the AP Newswire (1989)
  • DOE includes short abstracts from the Department of Energy
  • DTD contains the dtd-files for AP, DOE, FR, WSJ, ZIFF -files
  • FR includes whole issues of the Federal Register (1989)
  • WSJ includes copyrighted stories from the Wall Street Journal (1987, 1988, 1989)
  • ZIFF includes information from the Computer Select disks (1989/1990, Ziff-Davis Publishing)

Tipster Vol. 2

  • AP includes copyrighted stories from the AP Newswire (1988)
  • DTD contains the dtd-files for AP, FR, WSJ, ZIFF -files
  • FR includes more issues of the Federal Register (1989)
  • WSJ includes copyrighted stories from the Wall Street Journal (1990, 1991, 1992)
  • ZIFF includes more information from the Computer Select disks (1989/1990, Ziff-Davis Publishing)

Tipster Vol. 3

  • AP includes copyrighted stories from the AP Newswire (1990)
  • DTD contains the dtd-files for AP, PATENTS, SJM, ZIFF -files
  • PATENTS includes U.S. Patent Documents (1983-1991)
  • SJM includes copyrighted stories from the San Jose Mercury News (1991)
  • ZIFF includes information from the Computer Select disks (1991/1992, Ziff-Davis Publishing)

These disks represent a revision of the first set of disks. There are several files which detail the changes between the previous set of disks and these disks. They are in the files:

  • README.doc A detailed list of changes that were made to disk1, disk2 and disk3, as well as to the qrels files.
  • README.d1 A mapping of old document numbers to new document numbers which were changed in the ziff data on disk one.
  • README.d2 A mapping of old document numbers to new document numbers which were changed in the ziff data on disk two.
  • README.tag A mapping of old tag names to new tag names which were changed on all three disks.

Research and Usecases

License Information

Data Source

https://trec.nist.gov/data/qa/T8_QAdata/disks4_5.html

Publications