The complete Tipster Corpus is comprised of a test collection built at NIST for the TIPSTER project and the related TREC project. The test collection consists of three CD-ROMs of SGML encoded documents distributed by LDC plus queries and answers (relevant documents) distributed by NIST.
The documents in the test collection are varied in style, size and subject domain. The format of all the documents is relatively clean and easy to use, with SGML-like tags separating documents and document fields. There is no part-of-speech tagging or breakdown into individual sentences or paragraphs as the purpose of this collection is to test retrieval against real-world data.
Tipster Vol. 1
AP
includes copyrighted stories from the AP Newswire (1989)DOE
includes short abstracts from the Department of EnergyDTD
contains the dtd-files for AP, DOE, FR, WSJ, ZIFF -filesFR
includes whole issues of the Federal Register (1989)WSJ
includes copyrighted stories from the Wall Street Journal (1987, 1988, 1989)ZIFF
includes information from the Computer Select disks (1989/1990, Ziff-Davis Publishing)
Tipster Vol. 2
AP
includes copyrighted stories from the AP Newswire (1988)DTD
contains the dtd-files for AP, FR, WSJ, ZIFF -filesFR
includes more issues of the Federal Register (1989)WSJ
includes copyrighted stories from the Wall Street Journal (1990, 1991, 1992)ZIFF
includes more information from the Computer Select disks (1989/1990, Ziff-Davis Publishing)
Tipster Vol. 3
AP
includes copyrighted stories from the AP Newswire (1990)DTD
contains the dtd-files for AP, PATENTS, SJM, ZIFF -filesPATENTS
includes U.S. Patent Documents (1983-1991)SJM
includes copyrighted stories from the San Jose Mercury News (1991)ZIFF
includes information from the Computer Select disks (1991/1992, Ziff-Davis Publishing)
These disks represent a revision of the first set of disks. There are several files which detail the changes between the previous set of disks and these disks. They are in the files:
README.doc
A detailed list of changes that were made to disk1, disk2 and disk3, as well as to the qrels files.README.d1
A mapping of old document numbers to new document numbers which were changed in the ziff data on disk one.README.d2
A mapping of old document numbers to new document numbers which were changed in the ziff data on disk two.README.tag
A mapping of old tag names to new tag names which were changed on all three disks.