This web page describes RCV1-v2/LYRL2004, a text categorization test collection which is distributed as a set of on-line appendices to a JMLR journal article.
In most cases, the following article should be cited:
Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf.
instead of the web page you are now reading. I will refer to this article as LYRL2004. LYRL2004 contains all the information that the web page does, except formatting details and such.
If for some reason you need to cite this web page, you could cite it as:
Lewis, D. D. RCV1-v2/LYRL2004: The LYRL2004 Distribution of the RCV1-v2 Text Categorization Test Collection (12-Apr-2004 Version). http://www.jmlr.org/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm.
or as appropriate for your bibliographic format. If the web page is cited, we ask that LYRL2004 always be cited as well, since it is part of the archival literature.
The Agreement which researchers must sign to obtain the RCV1 CD-ROMs from Reuters, Ltd. states:
"Summaries, analyses and interpretations of the linguistic properties of the information may be
derived and published provided it is not possible to reconstruct the Data from the summary."
Based on this clause, Reuters personnel have stated that distributing term/document matrices is not a violation of the Agreement:
http://groups.yahoo.com/group/ReutersCorpora/message/70
http://groups.yahoo.com/group/ReutersCorpora/message/106
To ensure that the original data cannot be reconstructed, the term/document matrices we distribute as appendices to the LYRL2004 paper remove words from a large stop list (including essentially all linguistic function words), replace the remaining words with stems, and scramble the order of the stems.
While the data described here is available without license, I encourage those using the data to also obtain a copy of the RCV1 CD-ROMs under the standard Agreement. At the very least, I very strongly encourage everyone using this data to respect the intent of clause 3.3 of the RCV1 Agreement:
"3.3 All publications resulting from research carried out using the Data must provide an attribution to Reuters. Such attribution should
include a reference to the specific corpus used. You agree to provide a copy of each such publication to Reuters on publication."
Doing so will encourage Reuters to make additional data sets available in the future.
The RCV1-v2/LYRL2004 test collection is made up of a large number of files, which take the form of 18 On-Line Appendices to the LYRL2004 article. We describe the files in the order of the corresponding On-Line Appendix numbers in the LYRL2004 paper.
On-Line Appendix 1 consists of the ASCII file rcv1.topics.txt, which is 488 bytes in size. It contains a list, one per line, of the names of the 103 RCV1 Topics categories that were available to Reuters indexers. See Section 3.2.1 of LYRL2004 for more information.
On-Line Appendix 2 consists of the ASCII file rcv1.topics.hier.orig, which is 6,965 bytes in size. It contains a 104 node hierarchy (tree) of Reuters Topics categories. There is 1 root node, plus nodes for the 103 assignable categories (some of which are leaf nodes and some of which are internal nodes). Each node is represented by a line of the form:
parent: <cat> child: <cat> child-description: <desc>
where <cat> is the name of a category, the string "Root" to indicate the root node, or the string "None" which is a placeholder that does not correspond to a node. There are 104 lines, one for each node, with the node specified in the child field, and the structure specified by giving the parent of each child.
See Section 3.2.1 of LYRL2004 for more information.
On-Line Appendix 3 consists of the ASCII file rcv1.topics.hier.expanded, which is 7,810 bytes in size. It contains a 117 node hierarchy (tree) of Reuters Topics categories. There is 1 root node, 13 third-level nodes that do not correspond to assignable categories, and 103 nodes for the 103 assignable categories (some of which are leaf nodes and some of which are internal nodes). The format of the file is the same as for On-Line Appendix 2. See Section 3.2.1 of LYRL2004 for more information.
On-Line Appendix 4 consists of the ASCII file rcv1.industries.txt, which is 2688 bytes in size. It contains a list, one per line, of the names of the 354 RCV1 Industry categories that were available to Reuters indexers. See Section 3.2.2 of LYRL2004 for more information.
On-Line Appendix 5 consists of the ASCII file rcv1.industries.hier, which is 30,162 bytes in size. It contains a 365 node hierarchy (tree) of Reuters Industry categories. There is 1 root node, 10 second level nodes which do not correspond to assignable categories, plus nodes for the 354 assignable categories (some of which are leaf nodes and some of which are internal nodes). The format of the file is the same as for On-Line Appendix 2. See Section 3.2.2 of LYRL2004 for more information.
On-Line Appendix 6 consists of the ASCII file rcv1.regions.txt, which is 2065 bytes in size. It contains a list, one per line, of the names of the 366 RCV1 Region categories that were available to Reuters indexers. See Section 3.2.3 of LYRL2004 for more information.
On-Line Appendix 7 consists of the ASCII file rcv1v2-ids.dat.gz, which is 1,715,108 bytes in gzipped form, or 5,527,301 bytes when uncompressed. It contains a list, one per line, of the Reuters-assigned IDs of the 804,414 documents in RCV1-v2. Each line contains a single ID. See Section 4 of LYRL2004 for more information.
On-Line Appendix 8 consists of the ASCII file rcv1-v2.topics.qrels.gz, which is 7,272,130 bytes in gzipped form, or 35,382,548 bytes when uncompressed. It specifies which Topic categories each RCV1-v2 document belongs to. The files have the format of TREC qrels files, as we now describe. Each category/document pair is specified by a separate one-line record. There are 2,606,875 lines, and each line has the format:
<category name> <did> 1
where <category name> is the name of the category, <did> is a Reuters-assigned document ID, and the 1 is redundant but required for TREC format.
As an example, the transactions for the first two documents in rcv1-v2.topics.qrels look like this:
E11 2286 1
ECAT 2286 1
M11 2286 1
M12 2286 1
MCAT 2286 1
C24 2287 1
CCAT 2287 1
These indicate that document 2286 belongs to Topic categories E11, ECAT, M11, M12, and MCAT. Document 2287 belongs to Topic categories
C24 and CCAT.
See Section 4 of LYRL2004 for more information.
On-Line Appendix 9 consists of the ASCII file rcv1-v2.industries.qrels.gz, which is 2,005,036 bytes in gzipped form, or 9,055,643 bytes when uncompressed. It has 560,922 lines and specifies which Industry categories each RCV1-v2 document belongs to. The file has the same format as On-Line Appendix 8. See Section 4 of LYRL2004 for more information.
On-Line Appendix 10 consists of the ASCII file rcv1-v2.regions.qrels.gz, which is 3,214,040 bytes in gzipped form, or 14,348,799 bytes when uncompressed. It has 1,057,880 lines and specifies which Region categories each RCV1-v2 document belongs to. The file has the same format as On-Line Appendix 8. See Section 4 of LYRL2004 for more information.
On-Line Appendix 11 consists of the ASCII file english.stop, which is 3,589 bytes in size. It contains a list of 571 stop words, one per line, that was developed by the SMART project. See Section 7 of LYRL2004 for more information.
On-Line Appendix 12 consists of ten ASCII files containing tokenized documents. The files fall in two groups.
Five of the files contain the exact RCV1-v2 token files used to produce the vectors that were then used for training and testing supervised learners in LYRL2004. Four files contain test set tokenized documents, and the fifth contains the training set tokenized documents.
In gzipped form the file sizes in bytes are:
lyrl2004_tokens_test_pt0.dat.gz : 44734992
lyrl2004_tokens_test_pt1.dat.gz : 45595102
lyrl2004_tokens_test_pt2.dat.gz : 44507510
lyrl2004_tokens_test_pt3.dat.gz : 42052117
lyrl2004_tokens_train.dat.gz : 5108963
In uncompressed form the file sizes in bytes are:
lyrl2004_tokens_test_pt0.dat : 153955383
lyrl2004_tokens_test_pt1.dat : 156091348
lyrl2004_tokens_test_pt2.dat : 153363982
lyrl2004_tokens_test_pt3.dat : 145174772
lyrl2004_tokens_train.dat : 17590105
The number of documents in each file is:
lyrl2004_tokens_test_pt0.dat : 199328 test documents
lyrl2004_tokens_test_pt1.dat : 199339 test documents
lyrl2004_tokens_test_pt2.dat : 199576 test documents
lyrl2004_tokens_test_pt3.dat : 183022 test documents
lyrl2004_tokens_train.dat : 23149 training documents
There are 23,149 training documents and 781,265 test documents in these files, for a total of 804,414 documents, i.e. all the documents from RCV1-v2 as defined in LYRL2004. The documents have been tokenized, stopworded, and stemmed. Most but not all punctuation was removed during stemming. Note that while the LYRL2004 experiments used the particular training/test split reflected in the files, this split had no impact on how the tokenized documents were created. Therefore, these files could be used in experiments with any other training/test split desired.
Each document in a file is represented in a format used by the SMART text retrieval system. A document has the format:
.I <did>
.W
<textline>+
<blankline>
where we have:
<did> : Reuters-assigned document id.
<textline> : A line of white-space separated strings, one for each token produced by preprocessing for the specified document. These lines never begin with a period followed by an upper case alphabetic character.
<blankline> : A single end of line character.
Each line that begins with ".I" indicates the start of a new document.
Here's an example of the tokenized document file format:
.I 1
.W
now is the time for all good documents
to come to the aid of the ir community
.I 2
.W
i am the best document since i have only one line
.I 3
.W
no i am the best document
Actual tokenized documents are typically longer.
See Section 7 of LYRL2004 for further details.
The remaining five files in On-Line Appendix 12 contain tokenized documents that correspond to original RCV1 documents that we did not include in the RCV1-v2 collection, and so are not included in the five files discussed above. They correspond to RCV1 documents that had demonstrably invalid category codes. In gzipped form the file sizes in bytes are:
lyrl2004-non-v2_tokens_test_pt0.dat.gz : 149887
lyrl2004-non-v2_tokens_test_pt1.dat.gz : 171205
lyrl2004-non-v2_tokens_test_pt2.dat.gz : 102370
In uncompressed form the file sizes in bytes are:
lyrl2004-non-v2_tokens_test_pt0.dat : 567844
lyrl2004-non-v2_tokens_test_pt1.dat : 592220
lyrl2004-non-v2_tokens_test_pt2.dat : 357564
lyrl2004-non-v2_tokens_test_pt3.dat : 435917
lyrl2004-non-v2_tokens_train.dat : 161010
The number of documents in each file is:
lyrl2004-non-v2_tokens_test_pt0.dat : 671 documents
lyrl2004-non-v2_tokens_test_pt1.dat : 661 documents
lyrl2004-non-v2_tokens_test_pt2.dat : 424 documents
lyrl2004-non-v2_tokens_test_pt3.dat : 463 documents
lyrl2004-non-v2_tokens_train.dat : 158 documents
These files contain tokenized versions of documents that are in RCV1, but not in
RCV1-v2. They were produced in exactly the same fashion as the RCV1-v2 tokenized document files, and have the same format. Of
these files, only lyrl2004-non-v2_tokens_train.dat was used in the LYRL2004 experiments. It was used to produce
lyrl2004-non-v2_vectors_train.dat which was in turn used (unintentionally) only for generating IDF weights. We include these files for completeness, but most researchers will
not need to use them.
See Section 7 of LYRL2004 for further details.
On-Line Appendix 13 consists of ten ASCII files containing document vectors. They were produced using the On-Line Appendix 12 files as the starting point. As discussed in Section 7 of LYRL2004, for most kinds of research the On-Line Appendix 12 files will be more useful.
Five of the files contain the exact RCV1-v2 vectors used for training and testing supervised learners in LYRL2004. Four files contain test set vectors, and the fifth contains the training set vectors. In gzipped form the file sizes in bytes are:
lyrl2004_vectors_test_pt0.dat.gz : 159879168
lyrl2004_vectors_test_pt1.dat.gz : 161878016
lyrl2004_vectors_test_pt2.dat.gz : 158580736
lyrl2004_vectors_test_pt3.dat.gz : 149512192
lyrl2004_vectors_train.dat.gz : 18620416
In uncompressed form the file sizes in bytes are:
lyrl2004_vectors_test_pt0.dat : 367197611
lyrl2004_vectors_test_pt1.dat : 371378053
lyrl2004_vectors_test_pt2.dat : 364319208
lyrl2004_vectors_test_pt3 .dat: 343575752
lyrl2004_vectors_train.dat : 42955532
The number of vectors in each file is:
lyrl2004_vectors_test_pt0.dat : 199328 test vectors
lyrl2004_vectors_test_pt1.dat : 199339 test vectors
lyrl2004_vectors_test_pt2.dat : 199576 test vectors
lyrl2004_vectors_test_pt3.dat : 183022 test vectors
lyrl2004_vectors_train.dat : 23149 training vectors
There are 23,149 training vectors and 781,265 test vectors in this data set, for a total of 804,414 vectors, i.e. all vectors from RCV1-v2 as defined in LYRL2004. Vectors are cosine-normalized, log TF-IDF vectors.
IDF weights in the above vectors were computed from the union of the lyrl2004_vectors_train.dat and non-lyrl2004_vectors_train.dat (see below), i.e. a small number of RCV1-v1 vectors that are not in RCV1-v2 were used in computing IDF weights. No vectors from the non-lyrl2004 files were used in any supervised learning. Any term in a test document that did not occur in one or more documents from the union of lyrl2004_vectors_train.dat and non-lyrl2004_vectors_train.dat was discarded before cosine normalization. This discarding of terms, as well as the impact of the training/test split on IDF computation, means the Appendix 13 vectors should not be used in experiments with any training/test split besides the one used in LYRL2004!
The main reason to use the Appendix 13 files would be if a researcher wants to directly compare a supervised learning algorithm against those tested in the LYRL2004 paper, while keeping the training/test split and text representation exactly the same. For all other purposes, the token files (On-Line Appendix 12) are likely to be more useful.
Each vector in a file of vectors is represented by a single line of the form:
<did> [<tid>:<weight>]+
where we have:
<did> : Reuters-assigned document id.
<tid> : A positive integer term id. Term ids are between 1 and 47,236. The corresponding type (string form) for each term id is found in On-Line Appendix 14.
<weight> : The numeric feature value, i.e. within document weight, assigned to this term for this document, as described in LYRL2004.
Here's an example of the vector file format:
999995 1:0.03 3:0.047 8:0.38749738478937479 14:0.1 2748:0.03
999996 7:0.13 19:0.138 255:0.58588 314:0.28101 18800:0.005
999998 2:0.00001 3:0.108 184:0.228 488:0.0821 40917:0.111
Actual vectors are much longer, both due to more terms and more decimal places in weights.
See Section 7 of LYRL2004 for further details.
The remaining 5 files in Appendix 13 contain vectors that correspond to original RCV1 documents that we did not include in the RCV1-v2 collection, and so are not included in the five files discussed above. They correspond to RCV1 documents that had demonstrably invalid category codes. In gzipped form the file sizes in bytes are:
lyrl2004-non-v2_vectors_test_pt0.dat.gz : 532480
lyrl2004-non-v2_vectors_test_pt1.dat.gz : 524288
lyrl2004-non-v2_vectors_test_pt2.dat.gz : 339968
lyrl2004-non-v2_vectors_test_pt3.dat.gz : 413696
lyrl2004-non-v2_vectors_train.dat.gz : 172032
In uncompressed form the file sizes in bytes are:
lyrl2004-non-v2_vectors_test_pt0.dat 1359872
lyrl2004-non-v2_vectors_test_pt1.dat 1294336
lyrl2004-non-v2_vectors_test_pt2.dat 839680
lyrl2004-non-v2_vectors_test_pt3.dat 974848
lyrl2004-non-v2_vectors_train.dat 389120
The number of vectors in each file is:
lyrl2004-non-v2_vectors_test_pt0.dat : 671 vectors
lyrl2004-non-v2_vectors_test_pt1.dat : 661 vectors
lyrl2004-non-v2_vectors_test_pt2.dat : 424 vectors
lyrl2004-non-v2_vectors_test_pt3.dat : 463 vectors
lyrl2004-non-v2_vectors_train.dat : 158 vectors
Of these files, only lyrl2004-non-v2_vectors_train.dat was used in the LYRL2004 experiments. It was used, along with lyrl2004_vectors_train.dat, only for generating IDF weights. These vectors were produced in exactly the same fashion as the RCV1-v2 vector files, and have the same format. We include them for completeness, but most researchers will not need to use them.
See Section 7 of LYRL2004 for further details.
On-Line Appendix 14 consists of the ASCII file stem.termid.idf.map.txt, which is 1,411,031 bytes in size. It specifies the mapping between the numeric term IDs used in our vector files (On-Line Appendix 13) and the stemmed tokens used in our tokenized document files (On-Line Appendix 12). There are 47,236 lines in the file, corresponding to the 47,236 unique stemmed tokens present in the 23,307 pre-breakpoint RCV1-v1 documents. The lines have the form:
<stem> <termid> <idf>
where we have:
<stem> : stemmed term, as appears in On-Line Appendix 12 files
<termid> : integer term ID, as appears in On-Line Appendix 13 files
<idf> : inverse document frequency value used for the term in the LYRL2004 experiments
Not all the 47,236 terms represented in this file were actually used in our experiments. See Section 6.5 and Section 7 of LYRL2004 for details.
On-Line Appendix 15 consists of nineteen ASCII files containing the contingency tables used to generate all experiment results reported in LYRL2004. The filenames (not the files) have the format:
<catset>.<alg>.<opt>.xml
where we have:
<catset> : Set of categories
<alg> : Classifier algorithm
<opt> : Whether the classifier was optimized for macroaveraging, microaveraging, or on a per-category basis.
All Industries files are 35067 bytes in size, all Regions files 35523 bytes, and all Topics files are 9909 bytes in size. The files are available as a gzipped tar archive, a15-contingency-tables.tar.gz, which is 73547 bytes in size.
All files are in XML format and have the following structure (note our convention for specifying file format is slightly different here than in the rest of this web page):
<allcats>
TABLELINE+
</allcats>
where "<allcats>" and "</allcats>" are actual XML tags in the file, each on its own line. Each TABLELINE is a line with the following format:
<c> <n> NAME </n> <tp> A </tp> <fp> B </fp> <fn> C </fn> <tn> D </tn> </c>
where we have:
NAME : name of a category
A : integer value for number of true positive (classifier assiged category to document, and category should be assigned to document) decisions the classifier made for this category on the RCV1-v2 test set documents.
B : integer value for number of false positive (classifier assiged category, category should not be assigned) decisions.
C : integer value for number of false negative (classifier did not assign category, category should be assigned) decisions.
D : integer value for number of true negative (classifier did not assign category, category should not be assigned) decisions.
Note A+B+C+D adds to 804,414 (the number of RCV1-v2 test documents) for all lines.
See Section 5.3 of LYRL2004 for more on effectiveness measures.
On-Line Appendix 16 consists of the ASCII file topics.rbb, which is 6143 bytes in size. It is a Reuters documentation file on a set of categories closely related to the Topics categories. See Section 3.2.4 of LYRL2004 for more information.
On-Line Appendix 17 consists of the ASCII file industries.rbb, which is 13222 bytes in size. It is a Reuters documentation file on a set of categories closely related to the Industries categories. See Section 3.2.4 of LYRL2004 for more information.
On-Line Appendix 18 consists of the ASCII file regions.rbb, which is 6900 bytes in size. It is a Reuters documentation file on a set of categories closely related to the Regions categories. See Section 3.2.4 of LYRL2004 for more information.