I have to say that I dislike most of widely accepted data corpuses. This is not because I am subjective, I see pure technical reasons for that. For example, let us consider the picture used for testing of image compression, called Lenna. It was copy made on ancient scanner in 1972 with garbage sitting in last 3 bits of every byte. For 30 years programmers desperately tried to achieve better compression ratio by compressing the white noise, when all scanners and digital cameras were producing statistically different data.
More recent example is this reduced picture from www.maximumcompression.com
Last 3 bits are identical, color palette is unrealistically small 4956 colors, while typical photographic picture of that size should have near 300,000 colors. The color data in this image are statistically distributed as text bytes in text documents, which makes text compression algorithms working better than image compression for this picture. The choice is obviously wrong, randomly picked image would be a better choice, but it is used for evaluation of efficiency of image compression for many years.
The corpuses for text categorization look easy and not challenging enough for me. In the 20 newsgroups file 51060 contains words ATHEISM or ATHEIST repeated near 150 times. The corpus is used for programmatic identification of latent topics, and how topic ATHEISM can be latent in in this file? It looks rather as a corpus for testing of key word search.
My document corpus is set of technical descriptions of published and granted patents. All patents are assigned a class. This class is assigned to a patent after very careful and timely expertise by a patent examiner, who is an expert in the field. It also may have sub-class and be very close to another class, which makes recognition of the category not that simple. The suggested corpus is small, 128 documents consisting of 8 categories (16 documents in each). All patents are public, so there is no problem with copying and using. Texts in files do not contain the classes or information about inventors or attorneys. There is nothing in the file that can tip about the category besides technical terminology. Although I did not conduct such experiment, but I presume that engineers, familiar with most of 8 industries, used in the CORPUS, will not be able to categorize these documents manually with precision better than 70 percent. Probably USPTO examiner will show better result. The corpus can be downloaded from the link at the top. The structure is very simple. 128 files in a single directory. Each file name, such as C399P7460809.txt, starts from letter C followed by class number, then letter P followed by a patent number. The tested software gets the name of directory, enumerates files and sorts them into categories. After categorization the names of the files in each category can be used for evaluation of precision, which is very convenient for producing an automatic report of accuracy. Needless to say that this corpus is free redistributable because it is public information.
There are some large data corpuses that may be good for testing the performance but not accuracy. For example, publicly available Enron e-mail set. Due to large size it can be used to verify if the program can achieve certain categorization for the reasonable time, but how the accuracy can be estimated. On that reason patents can be used for accuracy test, because they are categorized manually during an examination that lasted several years.