| |
| | AQUAINT Text Corpus (Site not responding. Last check: 2007-10-15) |
 | | The text data are separated into directories by source (apw, nyt, xie); within each source, data files are subdivided by year, and within each year, there is one file per date of collection. |
 | | The sampling for this corpus covers the period from January 1996 to September 2000, inclusive, for the Xinhua text collection, and from June 1998 to September 2000, inclusive, for New York Times and Associated Press. |
 | | However, despite our best intentions, there is unavoidable variation in the formatting of text data transmitted over these newswire services, and in a small percentage of stories, the typical cues for delimiting the TEXT content are lacking. |
| www.ldc.upenn.edu /Catalog/docs/LDC2002T31 (1010 words) |
|