| |
| | LSA: A Solution to Plato's Problem |
 | | From each article we took a sample consisting of (usually) the whole text or its first 2,000 characters, whichever was less, for a mean text sample length of 151 words, roughly the size of a rather long paragraph. |
 | | The variation appears to be largely determined by the size of the dictionaries sampled, and to some extent by the way in which words are defined as being separate from each other and by the testing methods employed. |
 | | Conveniently, this corpus is nearly the same in both overall size, five million words, and in number of word types, 68,000, as our encyclopedia sample (counting, for the encyclopedia sample, singletons not included in the SVD analysis), so that no correction for sample size, which alters word frequency distributions, was necessary. |
| lsi.telcordia.com /lsi/papers/PSYCHREV96.html (20008 words) |
|