| 20080281581 | METHOD OF IDENTIFYING DOCUMENTS WITH SIMILAR PROPERTIES UTILIZING PRINCIPAL COMPONENT ANALYSIS - The present invention generally provides methods and systems for characterizing texts, for example, for identifying textual documents by language, topic, author, or other attributes. In some embodiments, a method of the invention can include creating an n-gram frequency spectrum for a document under analysis, preferably selecting a subset of the n-gram frequency spectrum, transforming the n-gram frequency spectrum into principal component space, and identifying one or more attributes of the document according to its similarity to (or distinction from) reference documents in the principal component space. | 11-13-2008 |