Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z - Internet FAQ Archives

REPOST: Artificial Intelligence FAQ: AI Web Directories & Online Papers 5/6 [Monthly posting]
Section - [5-6] Where can I get a machine readable dictionary, thesaurus, and other text corpora?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Counties ]

Top Document: REPOST: Artificial Intelligence FAQ: AI Web Directories & Online Papers 5/6 [Monthly posting]
Previous Document: [5-5] Technical resources for/by undergraduate students
Next Document: [5-7] Where can I get training sets for machine learning algorithms?
See reader questions & answers on this topic! - Help others by sharing your knowledge



   Roget's 1911 Thesaurus is available by anonymous FTP from the
   Consortium for Lexical Research []

   It is also available from

   An old Webster's dictionary is in /text/dict/{DICT.Z,DICT.INDEX.Z}.
   Project Gutenberg also has Roget's 1911 Thesaurus. The Project
   Gutenberg archive is at The
   Project Gutenberg archive collects public domain electronic books. For more
   information, write to Michael S. Hart, Professor of Electronic Text,
   Executive Director of Project Gutenberg Etext, Illinois Benedictine
   College, 5700 College Road, Lisle, IL 60532 or send email to 

   For people without FTP, Austin Code Works sells floppy disks
   containing Roget's 1911 Thesaurus for $40.00. This money helps support
   the production of other useful texts, such as the 1913 Webster's dictionary.

   The Online Book Initiative maintains a text repository on (a public access UNIX system, 617-739-WRLD). See the
   README file on For more information, send email to, write to Software Tool & Die, 1330 Beacon Street,
   Brookline, MA 02146, or call 617-739-0202.

   The CHILDES project at Carnegie Mellon University has a lot of data of
   children speaking to adults, as well as the adult written and adult
   spoken corpora from the CORNELL project.  Contact Brian MacWhinney
   <> for more information.

   The Association for Computational Linguistics (ACL) has a Data
   Collection Initiative. For more information, contact Donald Walker at

   Two lists of common female first names (4967 names) and male first
   names (2924 names) are available for anonymous ftp from

   Read the file README first. Send mail to for more

   A list of 110,000 English words (one per line, in ASCII) is
   available in the PD1:<MSDOS.LINGUISTICS> directory on SIMTEL20 as the
   files WORDS1.ZIP, WORDS2.ZIP, WORDS3.ZIP, and WORDS4.ZIP. Although the
   list is in MS-DOS files, it can easily be used on other machines (but
   first you'll have to unzip the files on a DOS machine). The list
   includes inflected forms of the words, such as plural nouns and the
   -s, -ed, and -ing forms of verbs; thus the number of lexical stems in
   the list is considerably smaller than the total number of word forms.
   These files are available via FTP from WSMR-SIMTEL20.ARMY.MIL
   [].  SIMTEL20 files are mirrored on

   The Collins English Dictionary encoded as a Prolog fact base is
   available from the Oxford Text Archive by anonymous ftp from  []

   The Oxford Text Archive includes many other texts, dictionaries,
   thesauri, word lists, and so on, most of which are available for
   scholarly use and research only. See the files

   for more information, or write to, Oxford Text Archive,
   Oxford University Computing Services, 13 Banbury Road, Oxford OX2
   6NN, UK, call 44-865-273238 or fax 44-865-273275.

   Chuck Wooters <> has extracted the most
   likely pronunciation for each of about 6100 words in the hand-labeled
   TIMIT database, and made them available by anonymous ftp from

   A list of homophones from general American English is available by
   anonymous ftp from as the file
   homophones-1.01.txt. To receive the list by email, send mail to The list was compiled by Tony Robinson.

   Sigurd P. Crossland <> has been compiling 
   a dictionary of English words, including most common American words,
   abbreviations, hyphenations, and even incorrect spellings. The most
   recent version is available by anonymous ftp from

   The tar file includes 31 text files, one for each word-length from 2
   to 32. The compressed tar file takes up just over 4mb of space, and
   includes approximately 870,000 words.

   WordNet is an English lexical reference system based on current
   psycholinguistic theories of human lexical memory. It organizes nouns,
   verbs and adjectives into synonym sets corresponding to lexical
   concepts. The sets are linked by a variety of relations. Besides being
   of scientific interest, 
   it makes a handy thesaurus. WordNet is available by anonymous ftp from

   If you retrieve a copy of wordnet by ftp, please send mail to 


   Illumind publishes the Moby Thesaurus (25,000 roots/1.2 million
   synonyms), Moby Words (560,000 entries), Moby Hyphenator (155,000
   entries), and the Moby Part-of-Speech (214,000 entries), Moby
   Pronunciator (167,000 entries with IPA encoding, syllabification, and
   primary, secondary, and tertiary stress marks) and Moby Language
   (100,000 word word lists in five major world languages) lexical
   databases. All databases are supplied in pure ASCII, royalty-free, in
   both Macintosh and MS-DOS disk formats (also in .Z file formats). Both
   commercial (to resell derived structures as part of commercial
   applications) and educational/research licenses are available. Samples
   of each of the lexical databases are available by anonymous ftp from [].  For more
   information, write to Illumind, Attn: Grady Ward, 3449 Martha Court,
   Arcata, CA 95521, call/fax 707-826-7715, or send email to
   [Maintainer's note:  This contact information is no longer valid.
   We're working on finding a current address.]

   The Oxford Text Archive has hundreds of online texts in a wide variety
   of languages, including a few dictionaries (the OED, Collins, etc.).
   The Lancaster-Oslo-Bergen (LOB), Brown, and London-Lund corpii are also
   available from them.  For more information, write to Oxford Electronic
   Publishing, Oxford University Press, 200 Madison Avenue, New York, NY
   10016, call 212-889-0206, or send mail to
   (Their contact information in England is Oxford Text Archive, Oxford
   University Computing Service, 13 Banbury Road, Oxford OX2 6NN, UK, +44
   (865) 273238.)

Mailing Lists:

   CORPORA is a mailing list for Text Corpora. It welcomes information
   and questions about text corpora such as availability, aspects of
   compiling and using corpora, software, tagging, parsing, and
   bibliography. To be added to the list, send a message to Contributions should be sent to

Linguistic Data Consortium:

   The Linguistic Data Consortium was established to broaden the collection
   and distribution of speech and natural language data bases for the
   purposes of research and technology development in automatic speech
   recognition, natural language processing, and other areas where large
   amounts of linguistic data are needed.  Information about the LDC is
   available by anonymous ftp from [].
   Documents available in this directory include a paper on the background,
   rationale and goals of the LDC, a brief list of available data bases,
   and some tables summarizing these corpora. For further information,
   contact Elizabeth Hodas, <>, Mark Liberman
   <>, or Jack Godfrey <>.

User Contributions:

Comment about this article, ask questions, or add new information about this topic: