Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z
faqs.org - Internet FAQ Archives

REPOST: Artificial Intelligence FAQ: AI Web Directories & Online Papers 5/6 [Monthly posting]
Section - [5-6] Where can I get a machine readable dictionary, thesaurus, and other text corpora?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Counties ]


Top Document: REPOST: Artificial Intelligence FAQ: AI Web Directories & Online Papers 5/6 [Monthly posting]
Previous Document: [5-5] Technical resources for/by undergraduate students
Next Document: [5-7] Where can I get training sets for machine learning algorithms?
See reader questions & answers on this topic! - Help others by sharing your knowledge

Free:

   /usr/dict/words

   Roget's 1911 Thesaurus is available by anonymous FTP from the
   Consortium for Lexical Research 

      clr.nmsu.edu:/CLR/lexica/roget-1911 [128.123.1.12]

   It is also available from

      ftp://src.doc.ic.ac.uk/literary/collections/project_gutenberg/

   An old Webster's dictionary is in /text/dict/{DICT.Z,DICT.INDEX.Z}.
   Project Gutenberg also has Roget's 1911 Thesaurus. The Project
   Gutenberg archive is at ftp://mrcnext.cso.uiuc.edu/pub/etext/. The
   Project Gutenberg archive collects public domain electronic books. For more
   information, write to Michael S. Hart, Professor of Electronic Text,
   Executive Director of Project Gutenberg Etext, Illinois Benedictine
   College, 5700 College Road, Lisle, IL 60532 or send email to
   hart@vmd.cso.uiuc.edu. 

   For people without FTP, Austin Code Works sells floppy disks
   containing Roget's 1911 Thesaurus for $40.00. This money helps support
   the production of other useful texts, such as the 1913 Webster's dictionary.

   The Online Book Initiative maintains a text repository on
   ftp.std.com (a public access UNIX system, 617-739-WRLD). See the
   README file on obi.std.com:/obi/. For more information, send email to
   obi@world.std.com, write to Software Tool & Die, 1330 Beacon Street,
   Brookline, MA 02146, or call 617-739-0202.

   The CHILDES project at Carnegie Mellon University has a lot of data of
   children speaking to adults, as well as the adult written and adult
   spoken corpora from the CORNELL project.  Contact Brian MacWhinney
   <brian@andrew.cmu.edu> for more information.

   The Association for Computational Linguistics (ACL) has a Data
   Collection Initiative. For more information, contact Donald Walker at
   Bellcore, walker@flash.bellcore.com.

   Two lists of common female first names (4967 names) and male first
   names (2924 names) are available for anonymous ftp from 

      ftp.cs.cmu.edu:/user/ai/areas/nlp/corpora/names/

   Read the file README first. Send mail to mkant@cs.cmu.edu for more
   information. 

   A list of 110,000 English words (one per line, in ASCII) is
   available in the PD1:<MSDOS.LINGUISTICS> directory on SIMTEL20 as the
   files WORDS1.ZIP, WORDS2.ZIP, WORDS3.ZIP, and WORDS4.ZIP. Although the
   list is in MS-DOS files, it can easily be used on other machines (but
   first you'll have to unzip the files on a DOS machine). The list
   includes inflected forms of the words, such as plural nouns and the
   -s, -ed, and -ing forms of verbs; thus the number of lexical stems in
   the list is considerably smaller than the total number of word forms.
   These files are available via FTP from WSMR-SIMTEL20.ARMY.MIL
   [192.88.110.20].  SIMTEL20 files are mirrored on wuarchive.wustl.edu.

   The Collins English Dictionary encoded as a Prolog fact base is
   available from the Oxford Text Archive by anonymous ftp from

      ftp://ota.ox.ac.uk/pub/ota/dicts/1192/  [129.67.1.165]

   The Oxford Text Archive includes many other texts, dictionaries,
   thesauri, word lists, and so on, most of which are available for
   scholarly use and research only. See the files

      ota.ox.ac.uk:/pub/ota/textarchive.form
      ota.ox.ac.uk:/pub/ota/textarchive.info
      ota.ox.ac.uk:/pub/ota/textarchive.list
      ota.ox.ac.uk:/pub/ota/textarchive.sgml

   for more information, or write to archive@ox.ac.uk, Oxford Text Archive,
   Oxford University Computing Services, 13 Banbury Road, Oxford OX2
   6NN, UK, call 44-865-273238 or fax 44-865-273275.

   Chuck Wooters <wooters@icsi.berkeley.edu> has extracted the most
   likely pronunciation for each of about 6100 words in the hand-labeled
   TIMIT database, and made them available by anonymous ftp from
   ftp.icsi.berkeley.edu:/pub/speech/TIMIT.mostlikely.Z.

   A list of homophones from general American English is available by
   anonymous ftp from svr-ftp.eng.cam.ac.uk:/comp.speech/data/ as the file
   homophones-1.01.txt. To receive the list by email, send mail to
   Evan.Antworth@sil.org. The list was compiled by Tony Robinson.

   Sigurd P. Crossland <sig@seuss.vantage.gte.com> has been compiling 
   a dictionary of English words, including most common American words,
   abbreviations, hyphenations, and even incorrect spellings. The most
   recent version is available by anonymous ftp from

      ftp://wocket.vantage.gte.com/pub/standard_dictionary/

   The tar file includes 31 text files, one for each word-length from 2
   to 32. The compressed tar file takes up just over 4mb of space, and
   includes approximately 870,000 words.

   WordNet is an English lexical reference system based on current
   psycholinguistic theories of human lexical memory. It organizes nouns,
   verbs and adjectives into synonym sets corresponding to lexical
   concepts. The sets are linked by a variety of relations. Besides being
   of scientific interest, 
   it makes a handy thesaurus. WordNet is available by anonymous ftp from

      ftp://clarity.princeton.edu/pub/

   If you retrieve a copy of wordnet by ftp, please send mail to
   wordnet@princeton.edu. 

Commercial:

   Illumind publishes the Moby Thesaurus (25,000 roots/1.2 million
   synonyms), Moby Words (560,000 entries), Moby Hyphenator (155,000
   entries), and the Moby Part-of-Speech (214,000 entries), Moby
   Pronunciator (167,000 entries with IPA encoding, syllabification, and
   primary, secondary, and tertiary stress marks) and Moby Language
   (100,000 word word lists in five major world languages) lexical
   databases. All databases are supplied in pure ASCII, royalty-free, in
   both Macintosh and MS-DOS disk formats (also in .Z file formats). Both
   commercial (to resell derived structures as part of commercial
   applications) and educational/research licenses are available. Samples
   of each of the lexical databases are available by anonymous ftp from
   netcom.com:/pub/grady/Moby_Sampler.tar.Z [192.100.81.100].  For more
   information, write to Illumind, Attn: Grady Ward, 3449 Martha Court,
   Arcata, CA 95521, call/fax 707-826-7715, or send email to
   grady@netcom.com.
   [Maintainer's note:  This contact information is no longer valid.
   We're working on finding a current address.]

   The Oxford Text Archive has hundreds of online texts in a wide variety
   of languages, including a few dictionaries (the OED, Collins, etc.).
   The Lancaster-Oslo-Bergen (LOB), Brown, and London-Lund corpii are also
   available from them.  For more information, write to Oxford Electronic
   Publishing, Oxford University Press, 200 Madison Avenue, New York, NY
   10016, call 212-889-0206, or send mail to archive@vax.oxford.ac.uk.
   (Their contact information in England is Oxford Text Archive, Oxford
   University Computing Service, 13 Banbury Road, Oxford OX2 6NN, UK, +44
   (865) 273238.)

Mailing Lists:

   CORPORA is a mailing list for Text Corpora. It welcomes information
   and questions about text corpora such as availability, aspects of
   compiling and using corpora, software, tagging, parsing, and
   bibliography. To be added to the list, send a message to
   corpora-request@x400.hd.uib.no. Contributions should be sent to 
   corpora@x400.hd.uib.no.

Linguistic Data Consortium:

   The Linguistic Data Consortium was established to broaden the collection
   and distribution of speech and natural language data bases for the
   purposes of research and technology development in automatic speech
   recognition, natural language processing, and other areas where large
   amounts of linguistic data are needed.  Information about the LDC is
   available by anonymous ftp from ftp.cis.upenn.edu:/pub/ldc [130.91.6.8].
   Documents available in this directory include a paper on the background,
   rationale and goals of the LDC, a brief list of available data bases,
   and some tables summarizing these corpora. For further information,
   contact Elizabeth Hodas, <ehodas@walnut.ling.upenn.edu>, Mark Liberman
   <myl@unagi.cis.upenn.edu>, or Jack Godfrey <jgodfrey@unagi.cis.upenn.edu>.

User Contributions:

Comment about this article, ask questions, or add new information about this topic:

CAPTCHA