[ Usenet FAQs | Search | Web FAQs | Documents | RFC Index ]
Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Single Page
Top Document: Artificial Intelligence FAQ:5/6 AI Web Directories & Online Papers [Monthly posting]
Previous Document: [5-6] Technical resources for/by undergraduate students
Next Document: [5-8] Where can I get training sets for machine learning algorithms?
-
Search the FAQ Archives
Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Single Page
Top Document: Artificial Intelligence FAQ:5/6 AI Web Directories & Online Papers [Monthly posting]
Previous Document: [5-6] Technical resources for/by undergraduate students
Next Document: [5-8] Where can I get training sets for machine learning algorithms?
[5-7] Where can I get a machine readable dictionary, thesaurus, and other text corpora?
Linguistic Data Consortium: The Linguistic Data Consortium was established to broaden the collection and distribution of speech and natural language data bases for the purposes of research and technology development in automatic speech recognition, natural language processing, and other areas where large amounts of linguistic data are needed. LDC corpora are the most commonly used in published research. Information about the LDC is at http://www.ldc.upenn.edu/ Free: On unix systems, /usr/dict/words is a fine word list. =========== The Moby Thesaurus (25,000 roots/1.2 million synonyms), Moby Words (560,000 entries), Moby Hyphenator (155,000 entries), and the Moby Part-of-Speech (214,000 entries), Moby Pronunciator (167,000 entries with IPA encoding, syllabification, and primary, secondary, and tertiary stress marks) and Moby Language (100,000 word word lists in five major world languages) lexical databases are available at: http://www.dcs.shef.ac.uk/research/ilash/Moby/ This was once commercial but is now in the public domain. [thanks to Robert Bechtel] =========== Roget's 1911 Thesaurus is available by anonymous FTP from the Consortium for Lexical Research clr.nmsu.edu:/CLR/lexica/roget-1911 [128.123.1.12] It is also available from src.doc.ic.ac.uk:/literary/collections/project_gutenberg/roget11.txt.Z An old Webster's dictionary is in /text/dict/{DICT.Z,DICT.INDEX.Z}. Project Gutenberg also has Roget's 1911 Thesaurus. The Project Gutenberg archive is at mrcnext.cso.uiuc.edu:/pub/etext/. The Project Gutenberg archive collects public domain electronic books. For more information, write to Michael S. Hart, Professor of Electronic Text, Executive Director of Project Gutenberg Etext, Illinois Benedictine College, 5700 College Road, Lisle, IL 60532 or send email to hart@vmd.cso.uiuc.edu. The Online Book Initiative maintains a text repository at /obi.std.com:/obi/">http://obi.std.com:/obi/ The CHILDES project at Carnegie Mellon University has a lot of data of children speaking to adults, as well as the adult written and adult spoken corpora from the CORNELL project. Contact Brian MacWhinney <brian@andrew.cmu.edu> for more information. The Association for Computational Linguistics (ACL) has a Data Collection Initiative. For more information, contact Donald Walker at Bellcore, walker@flash.bellcore.com. Two lists of common female first names (4967 names) and male first names (2924 names) are available for anonymous ftp from ftp.cs.cmu.edu:/user/ai/areas/nlp/corpora/names/ Read the file README first. Send mail to mkant@cs.cmu.edu for more information. A list of 110,000 English words (one per line, in ASCII) is available in the PD1:<MSDOS.LINGUISTICS> directory on SIMTEL20 as the files WORDS1.ZIP, WORDS2.ZIP, WORDS3.ZIP, and WORDS4.ZIP. Although the list is in MS-DOS files, it can easily be used on other machines (but first you'll have to unzip the files on a DOS machine). The list includes inflected forms of the words, such as plural nouns and the -s, -ed, and -ing forms of verbs; thus the number of lexical stems in the list is considerably smaller than the total number of word forms. These files are available via FTP from WSMR-SIMTEL20.ARMY.MIL [192.88.110.20]. SIMTEL20 files are mirrored on wuarchive.wustl.edu. The Collins English Dictionary encoded as a Prolog fact base is available from the Oxford Text Archive by anonymous ftp from ota.ox.ac.uk:/pub/ota/dicts/1192/ [129.67.1.165] The Oxford Text Archive includes many other texts, dictionaries, thesauri, word lists, and so on, most of which are available for scholarly use and research only. See the files ota.ox.ac.uk:/pub/ota/textarchive.form ota.ox.ac.uk:/pub/ota/textarchive.info ota.ox.ac.uk:/pub/ota/textarchive.list ota.ox.ac.uk:/pub/ota/textarchive.sgml for more information, or write to archive@ox.ac.uk, Oxford Text Archive, Oxford University Computing Services, 13 Banbury Road, Oxford OX2 6NN, UK, call 44-865-273238 or fax 44-865-273275. Chuck Wooters <wooters@icsi.berkeley.edu> has extracted the most likely pronunciation for each of about 6100 words in the hand-labeled TIMIT database, and made them available by anonymous ftp from ftp.icsi.berkeley.edu:/pub/speech/TIMIT.mostlikely.Z. A list of homophones from general American English is available by anonymous ftp from svr-ftp.eng.cam.ac.uk:/comp.speech/data/ as the file homophones-1.01.txt. To receive the list by email, send mail to Evan.Antworth@sil.org. The list was compiled by Tony Robinson. Sigurd P. Crossland <sig@seuss.vantage.gte.com> has been compiling a dictionary of English words, including most common American words, abbreviations, hyphenations, and even incorrect spellings. The most recent version is available by anonymous ftp from wocket.vantage.gte.com:/pub/standard_dictionary/dic-0394.tar.gz The tar file includes 31 text files, one for each word-length from 2 to 32. The compressed tar file takes up just over 4mb of space, and includes approximately 870,000 words. WordNet is an English lexical reference system based on current psycholinguistic theories of human lexical memory. It organizes nouns, verbs and adjectives into synonym sets corresponding to lexical concepts. The sets are linked by a variety of relations. Besides being of scientific interest, it makes a handy thesaurus. WordNet is available by anonymous ftp from clarity.princeton.edu:/pub/ If you retrieve a copy of wordnet by ftp, please send mail to wordnet@princeton.edu. Commercial: The Oxford Text Archive has hundreds of online texts in a wide variety of languages, including a few dictionaries (the OED, Collins, etc.). The Lancaster-Oslo-Bergen (LOB), Brown, and London-Lund corpii are also available from them. For more information, write to Oxford Electronic Publishing, Oxford University Press, 200 Madison Avenue, New York, NY 10016, call 212-889-0206, or send mail to archive@vax.oxford.ac.uk. (Their contact information in England is Oxford Text Archive, Oxford University Computing Service, 13 Banbury Road, Oxford OX2 6NN, UK, +44 (865) 273238.) Mailing Lists: CORPORA is a mailing list for Text Corpora. It welcomes information and questions about text corpora such as availability, aspects of compiling and using corpora, software, tagging, parsing, and bibliography. To be added to the list, send a message to corpora-request@x400.hd.uib.no. Contributions should be sent to corpora@x400.hd.uib.no.
Top Document: Artificial Intelligence FAQ:5/6 AI Web Directories & Online Papers [Monthly posting]
Previous Document: [5-6] Technical resources for/by undergraduate students
Next Document: [5-8] Where can I get training sets for machine learning algorithms?
Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Single Page
[ Usenet FAQs | Search | Web FAQs | Documents | RFC Index ]
Send corrections/additions to the FAQ Maintainer:
crabbe@usna.edu
Last Update September 05 2008 @ 00:12 AM