Top Document: comp.ai.neural-nets FAQ, Part 4 of 7: Books, data, etc.
Previous Document: How to benchmark learning methods?
See reader questions & answers on this topic! - Help others by sharing your knowledge
1. UCI machine learning database ++++++++++++++++++++++++++++++++ A large collection of data sets accessible via anonymous FTP at ftp.ics.uci.edu [22.214.171.124] in directory /pub/machine-learning-databases" or via web browser at http://www.ics.uci.edu/~mlearn/MLRepository.html 2. UCI KDD Archive ++++++++++++++++++ The UC Irvine Knowledge Discovery in Databases (KDD) Archive at http://kdd.ics.uci.edu/ is an online repository of large datasets which encompasses a wide variety of data types, analysis tasks, and application areas. The primary role of this repository is to serve as a benchmark testbed to enable researchers in knowledge discovery and data mining to scale existing and future data analysis algorithms to very large and complex data sets. This archive is supported by the Information and Data Management Program at the National Science Foundation, and is intended to expand the current UCI Machine Learning Database Repository to datasets that are orders of magnitude larger and more complex. 3. The neural-bench Benchmark collection ++++++++++++++++++++++++++++++++++++++++ Accessible at http://www.boltz.cs.cmu.edu/ or via anonymous FTP at ftp://ftp.boltz.cs.cmu.edu/pub/neural-bench/. In case of problems or if you want to donate data, email contact is "firstname.lastname@example.org". The data sets in this repository include the 'nettalk' data, 'two spirals', protein structure prediction, vowel recognition, sonar signal classification, and a few others. 4. Proben1 ++++++++++ Proben1 is a collection of 12 learning problems consisting of real data. The datafiles all share a single simple common format. Along with the data comes a technical report describing a set of rules and conventions for performing and reporting benchmark tests and their results. Accessible via anonymous FTP on ftp.cs.cmu.edu [126.96.36.199] as /afs/cs/project/connect/bench/contrib/prechelt/proben1.tar.gz. and also on ftp.ira.uka.de as /pub/neuron/proben1.tar.gz. The file is about 1.8 MB and unpacks into about 20 MB. 5. Delve: Data for Evaluating Learning in Valid Experiments +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Delve is a standardised, copyrighted environment designed to evaluate the performance of learning methods. Delve makes it possible for users to compare their learning methods with other methods on many datasets. The Delve learning methods and evaluation procedures are well documented, such that meaningful comparisons can be made. The data collection includes not only isolated data sets, but "families" of data sets in which properties of the data, such as number of inputs and degree of nonlinearity or noise, are systematically varied. The Delve web page is at http://www.cs.toronto.edu/~delve/ 6. Bilkent University Function Approximation Repository +++++++++++++++++++++++++++++++++++++++++++++++++++++++ A repository of data sets collected mainly by searching resources on the web can be found at http://funapp.cs.bilkent.edu.tr/DataSets/ Most of the data sets are used for the experimental analysis of function approximation techniques and for training and demonstration by machine learning and statistics community. The original sources of most data sets can be accessed via associated links. A compressed tar file containing all data sets is available. 7. NIST special databases of the National Institute Of Standards ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ And Technology: +++++++++++++++ Several large databases, each delivered on a CD-ROM. Here is a quick list. o NIST Binary Images of Printed Digits, Alphas, and Text o NIST Structured Forms Reference Set of Binary Images o NIST Binary Images of Handwritten Segmented Characters o NIST 8-bit Gray Scale Images of Fingerprint Image Groups o NIST Structured Forms Reference Set 2 of Binary Images o NIST Test Data 1: Binary Images of Hand-Printed Segmented Characters o NIST Machine-Print Database of Gray Scale and Binary Images o NIST 8-Bit Gray Scale Images of Mated Fingerprint Card Pairs o NIST Supplemental Fingerprint Card Data (SFCD) for NIST Special Database 9 o NIST Binary Image Databases of Census Miniforms (MFDB) o NIST Mated Fingerprint Card Pairs 2 (MFCP 2) o NIST Scoring Package Release 1.0 o NIST FORM-BASED HANDPRINT RECOGNITION SYSTEM Here are example descriptions of two of these databases: NIST special database 2: Structured Forms Reference Set (SFRS) -------------------------------------------------------------- The NIST database of structured forms contains 5,590 full page images of simulated tax forms completed using machine print. THERE IS NO REAL TAX DATA IN THIS DATABASE. The structured forms used in this database are 12 different forms from the 1988, IRS 1040 Package X. These include Forms 1040, 2106, 2441, 4562, and 6251 together with Schedules A, B, C, D, E, F and SE. Eight of these forms contain two pages or form faces making a total of 20 form faces represented in the database. Each image is stored in bi-level black and white raster format. The images in this database appear to be real forms prepared by individuals but the images have been automatically derived and synthesized using a computer and contain no "real" tax data. The entry field values on the forms have been automatically generated by a computer in order to make the data available without the danger of distributing privileged tax information. In addition to the images the database includes 5,590 answer files, one for each image. Each answer file contains an ASCII representation of the data found in the entry fields on the corresponding image. Image format documentation and example software are also provided. The uncompressed database totals approximately 5.9 gigabytes of data. NIST special database 3: Binary Images of Handwritten Segmented --------------------------------------------------------------- Characters (HWSC) ----------------- Contains 313,389 isolated character images segmented from the 2,100 full-page images distributed with "NIST Special Database 1". 223,125 digits, 44,951 upper-case, and 45,313 lower-case character images. Each character image has been centered in a separate 128 by 128 pixel region, error rate of the segmentation and assigned classification is less than 0.1%. The uncompressed database totals approximately 2.75 gigabytes of image data and includes image format documentation and example software. The system requirements for all databases are a 5.25" CD-ROM drive with software to read ISO-9660 format. Contact: Darrin L. Dimmick; email@example.com; (301)975-4147 The prices of the databases are between US$ 250 and 1895 If you wish to order a database, please contact: Standard Reference Data; National Institute of Standards and Technology; 221/A323; Gaithersburg, MD 20899; Phone: (301)975-2208; FAX: (301)926-0416 Samples of the data can be found by ftp on sequoyah.ncsl.nist.gov in directory /pub/data A more complete description of the available databases can be obtained from the same host as /pub/databases/catalog.txt 8. CEDAR CD-ROM 1: Database of Handwritten Cities, States, ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ZIP Codes, Digits, and Alphabetic Characters ++++++++++++++++++++++++++++++++++++++++++++ The Center Of Excellence for Document Analysis and Recognition (CEDAR) State University of New York at Buffalo announces the availability of CEDAR CDROM 1: USPS Office of Advanced Technology The database contains handwritten words and ZIP Codes in high resolution grayscale (300 ppi 8-bit) as well as binary handwritten digits and alphabetic characters (300 ppi 1-bit). This database is intended to encourage research in off-line handwriting recognition by providing access to handwriting samples digitized from envelopes in a working post office. Specifications of the database include: + 300 ppi 8-bit grayscale handwritten words (cities, states, ZIP Codes) o 5632 city words o 4938 state words o 9454 ZIP Codes + 300 ppi binary handwritten characters and digits: o 27,837 mixed alphas and numerics segmented from address blocks o 21,179 digits segmented from ZIP Codes + every image supplied with a manually determined truth value + extracted from live mail in a working U.S. Post Office + word images in the test set supplied with dic- tionaries of postal words that simulate partial recognition of the corresponding ZIP Code. + digit images included in test set that simulate automatic ZIP Code segmentation. Results on these data can be projected to overall ZIP Code recogni- tion performance. + image format documentation and software included System requirements are a 5.25" CD-ROM drive with software to read ISO-9660 format. For further information, see http://www.cedar.buffalo.edu/Databases/CDROM1/ or send email to Ajay Shekhawat at <ajay@cedar.Buffalo.EDU> There is also a CEDAR CDROM-2, a database of machine-printed Japanese character images. 9. AI-CD-ROM (see question "Other sources of information") ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 10. Time series +++++++++++++++ Santa Fe Competition -------------------- Various datasets of time series (to be used for prediction learning problems) are available for anonymous ftp from ftp.santafe.edu in /pub/Time-Series". Data sets include: o Fluctuations in a far-infrared laser o Physiological data of patients with sleep apnea; o High frequency currency exchange rate data; o Intensity of a white dwarf star; o J.S. Bachs final (unfinished) fugue from "Die Kunst der Fuge" Some of the datasets were used in a prediction contest and are described in detail in the book "Time series prediction: Forecasting the future and understanding the past", edited by Weigend/Gershenfield, Proceedings Volume XV in the Santa Fe Institute Studies in the Sciences of Complexity series of Addison Wesley (1994). M3 Competition -------------- 3003 time series from the M3 Competition can be found at http://forecasting.cwru.edu/Data/index.html The numbers of series of various types are given in the following table: Interval Micro Industry Macro Finance Demog Other Total Yearly 146 102 83 58 245 11 645 Quarterly 204 83 336 76 57 0 756 Monthly 474 334 312 145 111 52 1428 Other 4 0 0 29 0 141 174 Total 828 519 731 308 413 204 3003 Rob Hyndman's Time Series Data Library -------------------------------------- A collection of over 500 time series on subjects including agriculture, chemistry, crime, demography, ecology, economics & finance, health, hydrology & meteorology, industry, physics, production, sales, simulated series, sport, transport & tourism, and tree-rings can be found at http://www-personal.buseco.monash.edu.au/~hyndman/TSDL/ 11. Financial data ++++++++++++++++++ http://chart.yahoo.com/d?s= http://www.chdwk.com/data/index.html 12. USENIX Faces ++++++++++++++++ The USENIX faces archive is a public database, accessible by ftp, that can be of use to people working in the fields of human face recognition, classification and the like. It currently contains 5592 different faces (taken at USENIX conferences) and is updated twice each year. The images are mostly 96x128 greyscale frontal images and are stored in ascii files in a way that makes it easy to convert them to any usual graphic format (GIF, PCX, PBM etc.). Source code for viewers, filters, etc. is provided. Each image file takes approximately 25K. For further information, see http://facesaver.usenix.org/ According to the archive administrator, Barbara L. Dijker (firstname.lastname@example.org), there is no restriction to use them. However, the image files are stored in separate directories corresponding to the Internet site to which the person represented in the image belongs, with each directory containing a small number of images (two in the average). This makes it difficult to retrieve by ftp even a small part of the database, as you have to get each one individually. A solution, as Barbara proposed me, would be to compress the whole set of images (in separate files of, say, 100 images) and maintain them as a specific archive for research on face processing, similar to the ones that already exist for fingerprints and others. The whole compressed database would take some 30 megabytes of disk space. I encourage anyone willing to host this database in his/her site, available for anonymous ftp, to contact her for details (unfortunately I don't have the resources to set up such a site). Please consider that UUNET has graciously provided the ftp server for the FaceSaver archive and may discontinue that service if it becomes a burden. This means that people should not download more than maybe 10 faces at a time from uunet. A last remark: each file represents a different person (except for isolated cases). This makes the database quite unsuitable for training neural networks, since for proper generalisation several instances of the same subject are required. However, it is still useful for use as testing set on a trained network. 13. Linguistic Data Consortium ++++++++++++++++++++++++++++++ The Linguistic Data Consortium (URL: http://www.ldc.upenn.edu/ldc/noframe.html) is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC's host institution. The LDC catalog includes pronunciation lexicons, varied lexicons, broadcast speech, microphone speech, mobile-radio speech, telephone speech, broadcast text, conversation text, newswire text, parallel text, and varied text, at widely varying fees. Linguistic Data Consortium University of Pennsylvania 3615 Market Street, Suite 200 Philadelphia, PA 19104-2608 Tel (215) 898-0464 Fax (215) 573-2175 Email: email@example.com 14. Otago Speech Corpus +++++++++++++++++++++++ The Otago Speech Corpus contains speech samples in RIFF WAVE format that can be downloaded from http://divcom.otago.ac.nz/infosci/kel/software/RICBIS/hyspeech_main.html 15. Astronomical Time Series ++++++++++++++++++++++++++++ Prepared by Paul L. Hertz (Naval Research Laboratory) & Eric D. Feigelson (Pennsyvania State University): o Detection of variability in photon counting observations 1 (QSO1525+337) o Detection of variability in photon counting observations 2 (H0323+022) o Detection of variability in photon counting observations 3 (SN1987A) o Detecting orbital and pulsational periodicities in stars 1 (binaries) o Detecting orbital and pulsational periodicities in stars 2 (variables) o Cross-correlation of two time series 1 (Sun) o Cross-correlation of two time series 2 (OJ287) o Periodicity in a gamma ray burster (GRB790305) o Solar cycles in sunspot numbers (Sun) o Deconvolution of sources in a scanning operation (HEAO A-1) o Fractal time variability in a seyfert galaxy (NGC5506) o Quasi-periodic oscillations in X-ray binaries (GX5-1) o Deterministic chaos in an X-ray pulsar? (Her X-1) URL: http://xweb.nrl.navy.mil/www_hertz/timeseries/timeseries.html 16. Miscellaneous Images ++++++++++++++++++++++++ The USC-SIPI Image Database: http://sipi.usc.edu/services/database/Database.html CityU Image Processing Lab: http://www.image.cityu.edu.hk/images/database.html Center for Image Processing Research: http://cipr.rpi.edu/ Computer Vision Test Images: http://www.cs.cmu.edu:80/afs/cs/project/cil/ftp/html/v-images.html Lenna 97: A Complete Story of Lenna: http://www.image.cityu.edu.hk/images/lenna/Lenna97.html 17. StatLib +++++++++++ The StatLib repository at http://lib.stat.cmu.edu/ at Carnegie Mellon University has a large collection of data sets, many of which can be used with NNs. ------------------------------------------------------------------------ Next part is part 5 (of 7). Previous part is part 3. -- Warren S. Sarle SAS Institute Inc. The opinions expressed here firstname.lastname@example.org SAS Campus Drive are mine and not necessarily (919) 677-8000 Cary, NC 27513, USA those of SAS Institute.
Top Document: comp.ai.neural-nets FAQ, Part 4 of 7: Books, data, etc.
Previous Document: How to benchmark learning methods?
Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page
Send corrections/additions to the FAQ Maintainer:
email@example.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM