Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z - Internet FAQ Archives FAQ, Part 4 of 7: Books, data, etc.
Section - Databases for experimentation with NNs?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Counties ]

Top Document: FAQ, Part 4 of 7: Books, data, etc.
Previous Document: How to benchmark learning methods?
See reader questions & answers on this topic! - Help others by sharing your knowledge

1. UCI machine learning database

   A large collection of data sets accessible via anonymous FTP at [] in directory 
   /pub/machine-learning-databases" or via web browser at 

2. UCI KDD Archive

   The UC Irvine Knowledge Discovery in Databases (KDD) Archive at is an online repository of large datasets which
   encompasses a wide variety of data types, analysis tasks, and application
   areas. The primary role of this repository is to serve as a benchmark
   testbed to enable researchers in knowledge discovery and data mining to
   scale existing and future data analysis algorithms to very large and
   complex data sets. This archive is supported by the Information and Data
   Management Program at the National Science Foundation, and is intended to
   expand the current UCI Machine Learning Database Repository to datasets
   that are orders of magnitude larger and more complex. 

3. The neural-bench Benchmark collection

   Accessible at or via anonymous FTP at In case of problems or if
   you want to donate data, email contact is "". The
   data sets in this repository include the 'nettalk' data, 'two spirals',
   protein structure prediction, vowel recognition, sonar signal
   classification, and a few others. 

4. Proben1

   Proben1 is a collection of 12 learning problems consisting of real data.
   The datafiles all share a single simple common format. Along with the
   data comes a technical report describing a set of rules and conventions
   for performing and reporting benchmark tests and their results.
   Accessible via anonymous FTP on [] as 
   /afs/cs/project/connect/bench/contrib/prechelt/proben1.tar.gz. and also
   on as /pub/neuron/proben1.tar.gz. The file is about 1.8 MB
   and unpacks into about 20 MB. 

5. Delve: Data for Evaluating Learning in Valid Experiments

   Delve is a standardised, copyrighted environment designed to evaluate the
   performance of learning methods. Delve makes it possible for users to
   compare their learning methods with other methods on many datasets. The
   Delve learning methods and evaluation procedures are well documented,
   such that meaningful comparisons can be made. The data collection
   includes not only isolated data sets, but "families" of data sets in
   which properties of the data, such as number of inputs and degree of
   nonlinearity or noise, are systematically varied. The Delve web page is

6. Bilkent University Function Approximation Repository

   A repository of data sets collected mainly by searching resources on the
   web can be found at Most of the
   data sets are used for the experimental analysis of function
   approximation techniques and for training and demonstration by machine
   learning and statistics community. The original sources of most data sets
   can be accessed via associated links. A compressed tar file containing
   all data sets is available. 

7. NIST special databases of the National Institute Of Standards
   And Technology:

   Several large databases, each delivered on a CD-ROM. Here is a quick
    o NIST Binary Images of Printed Digits, Alphas, and Text 
    o NIST Structured Forms Reference Set of Binary Images 
    o NIST Binary Images of Handwritten Segmented Characters 
    o NIST 8-bit Gray Scale Images of Fingerprint Image Groups 
    o NIST Structured Forms Reference Set 2 of Binary Images 
    o NIST Test Data 1: Binary Images of Hand-Printed Segmented Characters 
    o NIST Machine-Print Database of Gray Scale and Binary Images 
    o NIST 8-Bit Gray Scale Images of Mated Fingerprint Card Pairs 
    o NIST Supplemental Fingerprint Card Data (SFCD) for NIST Special
      Database 9 
    o NIST Binary Image Databases of Census Miniforms (MFDB) 
    o NIST Mated Fingerprint Card Pairs 2 (MFCP 2) 
    o NIST Scoring Package Release 1.0 
   Here are example descriptions of two of these databases: 

   NIST special database 2: Structured Forms Reference Set (SFRS)

   The NIST database of structured forms contains 5,590 full page images of
   simulated tax forms completed using machine print. THERE IS NO REAL TAX
   DATA IN THIS DATABASE. The structured forms used in this database are 12
   different forms from the 1988, IRS 1040 Package X. These include Forms
   1040, 2106, 2441, 4562, and 6251 together with Schedules A, B, C, D, E, F
   and SE. Eight of these forms contain two pages or form faces making a
   total of 20 form faces represented in the database. Each image is stored
   in bi-level black and white raster format. The images in this database
   appear to be real forms prepared by individuals but the images have been
   automatically derived and synthesized using a computer and contain no
   "real" tax data. The entry field values on the forms have been
   automatically generated by a computer in order to make the data available
   without the danger of distributing privileged tax information. In
   addition to the images the database includes 5,590 answer files, one for
   each image. Each answer file contains an ASCII representation of the data
   found in the entry fields on the corresponding image. Image format
   documentation and example software are also provided. The uncompressed
   database totals approximately 5.9 gigabytes of data. 

   NIST special database 3: Binary Images of Handwritten Segmented
   Characters (HWSC)

   Contains 313,389 isolated character images segmented from the 2,100
   full-page images distributed with "NIST Special Database 1". 223,125
   digits, 44,951 upper-case, and 45,313 lower-case character images. Each
   character image has been centered in a separate 128 by 128 pixel region,
   error rate of the segmentation and assigned classification is less than
   0.1%. The uncompressed database totals approximately 2.75 gigabytes of
   image data and includes image format documentation and example software.

   The system requirements for all databases are a 5.25" CD-ROM drive with
   software to read ISO-9660 format. Contact: Darrin L. Dimmick;; (301)975-4147

   The prices of the databases are between US$ 250 and 1895 If you wish to
   order a database, please contact: Standard Reference Data; National
   Institute of Standards and Technology; 221/A323; Gaithersburg, MD 20899;
   Phone: (301)975-2208; FAX: (301)926-0416

   Samples of the data can be found by ftp on in
   directory /pub/data A more complete description of the available
   databases can be obtained from the same host as 

8. CEDAR CD-ROM 1: Database of Handwritten Cities, States,
   ZIP Codes, Digits, and Alphabetic Characters

   The Center Of Excellence for Document Analysis and Recognition (CEDAR)
   State University of New York at Buffalo announces the availability of
   CEDAR CDROM 1: USPS Office of Advanced Technology The database contains
   handwritten words and ZIP Codes in high resolution grayscale (300 ppi
   8-bit) as well as binary handwritten digits and alphabetic characters
   (300 ppi 1-bit). This database is intended to encourage research in
   off-line handwriting recognition by providing access to handwriting
   samples digitized from envelopes in a working post office. 

        Specifications of the database include:
        +    300 ppi 8-bit grayscale handwritten words (cities,
             states, ZIP Codes)
             o    5632 city words
             o    4938 state words
             o    9454 ZIP Codes
        +    300 ppi binary handwritten characters and digits:
             o    27,837 mixed alphas  and  numerics  segmented
                  from address blocks
             o    21,179 digits segmented from ZIP Codes
        +    every image supplied with  a  manually  determined
             truth value
        +    extracted from live mail in a  working  U.S.  Post
        +    word images in the test  set  supplied  with  dic-
             tionaries  of  postal  words that simulate partial
             recognition of the corresponding ZIP Code.
        +    digit images included in test  set  that  simulate
             automatic ZIP Code segmentation.  Results on these
             data can be projected to overall ZIP Code recogni-
             tion performance.
        +    image format documentation and software included

   System requirements are a 5.25" CD-ROM drive with software to read
   ISO-9660 format. For further information, see or send email to Ajay
   Shekhawat at <ajay@cedar.Buffalo.EDU> 

   There is also a CEDAR CDROM-2, a database of machine-printed Japanese
   character images. 

9. AI-CD-ROM (see question "Other sources of information")

10. Time series

   Santa Fe Competition

   Various datasets of time series (to be used for prediction learning
   problems) are available for anonymous ftp from in 
   /pub/Time-Series". Data sets include:
    o Fluctuations in a far-infrared laser 
    o Physiological data of patients with sleep apnea; 
    o High frequency currency exchange rate data; 
    o Intensity of a white dwarf star; 
    o J.S. Bachs final (unfinished) fugue from "Die Kunst der Fuge" 

   Some of the datasets were used in a prediction contest and are described
   in detail in the book "Time series prediction: Forecasting the future and
   understanding the past", edited by Weigend/Gershenfield, Proceedings
   Volume XV in the Santa Fe Institute Studies in the Sciences of Complexity
   series of Addison Wesley (1994). 

   M3 Competition

   3003 time series from the M3 Competition can be found at 

   The numbers of series of various types are given in the following table: 

   Interval  Micro Industry    Macro  Finance    Demog    Other    Total
   Yearly      146      102       83       58      245       11      645
   Quarterly   204       83      336       76       57        0      756
   Monthly     474      334      312      145      111       52     1428
   Other         4        0        0       29        0      141      174
   Total       828      519      731      308      413      204     3003

   Rob Hyndman's Time Series Data Library

   A collection of over 500 time series on subjects including agriculture,
   chemistry, crime, demography, ecology, economics & finance, health,
   hydrology & meteorology, industry, physics, production, sales, simulated
   series, sport, transport & tourism, and tree-rings can be found at 

11. Financial data

12. USENIX Faces

   The USENIX faces archive is a public database, accessible by ftp, that
   can be of use to people working in the fields of human face recognition,
   classification and the like. It currently contains 5592 different faces
   (taken at USENIX conferences) and is updated twice each year. The images
   are mostly 96x128 greyscale frontal images and are stored in ascii files
   in a way that makes it easy to convert them to any usual graphic format
   (GIF, PCX, PBM etc.). Source code for viewers, filters, etc. is provided.
   Each image file takes approximately 25K. 

   For further information, see

   According to the archive administrator, Barbara L. Dijker
   (, there is no restriction to use them.
   However, the image files are stored in separate directories corresponding
   to the Internet site to which the person represented in the image
   belongs, with each directory containing a small number of images (two in
   the average). This makes it difficult to retrieve by ftp even a small
   part of the database, as you have to get each one individually.
   A solution, as Barbara proposed me, would be to compress the whole set of
   images (in separate files of, say, 100 images) and maintain them as a
   specific archive for research on face processing, similar to the ones
   that already exist for fingerprints and others. The whole compressed
   database would take some 30 megabytes of disk space. I encourage anyone
   willing to host this database in his/her site, available for anonymous
   ftp, to contact her for details (unfortunately I don't have the resources
   to set up such a site). 

   Please consider that UUNET has graciously provided the ftp server for the
   FaceSaver archive and may discontinue that service if it becomes a
   burden. This means that people should not download more than maybe 10
   faces at a time from uunet. 

   A last remark: each file represents a different person (except for
   isolated cases). This makes the database quite unsuitable for training
   neural networks, since for proper generalisation several instances of the
   same subject are required. However, it is still useful for use as testing
   set on a trained network. 

13. Linguistic Data Consortium

   The Linguistic Data Consortium (URL: is an open consortium of
   universities, companies and government research laboratories. It creates,
   collects and distributes speech and text databases, lexicons, and other
   resources for research and development purposes. The University of
   Pennsylvania is the LDC's host institution. The LDC catalog includes
   pronunciation lexicons, varied lexicons, broadcast speech, microphone
   speech, mobile-radio speech, telephone speech, broadcast text,
   conversation text, newswire text, parallel text, and varied text, at
   widely varying fees. 

      Linguistic Data Consortium 
      University of Pennsylvania 
      3615 Market Street, Suite 200 
      Philadelphia, PA 19104-2608 
      Tel (215) 898-0464 Fax (215) 573-2175

14. Otago Speech Corpus

   The Otago Speech Corpus contains speech samples in RIFF WAVE format that
   can be downloaded from 

15. Astronomical Time Series

   Prepared by Paul L. Hertz (Naval Research Laboratory) & Eric D. Feigelson
   (Pennsyvania State University): 
    o Detection of variability in photon counting observations 1
    o Detection of variability in photon counting observations 2 (H0323+022)
    o Detection of variability in photon counting observations 3 (SN1987A) 
    o Detecting orbital and pulsational periodicities in stars 1 (binaries) 
    o Detecting orbital and pulsational periodicities in stars 2 (variables)
    o Cross-correlation of two time series 1 (Sun) 
    o Cross-correlation of two time series 2 (OJ287) 
    o Periodicity in a gamma ray burster (GRB790305) 
    o Solar cycles in sunspot numbers (Sun) 
    o Deconvolution of sources in a scanning operation (HEAO A-1) 
    o Fractal time variability in a seyfert galaxy (NGC5506) 
    o Quasi-periodic oscillations in X-ray binaries (GX5-1) 
    o Deterministic chaos in an X-ray pulsar? (Her X-1) 

16. Miscellaneous Images

   The USC-SIPI Image Database:

   CityU Image Processing Lab:

   Center for Image Processing Research:

   Computer Vision Test Images:

   Lenna 97: A Complete Story of Lenna:

17. StatLib

   The StatLib repository at at Carnegie Mellon
   University has a large collection of data sets, many of which can be used
   with NNs. 


Next part is part 5 (of 7). Previous part is part 3. 


Warren S. Sarle       SAS Institute Inc.   The opinions expressed here    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.

User Contributions:

Comment about this article, ask questions, or add new information about this topic:

Top Document: FAQ, Part 4 of 7: Books, data, etc.
Previous Document: How to benchmark learning methods?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer: (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM