Neural Network Databases and Learning Data

Repositories and Databases
CMU AI Repository - Neural Networks
This directory contains neural networks, connectionist systems, and neural systems software and related materials.
DELVE
Delve is a standardised environment designed to evaluate the performance of methods that learn relationships based primarily on empirical data. Delve makes it possible for users to compare their learning methods with other methods on many datasets. The Delve learning methods and evaluation procedures are well documented, such that meaningful comparisons can be made.
EconData
Several hundred thousand economic time series, produced by a number U.S. Government agencies and distributed in a variety of formats and media, can be found here. They have been put into a standard, highly efficient, easy-to-use form for personal computers and made publicly available through this site. These series include national income and product accounts (NIPA), labor statistics, price indices, current business indicators, industrial production, information on states and regions, as well as international data.
FuNet - Neural
Artificial neural network repository at FuNet.
National Space Science Data Center
The National Space Science Data Center serves as the permanent archive for NASA space science mission data. "Space science" means astronomy and astrophysics, solar and space plasma physics, and planetary and lunar science. As permanent archive, NSSDC teams with NASA's discipline-specific space science "active archives" which provide access to data to researchers and, in some cases, to the general public.
Collection of various documents, software, data, etc. related to Neural Networks.
Netlib Repository
Netlib is a collection of mathematical software, papers, and databases.
Neuroprose
Old archive of neural network papers collected during the period 1990-2000. The mirror can also be found at here.
NIST Technical Databases
NIST has provides well-documented numeric data to scientists and engineers for use in technical problem-solving, research, and development. These recommended values are based on data which have been extracted from the world's literature, assessed for reliability, and then evaluated to select the preferred values. These data activities are conducted by scientists at NIST and in university data centers.
RISE
RISE is a distributed repository of online information sources that are used for the empirical analysis of learning algorithms that generate extraction patterns. The sources included in this repository are provided by people from the information extraction (IE) and wrapper generation (WG) communities. Both communities use machine learning algorithms to generate extraction patterns for online information sources.
Reinforcement Learning Repository
The purpose of this web site is to provide a centralized resource for research on Reinforcement Learning (RL), which is currently an actively researched topic in artificial intelligence. This site contains resources on both RL research and applications to areas such as robotics and industrial problems. Resources available include technical publications, sample testbeds, implementations of various algorithms, online simulation packages, workshop information, and discussion forums for a variety of research areas within RL.
GANN
WWW repository of resources on Evolutionary Design of Neural Architectures (EDNA).
Solar Data Services
Data pertaining to solar activity and the upper atmosphere. Beyond the obvious considerations of heat and light, some examples of these direct and indirect solar influences are the effects on short-wave radio communications, navigation, use of satellites for communication and navigation, hazards to humans and instruments in space, electrical power transmission, geomagnetic prospecting, gas pipeline monitoring, and possibly weather and human and animal behaviour.
StatLib
StatLib, a system for distributing statistical software, datasets, and information by electronic mail, FTP and WWW. StatLib started out as an e-mail service and some of the organization still reflects that heritage.
Time Series Data Library
This is a collection of about 800 time series, drawn from many different fields. The time series may be freely copied and used, provided this source is clearly acknowledged.
UCI Machine Learning Repository
This is a repository of databases, domain theories and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
WIPO Automated Categorization Datasets
WIPO automated categorization datasets provides information about the datasets' contents, how get access, how to subscribe to the discussion list, and links to various relevant background sources of information.

Specific Task Data Sets
20 Newsgroups
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection.
Document Understanding Database
There are five concepts, expressed as predicates, to be learned. They concern five logical components that is possible to identify in a sample of business letters, namely sender, receiver, logotype, reference number and date.
DMEF
Four individual data sets, each containing customer buying history for about 100,000 customers of nationally known catalog and non-profit database marketing businesses are available through DMEF to approved academic researchers for use within academic situations. Corporate names are anonymous and customer names and addresses have been removed, but the business type is indicated. ZIP codes have been retained (if possible) to provide a potential link to Census ZIP level demographics.
Faces
Face recognition using neural networks. 32 images of each of 20 students in the class were taken with a variety of head positions and facial expressions. These images were then used to train and test neural networks to recognize individual people, and to recognize different face poses.
KDD Cup Data 2000, 2001, 2002, 2003, 2004
Three real-world datasets are available.
Mesh Design
Training set for learning on a Finite Element Mesh Design problem. This is a complete set of the training examples. Ten different FE mesh models have been used as a source of examples.
Natural Language Learning Data
This directory contains data for natural-language learning experiments in a form suitable for Inductive Logic Programming (ILP) systems.
NIST Special Database 4
The NIST database of fingerprint images contains 2000 8-bit gray scale fingerprint image pairs. Each image is 512-by-512 pixels with 32 rows of white space at the bottom and classified using one of the five classes. The database is evenly distributed over each of the five classifications with 400 fingerprint pairs from each class.
PKDD'99 Discovery Challenge Datasets
PKDD'99 Discovery Challenge conference data. Data is divided into 2 groups: financial and medical.
Sensor Data of a Mobile Robot
A set of data sets, where each data set is represented in first order logic. Valid restrictions of all data sets are: facts can be linked using the argument of type TIME, and there are never two different facts concerning the same sensor and the same point in time. Each data set corresponds to learning disjoint concepts at one level. The levels are organized in a hierarchy.
Stage Transitions Data Set
The work on reverse engineering a human controlling a simulation of an F-16 aircraft required the definition of a flight plan (sequence of flight stages). In that work a controller was semi-automatically induced from behavioural traces using Quinlan's C4.5. For each stage and for each aircraft control a tree was induced. When using the induced controller as an auto-pilot the values for the aircraft controls are obtained by interpreting the corresponding decision tree for that control and for that flight stage. In the transition from stage to stage were done by means of hand-coded rules.
Text Categorization
Test collections are standard data sets used to measure the effectiveness of information retrieval systems. Most were originally developed to support research on IR, but practitioners often find them useful as well.
TIC
The Insurance Company (TIC) Benchmark. This datamining benchmark dataset is ideally suited for testing your datamining algorithms or using it as a case for datamining lab sessions. The data was supplied by Sentient Machine Research. The main question is: Can you predict who would be interested in buying a caravan insurance policy and give an explanation why?
The 4 Universities Data Set
This data set contains WWW-pages collected from computer science departments of various universities in January 1997 by the World Wide Knowledge Base project of the CMU text learning group. The 8,282 pages were manually classified into 7 categories.
The Penn Treebank Project
The Penn Treebank Project annotates naturally-occuring text for linguistic structure.
TREC
Text REtrieval Conference data. The organizations supplying data have either provided it free of charge, or for a fee. The data is copyrighted, and also has commercial value as data, so one must be careful to use it only for research purposes rather than for its informational uses.
Web Index Recommendation Data
Data for multi-instance learning based web index recommendation. The 113 web index pages are labeled by 9 volunteers according to their interests. Therefore there are 9 data sets. If the volunteer is interested in at least one linked page of the index, then the web index page is labeled as positive. Otherwise the index page is labeled as negative. There is no label for the linked pages.
Web Page Data
Our dataset consists of 10,000 web documents classified into 10 equally-sized categories each containing 1,000 web documents. Each category was chosen by selecting 4 distinct themes from the real world, namely Banking and Finance, Programming Languages, Science and Sport. The second step was to select 2/3 categories from each theme giving a total of 10 categories. This method of category selection allows the user to select a wide range of categorization experiments with varying degrees of categorization difficulty.
WordSimilarity-353
The WordSimilarity-353 Test Collection contains two sets of English word pairs along with human-assigned similarity judgements. The collection can be used to train and/or test computer algorithms implementing semantic similarity measures (i.e., algorithms that numerically estimate similarity of natural language words).

See Also...
DatGen, DBCat, CoCoMac, CoCoDat, NeuroDatabase, FlyBrain, IBVD, INDP, NeuronDB, SynapseWEB, WormBase, ZFIN, PDB
Database Systems: Design,... +Review
High Performance MySQL: Optimization,... +Review
Microsoft SQL Server 2005 Unleashed +Review
Super Crunchers: Why Thinking-by-Numbers... +Review
Sams Teach Yourself SQL in 10 Minutes...
Access 2007: The Missing Manual +Review
Microsoft SQL Server 2005 Reporting... +Review
Information Systems Today: Managing in... +Review
MCTS Self-Paced Training Kit (Exam... +Review
New Perspectives on Microsoft Office... +Review
The Out-of-Sync Child: Recognizing and... +Review
40 Fabulous Math Mysteries Kids Can't... +Review
What to Do When Your Brain Gets Stuck: A... +Review
You Mean I'm Not Lazy, Stupid or... +Review
How People Learn: Brain, Mind,...
Teaching with the Brain in Mind, Revised... +Review
Learning to Speak Alzheimer's: A...