|
Datasets for testing
During our time in the search business we have collected some datasets and wordlist that we think can be useful for testing and research into search and information retrieval in general .
By Searchdaimon
-
Wikipediadoc
This dataset consist of 67 537 Wikipedia articles converted to Word format. The data set was made by parsing an xml database dump of Wikipedia and converting it to individual html files. Each html files was then open in Microsoft Word 2002 (Office XP), so saved by Word as .doc .
Download by BitTorrent (recommended)
Download by http
Creative Commons Attribution 3.0
-
English and Norwegian word stemers with word lists for reverse lookup
A stemmer is a heuristic transformator that aim at reducing a word to it's stem, base or root form. For exsample walked, walks, walking are all derivates from the word walk.
This dataset consist of some Perl scripts that can stem English and Norwegian words. Aka: stem(walked) –> walk, and two word lists for lookup of reverst steming. Aka: lookup(walk) –> walked, walks, walking. The word lists was created by steming the most common word found on 64 million webpages.
Demo
29979 English words
66479 Norwegian words
Download by http
Creative Commons Attribution 3.0
-
Lists of adult words
Lists of words and two words phrases often seen on pornographic sites.
Read more
Creative Commons Attribution-ShareAlike 3.0
Recommended third party resources
-
EDRM Enron Email Data Set v2
One of the best and most used data sets in information retrieval research. This data set contains Enron e-mail messages and attachments from about 150 users, mostly senior management of Enron, organized into folders. This data was originally made public, and posted to the web, by the US Federal Energy Regulatory Commission during its Enron investigation. The data set was created by EDRM.
As xml
Contains XML description, EML files with attachments, native attachments, text email bodies and text email attachments.
Download by BitTorrent (recommended)
Download by http at edrm.net
Pst
Download by BitTorrent (recommended)
Download by http at edrm.net
Creative Commons Attribution 3.0 United States License
-
Geocities - 641 GB of fun
Have you been missing the blink tag lately? Or maybe you want to test your html parser on some real data. The Geocities torrent is on of the largest data collection of real hand-made documents, created by millions of users. The data set was created by The Archive Team.
Download by BitTorrent
Unknown license
|
|