- The Weblog Data Collection containing about 3-weeks of blogs written during the London Bombings is specially being released by Buzzmetrics for this workshop. Those wanting the data for the workshop can ftp it after faxing the completed and signed Data Share Agreement to +1 412 802 7986 and sending in an email (containing a simple statement requesting the data, name, affiliation, email address) to Matthew Hurst.
- Duplicate Detection, Record Linkage and Identity Uncertainty
- Pine-Info discussion list web archive containing emails of users reporting problems and responses from other users offering solutions and advice.
- Enron Email Corpus containing about 0.5 M email messages of about 150 users. Various other sites host this data with different levels of processing (for example here).
- Historical Texts
- Lancaster Newsbook Corpus comprising over 22,000 documents including over 7,000 newsbooks from 1640-1661 - key resources of both early journalism and English politics.
- Wordhoard contains four different corpora -- Chaucer, Spenser, Shakespeare and Early Greek Epic which includes Homer, Hesiod, and the Homeric Hymns in the original Greek, with English and/or German translations.
- Corpus del Espaņol 100 million word corpus of Spannish texts.
- ISRI OCR data comprising scanned page images with corresponding ground-truth text.
- English SMS Dataset that has been manually translated to standard form and automatically aligned at the word level.
- Speech, Machine Translation, Large Electronic Texts etc.
- LDC data for example SPINE, switchboard, broadcast news, TDT etc. Available on payment.
- TDT-2 database with both reference transcriptions (10% wer) and ASR transcriptions (30% wer). Available on payment.
- The Linguist List: Texts and Corpora A meta site containing links to many corpora.
Please let us know if you are aware of any other relevant dataset.