Welcome to this LibGuide supporting TDM practitioners in Cambridge, students and researchers considering a project employing TDM, and librarians fielding enquiries about TDM from their library users.
This guide is a work in progress and as far as we are aware, the first LibGuide on TDM in the UK! We want to make this useful to you, so please email us with any suggestions or ideas - write to ejournals@lib.cam.ac.uk. Thank you.
OpenMinTed is working on a platform that will be a gateway to many types of language data, including tagsets, ontologies, publications and corpora. The platform will also offer services and functionalities that are useful for text and data mining, and allow miners to share their tools and build their own workflows.
Read the blog post on the Unlocking Research blog from the Office of Scholarly Communication about where we are in Cambridge with TDM.
Sharing thinking on TDM in Cambridge: Links to description and social media on the Cambridge Symposium on TDM
The University Library has acquired digital archives from Gale Cengage, a publisher of large primary source materials, including historical documents and newspapers. These digital archives are now available within a new resource called the “Gale Digital Scholar Lab” which has been specifically designed for the purpose of enabling text-mining and analysis.
Using the Lab you can search the archives as you would on their native platforms and build content sets from these search results. You can make multiple content sets and analyse the corpus that you amass using the tools provided in the Lab. The tools available in the Lab now are all Open Source (and it is the ambition of the publisher that these will be expanded on over time): Topic Modelling (Mallet); Frequencies (Lucene); Clustering (SciKit Learn); Parts-of-Speech Tagger (spaCy); Sentiment Analysis (OpenNLP); Named Entity Recognition (spaCy); Ngrams (Lucene).
The Lab promises to open up new possibilities for the relative newcomer to digital scholarship in this area, allowing natural language processing tools to be applied to raw text data (OCR), facilitating new discoveries and insights. The Lab makes much of visualization of results and data and thus lends itself to scholarly sharing and “bridging the gap between scholarly resources and faculty researchers/students”. The Lab facilitates organisation of content sets, including renaming, duplicating and versioning as well as identifying the searches used to create the content set, which makes sharing and reproducing research projects easier than is usually the case.
A Getting Started "Walkthrough" Guide to the Lab is available here.
Archives included in the Lab to which Cambridge has access for analysis are:
17th and 18th century Burney collection
19th century UK periodicals
British Library newspapers
Economist historical archive, 1843-2014
Eighteenth century collections online
Illustrated London News historical archive, 1842-2003
Making of modern law: legal treatises, 1800-1926
Nineteenth century U.S. newspapers
Times digital archive
Times literary supplement historical archive
U.S. declassified documents online
The access to the Lab is on a trial basis to help Cambridge assess its usefulness to the practitioner and to encourage and promote the resource to digital humanities scholarship in Cambridge generally.
Please contact ejournals@lib.cam.ac.uk for the username and password to access the Lab. Thank you.
The TDM Test Kitchen is an experimental service supported by Cambridge Digital Humanities, Cambridge University Library and Cambridge University Press.
The TDM Test Kitchen aims to:
Explore the application of TDM (Text and Data-Mining) methods to CUP and UL collections.
Provide a ‘live’ learning environment where researchers, CUP and library staff involved either using TDM methods or developing TDM support services can learn more about TDM methods, share good practice and exchange knowledge about how to overcome challenges.
Facilitate discussion between researchers, the UL and CUP about how to develop TDM methods and services in future.
OpenMinTeD "sets out to create an open, service-oriented e-Infrastructure for Text and Data Mining (TDM) or scientific and scholarly content. Researchers can collaboratively create, discover, share and re-use knowledge from a wide range of text-based scientific related resources in a seamless way".
OpenMinTeD has a Knowledge Base comprising a range of materials including visualizations of the TDM workflows, textual guides, Webinars, and training videos showing methods applied in practice by experts in the field:
- Key concepts and areas in TDM explained - part 1
- Key concepts and areas in TDM explained - part 2: Knowledge representation
- Key concepts and areas in TDM explained - part 3: Recommenders and filtering
- Key concepts and areas in TDM explained - part 4: Semantic search
- Key concepts and areas in TDM explained - part 5: Knowledge discovery