Skip to main content

Text & Data Mining: Cambridge TDM

Dr. Gabriel Recchia, Research Associate, CRASSH, writes from the interdisciplinary Concept Lab project:-

For the past two and a half years, I have been a research associate at the Concept Lab (under PI Peter de Bolla), an interdisciplinary project housed at the Centre for Research in the Arts, Social Sciences, and Humanities at the University of Cambridge. The Concept Lab combines perspectives from digital humanities, computational linguistics, and cognitive science to attempt a bold goal: the development of a practical and theoretical framework for the analysis of conceptual structure. This is linked to a deep commitment to developing new methods for understanding the history of ideas. By attending to how statistical associations between groups of words in large corpora of printed texts constellate, disaggregate, and change in other ways over time, it is possible to obtain a deeper understanding of how the discourse surrounding particular concepts has changed from decade to decade -- and, we would argue, to discover how the very concepts being discussed have changed as well.

In this regard, the eresources and services provided by the Cambridge University Library Eresources Subscriptions team have been invaluable. Early on, we considered turning to Google Books data to derive lexical associations for the eighteenth century, but quickly discovered that the coverage and variety of sources searchable within ECCO (Eighteenth Century Collections Online) was far superior to what Google Books had to offer for this time period. In addition, we were aware that the Stanford Literary Lab had successfully obtained access to the full text of ECCO documents in a machine-readable format, although in their case this required the metadata to be extracted from ancient tape drives. We hoped it might be possible to get such access given Cambridge's ECCO subscription, and made our request known to the library.

Thus began a lengthy period of negotiation between the publisher and the Library to hash out the terms under which Cambridge University researchers might obtain access to ECCO metadata for purposes of text and data mining. Although I only had a slim window into this process, I was continually struck by the Library's strong commitment to obtaining terms that would be in the best interest of University projects. Ultimately, a deal was made, and the Cambridge University Library Eresources team provided us with what we needed in an easy-to-use format under terms that permitted us to do the research we had hoped to do. It was well worth the wait, as ECCO has allowed us to explore the trajectory of political, economic, psychological and other concepts over time in a way that simply would not otherwise have been possible. Through my participation in the Digital Humanities Network, I have gladly been able to point many people at Cambridge towards ECCO and other subscription eresources, available at http://libguides.cam.ac.uk/az.php?a=all , that we make use of on a regular basis. These include Early English Books Online and Gale Primary Sources, which provides advanced search capabilities for texts within the Times Digital Archive, British Library Newspapers, and eleven other databases. We owe Cambridge University Library and the Eresources Subscriptions team many thanks for their efforts.
 

The Cambridge Big Data Strategic Research Initiative brings together researchers from across the University to address challenges presented by our access to unprecedented volumes of data. Our research spans all six Schools of the University, from the underlying fundamentals in mathematics and computer science, to applications ranging from astronomy and bioinformatics, to medicine, social science and the humanities

In parallel, our research addresses important issues around law, ethics and economics, in order to apply Big Data to solve challenging problems for society.

Cambridge Big Data supports collaboration and knowledge transfer in this growing field.