Skip to Main Content

Study Skills

Text & Data Mining: What is TDM?

What is TDM?

“Text and data mining (TDM) is the process of deriving information from machine-read material. It works by copying large quantities of material, extracting the data, and recombining it to identify patterns.” 

– UK Government


illustration of the basic stages of TDM process as described in the text below


The 3, or 4, basic stages of TDM

In Jisc's model above, 4 stages in the TDM process are represented. First, potentially relevant documents are identified. These documents are then turned into a machine-readable format so that structured data can be extracted ("normalized documents"). The useful information is extracted (Stage 3 - "derived dataset") and then mined (Stage 4 - "extracted information to discovered knowledge") to discover new knowledge, test hypotheses, and identify new relationships.  Frequently the stages are represented as threefold, the first two in Jisc's model, of identification and packaging into structured data through information retrieval techniques, represented as one piece, followed by [2] actual extraction that locates and pulls out the data and its relationships that are of interest, followed by [3] the mining which is not the extraction as such, but rather the exposure of the patterns of knowledge to reveal new discoveries and research outcomes. 

The journey from unstructured text to structured content

In the National Centre for Text Mining's model the processes of information retrieval, extraction (including manipulation and annotation of the data), and knowledge discovery have taken the original "unstructured text (implicit knowledge)" to the final stage of "structured content (explicit knowledge)". Text mining more specifically has been defined as the process of "structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and final evaluation and interpretation of the output" though the last element is more commonly termed data mining as, whether the original content is text or data, the revelation of new knowledge from mining is performed on what has become "structured data" (Sayantani Ghosh, 'A tutorial review on text mining algorithms', International Journal of Advanced Research in Computer and Communication Engineering, 1 (2012), 223). 

Conceptual modelling of knowledge discovery

Three or four part models have been expanded further, with the emphasis on the final process of data mining, in the standards CRISP-DM (Cross Industry Standard Process for Data Mining) and SEMMA (Sample, Explore, Modify, Model and Assess), building on conceptual models such as the KDP or Knowledge Discovery Process that has nine discrete steps: developing and understanding the application domain; creating a target data set; data cleaning and preprocessing; data reduction and projection; choosing the data mining task; choosing the data mining algorithm; data mining; interpreting mixed patterns; and, consolidating discovered knowledge (see: Krzysztof J. Clos, Witold Pedrycz, Roman W. Swiniarski, Lukasz A. Kurgan, Data mining: a knowledge discovery approach (Springer, 2007), p. 12).

Image credit: Figure 2. Schematic overview of the processes involved in text mining of scholarly content. ©JISC. CC BY-NC-ND (Image in report: [section 2] Value and benefits of text mining)

Thing 19

It's Thing 19 - in 23 Research things Cambridge

Text and data mining (TDM) is a process through which large amounts of information can be analysed electronically.  This allows researchers to work through far more research content than they would ever be able to do manually. 

Detailed definition

TDM, as text mining and as data mining, has been defined thus:

Data Mining is the computational process of discovering and extracting knowledge from structured data. Text Mining is the computational process of discovering and extracting knowledge from unstructured data.

Text Mining may be viewed as a specific form of Data Mining, in which the various algorithms firstly transform unstructured textual data into structured data which may then be analysed more systematically. Therefore the term TDM (Text & Data Mining) is often used.  The term TDM is also increasingly used to designate the Text & Data Mining of scholarly content, such as journal articles, book chapters or conference proceedings.

The IPO (Intellectual Property Office, UK) defines TDM as the "the use of automated analytical techniques to analyse text and data for patterns, trends and other useful information”

The future of open science

This video, part of a presentation at UKSG, demonstrates TDM is effectively the way all research will be conducted in the future.

Untangling Text and Data Mining

"It turns out that 'mining' is not a very good metaphor for what people in the field actually do. Mining implies extracting precious nuggets of ore from otherwise worthless rock. If data mining really followed this metaphor, it would mean that people were discovering new factoids within their inventory databases. However, in practice this is not really the case. Instead, data mining applications tend to be (semi)automated discovery of trends and patterns across very large datasets, usually for the purposes of decision making"

From 'Untangling Text Data Mining' by Marti A. Hearst ('Proceedings of the 37th annual meeting of the Association of Computational Linguistics on computational linguistics' Association of Computational Linguistics, Stroudsburg, PA 1999)

© Cambridge University Libraries | Accessibility | Privacy policy | Log into LibApps