Text and data mining (TDM)

Text and data mining (TDM) is an important tool in contemporary research which is gaining in popularity and influence. It is a part of Artificial Intelligence, Machine Learning, Big Data, Natural Language Processing, semantic analysis and other research activity that involves using programming techniques to analyse data and obtain insight.

There is an exception in UK copyright law which allows those conducting research at the University to use TDM provided:

  • The researcher(s) has lawful access to the source material – this would include any journals and databases the University subscribes to, or material that is legally openly available to the public;
  • The purpose of the research is non-commercial– this includes any research which is not produced for commercial purposes, even if it is commercially funded;
  • The source of the data is cited in the usual way, unless it is impossible, practically, to do so.

This exception also includes a clause that means, in theory, it cannot be overridden by contracts/ licence. Suppliers/ publishers/ website owners etc. cannot prevent researchers from mining provided they follow the criteria above. In reality, some suppliers do try to do this and require that researchers use the platform’s tools or interfaces, or contact them in advance of any activity.  Some suppliers may even restrict the amount of data that can be mined or the rate it is mined at. 

If you encounter any difficulties of this nature, please do contact Library Services. Support is also available from BEAR around accessing storage, computing power and support with code writing and software development. They hold regular drop-in sessions where you can find out more.

The information below provides some guidance on TDM and issues that may arise. Please also see our guidance on:

Text and data mining community of practice

Library services is currently running a project which explores issues and factors associated with text and data mining, as well as the services and support needed for this type of research. Project lead Lisa Bird (Head of Copyright and Licensing) would like to establish a community of practice so that those with an interest in text and data mining can have a space to discuss their needs, break down silos and create a supportive network. If you would be interested in being part of this please contact Lisa on l.s.bird@bham.ac.uk.

What is TDM?

Text and Data Mining is a research method that involves extracting information from content using software and technological methods.  TDM can include the analysis of files, text, webpages, images, social media posts etc, but may also be applied to non text based works such as images. It allows data to be searched, analysed and extracted far quicker than manual searching as thousands of data sources can be searched in a matter of seconds, with results surfaced and structured ready for exploration.

TDM is often an essential part of AI, machine learning and big data activities, functioning as a mechanism for training programmes and analysing/ interpreting data. 

Under the Copyright, Designs and Patents Act 1988, the exception allowing for TDM is headed ‘Copies for text and data analysis for non-commercial research’ but the wording within the legislation simply refers to the notion of ‘computational analysis’.

The law allows for this ‘computational analysis’ to be carried out on any content a person engaged in non-commercial research has legal access to.  This could be content available online and to content that is procured and subscribed to by institutions.

As LIBER Europe explain, there are four stages to the TDM process. First, potentially relevant documents are identified (Stage 1). These documents are then turned into a machine-readable format so that structured data can be extracted (Stage 2). The useful information is extracted (Stage 3) and then mined (Stage 4) to discover new knowledge, test hypotheses, and identify new relationships. 

Image credit: JISC / Value and Benefits of Text Mining (2012)

TDM activity involves harvesting data for analysis, cleaning the data, ordering and indexing it.  In order to do that a copy of the data to be analysed must be obtained or extracted and transferred to the appropriate/ desired tool for analysis.  This is where copyright comes in- making a copy is an act of copyright infringement which is permitted under the TDM exception without the permission of the owner of the data. 

Accessing content

Any content that can be accessed legally can be mined for non-commercial research purposes in the UK and this includes content subscribed to by the library or content that can be openly accessed online.

Content suppliers often have tools and mechanisms that can be used to supply/ extract the data.  Tools such as APIs and feeds are common, as is the supply of data on physical hard drives.  Alternatively, researchers could use third party packages to extract and analyse the data, or they could develop their own tools.

There is a trade-off between using supplier tools, third party tools, and using those created independently.  The suitable solution will depend on the circumstances of each project and the resources available.

Post-analysis

The TDM exception covers the ability to carry out the ‘computational analysis’, it does not apply to the archiving of research data, the reproducibility of research or publication activities within the scholarly communications lifecycle.  Essentially, as soon as the analysis is complete the data is no longer covered by the TDM exception and so researchers must either look to rely on other exceptions in law, or permissions (licences) from data owners to enable these activities.

Research Data Management and reproducibility 

The University offers guidance around the management of research data especially where data are sourced from and owned by external parties.  If the subscription agreement and/ or tool terms and conditions allow, such data retention policies should be followed.  It may be that data can be retained for certain periods, and researchers will be responsible for managing compliance with those terms.

If the data cannot legally be stored beyond the analysis stage, it is important to document the methodology and data sources used so that other researchers can reproduce and validate your results.

Publication

Some library subscription licences, and supplier tools/ API agreements allow data to be stored and may also allow extracts from the data to be included within publications.  Unfortunately, there is a trend to significantly limit these rendering them almost unworkable for researchers. 

Such clauses limit reuse in publications to so called ‘snippets’ of data, e.g. 150 - 200 characters from a single copyright work or source.  Frequently this is unusable for publication purposes where a quote or single line in a concordance table may exceed this limit.

Where such limitations are unworkable, researchers are left with two options:

  1. Separate publication permission:

    As is common in the publication process, permission to reuse extracts can be obtained from the rights holders. 

    Sometimes obtaining this may be difficult, and time consuming, especially where there are lots of rights holders or the original content is old.  Depending on what and how you are using this content there may also be a fee involved.

  2. Use other exceptions:

    There are other exceptions in UK law which permit the use of extracts for the purposes of quotation or criticism and review.  If you are using the extract for either of these purposes and you are citing the author/ owner in the usual way this may fall within the scope of the exception. 

    Publishers may object to the use of these other exceptions and require authors to follow option 1 above. 

We would encourage authors to discuss the use of these exceptions rather than relying on permission.

More detail on exceptions.

Further information

Colleges

Professional Services