Copyright and text and data mining

Text and Data Mining (TDM) is a research method that uses software to extract information from various types of content. This involves copying data for analysis, which typically infringes copyright. However, the TDM exception allows this without the data owner’s permission.

TDM can include the computational analysis of files, text, webpages, images, social media posts, etc., and can be applied to non-text-based works like images. It enables rapid searching, analysing, and extracting of data from thousands of sources in seconds, facilitating exploration.

TDM is part of Artificial Intelligence, Machine Learning, Big Data, Natural Language Processing, Digital Scholarship, semantic analysis, and other research activities using programming techniques to analyse data and gain insights.

Copyright exception for TDM

Under the Copyright, Designs and Patents Act 1988, the exception for TDM is titled ‘Copies for text and data analysis for non-commercial research’ but refers to ‘computational analysis’.

This exception allows researchers to copy items for TDM if all of the following points apply:

You have lawful access to the source material (e.g., journals and databases the University subscribes to, or publicly available material).
The research is non-commercial (not produced for commercial purposes, even if commercially funded).

The source of the data is cited, unless it is impossible, practically, to do so.

Using the Library’s online resources for TDM

The TDM exception also includes a clause that means, in theory, it cannot be overridden by contracts/ licences. Suppliers/ publishers/ website owners etc. cannot prevent you from mining if you follow the criteria above. Some suppliers do try to do this.

Some suppliers may require using their tools or contacting them before mining and may restrict data mining rates.

When using tools not provided by publishers to mine their data you may encounter difficulties due to security mechanisms. Sites may include ‘NoRobot.txt’ files or tags preventing scripts from running effectively. Suppliers may object to activities that place unusual loads on their servers. Engaging in dialogue with suppliers can help avoid issues.

Libraries and Learning Resources receives notifications from suppliers about suspicious activity, which can lead to individual researchers or IP addresses being locked out until resolved. In rare cases, a supplier may suspend all access while investigating.

If an alert is received, we will work with IT Services to identify the cause. If related to TDM, we will liaise with the supplier to remove the lock. Very rarely, we may ask the researcher to join those conversations.

We aim to protect your legal right to mine content while respecting suppliers’ platform stability.

Accessing content: suppliers' tools

Content suppliers often provide tools like APIs and feeds, or data on physical hard drives. Some suppliers charge for access to their tools, so check costs before starting a project. Contact copyright@contacts.bham.ac.uk if you need advice about if a resource has an API available or for guidance on how to get a quotation

Note that some tools are designed for specific data sets and may not be usable for other sources. They may also have limitations on data flow or accessible datasets.

Suppliers’ tools may have separate terms and conditions governing their use and it is important that you read and understand them. These terms will state what can be done with the content, how it can be accessed, who can use it, how it can be shared, etc. These are very important especially in terms of publication activity or in data retention for reproducibility purposes. These limits may cause some difficultly for your project and dissemination activity.

Some tools are included in library subscriptions and can be accessed directly via platforms and suppliers.

Text and data mining using the Gale Digital Scholar Lab

The Gale Digital Scholar Lab allows users to build content sets, for mining purposes, from the Gale archives the University has access to. It provides options to clean the data and analyse the corpus using various tools. See the archives we have access to via the Digital Scholar Lab.

The Gale Digital Scholar Lab can be accessed via FindIt@Bham. Ensure you are logged into FindIt@Bham. Once the Gale Digital Scholar page has launched click on the ‘Login/create account’ button. You can then use the ‘Institutional Login’ option to log in with your University account.

Corpuses on the BEAR infrastructure

We have hosted several archives on the BEAR Infrastructure. Depending on the archive, file formats range from simple JPEGs to XML files. The archives we can provide access to via the BEAR infrastructure can be found via our list of TDM Archives.

Bespoke and researcher-created tools

You can create your own tools for mining activity, based on existing FOSS tools or from scratch.

Support is also available from BEAR around accessing storage, computing power and support with code writing and software development. They hold regular drop-in sessions where you can find out more.

You should also ensure that your research methodology is reproducible which may mean ensuring the code is sufficiently accessible, available and functional for peer reviewers to validate their results.

Useful links

Contacts

For advice on copyright matters and University licences, please contact: