Text and data mining (TDM)

Text and data mining (TDM) is an important tool in contemporary research which is gaining in popularity and influence. It is a part of Artificial Intelligence, Machine Learning, Big Data, Natural Language Processing, semantic analysis and other research activity that involves using programming techniques to analyse data and obtain insight.

There is an exception in UK copyright law which allows those conducting research at the University to use TDM provided:

  • The researcher(s) has lawful access to the source material – this would include any journals and databases the University subscribes to, or material that is legally openly available to the public;
  • The purpose of the research is non-commercial– this includes any research which is not produced for commercial purposes, even if it is commercially funded;
  • The source of the data is cited in the usual way, unless it is impossible, practically, to do so.

This exception also includes a clause that means, in theory, it cannot be overridden by contracts/ licence.  Suppliers/ publishers/ website owners etc. cannot prevent researchers from mining provided they follow the criteria above.  In reality, some suppliers do try to do this and require that researchers use the platform’s tools or interfaces, or contact them in advance of any activity.  Some suppliers may even restrict the amount of data that can be mined or the rate it is mined at.  If you encounter any difficulties of this nature, please do contact Library Services.

The information below provides some guidance on TDM and issues that may arise.

What is TDM?

Text and Data Mining is a research method that involves extracting information from content using software and technological methods.  TDM can include the analysis files, text, webpages, images, social media posts etc, but may also be applied to non text based works such as images.  The act of TDM allows data to be searched, analysed and extracted far quicker than manual searching as thousands of data sources can be searched in a matter of seconds, with results surfaced and structured ready for exploration.

TDM is often an essential part of AI, machine learning and big data activities, functioning as a mechanism for training programmes and analysing/ interpreting data. 

Under the law, the exception allowing for TDM is headed ‘Copies for text and data analysis for non-commercial research’ but the wording within the legislation simply refers to the notion of ‘computational analysis’.[1] 

The law allows for this ‘computational analysis’ to be carried out on any content a person engaged in non-commercial research has legal access to[2].  This could be content available online and to content that is procured and subscribed to by institutions.

There are four stages to the TDM process. First, potentially relevant documents are identified (Stage 1). These documents are then turned into a machine-readable format so that structured data can be extracted (Stage 2). The useful information is extracted (Stage 3) and then mined (Stage 4) to discover new knowledge, test hypotheses, and identify new relationships.[3]

Image credit: JISC / Value and Benefits of Text Mining (2012)

TDM activity involves harvesting data for analysis, cleaning the data, ordering and indexing it.  In order to do that a copy of the data to be analysed must be obtained or extracted and transferred to the appropriate/ desired tool for analysis.  This is where copyright comes in- making a copy is an act of copyright infringement which is permitted under the exception without the permission of the owner of the data. 

[1] Copyright Designs and Patents Act 1988, as amended, s.29A https://www.legislation.gov.uk/ukpga/1988/48/section/29A

[2] Ibid, s29A (1)(a)

[3] Liber, Text & Data Mining https://libereurope.eu/topic/text-data-mining/ Accessed 05/03/2021

Accessing content

Any content that can be accessed legally can be mined for non-commercial research purposes in the UK and this includes content subscribed to by the library or content that can be openly accessed online.

Content suppliers often have tools and mechanisms that can be used to supply/ extract the data.  Tools such as APIs and feeds are common, as is the supply of data on physical hard drives.  Alternatively, researchers could use third party packages to extract and analyse the data, or they could develop their own tools.

There is a trade-off between using supplier tools, third party tools, and using those created independently.  The suitable solution will depend on the circumstances of each project and the resources available.

Supplier and platform tools

Many suppliers now have API, feeds and other systems that can be queried by researchers quickly and easily granting access to huge swathes of data within a matter of clicks.  There are more and more ‘ready made’ / ‘off the shelf’ publisher services available with increasing flexibility and usability, especially for those beginning to explore data mining projects. 

The benefits of these tools are that they have developed by the suppliers working with their own content.  They have been specifically tested and designed with researchers and students in mind to ensure they meet the needs of the community they serve.  Many suppliers have extensive engagement with the research community and have used that to develop and refine the packages.  This might also include ensuring the structure of the data, any tags and formatting are compatible and usable by the tool.  They also benefit from publisher support and growing communities of users.

In addition, as these are official supplier tools there should be no issue with accessing the supplier’s content through them- issues with licence breaches or security threats posed by machine driven searching/ downloading etc will not arise, unlike when using third party or bespoke tools (see below).

Researchers should also note that as some of these tools are designed to work with specific data sets and may not be useable for other sources of content.  They may also be limited in scope and scale and may restrict certain data flow limits or which datasets can be accessed. 

All of these supplier tools will have separate terms and conditions governing their use and it is important that you read and understand them.  These terms will state what can be done with the content, how it can be accessed, who can use it, how it can be shared, etc.  These are very important especially in terms of publication activity or in data retention for reproducibility purposes.  These limits may cause some difficultly for your project and dissemination activity.

Some of these tools are already included within library subscriptions and can be accessed directly via the platforms/ suppliers. Some of the tools are also only available to subscribing institutions and may even require advanced subscription levels, over and above what might already be in place. Other tools may form standalone products unconnected to subscriptions.  These are likely to be chargeable and the costs can be considerable.

Researchers should ensure that any tool is adequately trialled, tested and reviewed before any investment is made.

When seeking to use or trial such a tool please do consult with Library Services for advice and guidance.

Third-party tools

There are a growing number of open source and commercial packages available to enable TDM activity.  Specific disciplines may have different preferences in terms of function and need which will determine the suitability of a specific tool for a particular activity.  Some of these tools are commercial software packages that can be licensed (at cost) for a particular researcher or group.  Free and Open Source Software (FOSS) tools are also available, which are designed to enable changes and developments to be added by researchers as required.

Many of these tools have support infrastructures and communities of practice behind them with researchers and developers improving, or customising the functionality for specific projects.  They can handle datasets from multiple sources and are not tied to one particular supplier.  This increased flexibility may be attractive to some researchers, especially those more confident in using programming or coding techniques to tailor the package to their needs.

Some of the newer tools are far easier to use than they used to be as TDM is becoming more accessible and mainstream.  Many can be picked up quickly and easily allowing novice researchers to engage with their research questions more readily.  Others, especially those intended for bespoke customisation do require some knowledge of programming languages such as R or Python.

Additionally, the structure, format and usability of data extracted by these tools may require additional processes, i.e. stage 2 above, before it can be analysed.  For example text may require structure to be added or removed, HTML/ XML tags deleted, or even may be required to undergo character recognition processes to ensure it is machine readable as a work containing text.  This pre analysis processing could be considerable especially if you are working with historic print or handwritten materials that have been digitised.

As with the supplier tools, some of these third party tools may be limited or focused on particular data sets or types and may have limited data flow rates.  They may lack some of the interactivity, ‘gloss’ or easy functionality that the supplier solutions have, and researchers may need to combine different tools to achieve the desired result.  They may also require technical skill and expertise to function effectively, along with server space to operate effectively. 

Bespoke and researcher-created tools

Beyond supplier and third-party tools, researchers could create their own tools to carry out the mining activity if they so wish.  These tools could be based on existing FOSS tool or could be created completely from scratch.

These custom tools will be dedicated to the specific task they are required for and by their very nature, they will do exactly what the researcher wants them to do. They will include the precise data sources, formats and structures the research activity requires.  They can dictate flow rates and other variables that enable the activity to take place.  As with the third party tools mentioned above other pre-analysis processes may be required before the data can be analysed correctly.

Often many researchers do not have the time to write, develop, and test custom tools within the scope of a particular research project. Many researchers do not have the technical skill to write/ create such bespoke programmes, nor the funding to pay for a Research Software Engineer. 

When building and running bespoke solutions like this the question of suitable research infrastructure arises. Is there enough computing power and storage space available to perform the functions required?  Where and how will the data be stored?  Support may be available from BEAR and the Advance Research Computing Team but this may be chargeable.

Researchers should also ensure that their research methodology is reproducible which may mean ensuring the code is sufficiently accessible, available and functional for peer reviewers to validate their results.

Supplier restrictions

Using tools that are independent of suppliers may encounter difficulties as some sited deploy security mechanisms to protect them from attack.  Some sites may include machine readable ‘NoRobot.txt’ files or tags that will prevent scripts from running effectively and may result in the tool not functioning or results being unavailable.  Anything that places unusual loads on supplier servers may be questioned by suppliers and objected against.  Suppliers have the right to complain about mining activity especially where it undermines the integrity of their infrastructure.  Engaging in dialogue and cooperation with suppliers may help avoid some of these issues.

Library Services receives notifications from suppliers informing us of suspicious and unusual activity which suppliers might think are associated to unauthorised access, or hacks.  Occasionally this can cause researchers or IP addresses to be locked out of accessing a certain site until the query is resolved.  In rare circumstances a supplier will suspend all UoB access while it explores the problem.

If such an alert is received, Library Services will work with IT Services to identify the machines and researchers involved and investigate the cause.  If it relates to TDM activity we will liaise with the supplier to explain the circumstances.  Very rarely, we may ask the researcher to join those conversations. 

Library Services will work hard to ensure that your ability to exercise the legal right to mine content is protected as far as possible, while respecting a supplier’s need to protect the stability of their platform.

Post analysis

The TDM exception covers the ability to carry out the ‘computational analysis’, it does not apply to the archiving of research data, the reproducibility of research or publication activities within the scholarly communications lifecycle.  Essentially, as soon as the analysis is complete the data is no longer covered by the TDM exception and so researchers must either look to rely on other exceptions in law, or permissions (licences) from data owners to enable these activities. 

Research Data Management and reproducibility

The University offers guidance around the management of research data especially where data are sourced from and owned by external parties.  If the subscription agreement and/ or tool terms and conditions allow, such data retention policies should be followed.  It may be that data can be retained for certain periods, and researchers will be responsible for managing compliance with those terms.

If the data cannot legally be stored beyond the analysis stage, it is important to document the methodology and data sources used so that other researchers can reproduce and validate your results.

Publications

Some library subscription licences, and supplier tools/ API agreements allow data to be stored and may also allow extracts from the data to be included within publications.  Unfortunately, there is a trend to significantly limit these rendering them almost unworkable for researchers. 

Such clauses limit reuse in publications to so called ‘snippets’ of data, e.g. 150 - 200 characters from a single copyright work or source.  Frequently this is unusable for publication purposes where a quote or single line in a concordance table may exceed this limit.

Where such limitations are unworkable, researchers are left with two options:

  1. Separate publication permission:

    As is common in the publication process, permission to reuse extracts can be obtained from the rights holders. 

    Sometimes obtaining this may be difficult, and time consuming, especially where there are lots of rights holders or the original content is old.  Depending on what and how you are using this content there may also be a fee involved.

  2. Use other exceptions:

    There are other exceptions in UK law which permit the use of extracts for the purposes of quotation or criticism and review.  If you are using the extract for either of these purposes and you are citing the author/ owner in the usual way this may fall within the scope of the exception. 

    Publishers may object to the use of these other exceptions and require authors to follow option 1 above. 

We would encourage authors to discuss the use of these exceptions rather than relying on permission.

More detail on exceptions.

 

Colleges

Professional Services