Text and data mining tools

The following guidance might be useful to you when deciding what tools to use for text and data mining.

Supplier and platform tools

Many suppliers now have API, feeds and other systems that can be queried by researchers quickly and easily granting access to huge swathes of data within a matter of clicks.  There are more and more ‘ready made’ / ‘off the shelf’ publisher services available with increasing flexibility and usability, especially for those beginning to explore data mining projects. 

The benefits of these tools are that they have developed by the suppliers working with their own content.  They have been specifically tested and designed with researchers and students in mind to ensure they meet the needs of the community they serve.  Many suppliers have extensive engagement with the research community and have used that to develop and refine the packages.  This might also include ensuring the structure of the data, any tags and formatting are compatible and usable by the tool.  They also benefit from publisher support and growing communities of users.

In addition, as these are official supplier tools there should be no issue with accessing the supplier’s content through them. Issues with licence breaches or security threats posed by machine driven searching and downloading etc will not arise, unlike when using third party or bespoke tools (see below).

Researchers should also note that as some of these tools are designed to work with specific data sets and may not be usable for other sources of content.  They may also be limited in scope and scale and may restrict certain data flow limits or which datasets can be accessed. 

All of these supplier tools will have separate terms and conditions governing their use and it is important that you read and understand them.  These terms will state what can be done with the content, how it can be accessed, who can use it, how it can be shared, etc.  These are very important especially in terms of publication activity or in data retention for reproducibility purposes.  These limits may cause some difficultly for your project and dissemination activity.

Some of these tools are already included within library subscriptions and can be accessed directly via the platforms and suppliers. See Text and Data Mining specific resources for information about some of these. Some tools are only available to subscribing institutions and may even require advanced subscription levels, over and above what might already be in place. Other tools may form standalone products unconnected to subscriptions.  These are likely to be chargeable, and the costs can be considerable.

Researchers should ensure that any tool is adequately trialled, tested and reviewed before any investment is made.

When seeking to use or trial such a tool please do consult with Library Services for advice and guidance.

Third-party tools

There are a growing number of open source and commercial packages available to enable TDM activity.  Specific disciplines may have different preferences in terms of function and need which will determine the suitability of a specific tool for a particular activity.  Some of these tools are commercial software packages that can be licensed (at cost) for a particular researcher or group.  Free and Open Source Software (FOSS) tools are also available, which are designed to enable changes and developments to be added by researchers as required.

Many of these tools have support infrastructures and communities of practice behind them with researchers and developers improving, or customising the functionality for specific projects.  They can handle datasets from multiple sources and are not tied to one particular supplier.  This increased flexibility may be attractive to some researchers, especially those more confident in using programming or coding techniques to tailor the package to their needs.

Some of the newer tools are far easier to use than they used to be as TDM is becoming more accessible and mainstream.  Many can be picked up quickly and easily allowing novice researchers to engage with their research questions more readily.  Others, especially those intended for bespoke customisation do require some knowledge of programming languages such as R or Python. BEAR run Software Carpentry workshops on R and Python designed for beginners with little to no prior computational experience. The instructors put a priority on creating a friendly environment to empower researchers and enable data-driven discovery.

The structure, format and usability of data extracted by these tools may require additional processes, i.e. stage 2 above, before it can be analysed.  For example text may require structure to be added or removed, HTML/ XML tags deleted, or even may be required to undergo character recognition processes to ensure it is machine readable as a work containing text.  This pre analysis processing could be considerable especially if you are working with historic print or handwritten materials that have been digitised.

As with the supplier tools, some of these third party tools may be limited or focused on particular data sets or types and may have limited data flow rates.  They may lack some of the interactivity, ‘gloss’ or easy functionality that the supplier solutions have, and researchers may need to combine different tools to achieve the desired result.  They may also require technical skill and expertise to function effectively, along with server space to operate effectively.

Bespoke and researcher-created tools

Beyond supplier and third-party tools, researchers could create their own tools to carry out the mining activity if they so wish.  These tools could be based on existing FOSS tool or could be created completely from scratch.

These custom tools will be dedicated to the specific task they are required for and by their very nature, they will do exactly what the researcher wants them to do. They will include the precise data sources, formats and structures the research activity requires.  They can dictate flow rates and other variables that enable the activity to take place.  As with the third party tools mentioned above other pre-analysis processes may be required before the data can be analysed correctly.

Often many researchers do not have the time to write, develop, and test custom tools within the scope of a particular research project. Many researchers do not have the technical skill to write/ create such bespoke programmes, nor the funding to pay for a Research Software Engineer. 

When building and running bespoke solutions like this the question of suitable research infrastructure arises. Is there enough computing power and storage space available to perform the functions required? Where and how will the data be stored? Support may be available from BEAR and the Advance Research Computing Team, but this may be chargable.

Researchers should also ensure that their research methodology is reproducible which may mean ensuring the code is sufficiently accessible, available and functional for peer reviewers to validate their results.

Supplier restrictions

Using tools that are independent of suppliers may encounter difficulties as some sited deploy security mechanisms to protect them from attack.  Some sites may include machine readable ‘NoRobot.txt’ files or tags that will prevent scripts from running effectively and may result in the tool not functioning or results being unavailable.  Anything that places unusual loads on supplier servers may be questioned by suppliers and objected against.  Suppliers have the right to complain about mining activity especially where it undermines the integrity of their infrastructure.  Engaging in dialogue and cooperation with suppliers may help avoid some of these issues.

Library Services receives notifications from suppliers informing us of suspicious and unusual activity which suppliers might think are associated to unauthorised access, or hacks.  Occasionally this can cause researchers or IP addresses to be locked out of accessing a certain site until the query is resolved.  In rare circumstances a supplier will suspend all UoB access while it explores the problem.

If such an alert is received, Library Services will work with IT Services to identify the machines and researchers involved and investigate the cause.  If it relates to TDM activity we will liaise with the supplier to explain the circumstances.  Very rarely, we may ask the researcher to join those conversations. 

Library Services will work hard to ensure that your ability to exercise the legal right to mine content is protected as far as possible, while respecting a supplier’s need to protect the stability of their platform.

Useful links

 

Colleges

Professional Services