Feeds:
Posts
Comments

Posts Tagged ‘text mining’

The Institute of Historical Research now offer a wide selection of digital research training packages designed for historians and made available online on History SPOT.  Most of these have received mention on this blog from time to time and hopefully some of you will have had had a good look at them.  These courses are freely available and we only ask that you register for History SPOT to access them (which is a free and easy process).  Full details of our online and face-to-face courses can also be found on the IHR website. Here is a brief look at one of them.

When the Institute of Historical Research began building research training modules online, we decided fairly early on that they needed to be much more than just text.  In the Tex Mining for Historians module we included various videos to help learners to improve their knowledge of the subject.  One of these was a very simple introduction to natural language processing.

This video – available on the course and on vimeo is very short and discusses natural language processing (or NLP for short) in very basic terms.  This is intentional as the rest of this section of the module looks at the subject in much more detail.

What is Natural Language Processing? from History SPOT on Vimeo.

If you would like to have a look at this module please register for History SPOT for free and follow the instructions (http://historyspot.org.uk).  If you would like further information about this course, and the others that the IHR offer please have a look at our Research Training pages on the IHR website.

Advertisements

Read Full Post »

The Institute of Historical Research now offer a wide selection of digital research training packages designed for historians and made available online on History SPOT.  Most of these have received mention on this blog from time to time and hopefully some of you will have had had a good look at them.  These courses are freely available and we only ask that you register for History SPOT to access them (which is a free and easy process).  Full details of our online and face-to-face courses can also be found on the IHR website. Here is a brief look at one of them.

When the Institute of Historical Research began building research training modules online, we decided fairly early on that they needed to be much more than just text.  In the Tex Mining for Historians module we included various videos to help learners to improve their knowledge of the subject.  One of these was a very simple introduction to natural language processing.

This video – available on the course and on vimeo is very short and discusses natural language processing (or NLP for short) in very basic terms.  This is intentional as the rest of this section of the module looks at the subject in much more detail.

What is Natural Language Processing? from History SPOT on Vimeo.

If you would like to have a look at this module please register for History SPOT for free and follow the instructions (http://historyspot.org.uk).  If you would like further information about this course, and the others that the IHR offer please have a look at our Research Training pages on the IHR website.

Read Full Post »

Example page from the Text Mining course

Example page from the Text Mining course

The Institute of Historical Research now offer a wide selection of digital research training packages designed for historians and made available online on History SPOT.  Most of these have received mention on this blog from time to time and hopefully some of you will have had had a good look at them.  These courses are freely available and we only ask that you register for History SPOT to access them (which is a free and easy process).  Full details of our online and face-to-face courses can also be found on the IHR website.

I thought that it might be useful to talk a little more about these courses on the blog and provide a brief sample.  Over the coming months I will post up a series of blog posts about each of our training courses, and give you a little sneak peak so that you have a better idea what to expect.

I have chosen the Text Mining module as the first, for several reasons.  First, because it is probably the one that exemplifies what we are trying to do the best.  That is, to make digital tools accessible to historians through a series of introductory training courses.  The Text Mining for Historians module does just this, beginning from the very simple and slowly moving forward toward the more complex.

Text mining is not a tool of itself, but a series of tools that enables us to explore, interrogate, and analyse large bodies of text or texts.  Imagine, if you will, that you have gathered together a corpus of text – perhaps it’s a diary or series of diaries from a particular period, perhaps it’s a series of publications on a particular subject, or maybe it’s a set of official records spanning many decades or even centuries.  Normally you would wade through these documents one at a time and take notes.  Text mining allows you to automate certain elements of this task and helps you to discover trends and connections that you might never be able to do looking at the texts through traditional methods.

This training module takes you from the theory (i.e. what is text mining all about) through to its application for historical texts, and eventually on to the more complex areas of what is called topic modelling, natural language processing, and named entity recognition.  In this post I’m going to quote from the opening section of this course as it gives a description of what historians might consider a good use for text mining.  In this example we are looking at the Old Bailey Trial accounts used on the popular Old Bailey Proceedings Online website:

 ****

Would you like to know how often the word ‘guilty’ appears in the Old Bailey trial accounts? The answer is findable using a standard search engine on the Old Bailey Online website (it’s 182612). How about how many people were found guilty? The answer is 163261. What about the number of defendants found guilty of murder? The answer is 1518. These last two figures are not possible to find through the standard search engine as they are an entirely different type of question; we are not looking for how many times the word ‘guilty’ appears in the proceedings but how many trials resulted in a guilty verdict. We want to discover something meaningful within the body of texts, automatically rather than manually checking each and every trial account.

This is a relatively simple example of text mining where the original documents have been marked up and tagged by surname, given name, alias, offence, verdict, and punishment. To calculate those results manually you would have to work your way through 197,745 criminal trial accounts (some 127 million words in total).

This form of text mining, however, is little more than an advanced search engine – useful but limited. As the creators of the Old Bailey Online themselves admit (and have attempted to redress in a subsequent project):

‘Analyzing this kind of data by decade, or trial type, or defendant gender etc., can re-enforce the categories, the assumptions, and the prejudices the user brings to each search and those applied by the team that provided the XML markup when the digital archive was first created’.

– Dan Cohen et al, ‘Data Mining with Criminal Intent’, Final White Paper (31 August 2011), p. 12.

In other words the search options and text tagging were emphasising and reinforcing a pre-determined expectation of what the resource creators believed was the important data. Text mining tools can help to explore alternative questions more openly.

The Data Mining with Criminal Intent (DMCI) project has done just this by enabling researchers not only to query the Old Bailey site but to export those results to a Zotero library to be managed and from there toVoyeur and other text mining tools for text analysis and visualisation.

The team behind the project uses the example of an investigator trying to understand the role poison might have had in murder cases. Using the search engine brings up 448 entries for ‘poison’ but doesn’t tell us much about what this means. Using Zotero and Voyeur it is possible to filter out the stop words and legal terminology common in all entries to find out what other words commonly appear near to the word ‘poison’. Through this method of text mining it was possible to conclude that poison was probably more commonly administered through drinks such as coffee than through food (see pp. 6-7 of the white paper report Data Mining with Criminal Intent’).

****

If you would like to have a look at this module please register for History SPOT for free and follow the instructions (http://historyspot.org.uk).  If you would like further information about this course, and the others that the IHR offer please have a look at our Research Training pages on the IHR website.

Read Full Post »

Two new inter-related free modules are now available on the History SPOT platform beginning a series on digital tools.

The first is about semantic mark-up – this is a beginner’s guide to marking up a text in XML so that it is searchable for information pertinent to historical research.  Semantic mark-up is extremely common in History digital projects.  Take for example the Old Bailey Proceedings Online or TAMO (The Acts and Monuments Online, otherwise known as the John Foxe Project).  Both websites have marked up their texts so that you can find specific persons, places, and types of information from an index or via a search engine.  Our module will guide you from beginning to end of that process from the starting point of no knowledge whatsoever (beyond some basic knowledge of using computers).

The same is true for the second module, on the topic of text mining.  Where semantic mark-up enables you to find known information more easily, text mining gives you the opportunity to add additional structure to unstructured text (or texts) to enable research on relationships otherwise difficult to identify.  For example, text mining would enable you to explore a large body of text, such as the Old Bailey records, and ask questions about associations: does, for example, mention of wine or beer appear most often with acts of violence or illness?  You would presume it does, but text mining will allow you to confirm that fact.

Now, text mining is undoubtedly a more specialist tool and to use it properly requires some extensive technical expertise.  Our module tries to introduce you to the subject lightly and builds upon the training given for semantic mark-up.  Beginners should be able to work their way through the module and understand what they are being asked to do and why.  The module won’t tell you everything there is to know about text mining, nor will it train you in using development tools (although it will show you where you can go to get some basic knowledge on these).  What it will do is show you what is involved and what you might get out of it before taking extensive time to learn the tools in-depth.

Sample page from the Text Mining module

In addition to these two digital tools training modules, we have a tool audit (a list of various digital tools with a little introductory information) and a series of case studies.  These are on the topics of semantic data, text mining, visualisation, linked data, and cloud computing.

These training materials are all outcomes from the JISC funded HISTORE project.  For more details of that project please visit the HISTORE Blog and the Digital Skills workshop podcasts on History SPOT.

 

To view the Digital Tools course materials click here.

Read Full Post »

Example from one of the upcoming semantic data module pages.

All of the podcasts from last months digital tools workshop are now available on History SPOT.  The Histore workshop introduces our new modules on semantic data and text mining which will be released at the end of August.  It also looked more generally at the uses of semantic data and text mining for use by historians.  The presentations are as follows:

Introduction to the Project
Jonathan Blaney (IHR) and Dr Matt Phillpott (IHR)

Discussion of the Histore Project – its aims and outputs.  Jonathan talks about the various digital tools that we are looking at and discusses the tools audit and case stduies.  Matt Phillpott discusses the semantic data and text mining modules.

An Introduction to Text Mining
Matteo Romanello (DAI/KCL)

Matteo describes the process of text mining and how it might be useful for historians.

New Tools for Old Books
Pip Willcox (Bodleian Library, Oxford)

Pip talks about the Text Creation Partnership for Early English Books Online (EEBO).  This is part of the Text Encoding Initiative (TEI), which makes machine readable copies of old texts for greater searchability and analysis.  Pip’s talk touches upon the issues of semantic data and text mining.

To listen to these podcasts click here.

Also check out the Histore Blog.

Read Full Post »

A few weeks ago the IHR held an afternoon workshop on the topic of digital tools.  We were promoting the fruits of a JISC-funded project called Histore, from which we will develop guidance and information about digital tools useful for historians.  Amongst these, will be two modules that will appear on History SPOT in late August.  The modules relate to one another and are on the topics of semantic mark-up and text mining for use by historians.  Both modules are designed as introductions to the tools for beginners with little or no knowledge of what they do or how to use them.  Last week I posted up a summary of the workshop on the Histore blog, but due to its relevance to History SPOT, I thought it worthwhile to repeat it here.

This blog post first appeared on 9 July 2012 on the Histore blog: Digital tools Workshop – overview of the breakout sessions

Our recent workshop on digital tools for historians has given us plenty of food for thought.  Do historians want training in digital tools?  The answer seemed to be yes (although admittedly we might have been talking with the already converted).

Do historians have time or incentive to undertake training in digital tools?  Ah!  Now we have a problem.  The overwhelming response during our breakout sessions was that there was little incentive or guidance within the profession in regard to digital tools.  Indeed, newly off the press a British Library study funded by JISC has confirmed that Generation Y at least (that is, those born between 1982 and 1994) are not as ready to use complex digital tools as is often assumed.  The report Researchers of Tomorrow: The research behaviour of Generation Y doctoral students (2012) suggests more tailor made training is required, although it also agrees that there remains a reluctance to undertake such training unless it is already recognised as essential to students current researches.

A further problem presents itself on this subject that was touched upon in our breakout sessions; there is a lack of basic knowledge about what tools there are to achieve research tasks.  There is no advice as to how easy or difficult those tools are to use (including how much time and cost it will take to learn).  Neither is there much advice on how tools can be adapted and used in historical research in general.

Sample page from the Text Mining module in development

 

These are all serious impediments that historian will need to address, as digital tools can offer exciting new opportunities to learn things from our textual heritage.   Group 2 from our breakout sessions, for example, argued for digital tools training to be included within undergraduate tuition.  This, they argued, should be viewed as fundamental research skills and be given as much weight as non-digital skills tuition.  Group 3 suggested adding digital tools training to skills workshops as a means of adding to the PhD ‘package’.

What was interesting, that came out of all three groups, however, was a feeling that such dedicated training is not generally where they, themselves go to learn these skills, nor something that they want to necessarily go through to achieve their initial aims.  They liked to dip into a subject to learn what they need, and then if it is useful enough consider a full face-to-face or online course.  Group 1 emphasised that if they need to learn something about a digital tool they will generally Google it and find the information on forums, blogs, and wikis.  Indeed, many participants had used free training materials found through these methods.

Nevertheless, such searching relies upon the fundamental need to know what tools exist in the first place and which are useful to research.  Group 1 discussed the need for a central location where such information could be found by historians.  It was pointed out that the Arts-Humanities.net provides such a service.  It was interesting that few in the group were aware of this.

In all, it would appear from the discussion in our breakout groups, that historians want more easily available information on what tools there are and how these might be applicable to their own research.  They want to be able to find out a little bit about these tools quickly, and, where possible, gain a basic knowledge of how they work and what can be done with them, before considering spending their time on a training course.  What type of training course was, however, not quite made clear.  Do historians want face to face training on specific tools or techniques?   Or would they prefer online courses?  Perhaps a mixture of both?

From these discussions it would appear that our approach with the two HISTORE modules (one on semantic data and another on text mining) was the right one.  We are creating two relatively short freely available modules that introduce each subject and which suggest what historians can potentially gain from using such tools.  The modules are broken down into sections which work through the process from the basic to the more complex (although they are not intended to give everything you would want to know about the tools).  These then, are introductions.  The first section of each course will introduce you to the tool and can be read within 30 minutes (probably more like 10 if you don’t do the exercises).  From there you can go further if you would like to gain a basic grasp of the tool.  In some cases that might well be enough for what you need.  At the very least the modules should enable you to judge for yourself whether more training and time should be spent learning about the tool.

Over the course of the next week we shall post brief bullet point notes from each of the breakout sessions, so you can see a little more of what was said.  Soon after this, we will also post the audio and hopefully video from the presentations given at the workshop.  By the end of August we hope to have the modules ready for release and so we will be talking a little more about these very soon!

Read Full Post »

History SPOT will soon be home to training modules from the IHR Digital project (funded by JISC) HISTORE.  This project is developing short modules introducing various digital tools that might be of use to historians.  For instance, I’m currently working on the introduction pieces for a module on Text Mining.  This tool allows historians to search large corpuses of digitalized texts in a deeper and more meaningful way than an ordinary search engine could ever achieve on its own. 

As part of the project the IHR will be holding an afternoon working on Thursday 21 June (2pm to 4.30pm) on the topic of using and learning about digital tools for historical research.  This blog post, then, is an invitation.  If you would like to join us please email Jonathan Blaney at jonathan.blaney@sas.ac.uk.  It doesn’t matter if you have in-depth knowledge of digital tools or whether you are just interested in finding out something about what such tools might offer, we would very much like to have you there. 

There will be several talks followed by a break-out session.  The project team will discuss the work we’ve done to date and there will be more general talks on the topics of semantic markup and text mining.  Attendees are encouraged to bring digital project ideas to discuss during the break-out. There will also be an opportunity to discuss your projects with us one-to-one, if you’d like to.

This workshop is free but places are limited. So again, if you’d like to come to the workshop, or have any questions about it, just drop us an email at jonathan.blaney@sas.ac.uk.

Read Full Post »

Older Posts »

%d bloggers like this: