Posts Tagged ‘HISTORE’

Example page from the Text Mining course

Example page from the Text Mining course

The Institute of Historical Research now offer a wide selection of digital research training packages designed for historians and made available online on History SPOT.  Most of these have received mention on this blog from time to time and hopefully some of you will have had had a good look at them.  These courses are freely available and we only ask that you register for History SPOT to access them (which is a free and easy process).  Full details of our online and face-to-face courses can also be found on the IHR website.

I thought that it might be useful to talk a little more about these courses on the blog and provide a brief sample.  Over the coming months I will post up a series of blog posts about each of our training courses, and give you a little sneak peak so that you have a better idea what to expect.

I have chosen the Text Mining module as the first, for several reasons.  First, because it is probably the one that exemplifies what we are trying to do the best.  That is, to make digital tools accessible to historians through a series of introductory training courses.  The Text Mining for Historians module does just this, beginning from the very simple and slowly moving forward toward the more complex.

Text mining is not a tool of itself, but a series of tools that enables us to explore, interrogate, and analyse large bodies of text or texts.  Imagine, if you will, that you have gathered together a corpus of text – perhaps it’s a diary or series of diaries from a particular period, perhaps it’s a series of publications on a particular subject, or maybe it’s a set of official records spanning many decades or even centuries.  Normally you would wade through these documents one at a time and take notes.  Text mining allows you to automate certain elements of this task and helps you to discover trends and connections that you might never be able to do looking at the texts through traditional methods.

This training module takes you from the theory (i.e. what is text mining all about) through to its application for historical texts, and eventually on to the more complex areas of what is called topic modelling, natural language processing, and named entity recognition.  In this post I’m going to quote from the opening section of this course as it gives a description of what historians might consider a good use for text mining.  In this example we are looking at the Old Bailey Trial accounts used on the popular Old Bailey Proceedings Online website:


Would you like to know how often the word ‘guilty’ appears in the Old Bailey trial accounts? The answer is findable using a standard search engine on the Old Bailey Online website (it’s 182612). How about how many people were found guilty? The answer is 163261. What about the number of defendants found guilty of murder? The answer is 1518. These last two figures are not possible to find through the standard search engine as they are an entirely different type of question; we are not looking for how many times the word ‘guilty’ appears in the proceedings but how many trials resulted in a guilty verdict. We want to discover something meaningful within the body of texts, automatically rather than manually checking each and every trial account.

This is a relatively simple example of text mining where the original documents have been marked up and tagged by surname, given name, alias, offence, verdict, and punishment. To calculate those results manually you would have to work your way through 197,745 criminal trial accounts (some 127 million words in total).

This form of text mining, however, is little more than an advanced search engine – useful but limited. As the creators of the Old Bailey Online themselves admit (and have attempted to redress in a subsequent project):

‘Analyzing this kind of data by decade, or trial type, or defendant gender etc., can re-enforce the categories, the assumptions, and the prejudices the user brings to each search and those applied by the team that provided the XML markup when the digital archive was first created’.

– Dan Cohen et al, ‘Data Mining with Criminal Intent’, Final White Paper (31 August 2011), p. 12.

In other words the search options and text tagging were emphasising and reinforcing a pre-determined expectation of what the resource creators believed was the important data. Text mining tools can help to explore alternative questions more openly.

The Data Mining with Criminal Intent (DMCI) project has done just this by enabling researchers not only to query the Old Bailey site but to export those results to a Zotero library to be managed and from there toVoyeur and other text mining tools for text analysis and visualisation.

The team behind the project uses the example of an investigator trying to understand the role poison might have had in murder cases. Using the search engine brings up 448 entries for ‘poison’ but doesn’t tell us much about what this means. Using Zotero and Voyeur it is possible to filter out the stop words and legal terminology common in all entries to find out what other words commonly appear near to the word ‘poison’. Through this method of text mining it was possible to conclude that poison was probably more commonly administered through drinks such as coffee than through food (see pp. 6-7 of the white paper report Data Mining with Criminal Intent’).


If you would like to have a look at this module please register for History SPOT for free and follow the instructions (http://historyspot.org.uk).  If you would like further information about this course, and the others that the IHR offer please have a look at our Research Training pages on the IHR website.

Read Full Post »

Example from one of the upcoming semantic data module pages.

All of the podcasts from last months digital tools workshop are now available on History SPOT.  The Histore workshop introduces our new modules on semantic data and text mining which will be released at the end of August.  It also looked more generally at the uses of semantic data and text mining for use by historians.  The presentations are as follows:

Introduction to the Project
Jonathan Blaney (IHR) and Dr Matt Phillpott (IHR)

Discussion of the Histore Project – its aims and outputs.  Jonathan talks about the various digital tools that we are looking at and discusses the tools audit and case stduies.  Matt Phillpott discusses the semantic data and text mining modules.

An Introduction to Text Mining
Matteo Romanello (DAI/KCL)

Matteo describes the process of text mining and how it might be useful for historians.

New Tools for Old Books
Pip Willcox (Bodleian Library, Oxford)

Pip talks about the Text Creation Partnership for Early English Books Online (EEBO).  This is part of the Text Encoding Initiative (TEI), which makes machine readable copies of old texts for greater searchability and analysis.  Pip’s talk touches upon the issues of semantic data and text mining.

To listen to these podcasts click here.

Also check out the Histore Blog.

Read Full Post »

A few weeks ago the IHR held an afternoon workshop on the topic of digital tools.  We were promoting the fruits of a JISC-funded project called Histore, from which we will develop guidance and information about digital tools useful for historians.  Amongst these, will be two modules that will appear on History SPOT in late August.  The modules relate to one another and are on the topics of semantic mark-up and text mining for use by historians.  Both modules are designed as introductions to the tools for beginners with little or no knowledge of what they do or how to use them.  Last week I posted up a summary of the workshop on the Histore blog, but due to its relevance to History SPOT, I thought it worthwhile to repeat it here.

This blog post first appeared on 9 July 2012 on the Histore blog: Digital tools Workshop – overview of the breakout sessions

Our recent workshop on digital tools for historians has given us plenty of food for thought.  Do historians want training in digital tools?  The answer seemed to be yes (although admittedly we might have been talking with the already converted).

Do historians have time or incentive to undertake training in digital tools?  Ah!  Now we have a problem.  The overwhelming response during our breakout sessions was that there was little incentive or guidance within the profession in regard to digital tools.  Indeed, newly off the press a British Library study funded by JISC has confirmed that Generation Y at least (that is, those born between 1982 and 1994) are not as ready to use complex digital tools as is often assumed.  The report Researchers of Tomorrow: The research behaviour of Generation Y doctoral students (2012) suggests more tailor made training is required, although it also agrees that there remains a reluctance to undertake such training unless it is already recognised as essential to students current researches.

A further problem presents itself on this subject that was touched upon in our breakout sessions; there is a lack of basic knowledge about what tools there are to achieve research tasks.  There is no advice as to how easy or difficult those tools are to use (including how much time and cost it will take to learn).  Neither is there much advice on how tools can be adapted and used in historical research in general.

Sample page from the Text Mining module in development


These are all serious impediments that historian will need to address, as digital tools can offer exciting new opportunities to learn things from our textual heritage.   Group 2 from our breakout sessions, for example, argued for digital tools training to be included within undergraduate tuition.  This, they argued, should be viewed as fundamental research skills and be given as much weight as non-digital skills tuition.  Group 3 suggested adding digital tools training to skills workshops as a means of adding to the PhD ‘package’.

What was interesting, that came out of all three groups, however, was a feeling that such dedicated training is not generally where they, themselves go to learn these skills, nor something that they want to necessarily go through to achieve their initial aims.  They liked to dip into a subject to learn what they need, and then if it is useful enough consider a full face-to-face or online course.  Group 1 emphasised that if they need to learn something about a digital tool they will generally Google it and find the information on forums, blogs, and wikis.  Indeed, many participants had used free training materials found through these methods.

Nevertheless, such searching relies upon the fundamental need to know what tools exist in the first place and which are useful to research.  Group 1 discussed the need for a central location where such information could be found by historians.  It was pointed out that the Arts-Humanities.net provides such a service.  It was interesting that few in the group were aware of this.

In all, it would appear from the discussion in our breakout groups, that historians want more easily available information on what tools there are and how these might be applicable to their own research.  They want to be able to find out a little bit about these tools quickly, and, where possible, gain a basic knowledge of how they work and what can be done with them, before considering spending their time on a training course.  What type of training course was, however, not quite made clear.  Do historians want face to face training on specific tools or techniques?   Or would they prefer online courses?  Perhaps a mixture of both?

From these discussions it would appear that our approach with the two HISTORE modules (one on semantic data and another on text mining) was the right one.  We are creating two relatively short freely available modules that introduce each subject and which suggest what historians can potentially gain from using such tools.  The modules are broken down into sections which work through the process from the basic to the more complex (although they are not intended to give everything you would want to know about the tools).  These then, are introductions.  The first section of each course will introduce you to the tool and can be read within 30 minutes (probably more like 10 if you don’t do the exercises).  From there you can go further if you would like to gain a basic grasp of the tool.  In some cases that might well be enough for what you need.  At the very least the modules should enable you to judge for yourself whether more training and time should be spent learning about the tool.

Over the course of the next week we shall post brief bullet point notes from each of the breakout sessions, so you can see a little more of what was said.  Soon after this, we will also post the audio and hopefully video from the presentations given at the workshop.  By the end of August we hope to have the modules ready for release and so we will be talking a little more about these very soon!

Read Full Post »

History SPOT will soon be home to training modules from the IHR Digital project (funded by JISC) HISTORE.  This project is developing short modules introducing various digital tools that might be of use to historians.  For instance, I’m currently working on the introduction pieces for a module on Text Mining.  This tool allows historians to search large corpuses of digitalized texts in a deeper and more meaningful way than an ordinary search engine could ever achieve on its own. 

As part of the project the IHR will be holding an afternoon working on Thursday 21 June (2pm to 4.30pm) on the topic of using and learning about digital tools for historical research.  This blog post, then, is an invitation.  If you would like to join us please email Jonathan Blaney at jonathan.blaney@sas.ac.uk.  It doesn’t matter if you have in-depth knowledge of digital tools or whether you are just interested in finding out something about what such tools might offer, we would very much like to have you there. 

There will be several talks followed by a break-out session.  The project team will discuss the work we’ve done to date and there will be more general talks on the topics of semantic markup and text mining.  Attendees are encouraged to bring digital project ideas to discuss during the break-out. There will also be an opportunity to discuss your projects with us one-to-one, if you’d like to.

This workshop is free but places are limited. So again, if you’d like to come to the workshop, or have any questions about it, just drop us an email at jonathan.blaney@sas.ac.uk.

Read Full Post »

I have been writing this blog ever since I took on History SPOT for the IHR over two years ago.  It took me a while to find my feet as I had never created or written a blog before.  My remit was to make the blog more interesting than just relaying update reports which would quickly become dull not only to read but also to write. 

“What we want is a ‘day in the life’ of a project officer” Jane Winters (head of IHR Publications) told me on my first day.  Looking back at my blog posts I don’t think I have ever actually done that.  I have discussed research training and the nature of podcasts.  I have narrated the highs and lows of live streaming.  I have summarised or reviewed numerous IHR podcasts and given the odd project update.  But I have never talked about my working day.  Perhaps, it is time to do just that.  Time to indulge in a little bit of ‘this is what I do’, although I won’t go on for too long I promise.

My working day begins at a railway station – queued up with other commuters in untidy columns approximated to where the train doors will open.  My train journey takes about 30 minutes, in which time I often listen to one of our podcasts and take notes.  This morning I was listening to a talk about the development of cricket as a sport in France.  Yesterday, the subject was ‘Memory’ as a focus for looking at the early modern period.  I never know what subject will come up next, which makes the process all the more fun. 

After dodging crowds of commuters its coffee time!  In the café I will generally write up my blog posts, usually from the recording I was listening to on the train.  Then it’s a short walk into the office where I pick up the audio recorders from seminars held the night before.  Once at work proper, I check my emails and upload the day’s podcast to History SPOT and add a new blog post to the History SPOT blog.  These are daily tasks Monday to Thursday which I tend to do early on so that I can start to work though my tasks list for the rest of the day. 

I then upload the audio file from the recorder to my computer and edit the file.  This usually consists of chopping off the beginning and end, adjusting the sound levels (as much as possible), and adding metadata to the finished mp3. 

For the rest of this morning I worked on the HISTORE project.  At the moment I’m working on a short case study about the John Foxe Online project as an example of semantic data.  Although John Foxe, and his Acts and Monuments was the focus of my PhD thesis, and despite helping out on some of the text transcription, I had thought next to nothing about what any of this meant in terms of the digital tools employed so this work is proving quite illuminating. 

In the afternoon I finished editing one of the Digital History videos – adding images to the video and zooming in and out where appropriate.  This is time consuming work but quite relaxing and enjoyable.  There is something satisfying about creating a short video. 

My next to final task of the day was to continue working on the Online Databases course that we are developing for launch in 2012/13.  Mark Merry (its author) provided me with additional text and images this morning so now it’s a matter of uploading this to History SPOT and making it into something that will display nicely.  This often involves working with some straightforward html coding and working out in what format the data should be displayed.  Again, time consuming work, but quite enjoyable to do once I get into it.

The final task of the day is to set up the audio recorders for tonight’s seminars.  This varies.  Some nights there won’t be any to record.  Today is one such day.  As far as the seminars are concerned we are still in the Easter period so groups have temporarily grinded to a halt.  Other nights there can be anywhere between one to three events scattered throughout Senate House and Stewart House.  This can mean some running around and up and down stairs. 

So, in a nut shell, that is roughly a day in the life of the History SPOT Project officer.  From tomorrow I’ll get back to posting some more summaries of our podcasts.                    


Read Full Post »

%d bloggers like this: