At some stage or another any historian in the twenty-first century will consider embarking on a digitisation project of their own. Back in 2010 I briefly explored the possibility of organising the digitisation of some old school text books that I had been researching as part of my work on the Teaching Reading in Australia project. If I was to organise this I wanted to do it properly and ensure the resulting data could be linked to other similar historical data and be useful for other researchers. I did not want to do another project that merely reproduced pretty pictures of text (pdfs) which were not machine readable.
I was quickly confronted by the sad fact that my ambitions exceeded my skills. From attending THATCamps, reading blogs and following digital humanists on Twitter I knew that I should encode the data in XML using the framework provided by the Text Encoding Initiative (TEI), but I didn’t know how to do that. I don’t like doing something unless I do it properly, and I always have too much to do, so I dropped the idea.
Like all historians I have transcribed many hand-written documents from photos of primary sources I have taken for research purposes in the archives. Each document is idiosyncratic. The relevant items on a page are not restricted to words. There are underlines, crossed out words (who did the crossing out?), notes scribbled in the margins by the original author at a later date or someone else. There are arrows, drawings or diagrams. Too often the writing may be illegible. Each of these important bits of information needs to be recorded in the transcription. Quite often I will use markup borrowed from html or make up my own methods to signal a type of message in a transcription.
Since then I have been fascinated by a project of Dr Melodee Beals who is a Senior Lecturer in History at Sheffield Hallam University. Beals is marking up her transcriptions of historic documents in TEI. Separating the design from the text is a fundamental principle of web design. TEI enables us to prepare the transcription in a way that can be easily formatted for display on websites via XSLT. Beals’ project makes so much sense for historians. Why not incorporate some basic TEI markup in our transcriptions from the moment we start transcribing documents?
I needed to learn more about this mysterious TEI.
Fortune smiled and one of the workshops offered at this week’s Global Digital Humanities Conference covered basic TEI. For the last day and a half I have been learning about TEI and manipulation of images in the workshop, ‘Introduction to Digital Manuscript Studies‘ conducted by Elena Pierazzo, Professor of Italian Studies and Digital Humanities at the University of Grenoble 3 ‘Stendhal’, and Peter Stokes, Senior Lecturer in Digital Humanities at Kings College London. (Have a look at the impressive results of Pierazzo’s TEI transcription work on Proust’s notebook).
I now have the kickstart that I need. Last night I worked on marking up a transcription I had done of a document from my own project to reinforce what I had learned. One thing that has been bothering me about some transcriptions available on the internet is the lack of consistency with date formatting. There are many ways we can write dates and authors of handwritten documents use all sorts of approaches. Last night I discovered ‘13 Names, Dates, People, and Places’. This is the TEI chapter for me! I discovered how to encode a consistent, searchable date format while preserving the idiosyncratic way it was recorded in the original document. Oh, the potential of this!
Another thing I loved about this was the ability to tag names of people who occur in the document and link that to properly formatted biographical detail of those people in the same document. This leads to a document where potentially all mentions of that person can be flagged whether they are referred to as Harry or Smith or Mr Smith or the driver or….
I can do this, but how much will I do? As Professor Pierazzo stressed throughout the workshop, we need to work out what the goal of our transcription is and who the audience is for the transcription before we embark on a transcription project. With TEI a document can be encoded to the most miniscule detail, but encoding is time-consuming and therefore expensive. We should markup only to the extent that is necessary to fulfil the goals of the project and no more.
I was transcribing a minute book, so I decided that I would only encode the first occurrence of a name on a page. The transcriptions that I am doing are for my personal research purposes so that is sufficient. To make it even quicker I realised today that I could continue to transcribe documents in word but just add TEI-consistent tagging while I am transcribing, rather than my home-made tags. If I want to publish my transcriptions at a later date at least they will be part of the way there.
As Dr Beals has pointed out, the act of encoding the transcription makes us slow down and read the page carefully. Historians do a lot of fine-grain analysis of historic documents alert for the nearly hidden references that can signal a significant historical issue. I agree with Beals that close reading is still an essential skill for all historians. For a critical document, close reading is slow reading.
Dr Stokes taught us some of the issues behind digital imaging and ran us through some basic exercises in image manipulation, principally centring around exposing text that is difficult to read or hidden. Using Gimp this afternoon I managed to get a lot more clarity on a crucial but difficult field court-martial I am working on. Unfortunately the clarity did not make up for the poor handwriting and the relatively low resolution of the original. Sometimes there is no substitute for viewing the original paper document – I live in hope!
What distinguished this workshop was how both the presenters urged us to really examine the documents that we were working with, the physical nature of them, how they were constructed, their provenance etc. Throughout the workshop they urged us to consider the ethical issues behind the work that we were doing and interrogate our understanding of the nature of the text and the physical item that embodies it. We need to re-think the unquestioned knowledge about the nature of a manuscript which in the past we blithely drew upon. In an interesting way the new technology is causing us to rethink our understanding of the nature of objects that are centuries old.
It is these two factors that in my mind is a distinguishing characteristic of Digital Humanities. This workshop was not just about learning technical skills. It was about applying a humanities mindset to the material we were working with.
Historic documents are not cold, hard facts. No-where is this more evident than when working with war documents. I constantly remind myself that these documents are like quicksand. The soldiers and nurses in the military had many reasons to obscure certain truths. They were writing to family at home so they didn’t want to make them any more anxious than they were. The soldiers and nurses were too often faced with horrific situations which they may not have wished to recall. The cloak of nationalism naturally pervaded the war effort, subtly and not so subtly shading the reporting of the combatants. At any time many of the combatants were suffering from psychological distress of one sort or another, thus affecting their writing. A significant issue is that we only have access to a tiny minority of the writing of the combatants. There is a significant private archive of soldier writing held in repositories of individual households under the care of family curators which is not so readily accessible to the researcher. And then there is all the writing we have lost over the years…
The database of diaries and letters I am using cannot in any way be regarded as representative. The diaries and letters I am working with are exceptional. I have to be very careful about the conclusions I draw from such an unrepresentative sample of writing and thinking during the war.
I have to remind myself of these issues every time I run my programs and receive a beautiful spreadsheet of results. The tidiness of the results, the ability to search, to filter, to order is so seductive. Before we know it we are absorbed in trying to get more out of the data that we can easily lose the scepticism about evidence that leads historians to produce illuminating analysis.
Every time I run a program or extract data via an API, I explicitly voice these concerns to myself in order to keep myself in check. Digital Humanities work is not about partying with data. We need to constantly remind ourselves of the ethical issues connected with our work. If we think there are no ethical issues we need to look again. No historical research is free from ethical issues. Can anyone challenge this statement?
It is not possible to become a master at TEI or digital manipulation in one and a half days, neither is it possible to delve deeply into the nature of manuscript and the complex ethical issues behind this form of research. Professor Pierazzo and Dr Stokes did an excellent job in raising these three aspects of Digital Humanities work. I feel that I am now equipped to introduce new skills in my work and more grounded in the significant theoretical and ethical issues surrounding this form of research.
The Introduction to Digital Manuscript Studies workshop is part of the training programme provided by DiXit. They have longer training programmes as well.
Other Blog Posts About #dh2015 Workshops
- Jussi-Pekka Hakkarainen, ‘What Did I Learn from the DH2015 Workshops? Recap, Days 1-2‘, Fenno-Ugrica, 30/6/2015.