In my spare time over the last few weeks, I have been experimenting with the tools developed by Tim Sherratt to extract data from Australian digitised newspapers available through Trove. In a previous post I discussed how we can produce graphs showing the frequency of the use of particular words in Australian newspapers over time using Sherratt’s tools. In this post I will look at other methods of text analysis and explain how I used Sherratt’s tools to extract a large number of articles from the Trove database and used a text analysis tool to further analyse the articles. This post is about possibilities, not conclusions. It is a work-in-progress, so I am keen to hear your suggestions and experiences.
Creating the Corpus
The first step was to identify and collect the documents that I wanted to analyse. I used Sherratt’s “harvester” to collect all newspaper articles containing the phrase “secular education”. The search I constructed in Trove returned over 7,800 articles. It would take a ridiculously long time to click on each article, download it and save it. This became a feasible exercise with Sherratt’s tool. After I had run it, the articles were all saved as text files and could be found in subdirectories according to the year they were published. The file names included the name of the newspaper, the date and page where the article was found. Wonderful! I encourage you to have a go using the harvester. It requires a bit of work but Tim Sherratt has given good instructions in his posts, Mining the Treasures of Trove (Part 1), and Part 2.
Text Analysis – the Basics
Now that I had this lovely collection of more than 7,800 articles I wanted to analyse them. I could read them each individually, and I probably will read a good proportion of them, but that will take a long time. Tim Sherratt has given examples of the type of analysis that could be produced with computerised tools. Now that I had my own collection (or corpus), I could explore the world of digital text analysis.
The United Kingdom’s Research Information Network has released a report which explains how information technology is used in the humanities. The chapter on ‘corpus linguistics’ (pp. 48-53) explains the type of work undertaken by researchers who specialise in this area. They identified some of the basic methods that these researchers use to analyse written texts (p. 51).
The researcher analyses the number of times that words occur in the documents. You are probably thinking that this would probably return a lot of uninteresting data. After all, who cares how many times words such as ‘the’, ‘and’, ‘is’ occur. Words such as these can be designated as “stop words”. Most text analysis tools allow users to exclude ‘stop words’ from results. A commonly used set of stop words is the one generated by the University of Glasgow.
The researcher selects a keyword and examines the words most often used in the vicinity of that keyword. In my case the words “secular” and “education” are often collocates. It is interesting to analyse the types of adjectives used with certain nouns for a given period. Are the adjectives used pejorative in nature, more neutral or positive?
This lists the context in which selected words appear, ie the sentence or paragraph where the keyword is located. These lists are called KWICs or Key Word in Context lists. A review of concordances can reveal in what circumstances a word was used, eg. was the word/phrase used largely in the context of one particular debate or was it used in a variety of contexts.
The last two techniques interest me because these are additional tools that I can use to explore the meaning of words that I am interested in.
Using Voyeur Tools to Analyse my Corpus
There are many free tools to help people wishing to do text analysis. I chose to use Voyeur Tools as it enabled me to upload a zip file, rather than being restricted to uploading files individually.
As I am experimenting rather than producing definitive analysis, I decided to analyse just one year of articles. I focussed on the articles from 1870 containing the phrase “secular education”.
Voyeur Tools produces many different forms of analysis. I will leave you to explore it all, but I will highlight some of the analysis that caught my eye.
There is some work to be done when Voyeur returns its results. Firstly the top left panel will show the frequency of every word in the collection that was uploaded, including stop words. I wanted to focus on the word secular so I typed “secular” in the search area at the bottom of this panel which narrowed the results those words that included the word “secular”. As you can see the OCR was not 100% accurate. This is something to keep in mind with text analysis. It is limited by the quality of the OCR.
In order to generate more analysis I selected the word “secular” in this list.
The top right panel lists the number of articles in which the selected word occurs and the frequency with which it occurs in the article. The trend graph indicates where in the article the word occurs. The first article shows a flat line and then a spike about a third of the way in the article. This article is a transcription of a session of the Legislative Assembly. The article extends to a page and a half of the newspaper. Naturally the members of parliament were debating a number of different measures and the debate about education occurred about a third of the way through the article.
As I found in my previous post, the discussion about secular education was particularly high in South Australian newspapers in 1870 so I selected an article from the South Australian Advertiser for further analysis.
The bottom right panel displays the Keywords in Context (KWIC) for selected items. It will not appear unless an item or number of items is selected in the top right panel. This panel is useful for quickly highlighting occurrences of selected words in lengthy articles. A quick review of the KWICs for this article indicates some interesting words used in conjunction with the word “secular”. The word “purely” occurs several times as well as the words “paying”, “pay” and “payment”.
Throughout this process I have had to deal with recalcitrant servers which upset these tools. These problems were out of the control of the developers of the tools. It required persistence to get there and I am grateful for suggestions on how to deal with this from Tim Sherratt regarding his Trove Harvester, and from Stéfan Sinclair and Geoffrey Rockwell with regard to Voyeur Tools. Due to these problems I used an old version of voyeur tools as the newer version was being temperamental. Hopefully that will be sorted out soon.
I am just beginning my exploration of text analysis. Over the next few weeks I plan to test more text analysis tools and refine the process I described above so that I can easily deploy it when I need to.
Brian Sarnacki has noted that a blog is a means of sharing ideas and receiving some feedback on how to address those issues. The work I have presented in this post is by no means complete. I am still reviewing text analysis tools such as the ones listed on the Digital Research Tools Wiki. I am finding out what is currently available and am eager to hear from you about your experience.
- Bulger, Monica, Eric T. Meyer, Grace de la Flor, Melissa Terras, Sally Wyatt, Marina Jirotka, Katherine Eccles, Christine Madsen, Reinventing research? Information practices in the humanities, A Research Information Network Report, April 2011, available on the Research Information Network website, accessed 5/5/2011.