Analysis of #OzHA2013

A few weeks ago I attended the 2013 annual conference of the Australian Historical Association in Wollongong.  There were many tweets from the conference and a few blogposts were written.  Yet many sessions were not reported online.

In order to give a view of the whole conference, not just the sessions I attended, I have analysed the conference program and shared the results on my history blog, Stumbling Through the Past.  I also analysed the conference tweets in order to understand more about the live reporting of the conference using this medium.

This post explains the methodology underpinning my analysis.  I do this because I believe that it is important that researchers allow themselves to be accountable for their work by allowing others to check what they have done.  I also believe that it is important that researchers share their data where possible so that other researchers can use it to do their own analysis.

There is also a third reason for this post.  I hope it will help other people learn some simple but powerful techniques that they can use in their research.  This blog is aimed at people starting out in digital humanities so the explanations will be basic.

My analysis of the proceedings of the 2013 annual conference of the Australian Historical Association in Wollongong relied on the following primary sources:

Other sources which are available but I have not used are:

From these sources I created the following spreadsheets which I used in my analysis of the conference:

Creative Commons License These attachments are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Analysis of the Conference Program

A large proportion of the time spent on this type of analysis always seems to be in preparing the data.  While we have many tools at hand to do this, it can be tedious.

The first step was to copy the Conference Handbook into an excel spreadsheet.  I was particularly interested in recording the details for each paper that was to be delivered.  The original format was designed to be easy to read in print format, not for data analysis.

It is important that each item takes one, and only one, row in the spreadsheet.  Have a look at the conference timetable starting at page 28.  You will see that the details of each paper spreads onto several rows and that the name of the presenter is in the same cell as the title of the paper.  It is important that each item takes one, and only one, row in the spreadsheet.  We also want each descriptive element of the paper to be separated into its own column.  So I re-organised the data so that it was stored in the following fashion:

Session Type Session Theme Session Title Name Gender Role Paper Title
Concurrent session Environmental History Bushfires Ann Smith F Speaker Bushfires in late nineteenth century South Australia
Roundtable None History on the web Peter Browning M Speaker None

It took some hours to put the program into a format where each paper took just one row and the data was divided into uniform categories.  The excel formulae LEFT, LEN, RIGHT, SEARCH, helped to de-aggregate different pieces of data which were stored in one cell such as separating the presenter’s name from the title of their paper.

I manually assigned a gender to each presenter.  This is an area where an error could easily creep in.  I tried to be careful not to guess and was left with a list of names to check on the internet.

Once I had prepared the data I then used pivot tables in excel to determine statistics such as the number of papers delivered for each session theme.  I have found pivot tables incredibly useful in the last few weeks.  My analysis of over one thousand reviews for the Australian Women Writers’ Challenge also relied on pivot tables.

I could spend ages explaining through words and screen shots how to use pivot tables, but I find that videos are the best format to learn such things.  You may find this video by Eugene O’Loughlin helpful, if not there are plenty of other videos that explain this.  Make sure that the video you watch demonstrates pivot tables using the same version of excel as you have on your computer.  Microsoft has changed the way pivot tables are used between the 2007 version of excel and the 2010 version.

The last thing I did was to run my spreadsheet through Voyant Tools to get a word cloud and examine the text in more detail.  To do this I copied the data that I wanted to examine more closely (the titles of the papers) into another spreadsheet and saved it as a .csv file.

I have explained the use of Voyant Tools in an earlier post – Delving Deeper into Trove (note that Voyant Tools used to be called Voyeur Tools).  This time around I have found it more stable making it a pleasure to use.

One thing that I learned this time around is that just because I can produce a pretty picture such as a word cloud or a graph, I should not necessarily use it.  Lesson no. 1: Unless the pretty picture provides a meaningful contribution to the analysis it is an unwarranted distraction. 

I had wanted to analyse the titles of the papers delivered at sessions which did not have a theme assigned. However the most frequently used words were ‘Australia’ (21 words), ‘Australian’ (17 words) and ‘history’ (17 words). That is hardly surprising at an Australian history conference!  The more interesting words which would help us give a richer picture of the papers presented occurred less than ten times.  At this low number I did not feel it was possible to reach any conclusions.  The problem probably lies in the very small number of words I was working with.  The corpus only had 1,131 words and 631 unique words.

Analysis of Conference Tweets

The first thing I did with the spreadsheet of conference tweets which Sharon Howard sent me was to delete the columns of data I was not interested in.  I saved the modified spreadsheet as a new file.  The golden rule is that the original data should be preserved. Any modifications should be done to a copy of the data saved in a new file. If a mistake is made then at least the original data has not been compromised.

As I was interested in comparing the reporting of the conference with the program I wanted to restrict my analysis to the tweets which occurred during the conference, ie those tweets dated 8th – 12th July.  I deleted tweets dating from 5th February to 7th July as well as the tweets sent after the main conference had concluded.  There were 746 tweets in the original spreadsheet.  I now had 554 tweets to work with.

All the hashtags used in a tweet were also stored in a separate column. I wanted to allocate one cell per hashtag so I had to employ the LEFT, LEN, RIGHT, SEARCH again.  Those formulae are useful! However, that was not the only manipulation of the data I had to do. REPLACE, ROUND and ISERR formulae were handy too.

At this point I had lots of columns and formulae within formulae producing nice looking results. The results might have been nice looking but they didn’t bear scrutiny. There was a serious logic flaw which I had to rectify.    Lesson no. 2: The fact that you have a result, doesn’t necessarily mean it is correct. So it was back to work to fix the errors.  It is easy to overlook mistakes in spreadsheets as the famous Reinhart and Rogoff error demonstrates.

I won’t bore you with the more mundane spreadsheet manipulation I did to produce a list of hashtags that I could examine. It still isn’t as tidy as I would like it to be.  Lesson 3: There will always be more that you can do – know when to stop!

The next issue I had to address was the t but the tweets themselves still had hashtags and @mentions in them.  I wanted a wordcloud that featured text, not handles and hashtags.  We know that every tweet had #OzHA2013 in it, so a word cloud that proclaimed that this was the most popular word used in the conference tweets is not telling us much.

The problem with @mentions and hashtags is that they do not occur in predictable places in a tweet.  It may be common to have a hashtag at the end of a tweet but tweeps who are well practiced can incorporate them into the text in a way that reads well. Excel cannot deal with this easily.

So I turned to programming.  Over the last week I have been working through the lessons in Python on The Programming Historian.  They are very good but after working through five lessons I felt the need to pause and consolidate what I had learned so far.  This was the ideal project.

I saved the tweets into a text file.  Using what I had learned on The Programming Historian I wrote this code to strip the tweets of the twitter handles and hashtags:

# Striptxt.py



def Removehandles(tweet):

StartLoc = tweet.find(“@”)
tweet = tweet[StartLoc:]
inside = 0
text = ‘ ‘

for char in tweet:

if char == ‘@’:

inside = 1

elif (inside == 1 and char == ‘ ‘):

inside = 0

elif inside == 1:

continue

else:

text += char

return text

def Removehashtags(tweet):

StartLoc = tweet.find(“#”)
tweet = tweet[StartLoc:]

inside = 0
text = ‘ ‘

for char in tweet:

if char == ‘#’:

inside = 1

elif (inside == 1 and char == ‘ ‘):

inside = 0

elif inside == 1:

continue

else:

text += char

return text


The following code draws on the module above and removes the handles and hashtags from the tweets:

# OzHA2013_tweet_strip

import Striptxt

f = open('OzHA2013_20130714_textonly.txt','r')
tweets = f.read()
text = Striptxt.Removehandles(tweets).lower()
tweetwordsonly = Striptxt.Removehashtags(text)
f.close()

g = open('OzHA2013_Stripped_Tweets.txt','w')
g.write(tweetwordsonly)
g.close()

Now I had a text file that had just the core text of the tweets ready to upload to Voyant tools. All I needed to do was to add the following to my stop word list in Voyant tools:

  • RT
  • MT
  • http
  • t.co
  • amp

… and start analysing the tweets!  This time the result was better. There were 8,098 words in the corpus and 1,990 complete words.

Conclusion

The combination of Excel spreadsheets, Voyant Tools, simple Python code plus those essential tools, patience and perseverance, helped to expose information in the conference program and Twitter stream which is not otherwise evident – refer to the analysis which I have posted on Stumbling Through the Past.

I am excited at the potential which I have unearthed through the simple programming I have learned on The Programming Historian.  I will now go back and finish the lessons as there are a few more techniques I want to learn.

At the beginning of this post I said that this post was in part about accountability. Please critique what I have done and play around with the data to produce your own analysis.  For those who are new to these methods of analysing data I hope that this post explained what you needed to know and has encouraged you to give it a go yourself.

This post is the last in a three-part series giving an overview of the 2013 conference of the Australian Historical Association.  The other posts can be read on my history blog, Stumbling Through the Past:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s