A Closer Look at Newspaper Items via the Trove API

A simple search is quick and easy using the Trove API but these are often too broad to provide the information we need. Any insights to be made are lost in the sludge of irrelevant results. In this post I will explain how to narrow the search down so that you can close in on the data that is more helpful for your research. If you have not used the Trove API before and need to learn how to do a simple search using this tool you should first read ‘An Introduction to the Trove API’.

Trove is a huge database of information contributed by over two thousand libraries in Australia as well as other organisations. It is an ever increasing data mine. Today the Trove website says that it holds 389,961,760 items. The Trove API gives access to items in several zones: book, picture, article, music, map, collection, newspaper, list. This series of posts focuses on the article zone which allows access to digitised newspapers.

This post is designed to be used as a reference when you need to find answers for particular types of searches that you are most likely to be conducting. Keep in mind that there is a lot more that you can do in the newspaper zone via the Trove API, and that there is also a large amount of data to explore in the other zones. If the answer to your question is not here then you should consult the Trove API Technical Guide. The other essential document you will need to consult regularly is the URL Encoding Reference so you can translate non-ASCII characters into a code that will be recognised on the Web.

First Things First – Learning About Facets

Each item in Trove has a number of attributes which are called facets. In the case of a newspaper item it will have a date of publication, a title, a type (eg advertising, family notice etc).

A facet must be preceded by ‘l-‘. Now it can be very hard to tell with the naked eye what is the character at the beginning of the string. It could be l, I or 1. In the font that I was using in Word it was clearly not a number but the two letters looked exactly the same. Is this character the lower case letter that falls between k and m in the alphabet, or is it the uppercase letter that falls between h and k?

This is a recurrent issue in computing but one that can be easily resolved. Simply copy the unknown character and paste it in an Excel spreadsheet. In a neighbouring cell query this letter using the =code function and a number corresponding to the ASCII code for the letter appears in the cell. Using this method we find that the character is the lower case letter that falls between k and m in the alphabet.

The other issue is whether the dash is an ASCII character. By consulting the URL Encoding Reference you will find that it is not and that the ASCII code for it is %2D.

So now you have the essential element that is used at the beginning of all calls that specify facets:

l%2D

Now you are ready to look up the facet that you want to include in your call. Remember to also place  the character, ‘&’ before this section of your API call.

Category Facets – articles and more

The category facet covers many types of written items in the newspapers:

  • Articles
  • Advertising
  • Details lists, results, guides
  • Family notices
  • Literature

I won’t explain how to use each facet, but I will give an explanation of how I constructed successful API calls for three of these facets. By showing you the methods I used to work out how to use them as well as the API calls themselves I hope you can work out how to use the other category facets.

Article

Computing is an arena where the pedant excels. Every space, dot and other blip is significant. You should assume that you are working in a case sensitive world unless advised otherwise. In this case make sure you have an initial capital for the word ‘Article’ or else you will get no results:

http://api.trove.nla.gov.au/result?key=&zone=newspaper&q=secular&l%2Dcategory=Article&s=0&n=100&include=articletext

Detailed lists, results, guides

The following example is more complicated so I will only include the compulsory search terms required for a simple search of the word ‘horse’ in this category :

http://api.trove.nla.gov.au/result?key=&zone=newspaper&q=horse&l%2Dcategory=Detailed+lists%2C+results%2C+guides

Hint: If you are getting an error message from your API Call or it is generally failing to work, do the same search in the Trove Advanced Search facility. Then copy and paste the URL that is returned into a text file or Word or something and study how it is constructed. Using this method I found that I had mistakenly used %82 to replace the comma whereas the correct ASCII code was %2C. Have a look at these two characters in the URL Encoding Reference and you will see why I made that mistake!

Family notices

There is way more to this than I expected:

http://api.trove.nla.gov.au/result?key=&zone=newspaper&q=Perkins&l%2Dcategory=Family+Notices

When I was testing this I received only one result using the API but 7676 results using the Advanced Search function. Clearly I had made an error but I could not work out the problem using the Trove API documentation so I studied the URL generated from the Advanced Search:

http://trove.nla.gov.au/newspaper/result?q=Perkins&exactPhrase=&anyWords=&notWords=&l-textSearchScope=*ignore*|*ignore*&fromdd=&frommm=&fromyyyy=&todd=&tomm=&toyyyy=&l-category=Family+Notices|category%3AFamily+Notices&l-word=*ignore*|*ignore*&sortby=

Hint: You might have an initial pang of anxiety at the sight of this loooong URL. Take a big breath and recall what your maths teacher taught you, “break down the problem into little steps and lay it out neatly”. So I split this URL into what I regard as active elements, the bits that actually generate our results, and the inactive bits. By changing the colour of the ‘active’ elements, the long stream of gibberish starts to make sense:

http://trove.nla.gov.au/newspaper/result?q=Perkins

&exactPhrase=&anyWords=&notWords=&l-textSearchScope=*ignore*|*ignore*&fromdd=&frommm=&fromyyyy=&todd=&tomm=&toyyyy=

&l-category=Family+Notices|category%3AFamily+Notices

&l-word=*ignore*|*ignore*&sortby=

To my surprise this shows that to generate a list of Family Notices, two elements are required, not one and that they are separated by a bar, |, character. And note that we don’t seem to have to change the bar character into a URL code. However, if you are having problems getting things to work there is always someone online to help. Tim Sherratt pointed out that the second element was not needed and showed it working in his Trove Console so I was able to update this section of the post.

Illustrated articles

You will notice that in the Advanced Search screen the check box for illustrated articles is in a different section to the check boxes for the other newspaper items such as those listed above. To restrict your search to illustrated articles you need to use an API call like this:

http://api.trove.nla.gov.au/result?key=&zone=newspaper&q=Perkins&l%2Dillustrated=Y

I worked out the coding for this API call by going straight to the URL provided by the equivalent search through the Advanced Search screen and once again found that the coding was quite different to that suggested by the Trove API Technical Guide. Tim Sherratt then lent a hand and proposed a more succinct way of expressing this API call.

Conclusion

You have to work for your rewards. This post demonstrates that the documentation provided by the National Library of Australia for the Trove API has been correctly named as a guide. It sends you in the right general direction but the best way of working out how to get the API to work is by deconstructing the Trove search URL for your query.

The manager of Trove, Tim Sherratt has offered some assistance to people working out how to use the API. Enter your API call in his Trove Console and share the link for the results with him via @wragge on Twitter if the API call fails. He will endeavour to help you with the issue. This is a great offer from Tim who is the expert in Trove.

This post was a lot more work and a lot longer than I expected. I had hoped to cover all the basic Advanced Search features that researchers would be most likely to want to incorporate into their API calls. Instead I will cover searching for articles within specific dates and sorting and relevance in my next post.

This is the third post in a series exploring the Trove API. The other posts are:

1. An Introduction to the Trove API

2. Using the Trove API with Excel Spreadsheets

I will be updating these posts in response to comments pointing out corrections needed or further explanations required. This post has been updated thanks to comments made by Tim Sherratt on Twitter and Travis M Sellers in the comments below.

Advertisements

4 thoughts on “A Closer Look at Newspaper Items via the Trove API

  1. Hi Yvonne –

    Thanks again for your excellent post. If you don’t mind my taking up your valuable time, I thought I might share a few, hopefully accurate observations I’ve noticed during my period of insanity programming in VBA this weekend.

    Firstly, I’ve found that I don’t need to convert many of the spaces, commas etc to the ASCII equivalent in my API queries.

    For example, http://api.trove.nla.gov.au/result?key=&zone=newspaper&q=secular&l-category=Article&s=0&n=100&include=articletext

    returns the same number of records in Trove Advanced search as your example, viz.,

    http://api.trove.nla.gov.au/result?key=&zone=newspaper&q=secular&l%2Dcategory=Article&s=0&n=100&include=articletext

    Secondly, I can enter “l-category=Detailed lists, results, guides” (no quotes) as opposed to your example “l%2Dcategory= Detailed+lists%2C+results%2C+guides” (no quotes). [I think the space after = in your example might be an error ??]

    But as always, there’s probably a valid reason to use the ASCII codes that I have overlooked in your blog, but it works fine for me when I compare the results returned with Trove Advanced search. That said, I am not worried if there was a slight discrepancy of a few dozen records, give or take when many searches return over 10,000 results AND as you pointed out, a search today will return less records than a search next month due to text corrections etc.

    I guess if I am on the right track, then the less ASCII conversion required the easier it is to understand the elements of the API query and less chance of an error.

    Finally, can I just confirm with the Family Notices category that the two elements ARE NOT required? I haven’t noticed any difference with using the one element and comparing with Advanced Search.

    • Yes, I noticed that some non-ASCII characters seem to work in a Trove API call. The Trove API – Technical Guide says that URL encoding “may” be required. w3schools states “the URL has to be converted into a valid ASCII format”. There is some discussion about this at Stack Overflow.

      From all this I conclude that it is safer to use URL encoding. While I’m constructing a call I split it into components on different lines or in different colours so that I can read it clearly, then I stitch it together. One downside of doing this is that sometimes rogue spaces creep in as you noticed (I’ve corrected that in the post).

      Re family notices: you are right, the two elements are not required. Tim Sherratt also pointed this out so I have updated the post to remove the second element.

  2. It might be worth adding that you can add to the API query the ampersand syntax “~[number]” as well. As you know if you are searching for John Smith, using “John Smith”~2 will return instances of “John William Smith”. I find it very useful for cemeteries that often had the word “general” included. So “Brighton Cemetery”~2 will return “Brighton General Cemetery”.

    http://api.trove.nla.gov.au/result?key=%5Bkey removed]&zone=newspaper&q=”Brighton cemetery”~2 [returns 16,839 records]

    http://api.trove.nla.gov.au/result?key=%5Bkey removed]&zone=newspaper&q=”Brighton cemetery” [returns 14,612 records]

    Again, I don’t need to convert ~ symbol to the ANSII equivalent which is great.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s