A Street Near you – a case study in linking disparate datasets

Note – this is a rather hurried post to accompany my talk for the Science Museum public webinar on Wikidata and cultural heritage collections, part of the Heritage Connector project. It will be tidied up and links and images added over the coming days, but the core details are here!
Presentation slides

At its heart, A Street Near You was conceived as a way of demonstrating what I saw as the significant potential of linking First World War datasets, enriching them, and presenting them in an engaging way. This blog post seeks to expand on my original post – about how it was buillt and the impact it had – to demonstrate some of the techniques used and the challenges faced on the road to achieving this.

As I write this, some stats about the current site:

  • 1.1 million core records of Commonwealth service men and women who died in the First World War
  • 733,000 location records, plus every person recorded at their place of burial/commemoration
  • 31,000 portraits and 14,000 other images (eg gravestones or memorials)
  • At least 4.6 million links to 3.2 million distinct pages across countless external websites
  • To date, since 9 Nov 2018 the site has seen 758,000 distinct users and is currently adding to this total by approximately 5-8,000 per week

How the data was connected

Although it feels at times like all the datasets are held together with rubber bands and sticky tape, it is a story of how there were just enough often fortuitous rather than intended aspects of the data that allowed them to be accessed, processed and connected. It’s remarkable to realise that decisions made about how information was collected and stored nearly 100 years ago made their georeferencing – using a truly 21st century cloud-based API – a possibility; how hard graft and logic applied during the creation of the Lives of the First World War crowdsourcing project helped forge connections between over 600,000 records and ensure that so many placemarks have further details for users to explore; how the choice of curators to supplement photographic portraits with text that was simply cut and pasted from another source helped ensure the site has over 8,000 portraits displayed; how teams of volunteers have enriched those basic records to add images, facts and stories to take users beyond the map and into the real people behind the names; how other volunteers have worked (and still continue to work) tirelessly on countless regional projects . Of course all of that is not to forget the wider role of the organisations who have gathered and preserved the content, and then shared it and allowed others to use it in ways like this. Nor is it to forget the vast quantity of information that remains in disconnected silos save for someone having added the right bit of information in the right format at the right moment, or because someone decided that they would not permit someone else to re-use their data.

What has this achieved?

By connecting all these datasets you can quickly enrich individual records with further images, information and links. This helps the end user, whether a local amateur historian or an academic, by providing a single point of access to more complete and rich datasets, but also the onward links for further research.

You can also start to analyse and present the data in different ways, for example gathering related records by cemetery, by regiment, by date of death and most recently by parish (currently England, Scotland and Wales).

You can provide data back to projects to assist them enrich their own projects. 

Core Datasets

At its heart, A Street Near You draws data and images from three key sources, the Commonwealth War Graves Commission casualty records, the Lives of the First World War centenary project, and the Imperial War Museums collections.

Commonwealth War Graves Commission (CWGC)

Over 1 million official records of those who died whilst serving for Commonwealth countries in the First World War. Each record typically has the first names, surname, rank, regiment and date of death. Many records also have service numbers and age of death. They all give the cemetery where the person is buried, but in terms of location what I was interested in exploring, and is the key component of the map data, is the field that they simply call ‘additional information’.  This information was collected from families and a typical example might be “”. About 600,000 records have some text in this field, and most have family information like “Son of John Joseph and Jenny Lapidge, of 104, Felix Rd., West Ealing, London”.

Two main issues arose:

  1. Access to data. The CWGC do not have a search API, but any results set from a manual web search can be downloaded as csv. Originally this was limited to, I seem to recall, 5,000 records per download. Not very helpful if you are trying to access 600,000 records. Another significant issue with the previous downloads was that they did not contain the record ID, so earlier versions of maps I created could not have links to the source record! Thankfully, some time in the latter part of 2018 the limit was increased to 50,000 and the csv does now include IDs.  They do update small numbers of records on a fairly regular basis so I have now implemented semi-automated incremental updates.
  2. Terms & Conditions. Reading the Terms & Conditions it wasn’t entirely clear how their daat can be used, and there were other rather contradictory statements on the CWGC site. But overall I made a judgement call and felt that my use of their data certainly fell within the spirit of their terms. They have since confirmed that they are happy, and also promoted the site themselves.

A key lesson – if they had not provided the improved access to their data through larger downloads, and if there had been any further doubts in my mind about their Terms & Conditions, the A Street Near You site may never have happened. As simple as that. Equally if they hadn’t provided their IDs I wouldn’t have been able to add a specific attribution link and I feel they would have lost out on a lot of traffic (I know that the site has sent tens of thousands of visitors to the Lives of the First World War project, and recently CWGC confirmed that “astreetnearyou.org accounts for 5.21% of our referrals and is the highest non-facebook or twitter referrer”).

Lives of the First World War

I was lucky enough whilst working at the Imperial War Museum to get quite familiar with this dataset. The database is hugely complex and absolutely huge – 7.8 million ‘Life Stories’ (all those who served, not just those who died), over 10 million sources, countless facts, and tens of thousands of images. The database is made up of the original ‘seed’ data (a primary name record from sources such as Medal Index Cards and CWGC) but then supplemented by volunteer and crowdsourced details taken from a range of sources including census records, military records, and users’ own documents, images and memories. Whilst the site is no longer gathering new submissions, it is a credit to those who conceived the site, I guess in around 2013, that they put work to create a permanent archive into the plans, the contract, and the budget!  IWM have now undertaken this work and as well as the public site the data has been released under a Creative Commons license. The major obstacle that remains is access, with data only available via flat and horribly complicated csv downloads. But it is access!

To unlock this data it needs to be connected to CWGC record identifiers. The original site had just under 600,000 already linked, but using a rules based approach, matching combinations of surname, regiment, service number and date of death (tweaked extensively depending on the target subset of data e.g. a regiment) it has been possible to automatically and reliably match a further approx 200,000. Beyond this, it’s then a manual task to mop up the rest, but a tool has been written for a couple of dedicated ex-Lives of the First World War volunteers to add these, and about 35,000 have been added so far.

The reason this is so critical is that it then opens up the opportunity to link in data held for that person in Lives, which could include full names instead of initials, variations in spelling, images (22,000 new portraits have been added this way), and links to stories and communities (326,000 links have been added to 162,000 distinct records).

Imperial War Museums (IWM) collections

The principal collection from the IWM that is used in A Street Near You is the Bond of Sacrifice Collection. This consists of over 16,000 portraits of men who served, and whilst not exclusively those who died, a large proportion are as they were contributed by families. 

I love this example of the most tenuous of connections – 6,000 of these images were able to be connected simply because at some point in the past (a practice now ended) a curator went through these and looked their record up on the CWGC site and copied and pasted the next of kin information. Not the link, not the unique identifier, but a blob of text that was thankfully unique enough, and almost always prefaced with the exact phrase “CWGC family information: “, to connect records. Indeed in the few cases it was not unique and the text was found on more than one CWGC record. Most of the time this was because they were brothers, a unique relationship that had otherwise gone uncaptured. A further set of images that have been integrated since launch is the Women’s War Work Collection, but for reasons described below this was much more difficult to connect up to names and then locations. 

Much work has been done to try to manually connect more and the total now stands at about 6,800.

The images are clearly licensed under IWM’s bespoke non-commercial license (based on Open Government License and effectively equivalent to CC BY-NC)

Further datasets

A key motivation in creating the site was to show the potential of linking datasets. In terms of the First World War there are thousands, both at a national level, globally, and at a regional level with countless local projects, many that were funded in the UK at least by the Heritage Lottery Fund (now National Lottery Heritage Fund). As you can imagine, these come in all sorts of shapes and sizes and the key questions for any such dataset are

  • Is the core data easily accessible 
  • Are there restrictions that would prevent use of data (e.g. licensing)
  • How do we connect their records to the matching individuals, accurately but automatically (or at least minimising the human judgement and effort involved)

As you can imagine, the answers to these vary considerably and in practise, as a personal project it has been a question of tackling those that will yield the most reward for the minimum effort.

Surrey in the Great War

This was a council-led, HLF funded First World War centenary project. It continues to be actively built on to this day, which is rarer than you might think for such projects! I believe that the underlying system was built by OrangeLeaf Systems.

With no API or data downloads access to the data involved scraping web pages. Crude, but effective. It worked thanks to the presence of links to a CWGC record (and hence identifier) and/or a link to the corresponding Lives of the First World War page. If only the latter existed, but the Lives page had been linked to the CWGC record (see above!) then you could extrapolate the direct link. This was so successful that 29,000 links to individual records have been added.

Other similar projects include:

  • Royal Welsh Fusiliers Museums – added 10,000 links and 2,100 portraits, provided them with links to portraits they didn’t have
  • British Jews in the First World War – 1,000 links
  • CWGC Archives – 1,100 links to documents
  • Isle of Wight Family History Society – 6,600 links
  • Dawlish in WW1 – 300 links

Online Cenotaph, Auckland Museum 

This is a good example of a major national project. Access was available through a good quality API and the data is openly licensed. Record matching was potentially problematic, but both their records and the CWGC records typically contained a service number, so that, coupled with surname and date of death (plus knowing that they were New Zealanders) made the process pretty simple and in total 17,000 links were added. By sharing back data they have now added links to both CWGC and A Street Near You to those matched records, something that they could never have done themselves and potentially of great benefit to their users.

A similar case is Virtual War Memorial Australia where 45,000 records have been matched. I’m still looking at Canadian and South African data.

Wikidata

This feature is currently turned off for technical reasons but the aim is to include open data including images and Wikipedia (and other) links to records for individuals, regiments and cemeteries. There’s alo a huge opportunity to provide Wikidata with data collected in teh project.

User contributed links and embedded tweets

It was always clear in my mind that this was a data project and I would not be allowing users to add content in the same way that for example Lives of the First World War had. However, by popular demand I opened up the facility for users to add links to an individual’s records. One possible source of links was Twitter, so a tool was written to take this out of the hands of users and harvest tweets that had an identifiable person (by virtue of the fact that they contained a link to a CWGC, Lives or A Street Near You record). All links are moderated before they are displayed as ‘Additional information’, with the Tweets being embedded in full rather than just displayed as a link. 

Work in progress

To follow, literally!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.