Behind the scenes on an award winning story

So this time last week, I was getting ready to catch a very early flight to Paris ahead of the Data Journalism Awards.

My story looking at where social services in Wales sent children for care placements and which councils from other parts of the UK were sending children to Wales won the award for data driven investigation, small media. So I set off for an exciting trip to Paris. It was an honour to be recognised alongside some awesome projects from around the world.

One of the things I had to do for the ceremony was put together a video about how the project was put together. Seeing the videos from the other winners and finding out more about the project was definitely the most interesting part of the ceremony. You can view all the videos here.

I’m currently working on a book on how to get started with data journalism, you can sign up for updates here, and this story is one of the projects featured in the book is the care placements for children – looking at ways of cleaning up data so you can use it to find stories.

Below is an extract from the Cleaning Data chapter of the book, which explains some of the techniques used to put to together the story.

The data for this story was gathered through FOI requests, which were replied to in a variety of formats.

When formatting data for use, you need to think about what question you were asking when you sought out the data and what information you were hoping to get back – this will help you lay out a table so that you can use it to look for answers.

In this example, the request asked for the information held by Welsh councils on care placements over the past three financial years.

To get a better understanding of the data, it made sense to split it into two tables, one for children being sent elsewhere by Welsh Councils and one for children being sent to areas in Wales.

To make the information understandable, you need to set out your column headings to organise it – placing council, area where placed and each of the financial years and then formatting the data provided in order to fit.

The master spreadsheet for care placements

In some cases, the data was already formatted in a way that would fit, with just the name of the council that provided the information needing adding.

In others, the data was provided as a table for each year, so these needed to be joined together so information for the same area could be read across all the year columns.

Mostly the data was provided as spreadsheets, making this an easy task to complete. Other forms such as Word documents and tables within the body of the email could also be easily copied into a spreadsheet and added to the table.

While putting together the data, it became clear that a summary table with totals for each council would need to be added, as not all the councils had provided a breakdown by area. Some had only provided a total, while some had provided a total plus information about where children were sent, but not individual totals for each area to avoid data protection issues.

Clean-Data-8

Once the data was all in one table, it makes it easier to start analysing it, for example, adding up the totals for all the councils to see an overall total for the whole of Wales.

Once the data had been gathered together in a master spreadsheet, it became clear that all of the councils had different ways of naming councils, and so the data would need to be tidied and standardised in order to get a clear picture of how many children are being sent to different areas.

Tidier data also makes it easier to get the data into a format that can be used to create visualisations – as you will see in a later chapter, you sometimes need specific formats in order to work with particular tools.

The data on where children had been placed by Welsh councils came from 18 different councils and as such contained a number of different variations for council names.

To start cleaning, the data you need to open up Open Refine and create a new project by uploading a CSV of the data you want to tidy up.

Once you have your data open, select Facet > Text Facet from the dropdown menu on the column you want to clean.

A box will appear on the left-hand side of the screen showing all the different data items in the column and a count of the number of times they appear.

Click on Cluster to bring up the panel to start matching up names.

Clean-Data-26

As you can see, the programme identifies data that is likely to be the same, either because they are the identical but use different capitalisation, because they look like shortened versions of a similar thing, or because they are a misspelling.

For example, NEATH PORT TALBOT, Neath Port Talbot, and Neath port talbot or Birmingham and Birgmingham.

For each identified possibility, there is a box to tick if you agree that they are a match. There is also a space to enter what you would like everything renamed to – you can change it from the default selection by entering your own text or by clicking on the link in the list for the version you prefer.

When you have been through the list picking out the ones you want to merge and what you want the merged group to be called, you need to click either Merge Selected and Re-cluster or Merge Selected and Close, depending on whether you think there are more similar data items to be found.

If the programme is not finding many similar items, you can change the way it makes comparisons by changing the method and keying function at the top of the panel – this essentially changes the number of similar words or phrases that will trigger a match.

The most effective way of working is to keep changing the criteria until either no more matches are found or none of the suggested similar items match.

In some cases, you may need to be quite liberal with your criteria. Here, to catch things such as Swansea County Council and Swansea City Council, you need to set a criterion that also tries to match Southampton and Northampton.

Clean-Data-24

Using the cluster method, it was possible to go from 289 different items to 209 in about five minutes, which is much quicker than trying to do it manually.

Once you have used the cluster function, you may then need to go through the list to catch any ones you may have missed. As it has been sorted alphabetically, you should be able to see similar data listed together. You can also sort by count, which may help identify items that only appear once or twice, possibly because they are a misspelling or an uncommon variation.

To edit, hover over the item you want to change and then click the edit button and type in what you want to change the title to. As you edit, the programme will automatically recalculate the list to reflect the new count of items.

Clean-Data-25

Once you have all the data tidied up, you can see more clearly which councils are seeing the most children sent to their area. You can also then use this to map the locations where children, in this case using Tableau.

Leave a Reply