Data Journalism for Beginners: What to do about too awesome to be true data?


Part of a series of posts looking at tips and ideas for getting started with data journalism

Why data journalism and getting started
Some maths and Excel basics
Finding your data
Cleaning up your data

Want more? Buy Getting Started with Data Journalism, a complete beginners guide to finding, cleaning analysing and visualising data in any size newsroom

I meant to get this series finished on Sunday but life and the Bank Holiday slightly got in the way (in other news my garden is looking a bit better)

But first a short detour as I forgot to add in a post on the perils of analysing data and some things you might want to avoid doing.

Understanding what the numbers might mean

When working with numbers, always be aware that they may not be telling you what you think they are telling you.

Some golden rules for dealing with data:

* Who put the figures together and how did they do it?

* Watch out for small numbers and rare events

* Consider how reliable your data is…and how accurate

* What are the long-term trends?

* Don’t cherry-pick your data

* Be careful of what the numbers mean – what information is actually being collected?

I occasionally get irritable and feel compelled to blog (or tweet) about terrible uses of data (usually in support of an agenda)

Generally if you think the data is suggesting a brilliant story, one you just have to get out there because it’s so great – stop.

Check you haven’t screwed up your maths or got your spreadsheet out of line while cleaning up your data and are thus applying Middlesbrough’s data to Westminster or trying to calculate the % of the column next to the total rather than the total.

Next, common sense check your data – is it actually telling you what you think it is.

It’s useful to think about things like reliability. Could that outlier just be down to someone noting down the wrong thing – don’t forget that hospital episode data contains quite a few pregnant men due to someone putting in the wrong diagnosis code.

The ONS is pretty good about flagging up issues with it’s data – such as with the recent zero hours figures, which may have shown an increase in zero hour contracts or an increase in the number of people describing their contract as a zero hours one after seeing the phrase in the media.

Confidence intervals are useful for evaluating whether something is significant, falling outside the confidence intervals, and worth a story or if it’s not really. This is the current problem with the Welsh hospital death rates row is that without confidence intervals we, like Sir Bruce Keogh, are unable to form a view.

Check as much of the available data as possible – is it really showing the pattern you think you’re seeing in a small timescale or set of data.

Labour recently attacked the coalition for the falling numbers of women getting smear tests since the coalition came to power. What they appear to have forgotten to do was check further back (probably unintentionally as they were looking at snapshot before and after figures) – the problem is that 2009/10 looks like the outlier, with numbers making it to tests about the same under the coalition and Labour.

Want more? Buy Getting Started with Data Journalism, a complete beginners guide to finding, cleaning analysing and visualising data in any size newsroom

Leave a Reply

Required fields are marked *.