clairemiller.net

May 6, 2014
by Claire
0 comments

Data Journalism for Beginners: What to do about too awesome to be true data?

Part of a series of posts looking at tips and ideas for getting started with data journalism

Why data journalism and getting started
Some maths and Excel basics
Finding your data
Cleaning up your data

Want more? Buy Getting Started with Data Journalism, a complete beginners guide to finding, cleaning analysing and visualising data in any size newsroom

I meant to get this series finished on Sunday but life and the Bank Holiday slightly got in the way (in other news my garden is looking a bit better)

But first a short detour as I forgot to add in a post on the perils of analysing data and some things you might want to avoid doing.

Understanding what the numbers might mean

When working with numbers, always be aware that they may not be telling you what you think they are telling you.

Some golden rules for dealing with data:

* Who put the figures together and how did they do it?

* Watch out for small numbers and rare events

* Consider how reliable your data is…and how accurate

* What are the long-term trends?

* Don’t cherry-pick your data

* Be careful of what the numbers mean – what information is actually being collected?

I occasionally get irritable and feel compelled to blog (or tweet) about terrible uses of data (usually in support of an agenda)

Generally if you think the data is suggesting a brilliant story, one you just have to get out there because it’s so great – stop.

Check you haven’t screwed up your maths or got your spreadsheet out of line while cleaning up your data and are thus applying Middlesbrough’s data to Westminster or trying to calculate the % of the column next to the total rather than the total.

Next, common sense check your data – is it actually telling you what you think it is.

It’s useful to think about things like reliability. Could that outlier just be down to someone noting down the wrong thing – don’t forget that hospital episode data contains quite a few pregnant men due to someone putting in the wrong diagnosis code.

The ONS is pretty good about flagging up issues with it’s data – such as with the recent zero hours figures, which may have shown an increase in zero hour contracts or an increase in the number of people describing their contract as a zero hours one after seeing the phrase in the media.

Confidence intervals are useful for evaluating whether something is significant, falling outside the confidence intervals, and worth a story or if it’s not really. This is the current problem with the Welsh hospital death rates row is that without confidence intervals we, like Sir Bruce Keogh, are unable to form a view.

Check as much of the available data as possible – is it really showing the pattern you think you’re seeing in a small timescale or set of data.

Labour recently attacked the coalition for the falling numbers of women getting smear tests since the coalition came to power. What they appear to have forgotten to do was check further back (probably unintentionally as they were looking at snapshot before and after figures) – the problem is that 2009/10 looks like the outlier, with numbers making it to tests about the same under the coalition and Labour.

Want more? Buy Getting Started with Data Journalism, a complete beginners guide to finding, cleaning analysing and visualising data in any size newsroom

May 3, 2014
by Claire
0 comments

Data Journalism for Beginners: Cleaning up your data

Part of a series of posts looking at tips and ideas for getting started with data journalism

May 1 - why data journalism and getting started
May 2 - finding your data
May 3 - cleaning it up
May 4 - mapping and visualising

Want more? Buy Getting Started with Data Journalism, a complete beginners guide to finding, cleaning analysing and visualising data in any size newsroom

Cleaning Data

Data is messy

You will need to tidy up much of the data you collect even before you start analysing it and finding stories.

20140503-232636.jpg

Some of the ways in which data can be messy:

    • It comes from several different sources and needs to be combined into one file.
      It is not in an easily useable format (for example, your data is stored as PDFs, web tables or other formats).
      It is full of abbreviations and/or the same data is coded in several different ways.
  • If you cannot understand or read the data, it is very hard to get a story from it.

    Data may go through a number of edits from its original format before it is in a state where it can be analysed.

    In order to keep track of these edits and to avoid making changes that make the data inaccurate, it may be useful to keep a log of changes that have been made as you go along.

    Here, we’re mainly going to be looking at the perennial problem for journalists, data sent as PDFs.

    Getting data out of PDFs

    Quite often official data will be supplied as a PDF, particularly if the data is contained in a report.

    As it can be difficult to work with data held in a PDF, you will often need to get the data out and into a more usable format, such as a spreadsheet.

    The following steps are just some of the ways you can liberate that data and make it easier to work with.

    Step 1

    Ask whether it is possible for the people responsible for the data to send you a more useable version of the information in the spreadsheet.

    They may be able to send you the original spreadsheet or text document the information came in.

    In the UK, you can add a request for a particular file format to an FOI request, which may help encourage officers to send it in that format.

    Changes to the Freedom of Information Act mean organisations will be required to provide any dataset in a reusable format, a change that should hopefully give more power to requesters seeking easier to use formats.

    Step 2

    Basic copy and paste sometimes works, allowing you to copy the data as a formatted table into a Microsoft Word or Excel document.

    In this case, the data from an FOI response from the Welsh Ambulance Service on ambulance response times almost copies and pastes into word with no problems. There are a few rows that do not have data for all the months, meaning the numbers do not necessarily line up with the right months. These would need comparing with the original document, so these rows could be corrected.

    Once finished, the cleaned data could then be transferred to a spreadsheet.

    Step 3

    Copy and paste and then clean up the data.

    For small amounts of data or not very complicated data, such as columns of numbers, you can copy and paste it into a Microsoft Word or Excel document. You can then tidy it up, either by converting the data to tables using delimiters, often a space, or by dragging columns back to the right place.

    Health behaviour in Young People

    The data is contained in a massive report from the World Health Organisation looking at health behaviours in children across the West – the report is too big to convert from PDF to XLS (see below) so another method was needed to get the data into tables.

    When copied and pasted into Word, the table data appears as one column with a list of countries followed by the figures for boys and girls as one list, alternating between the two genders.

    The first step is to convert the copied data into a table, using the Table > Convert text to table option. This should turn it into a table with one column and many rows.

    Copy and paste this into a spreadsheet. From the chart in the report you can see that there is a percentage for girls and a percentage for boys – turn those labels into column headers.

    All the numbers for each country are listed one after the other. To get two columns, one for girls and one for boys, you need to paste the numbers twice, with the second column dropped one row down so the girls percentage is next to the boys percentage for the same country.

    You then need to delete every other row, so you are left with the data for each country, with the girls percentage followed by the boys’ percentage. To delete rows, highlight the row by clicking on the number next to the row and then right-click to bring up a menu with the option to delete row. You can use Ctrl and click to highlight several rows at the same time.

    Next, move the list of countries up to next to the corresponding percentages. You may need to make sure that none of the country names have spilled over into adjoining cells when you created the table – if they have, just tidy them up so they are in one cell.

    Your table is now ready to use.

    Step 4

    Use a converter to turn the PDF into a useable format.

    There are many free online PDF converters on the internet, of varying effectiveness. The best tip is to upload your file to several and use the best conversion you get back.

    Online converters cannot cope with big files or PDF files where the table data is combined with large quantities of text. With these, it may be better to try converting those to Word documents and trying to extract the information in that way.

    In order to give an idea of how they work, this is an example of what they generally return when converting a simple PDF file. This involves uploading a simple PDF file to a number of different file converters to try to get an idea of how they would cope.

    The original PDF of ambulance data can be downloaded from here

    Some of the available PDF to XLS converters:

    * PDFtoexcel.org
    * zamzar.com
    * cometdocs.com
    * PDFtoexcelonline.com
    * free-PDFtoexcel.com

    To upload your PDF file, just follow the instructions on screen. The first four require an email for the file to be sent to, while the last is online based.

    Zamzar.com, PDFtoexcel.org and cometdocs.com all came back within minutes. On one occasion with this file, Free-PDFtoexcel.com also finished converting very quickly but on another, there was an error and it failed to upload.

    Zamzar.com and, when it works, Free-PDFtoexcel.com , puts each page of the document in a different spreadsheet table. The file from Zamzar.com threw up some warning messages when it was opening but seems to be ok.

    PDFtoexcel.org and cometdocs.com puts all of the PDF information in the same spreadsheet, with blank rows to indicate where page breaks were.

    Which type of conversion works better will depend on the PDF you are converting and what you are planning to do with the results.

    Whichever conversion site you use, the data is likely to be messy when you get it back. The converters tend to add blank rows when there is a page break in the PDF.

    In addition, depending on the size of the space between columns, they may not read them as two separate columns and leave them combined as one. If this is the case, it may just mean piecing tables back together and deleting blank spaces.

    You may also need to take care with rows where cells have been shifted out of line with their original column. For example, where there was a blank cell in the original PDF and the converter has read it either as not being there or as more than one blank space. In these cases, you may have to refer back to the original copy to ensure you have everything lined up properly.

    Step 5

    What if you are sent a PDF where the information is a scan of a document?

    This happens frustratingly often – if the only version of the information that can be tracked down by FOI officers is a printout, or if there has been some redaction done to protect personal data and the easiest way to do it was by physically covering up and then rescanning the documents.

    Getting data out of these types of files is hard – you cannot copy and paste or use online converters.

    One thing you can try is optical character recognition – this converts images of text into actual text, which you can then edit or analyse in other programmes.

    Software to do this often comes with scanners, so it may be worth checking round to see if anyone has it loaded on their computer.

    Otherwise, there is online OCR, which works in a similar way to online PDF converters.

    You can upload files to sites such as Free Online OCR and click preview to see an image of the file.

    An example file to practice on.

    You can then use the box on the page to drag it around the content you want to convert. This means you can narrow it down to just the data, rather than all the information in the document.

    Converting PDFs like this is likely to be slow going. You have to work page by page with the online converter, making it a less feasible option for big documents.

    Data you get out of documents in this way is likely to be very messy – tables will not be converted as tables and the character recognition does not always work, so you end up with odd bits of text.

    The cleaner the document, the better recognition is likely to work – poor quality scans tend to mean blurry characters that will not register properly with the programme.

    The above example contains a simple three-column table of ambulance response times, with each table headed with a postcode.

    When you copy the converted data into a spreadsheet, it shows as one column.

    From there it is fairly easy to split up the data. As each individual piece of data is all one phrase (the date, connected with a hyphen, the number of calls, and the average response time) separated with a space, you can use that space to split one column back into three.

    You can do this by using the Text to Columns option – which works in a similar way to Word’s convert text to table.

    Highlight the column you want to split and then select Data > Text to Columns (this works in both Microsoft Excel and OpenOffice Calc). This will bring up a dialogue box that gives you options for how you want to split the column, using either a fixed width or a delimiter such as a space, comma, or tab.

    With this data, select space and click OK to get the data back into three columns.

    Step 6

    What to do if none these work?

    Sometimes transcribing is an option – a very slow one, but it is possible.

    Sometimes it may be the only way – court data in England and Wales is very difficult to get in a useable form – the data team at the Guardian had to extract the data on riot related court cases by hand when it was carrying out its Reading the Riots project.

    There is also the option of using code to extract data from PDFs.

    Investigative journalism website ProPublica had to use a number of different techniques to extract data from PDFs as part of its Dollars for Docs investigation, which tracks financial ties between doctors and medical companies.

    The techniques they used are outlined on their website:

    * Turning PDFs to Text

    * Getting Text Out of an Image-Only PDF

    Open Refine for cleaning data

    A great tool for cleaning up messy data is Open Refine, which has a number of options that help tidy up your data, among them the cluster option which tries to find matches in your data helping you iron out misspellings and similar references to get consistent naming (particularly helpful if you want to do counts of things).

    Open Refine offers a number of quick transformations you may want to carry out on your data to make it tidier and easier to work with.

    From the drop-down menu next to the column you are editing, select Edit Cells > Common Transformations.

    The first two options allow you to get rid of unnecessary spaces, either in between words or at the start or end of the cell. It is not uncommon for people to sometimes add these when entering data, and while two cells may look the same, the computer will read the different number of spaces and count them as being different.

    The next option is Unescape HTML entities. Sometimes people add HTML to the entries – for example on the Food Hygiene Ratings, the right to reply text contains HTML to make it show up formatted on the website. This option will turn HTML for things like punctuation and space (e.g. ` ` for a space) back into want it is supposed to look like.

    The next three options allow you to change all the items in the column to upper case, lower case or title case, where all the words are capitalised. This does have some limitations with hyphens and acronyms as it simply capitalises the first letter of everything the programme thinks is a word.

    It is also possible to change the case of items in Microsoft Excel and OpenOffice Calc as well, by using the following formulas.

    * Titlecase: =PROPER(*Cell Reference*), e.g. =PROPER(B1)

    * Uppercase: =UPPER(*Cell Reference*), e.g. =UPPER(B1)

    * Lowercase: =LOWER(*Cell Reference*), e.g. =LOWER(B1)

    To number, date and text do the same thing as formatting numbers in cells does in Microsoft Excel or OpenOffice Calc. For example if you have a list of phone numbers you may want that to be read as text because programmes have a tendency to delete any starting 0s when they think the data is a number.

    The last option blanks out all the cells.

    Under the Edit Cells menu, there is another option to blank out all the cells and one to fill content down.

    Split and Join multi-valued cells can be used to split or join cells using commas or spaces, or other punctuation. This can be useful if the data you have been sent has addresses across several cells and you want it all in one.

    Cluster and edit will take you back to the Cluster panel.

    Options under Edit Columns allow you to remove or rename columns or swap the order around.

    Want more? Buy Getting Started with Data Journalism, a complete beginners guide to finding, cleaning analysing and visualising data in any size newsroom

    May 2, 2014
    by Claire
    0 comments

    Data Journalism for Beginners: Where to find some data

    Part of a series of posts looking at tips and ideas for getting started with data journalism

    May 1 - why data journalism and getting started
    May 2 - finding your data
    May 3 - cleaning it up
    May 4 - mapping and visualising

    Want more? Buy Getting Started with Data Journalism, a complete beginners guide to finding, cleaning analysing and visualising data in any size newsroom

    Sources of Data

    Data can be found in a variety of places.

    As more and more is released, knowing how to find the most useful datasets and those that contain the most interesting stories will be a key skill.

    Government Statistics

    The ONS publication hub, a useful daily source for data

    The ONS publication hub, a useful daily source for data

    Governments release a lot of data and tend to do it on a regular basis.

    Much of this will be collated data, where data is combined over a large area or longer period of time to give an overview of a subject, such as birth rate or burglaries.
    These kinds of releases provide much of the day-to-day information for data stories.

    They are the kind of stories most journalists will regularly cover, particularly anyone with a specialist beat such as crime, health or education. Most journalists probably would not particularly see these kinds of stories as data journalism.

    Working with these types of datasets tends to have a fairly straightforward workflow:

    • Look at the data when it is released, usually early in the morning
    • Analyse to pick out any possible changes – as the data tends to already be summarised, most stories will come from comparing different years or different areas to see changes over time, or to see areas where the numbers are particularly high or low.
    • Writing stories, getting comments, trying to get experts to explain why the numbers might be like they are or, if the numbers highlight a particular issue, getting political comment on that.
    • Putting together visualisations and datasets to illustrate any stories.

    Data sources to get you started

    • ONS - The Office for National Statistics holds all of the datasets released by the ONS, some even going back to the 1960s.
    • ONS Publication Hub – data organised by date of release – this is the best way to keep up with official statistics as they are released. Also, a good way to plan future stories around datasets that will be released in the future.
    • Data.gov.uk – another repository for UK data – this one is broader than the ONS and tends to also include datasets put together by individual departments and also data released as open data.
    • Data.gov – the US equivalent of data.gov.uk – bringing together data from federal agencies. It also has links to open data sites for states and cities, as well as around the world.

    Written answers

    In many parliaments, there is an opportunity for opposition politicians to submit questions to those representing government departments. The answers to these questions are often published on the website for the parliament or government and in some cases will include tables of data.

    It is often easy to search through the published answers – scrolling down, scanning for tables – and the data included can be interesting – the information on children in custody came from a parliamentary answer.

    If you are looking for data on a specific topic or related to a particular area, you can use They Work For You to create RSS feeds or alerts for searches.

    While you have little control over what data is published, it is worth checking the published records of the relevant parliament, as you may turn up data that does not apply to the area you cover, but may give you ideas for FOI requests.

    Searching

    The internet is full of data, buried away on sites, waiting to be found and turned into stories.

    Sometimes just looking on a website will not lead you to the most interesting data – council websites in particular, can be notoriously difficult to navigate.

    Often a Google search will return everything but what you are looking for.

    However, Google search has a number of options that make pinpointing datasets or searching out what public bodies hold easier.

    • To restrict search to just one site, type site:then the address of the site without the http:// or the www
    • To restrict search to just one site, type filetype:the file type you are looking for.
    • Put phrases inside quotation marks (“  ”) in order to search for the exact phrase instead of separate words
    • Use a minus sign (-) in order to eliminate words or phrases from your search
    • Narrow down a time period or range of numbers by putting the two numbers at either end of the range separated by two full stops (e.g. 2009..2011)

    Freedom of Information (FOI)

    Freedom of Information acts can be powerful tools in obtaining data you are interested in. Many countries now have laws in place allowing people to request access to information.

    What data are you looking for?

    Envisage what data will look like, then ask for it. Be specific – do you want a number per year per area? Per month? You do not just have to ask for figures – you can also ask for documents.

    Better yet…

    Seek data in original form – ask the organisation to send the anonymised database exported to spreadsheet.

    Potentially it gives you more options for finding stories, as you have the raw data to work with and to look for patterns/outliers in

    But it is not always easy to convince authorities to do this.

    You may run up against problems such as a lack of understanding about why you might be interested in the raw data, uncertainty about how to deal with big datasets, programmes that do not seem to allow export, concerns about personal data, and worries about the amount of time it will take to get the data.

    It may help when seeking a dataset, to ask for the record, code sheet or scheme – i.e. the categories under which the data is collected, e.g. date, time, description, in order to get a better idea of what the data you are seeking looks like. Hopefully this will make it easier to work out what is personal data and should be excluded, and what can be released.

    FOI Tips

    • While you do not have to mention the Freedom of Information Act, it helps to get your request to the right person.
    • You have to include a name and contact address (although this can be an email address, and you are not required to give a postal address). It helps to include several forms of contact details, in case someone needs to get in touch to clarify your request – telephone numbers are a good idea.
    • You can use a fake name (but make sure it is something sensible) but generally this should be avoided. The Act says requesters must give their real name and authorities can refuse requests from people it thinks are using a pseudonym. You may also run into problems if you want to make an appeal.
    • You do not have to say why you want the information and the fact you are a journalist is not relevant.
    • Be specific.
    • If you know what documents you want, ask for them by name, or keep requests limited to a small subject area.
    • If you know the technical terms, use them, e.g. children placed for adoption without their parents consent are subject to a care and/or placement order made through the courts.
    • For police, crimes recorded are usually easier to find on the computer system than incidents reported.
    • Councils tend to mainly work to financial years, as that is usually how projects are budgeted.
    • With FOIs it is best to ask for facts and figures – how much was spent on something, what is the current policy – rather than opinions. You can, of course, ask for documents containing people’s views on things.
    • If you suspect there may be a lot of data for a topic, limit the time period you ask for, you can always ask for more years in a later request if you think having more information over a longer period is important. However, do not try to get around time limits by breaking your request into pieces and sending each one off as a separate request – this just annoys FOI officers – plus they can combine them all into one request and refuse it on time limit grounds anyway.
    • Ask for confirmation that the request was received – it is always annoying to find out the reason you have not had a reply is because the spam filter blocked your request and you have to start over. Most authorities are pretty good at sending back something saying they are working on a response and giving an indication of when it will be due. If you do not get a confirmation, it is worth checking the request was received and has not gone astray.
    • Make a note of when a response is due back. Authorities have 20 working days to answer from when they receive a request. It is worth putting a note on the calendar, taking into account weekends and holidays and allowing a bit of leeway as it can take a couple of days for people to check their emails. Schools have up to 20 school days or 60 working days to account for the much longer holidays.
    • Information requests about the environment will often be dealt with under Environmental Information Regulations, which are generally treated in the same way as FOI requests. They have the same time limit of 20 working days for the authority to respond, but fewer exemptions for not providing the data – primarily personal data, national security, and whether a request is manifestly unreasonable.
    • If your request is for information about yourself, such as your medical records, you should make a subject access request under the Data Protection Act. You may be charged a £10 fee to have the authority compile this data.

    Need some ideas for FOI requests, here’s a great big list of possible requests, and here are some tips on making use of FOI more efficiently.

    Scraping

    Scraping is the process of creating a computer program to extract data from a webpage, data stream or PDF and pull it into a useable resource such as a database.
    It can be an incredibly powerful technique for gathering the data you want that is not in a useable format, however many of the techniques, require some understanding of computer code.

    Very simple scraping

    The food hygiene ratings are available as open data from the Food Standards Agency.

    Open up Open Refine (formally Google Refine). You may need to download the program first - just follow the installation instructions to download and start the program running.

    Open Refine runs in the Command window (the window with the black background) while you interact with it via a webpage. It describes itself as a power tool for working with messy data.

    Once you have the program running, click Create Project and then select Get data from Web Addresses (URLs).

    Then you need to copy in the URL of the page where the data is stored as an XML. Click Next.

    You will then see a screen which shows the layout of the data in XML, which you need to highlight at the level you want to import into the table.

    With the food hygiene data, the level you need to start at is the Establishment Detail level, as it contains all of the data about each establishment that has been rated. Click on that level to highlight it in yellow.

    Open Refine will then scrape the XML data into a table for you and will display it on the next page so you can check it is scraping what you want. If it is, you can click next to see your table, where you can either download it to work on else where or use Open Refine to clean and analyse the data.

    You can also use the data to create things like this map of Welsh premises with the lowest food hygiene ratings.

    Want more? Buy Getting Started with Data Journalism, a complete beginners guide to finding, cleaning analysing and visualising data in any size newsroom

    May 1, 2014
    by Claire
    1 Comment

    Data Journalism for Beginners: Some maths and Excel basics.

    Part of a series of posts looking at tips and ideas for getting started with data journalism

    May 1 – why data journalism and getting started
    May 2 – finding your data
    May 3 – cleaning it up
    May 4 – mapping and visualising

    Want more? Buy Getting Started with Data Journalism, a complete beginners guide to finding, cleaning analysing and visualising data in any size newsroom

    It's all about the spreadsheets

    It’s all about the spreadsheets

    Some basic maths

    The most useful thing for analysing data is a few simple sums.

    Usually when we’re looking for a story, we only use a few basic sums to analyse things – mostly to even out variations in population size, to make better comparisons, or to see what’s changed from year to year.

    Percentage of

    =(value/total)*100

    Useful if you want to calculate how much a figure is in comparison to a total amount.

    For example, you might want to calculate how much money is spent on supply teachers as a percentage of the total school budget or, how many prisoners have been identified as self-harming from the prison’s total population.

    Percentage change

    =((New – Old)/Old)*100

    If you have a dataset where you have data for two different time periods, one of the things you can look at is whether there has been a change between those two periods.

    Per 100/1,000/10,000 population

    =(number/total population)*100 (or 1,000 or 10,000)

    This is the best way to compare figures for areas with vastly different size populations – otherwise it is fairly likely that areas with lots of people will have more of what ever you are trying to measure.

    Equivalent to how many – days lost, per minute

    Another way to show how often something happens is to calculate it in units that people can grasp – days, weeks, minutes etc.

    For example, 65,000 parking tickets issued in a year is a lot, divided by 365, it is 178.1 a day.

    You can then divide it by the 24 hours in a day, or more realistically, the number of hours in the working day for traffic wardens, e.g. 7am to 11pm = 16 hours, so 178.1 divided by 16 equals 11.1 tickets an hour.

    From there you could work out how often a ticket is issued, on average in minutes – 60mins divided by 11.1 equals one every 5.4mins – round to five minutes and you potentially have a figure to start your story with.

    Mean and median

    The mean is another way of describing what is generally known as the average.

    To get the mean, you need to add up all of the values in the group you want to find the average for and then divide it by the total number of items in the group.

    Sometimes, averages can be misleading – for example, when looking at average wages, a small number of people who earn very high wages can skew the average so it is higher than a number that reflects what most people earn.

    In these instances, using the median, or the mid point of the numbers, may be more accurate. To get the median, you would line up all the numbers in ascending order and then pick the number that falls in the middle of the sequence.

    Quartiles, quintiles and deciles are calculated in a similar way to the median. Quartiles are found by dividing the group into four equal groups, quintiles, five and deciles, 10.

    Spreadsheet Basics

    Working with your data in a spreadsheet can be a good way to speed up your analysis by helping you to perform lots of calculations quickly.

    =SUM

    If you want to know the total of a column or row of numbers, this is the formula you need.

    To calculate the total, type =SUM into an empty cell. You can then drag a box around the cells you want to add together. If you want to add several unconnected cells or groups of cells you can add them to the same sum by separating them with commas.

    In Microsoft Excel, clicking on the Sigma symbol in the tool bar will highlight nearby cells so you can total them. If you want to change the cells highlighted drag the box to the area you want.

    =AVERAGE

    Just as knowing the total amount spent can be helpful, so can knowing the average amount spent.

    The formula works in much the same way as =SUM.

    In some cases, using the median, or the mid-point in the data where 50% of values are higher and 50% are lower, may be a better indicator.

    You can work out the median in the same way as the average by starting your formula =MEDIAN.

    =MIN, =MAX

    One way to find out the minimum and maximum amount in a column or row of figures is to use the above formulas.

    However, if you want to know which item in the list is the minimum or the maximum, and any additional connected data, you are going to have to scroll back through the whole list to find it. This may not be practical with a very long list, so this may not be the best technique for finding out the biggest and smallest numbers.

    =COUNT

    If you want to know how many items are in a list or part of a list, the count formula can be really helpful.

    There are a number of different options for using count:

    =COUNT will count all of the cells in the given range that have numbers in.

    =COUNTA counts all the cells that have something (text or numbers) in

    =COUNTBLANK counts all the cells that are blank.

    You can also add conditions to the counts so only certain cells are included in the total. For example, =COUNTIF can be used with conditions such as >, more than, <, less than, or =, equal to.

    Want more? Buy Getting Started with Data Journalism, a complete beginners guide to finding, cleaning analysing and visualising data in any size newsroom

    May 1, 2014
    by Claire
    1 Comment

    Data Journalism for Beginners: Why do it and how to get started

    The School of Data Journalism is currently ongoing at the International Journalism Festival in Perugia and to coincide I’m going to try some actual blogging about doing data journalism to go alongside it  - so if you haven’t been able to make it to Italy and you haven’t had a chance to watch the live streams, here’s some ideas, tips and tutorials for getting started with data journalism.

    May 1 – why data journalism and getting started
    May 2 – finding your data
    May 3 – cleaning it up
    May 4 – mapping and visualising

    Want more? Buy Getting Started with Data Journalism, a complete beginners guide to finding, cleaning analysing and visualising data in any size newsroom

    Why data journalism?

    What makes data journalism skills worth learning? Because they can help you find stories, or add weight to ideas, from data that is becoming more freely available and plentiful. Because you can do things quicker and easier, like analyse your FOI responses in a flash. Because visualisations can help tell stories in a more interesting way. Because if you don’t other people will beat you to these stories.

    Steve Doig told those at the School of Data this morning that: “The important word in “data journalism” is “journalism”, not “data”. That pretty much sums it up, the data is the place you can find your story, but you need journalism 

    The key to getting to grips with data journalism is remembering that it is just like all other types of journalism – what is the story?

    All the skills you would use to find a story in any other situation still apply when you are looking at a spreadsheet full of numbers.

    Some headlines from data journalism related stories (any excuse to get a Star Wars reference in)

    Some headlines from data journalism related stories (any excuse to get a Star Wars reference in)

    In many ways, you set about questioning a dataset or putting together a Freedom of Information request in much the same way you would think about interviewing a contact. You work out what information you are hoping to get back and the questions you might need to ask to do that.

    Things that might make a data story:

    • The biggest, the highest, the most
    • The smallest, the fewest
    • Fastest growing
    • Location
    • Successes
    • Failures
    • Why?

    People want information about things that affect their lives – how a government policy will affect them, how well local services are performing, how where they live compares to other places and whether things are better or whether they need to change.

    Many stories have scope for data journalism. Many stories already are data journalism

    The following examples show how combining traditional journalism skills with data journalism skills can help to build the story.

    Housing

     Data journalism

    Official statistics show a jump in the number of landlords going to court to get their tenants evicted.

    The figure may just be a quarterly blip, so monitoring the trend over the longer term would be a good idea.

    Traditional journalism

    However, the timing of the period the data covers coincides with people starting to have their benefits reduced because of changes to Housing Benefit – the figures could mean more people are getting into arrears leading to their landlords trying to evict them.

    Data journalism

    If you have the number of eviction attempts in each area, can you also get the data for how much Housing Benefit has been reduced in those areas? Is there a correlation between areas with a big shortfall between the old and new Housing Benefit payments and high levels of eviction orders?

    Traditional journalism

    Talk to charities that help people facing housing problems and councils who receive applications from people in need of housing. See if they are experiencing an increase they can put down to landlords evicting more people with rent arrears because of Housing Benefit changes. Is it possible to go to court hearings for possession orders and find out the reasons behind applications?

    Go where the stories are

    Another reason to explore learning to do data journalism is data journalism can be the answer to the story drought.

    Newspapers and online come with a high demand for new content, and ideas are sometimes in short supply and sources aren’t always supplying you with good tips and press releases are of varying quality but at least with data a bit of digging can lead to something new.

    Working on the Trinity Mirror Data Unit, our main day to day job is to analyse available datasets (a mix of the ONS releases for the day, ideas for things we think might have datasets and any datasets or FOIs we’ve got hanging around) and come up with possible lines for the papers in the group – covering the two data journalism steps above, while journalists on the regional titles, with the local knowledge and contacts, fill in the traditional journalism steps.

    Doing this, we can often come up with four or five potential leads a day, just working through datasets, often looking for the basic story ideas (see above) or doing fairly simple analysis (combining datasets, finding rates and percentages). The emphasis is mostly on finding the most interesting things we can from the dataset as quickly as possible, because the story idea needs to be reporters early so they can start getting comments on it.

    There are interesting stories out there, so getting the skills to go get them is definitely worthwhile.

    What you need to get started

    In order to get started with data journalism, there are a few tools you need, many of which you will already have, and all of which have free options.

    • A program that opens spreadsheet files – Microsoft Excel or OpenOffice Calc, which is open source software, so is free to download and use. Alternatively, you could use the spreadsheet option that is part of Google Drive.
    • Depending on what files you are working with, you may also need a PDF reader and a program to open word documents – there is one included in OpenOffice, or again you could use Google Drive.
    • A dataset or several datasets – the next section looks at where you can find data.
    • A place to write stories and a means of publishing them

    You may also need:

    • A Google account – you will need this to access Google Fusion Tables and it is helpful for storing and sharing documents.
    • Something to visualise data with – there are a number of different online options, you can also use Google Charts and Google Fusion Tables or Tableau, which is available as both a free and paid for version.
    • Open Refine (formally Google Refine) – for cleaning data

    Want more? Buy Getting Started with Data Journalism, a complete beginners guide to finding, cleaning analysing and visualising data in any size newsroom

    April 24, 2014
    by Claire
    0 comments

    Why Labour’s smear test attack on the coalition just doesn’t stand up

    The coalition is failing women, the number who are overdue smear tests has soared 11% since they came to power, possibly due to the difficulty getting GP appointments.

    When working with data, it’s a really good idea to make sure you’re working with as comprehensive a dataset as possible and that you’ve sense checked it before coming to any conclusions (this is probably less important if you’re a political party trying to make a point, but for data journalists let’s assume this is your starting point).

    Every morning we go through the PA schedule looking for possible data stories to work on that day, looking for things where we can find local angles or more data.

    One of the stories this morning was Labour’s research in the number of women who have missed smear tests, both most recently in 2012/13 and before the coalition came to power in 2009/10, based on HSCIC data.

    Obviously the dates where chosen to fit with Labour’s narrative that the coalition is doing a bad job but the other thing that springs to mind is Jade Goody and the rise in women making smear test appointments as a result of the coverage of her illness and death.

    The figures would fit, with much of the press coverage in the early part of 2009, boosting numbers into the next financial year.

    So is the number of women missing smear tests a lot worse than it used to be or was 2009/10 just a lot better than normal?

    The figures from the HSCIC go back usefully to 2007/08 and show that while the number of women overdue a smear test (by 3.5+ years for 25 to 49s and 5+ years for 50 to 64s) is 10.9% higher than in 2009/10…but it is only 1.3% higher than in 2007/08, suggesting this isn’t a steady downward trajectory.

    Also number of women is a good figure for headlines but it’s not a great piece of data for comparison as the population is growing – so even if the proportion of women stayed the same, or even fell, there could be more women in total that missed appointments.

    In terms of the proportion of women who are up to date, the figures show in 2012/13, 73.4% of women were up to date with their smear test, compared to 72.5% in 2007/08. So a smaller proportion of women are missing their smear tests now than in 2007/08 under Labour.

    (And now I’m thinking I should have made a graph of this, as it would have explained it better).

    Given all this, what’s most likely to be happening is that government policy has less impact on how likely women are to make a smear test appointment than press coverage of a celebrity cancer battle, and we’re just reverting back to the norm, with a standard level of refuseniks and the disorganised, after a few years where the importance of getting screened was a higher priority.

    Simon Rogers was talking earlier about the need for transparency in data journalism, the need to publish or link to your data. This is an example of where knowing the data source means you can check the analysis (and for us, dig a little into the regional variations).

    January 9, 2014
    by Claire
    2 Comments

    With FOI think MVP

    Being quite busy recently, I’ve been thinking about ways to use Freedom of Information Act requests to generate stories without taking up huge amounts of time (mine, and FOI officers)

    I’ve got a couple on the go at the moment – one involves working through reams of documents to pull out information and get it into some kind of structured spreadsheet, the other involves adding a line or so of data to a spreadsheet each time a new response comes in.

    Actual reality – never going to find time to get through the reams if documents to turn it in to a spreadsheet, as it involves reading the documents, pulling out the relevant data and then re-typing it into the spreadsheet. So it’s not very likely to get put into the structure it needs to be out into to find the stories.

    The other one – looking at the number of cancelled operations because of bed and staff shortages at each hospital trust- is more realistic. The data comes in as a small table for each trust, which can then be copied into the master spreadsheet, about 5 minutes of work for each trust. This is a bit of a slog when lots of response come in at once (which they invariably do on day 20) but is much more do-able.

    When putting in FOI requests, it’s worth thinking about what you’re trying to get back. It’s tempting to ask for all the information you can, but what are you going to do with it? Think ‘minimum viable response’ – bit like minimum viable product, think what’s the minimum I need to get this question answered and this story off the ground. If you get more, great, if the answers raise more interesting questions, you can expand on what you planned.

    With the cancelled operations example, the question I was looking for an answer to was whether hospital trusts were seeing more cancellations due to staff and bed shortages over the past three years, possibly as a result of increasing pressures on services. The information needed is the the number of cancellations in each of those categories for the past three years. Some trusts may send more of a breakdown, others may give an idea of how the number of cancellations compares to the overall number of operations etc. all useful but secondary to the main question.

    With a question to be answered and an idea what the responses will look like, it’s easier to build and system, and spreadsheet, to feed the data into and therefore makes analysis a lot easier than a vague-ish fishing trip.

    January 2, 2014
    by Claire
    0 comments

    Journalism New Year Resolutions

    Do I write this post every year? Probably.

    Some things I’m planning to work on this year.

    Blog more

    Yes, I say this a lot, then I get busy and don’t write anything for months.

    Not really sure best way to tackle this one. Might try for the setting time aside each day and just writing and hoping that some of it might be interesting. Also actually writing up any ideas instead if thinking about writing up ideas and then not doing it.

    So I’m going to try for a blog post a day for a week to try and get things going and then maybe actually work on scheduling a bit of time a few times a week just to write something.

    Learn new things

    Really if your journalism resolution doesn’t include learning something new, I’m not sure you’re going to survive in this industry (because the only thing keeping this industry going is trying new things (and hoping some of them work)).

    I think I might have got stuck in a bit of a visualisation rut. Everything ends up looking the same because it’s quick and easy, which is good for getting stuff done but not great for innovation and interesting-ness.

    So learning new stuff this year has two aims; I want to generally improve my coding skills because I want to get a bit further than the current situation of can sometimes make things sort of work and I also want to learn some visualisation languages, like Processing and D3.js, to try and get a bit more flexibility with creating stuff.

    I’ve signed up for the Harvard intro to computer studies course, which seems to be pretty good in not having weekly work expectations.

    Love online courses, rarely actually finish them.

    Also looking at Code Year to try and get into the habit.

    If you’re also looking for some new things to learn, some other ideas:

    If you want to learn SQL and other databases, the Stamford introduction to databases course is excellent and starting soon. It’s a good guide to the basics, especially SQL queries, and actually goes at a reasonable pace (I find some of the Coursera courses either involve more hours of video watching a week than people with job actually have as free time).

    If you want to learn some data journalism skills, there’s Getting Started with Data Journalism. Apparently there’s some call for those type of skills…Trinity Mirror Data Unit is hiring (see here and here), as are Newsnight, the Telegraph and the Guardian.

    Get more organised

    Another one that is twofold. Having decided to start 2014 with another round robin FOI request, I think I need a better system (or to send more FOIs to big over-arching organisations and a few less to councils).

    If you’re journalism resolution is to create more work for public bodies (I mean, put in more FOI requests), there’s a great big long list of FOI ideas that could get you started.

    Meanwhile, I need to sort through all the data I currently have stored. There are some good stories in there, things that could be updated or improved on, or mashed together but it isn’t in a very useful system at the moment – implement good file naming conventions is a really boring resolution but it’s probably a good idea.

    October 30, 2013
    by Claire
    0 comments

    The boundaries of data

    Main Entrance to West Middlesex Hospital, Isleworth, London. What do you do about tracking data relating to an organisation or area over time, if the boundaries keep changing?

    Recently, I’ve been doing more with English NHS data. This is a new one as I’ve mostly been dealing with Welsh NHS data! Which it turns out is relatively straightforward in comparison.

    One of the problems is that over, even a short period of time, hospital trusts are created, merged, and cease to exist. This means if you’re looking to collate data over a longer period, you need to find a way to combine data to reflect the most recent incarnation of the trust.

    It isn’t too difficult to sort things out if what’s happened is a straight merge and you have the raw numbers. So, for example, Bart’s. Healthcare is made of Barts and the Royal London, Newham, and Whipps Cross. By adding together the number of attendances at A&E, for example, you should get an reasonably accurate idea of what’s been going on at all the hospitals in the trust, even when they used to belong to other trusts.

    However, as of October 1, South London NHS Healthcare Trust has ceased to exist. But it has straightforwardly merged with another trust. Instead the three hospitals that were managed by the trust have been taken over by different trusts – Princess Royal University Hospital, Bromley, became part of Kings College London Foundation Trust, Queen Elizabeth Hospital became part of the Lewisham Healthcare Trust (now Lewisham and Greenwich) and Queen Mary’s, Sidcup, was taken over by Oxleas NHS Foundation Trust (which wasn’t even an acute trust before this (see, I said this was complicated).

    So you end up with data for three hospital trusts that you can’t compare to previous years.

    One way around this might be to break everything done to the smallest unit, in this case hospitals, as they’re less likely to be subject to change, as they have a physical location, and it would be easier to keep track of how the data has changed over time.

    August 21, 2013
    by Claire
    0 comments

    FOI and Datasets

    Request Initiative, not-for-profit community interest company that helps charities, NGOs non-profits make use of FOI, is blogging top tips for using the FOI act.

    This week’s topic, looking at asking for datasets, is timely, with the new FOI rules on datasets coming in from September 1.

    The advice is good – datasets are always much better for analysing than the response you are likely to get if you let an FOI officer do the number-crunching for you (usually it’s the chance to maybe find several stories in the dataset, rather than just getting the answer to one question).

    All good advice until I got to this bit, which just doesn’t match what I’ve found to be the case.

    The best way to request a whole dataset is in CSV (comma separated values) format. Whatever proprietary software might be in use, every database can export to CSV and you can open it with Excel when you receive it.

    Not the bit about asking for CSVs (ask for them, they’re great!), but about every database being able to export to CSV.

    You’d think, right?

    The problem, and experience has taught me this, is the that either no one knows how to get the programme to export to CSV (because it’s never come up before, because that’s not how the programme is used, because people generally only learn to do what they need to do, if they’ve never needed to do that then it probably isn’t something the programme can do) or actually it doesn’t export to CSV (or at least everyone in the organisation swears it doesn’t).

    Yes, there is an entire story in how the public sector may have invested in proprietary software for organising data and processes, where if they ever decide to change software, they would have no easy way of exporting their old data out and importing it into a new system (except, probably, at huge cost). That, or the programme does have the option but it’s not enabled.

    There was a great post yesterday from Andy Dickinson about trying to get data from the Home Office that suggests staff have to look through 75 screens to find information in just one database. Also yesterday, I had a conversation with South Wales Police, well known for their excellent records management, about how it would take them 2,000 plus hours to get some data out of a database (I did ask about exporting, you know, as opposed to typing it out by hand).

    Which, brings me to: why I don’t think the new rules on datasets are going to make it any easier to get datasets out of public bodies.

    The ICO has issued guidance on the new requirements. Reading through them, in theory, for most FOIs nothing has changed, small data tables put together to answer a question aren’t datasets covered by the new requirements so most FOI responses will go out much as they did before.

    Apart from that, most responses concerning what are probably datasets under the requirement could be made compliant by not converting spreadsheets to a PDF (seriously, stop doing that), but instead by saving them to a CSV and then making them reuseable under the Open Government Licence.

    Untitled-1 Do what Monmouthshire Council do and make the Open Government Licence your default setting.

    Simple.

    However, I suspect the new dataset requirements are not going to lead to the bounty of data the Government is hoping it will.

    The issue with requesting datasets now isn’t format (yes, PDFs are annoying when you get them but most are convertible), it’s getting the raw material out in the first place – the programmes that just don’t work like that, the information that isn’t held in that format.

    It’s not going to solve the issue where the requester is incredulously pointing out that being able to export information in a simple, open format might be a basic and key requirement of any software purchase, while a council officer claims that, of course, the information can’t be exported out of the programme because why would anyone need to do that when everything is done in the programme and the FOI officer is stuck in the middle.

    The new requirements may mean more attention starts to be paid to finding out how to export datasets, and may give FOI officers more power to get people to start asking questions of suppliers about export options but I suspect the caveat that data should be released in this form “so far as reasonably practicable” will win out and datasets will remain locked down.

    On the subject of blogging about FOI and blogging weekly, I’m aiming to try (not committing myself there) to blog a bit more regularly (or at all). As I’ve finished my book, I have a little more time, so should probably put it to better use.

    The plan is some more posts on FOI, data journalism (and I’m hoping the free time will allow for a bit more experimenting), and open data, at least semi-regularly. I have great hopes but then again, this blog is three-and-a-half years old and has less than 35 posts.