clairemiller.net

Data Journalism for Beginners: Cleaning up your data

| 2 Comments

Part of a series of posts looking at tips and ideas for getting started with data journalism

May 1 - why data journalism and getting started
May 2 - finding your data
May 3 - cleaning it up
May 4 - mapping and visualising

Want more? Buy Getting Started with Data Journalism, a complete beginners guide to finding, cleaning analysing and visualising data in any size newsroom

Cleaning Data

Data is messy

You will need to tidy up much of the data you collect even before you start analysing it and finding stories.

20140503-232636.jpg

Some of the ways in which data can be messy:

    • It comes from several different sources and needs to be combined into one file.
      It is not in an easily useable format (for example, your data is stored as PDFs, web tables or other formats).
      It is full of abbreviations and/or the same data is coded in several different ways.
  • If you cannot understand or read the data, it is very hard to get a story from it.

    Data may go through a number of edits from its original format before it is in a state where it can be analysed.

    In order to keep track of these edits and to avoid making changes that make the data inaccurate, it may be useful to keep a log of changes that have been made as you go along.

    Here, we’re mainly going to be looking at the perennial problem for journalists, data sent as PDFs.

    Getting data out of PDFs

    Quite often official data will be supplied as a PDF, particularly if the data is contained in a report.

    As it can be difficult to work with data held in a PDF, you will often need to get the data out and into a more usable format, such as a spreadsheet.

    The following steps are just some of the ways you can liberate that data and make it easier to work with.

    Step 1

    Ask whether it is possible for the people responsible for the data to send you a more useable version of the information in the spreadsheet.

    They may be able to send you the original spreadsheet or text document the information came in.

    In the UK, you can add a request for a particular file format to an FOI request, which may help encourage officers to send it in that format.

    Changes to the Freedom of Information Act mean organisations will be required to provide any dataset in a reusable format, a change that should hopefully give more power to requesters seeking easier to use formats.

    Step 2

    Basic copy and paste sometimes works, allowing you to copy the data as a formatted table into a Microsoft Word or Excel document.

    In this case, the data from an FOI response from the Welsh Ambulance Service on ambulance response times almost copies and pastes into word with no problems. There are a few rows that do not have data for all the months, meaning the numbers do not necessarily line up with the right months. These would need comparing with the original document, so these rows could be corrected.

    Once finished, the cleaned data could then be transferred to a spreadsheet.

    Step 3

    Copy and paste and then clean up the data.

    For small amounts of data or not very complicated data, such as columns of numbers, you can copy and paste it into a Microsoft Word or Excel document. You can then tidy it up, either by converting the data to tables using delimiters, often a space, or by dragging columns back to the right place.

    Health behaviour in Young People

    The data is contained in a massive report from the World Health Organisation looking at health behaviours in children across the West – the report is too big to convert from PDF to XLS (see below) so another method was needed to get the data into tables.

    When copied and pasted into Word, the table data appears as one column with a list of countries followed by the figures for boys and girls as one list, alternating between the two genders.

    The first step is to convert the copied data into a table, using the Table > Convert text to table option. This should turn it into a table with one column and many rows.

    Copy and paste this into a spreadsheet. From the chart in the report you can see that there is a percentage for girls and a percentage for boys – turn those labels into column headers.

    All the numbers for each country are listed one after the other. To get two columns, one for girls and one for boys, you need to paste the numbers twice, with the second column dropped one row down so the girls percentage is next to the boys percentage for the same country.

    You then need to delete every other row, so you are left with the data for each country, with the girls percentage followed by the boys’ percentage. To delete rows, highlight the row by clicking on the number next to the row and then right-click to bring up a menu with the option to delete row. You can use Ctrl and click to highlight several rows at the same time.

    Next, move the list of countries up to next to the corresponding percentages. You may need to make sure that none of the country names have spilled over into adjoining cells when you created the table – if they have, just tidy them up so they are in one cell.

    Your table is now ready to use.

    Step 4

    Use a converter to turn the PDF into a useable format.

    There are many free online PDF converters on the internet, of varying effectiveness. The best tip is to upload your file to several and use the best conversion you get back.

    Online converters cannot cope with big files or PDF files where the table data is combined with large quantities of text. With these, it may be better to try converting those to Word documents and trying to extract the information in that way.

    In order to give an idea of how they work, this is an example of what they generally return when converting a simple PDF file. This involves uploading a simple PDF file to a number of different file converters to try to get an idea of how they would cope.

    The original PDF of ambulance data can be downloaded from here

    Some of the available PDF to XLS converters:

    * PDFtoexcel.org
    * zamzar.com
    * cometdocs.com
    * PDFtoexcelonline.com
    * free-PDFtoexcel.com

    To upload your PDF file, just follow the instructions on screen. The first four require an email for the file to be sent to, while the last is online based.

    Zamzar.com, PDFtoexcel.org and cometdocs.com all came back within minutes. On one occasion with this file, Free-PDFtoexcel.com also finished converting very quickly but on another, there was an error and it failed to upload.

    Zamzar.com and, when it works, Free-PDFtoexcel.com , puts each page of the document in a different spreadsheet table. The file from Zamzar.com threw up some warning messages when it was opening but seems to be ok.

    PDFtoexcel.org and cometdocs.com puts all of the PDF information in the same spreadsheet, with blank rows to indicate where page breaks were.

    Which type of conversion works better will depend on the PDF you are converting and what you are planning to do with the results.

    Whichever conversion site you use, the data is likely to be messy when you get it back. The converters tend to add blank rows when there is a page break in the PDF.

    In addition, depending on the size of the space between columns, they may not read them as two separate columns and leave them combined as one. If this is the case, it may just mean piecing tables back together and deleting blank spaces.

    You may also need to take care with rows where cells have been shifted out of line with their original column. For example, where there was a blank cell in the original PDF and the converter has read it either as not being there or as more than one blank space. In these cases, you may have to refer back to the original copy to ensure you have everything lined up properly.

    Step 5

    What if you are sent a PDF where the information is a scan of a document?

    This happens frustratingly often – if the only version of the information that can be tracked down by FOI officers is a printout, or if there has been some redaction done to protect personal data and the easiest way to do it was by physically covering up and then rescanning the documents.

    Getting data out of these types of files is hard – you cannot copy and paste or use online converters.

    One thing you can try is optical character recognition – this converts images of text into actual text, which you can then edit or analyse in other programmes.

    Software to do this often comes with scanners, so it may be worth checking round to see if anyone has it loaded on their computer.

    Otherwise, there is online OCR, which works in a similar way to online PDF converters.

    You can upload files to sites such as Free Online OCR and click preview to see an image of the file.

    An example file to practice on.

    You can then use the box on the page to drag it around the content you want to convert. This means you can narrow it down to just the data, rather than all the information in the document.

    Converting PDFs like this is likely to be slow going. You have to work page by page with the online converter, making it a less feasible option for big documents.

    Data you get out of documents in this way is likely to be very messy – tables will not be converted as tables and the character recognition does not always work, so you end up with odd bits of text.

    The cleaner the document, the better recognition is likely to work – poor quality scans tend to mean blurry characters that will not register properly with the programme.

    The above example contains a simple three-column table of ambulance response times, with each table headed with a postcode.

    When you copy the converted data into a spreadsheet, it shows as one column.

    From there it is fairly easy to split up the data. As each individual piece of data is all one phrase (the date, connected with a hyphen, the number of calls, and the average response time) separated with a space, you can use that space to split one column back into three.

    You can do this by using the Text to Columns option – which works in a similar way to Word’s convert text to table.

    Highlight the column you want to split and then select Data > Text to Columns (this works in both Microsoft Excel and OpenOffice Calc). This will bring up a dialogue box that gives you options for how you want to split the column, using either a fixed width or a delimiter such as a space, comma, or tab.

    With this data, select space and click OK to get the data back into three columns.

    Step 6

    What to do if none these work?

    Sometimes transcribing is an option – a very slow one, but it is possible.

    Sometimes it may be the only way – court data in England and Wales is very difficult to get in a useable form – the data team at the Guardian had to extract the data on riot related court cases by hand when it was carrying out its Reading the Riots project.

    There is also the option of using code to extract data from PDFs.

    Investigative journalism website ProPublica had to use a number of different techniques to extract data from PDFs as part of its Dollars for Docs investigation, which tracks financial ties between doctors and medical companies.

    The techniques they used are outlined on their website:

    * Turning PDFs to Text

    * Getting Text Out of an Image-Only PDF

    Open Refine for cleaning data

    A great tool for cleaning up messy data is Open Refine, which has a number of options that help tidy up your data, among them the cluster option which tries to find matches in your data helping you iron out misspellings and similar references to get consistent naming (particularly helpful if you want to do counts of things).

    Open Refine offers a number of quick transformations you may want to carry out on your data to make it tidier and easier to work with.

    From the drop-down menu next to the column you are editing, select Edit Cells > Common Transformations.

    The first two options allow you to get rid of unnecessary spaces, either in between words or at the start or end of the cell. It is not uncommon for people to sometimes add these when entering data, and while two cells may look the same, the computer will read the different number of spaces and count them as being different.

    The next option is Unescape HTML entities. Sometimes people add HTML to the entries – for example on the Food Hygiene Ratings, the right to reply text contains HTML to make it show up formatted on the website. This option will turn HTML for things like punctuation and space (e.g. ` ` for a space) back into want it is supposed to look like.

    The next three options allow you to change all the items in the column to upper case, lower case or title case, where all the words are capitalised. This does have some limitations with hyphens and acronyms as it simply capitalises the first letter of everything the programme thinks is a word.

    It is also possible to change the case of items in Microsoft Excel and OpenOffice Calc as well, by using the following formulas.

    * Titlecase: =PROPER(*Cell Reference*), e.g. =PROPER(B1)

    * Uppercase: =UPPER(*Cell Reference*), e.g. =UPPER(B1)

    * Lowercase: =LOWER(*Cell Reference*), e.g. =LOWER(B1)

    To number, date and text do the same thing as formatting numbers in cells does in Microsoft Excel or OpenOffice Calc. For example if you have a list of phone numbers you may want that to be read as text because programmes have a tendency to delete any starting 0s when they think the data is a number.

    The last option blanks out all the cells.

    Under the Edit Cells menu, there is another option to blank out all the cells and one to fill content down.

    Split and Join multi-valued cells can be used to split or join cells using commas or spaces, or other punctuation. This can be useful if the data you have been sent has addresses across several cells and you want it all in one.

    Cluster and edit will take you back to the Cluster panel.

    Options under Edit Columns allow you to remove or rename columns or swap the order around.

    Want more? Buy Getting Started with Data Journalism, a complete beginners guide to finding, cleaning analysing and visualising data in any size newsroom

    2 Comments

    1. Pingback: 50 blogs for journalists, by journalists 2014 | mshglobalnews

    2. Pingback: 50 blogs for journalists, by journalists 2014 « BLOG do SAMPAIO

    Leave a Reply

    Required fields are marked *.