• Post category:Data Journalism
  • Reading time:7 mins read

Using Excel as a web scraper – Getting Started with Data Journalism

Want to know if your area is a Costa capital or just how many Greggs you have per head of population? The answer may be found in the Food Standards Agency’s food hygiene ratings data.

Everywhere in the UK that sells food is inspected to ensure they meet basic standards of not giving people food poisoning. This means it provides a fairly comprehensive list of all food and drink venues in the country.

The other plus is the FSA publishes all the ratings as open data (fairly regularly updated by councils with the inspections they’ve done). This makes it a much easier resource to deal with than trying to navigate store locator on chains’ websites.

How the food hygiene open data XML files look

However, how do you get all of the food hygiene rating data for councils across the UK into a single spreadsheet (without downloading and copying them all individually)? Turns out Excel, at least the newer versions, is an option for doing this.

For non-R/non-coding options, Open Refine is usually the most straightforward way to get data out of an XML file and into a spreadsheet, as it has the option to create projects using a web URL. You can even add several URLs at the same time.

However, you can’t add 300+ all in one go. But Excel offers a relatively quick and painless way of adding all of the XML files in one go.

This method used Excel for Microsoft 365 – how well it works in earlier versions will depend on whether options for importing XML files. Similar to Open Refine, there are limited options for importing XML files into both Google Sheets and OpenOffice Calc, but these mostly only allow one file at a time to be imported.

Firstly, you need to download (right-click and save link as…) at least one of the XML files of food hygiene data from the FSA website. In Excel, under the Developer tab at the top, select Source.

This will bring up a new pane on the right-hand side labelled XML Source. At the bottom of this is option labelled XML Maps… which gives you the option to create an XML Map by adding an XML schema to a workbook – basically tell Excel what the contents and structure of your XML files looks like.

This is helpful as while the food hygiene ratings data is broadly the same with the same data in columns with the same name – not every file has exactly the same columns (some don’t have a fourth address column, for example), and not every one has the columns in the same order (a problem when trying to combine the data a file at a time). Creating a schema makes it easier to get the same data in the same place over lots of files.

Next you need to click Add, and then find and select the XML file you downloaded earlier. You’ll likely get a pop-up that tells you the specified XML source does not refer to a scheme, so Excel will create a schema based on this source data (you can tick not to see that message).

After adding your file, you can click OK to exit the dialog box. When you do, you’ll see the map for your data appear in the window in the right-hand pane.

From looking at the original XML files (see above), you can see that the information about individual businesses is set out under the EstablishmentDetail part of the file (the header covers how up to date the file is and how many venues are included. In the right-hand pane, right-click on EstablishmentDetail and select Map element.

You’ll then need to select where in your sheet the elements will be mapped, then click OK. The items in the list will then appear as column headers.

Next you need to download all of the XML files you want from the FSA website. The number you need will depend on what you want to do – there’s a version of the page that groups councils by region if you were looking at a regional level. If you need to download all 300+ files, using a mass downloader extension (like this) will probably speed up the process.

Back under the Developer tab, select Import. You can then select all of the XML files you want to import. This may take some time (but hopefully won’t cause Excel to crash), but they should eventually all load into one spreadsheet ready for use.

Want more? The second edition of Getting Started with Data Journalism, a complete beginners guide to finding, cleaning analysing and visualising data in any size newsroom, is coming soon. Sign up to get updates when it launches.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.