Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

September 10 2010

15:00

Playing with heat-mapping UK data on OpenHeatMap

heat mapping test />

Last night OpenHeatMap creator Pete Warden announced that the tool now allowed you to visualise UK data. I’ve been gleefully playing with the heat-mapping tool today and thought I’d share some pointers on visualising data on a map.

This is not a tutorial for OpenHeatMap – Pete’s done a great job of that himself (video below) – but rather an outline of the steps to get some map-ready data in the first place.

1. Find a dataset to visualise.

You firstly need data that fits the geographical areas supported by OpenHeatMap (countries, constituencies, local authorities, districts and counties), and which suits geographical visualisation.

My first stop was the data.gov.uk RSS feed to see what recent datasets had been released, but you could also do advanced searches for “unemployment by county” etc. if you are looking for something specific to visualise.

Helpfully, each dataset description includes a field on “Geographical granularity”. This helps you quickly rule out those that only give country-level data (most of them). I ended up discovering the Scottish Neighbourhood Statistics page which allows you to download very specific data. Once on the site’s very helpful data download page I clicked on “Up to 100 indicators” and selected ‘Local authority’ from the first drop-down menu to ensure it matched up to the councils on OpenHeatMap.

I decided to map smoking data, downloaded it, and unzipped it. I then uploaded it to Google Spreadsheets (thankfully, the dataset was not too large).

2. Clarify the data

It’s at this point that the real work begins. Most datasets use a lot of jargon and codes, and you’ll have to spend time decoding those. In this case the zipped file contained a spreadsheet of “meta-data” that explained some of the codes used in the main spreadsheets. That cleared up part of my dataset, but there was a problem. Instead of local authority names, the spreadsheet (as is often the case) used codes.

I started scouring the web for a spreadsheet that might have those codes and what they meant. Eventually I found a page that listed them. Had the information been in a table, or even a list, I could have used the importHTML function in Google Docs (or Excel) to import the list. But it wasn’t, so I had 3 choices:

  • Convert the codes manually into authority names
  • Keep searching to see if I could find the same list in a better format
  • Write a scraper to convert the list on that page into a dataset

I tweeted an appeal for someone to scrape the page (the data was relatively structured – each item was prefixed by a 3-digit number, for example – which should make it easier) and Tom Smith quickly did just that. However, the scraper had for some reason not managed to capture the Scottish codes (possibly because they used the same numbers as the English ones).

Once those codes and associated local authority names were added to my spreadsheet I could convert my own codes to local authority names by using the VLOOKUP formula.

In Google Spreadsheets this means typing something like this into an empty cell to the right of your data: =VLOOKUP(D2,A:B,2)

You need to adapt the formula as follows:

  • D2 is the cell with the unknown code in it
  • A:B is the range of cells which contain both the code your are looking for and the associated word (A:B is all cells in A and B columns; you might also use something like A2:B340)
  • 2 (the “index”) is which of those columns contains the word you want to ‘pull’. 1 would be column A, because it is the first in your range. 2 is B.
  • As a whole, then, the code looks at the code in D2, then looks across all the cells in A and B to find the same code. If it finds it, then it will ‘copy across’ whatever is in the B column next to that code.

3. Create a spreadsheet for OpenHeatMap to use

With all my codes converted to local authority names (which OpenHeatMap should recognise), I’ve got all the data I need.

I create a new spreadsheet that only contains the relevant data: local authority, and a value (in this case, the percentage of the population that smokes).

I change the header for the local authorities council to uk_council and the other column to value

OpenHeatMap will use the names in uk_council to decide what areas of a UK map to colour in – and it will use the numbers in value to paint them according to whether they are high, low, or near the middle ground.

Now, I publish the spreadsheet as a webpage by going to Share > Publish as webpage. And copy the URL that is generated.

Then it’s off to openheatmap.com and follow the instructions there.

At this point, I hit a problem – one of the local authorities is not recognised: Eilean Siar. That’s okay – I thought this might happen. On the webpage of local authority codes that one was followed by the parenthesis: (formerly Western Isles). This sort of change of name or alternative name is worth looking out for when gathering data as it’s the sort of thing that does tend to trip up scripts.

In my spreadsheet, I change Eilean Siar to Western Isles, go to Share > Publish as a webpage again and click ‘Republish now‘. Returning to OpenHeatMap, I try again, and it seems to have solved the problem. The map works.

It’s a wonderfully simple way to achieve something that would have required a lot of programming knowhow 6 months ago. The UK is notoriously problematic in the way that it uses a range of different geographical reference points (health authorities, police forces and other data sources use others still), so this is a massive step forward in being able to present regional differences in stories.

If you use OpenHeatMap or mapping data yourself, please let me know how you get on and any other tips you pick up.

June 28 2010

09:22

So Where Do the Numbers in Government Reports Come From?

Last week, the COI (Central Office of Information) released a report on the “websites run by ministerial and non-ministerial government departments”, detailing visitor numbers, costs, satisfaction levels and so on, in accordance with COI standards on guidance on website reporting (Reporting on progress: Central Government websites 2009-10).

As well as the print/PDF summary report (>a href=”http://coi.gov.uk/websitemetricsdata/websitemetrics2009-10.pdf”>Reporting on progress: Central Government websites 2009-10 (Summary) [PDF, 33 pages, 942KB]) , a dataset was also released as a CSV document (Reporting on progress: Central Government websites 2009-10 (Data) [CSV, 66KB]).

The summary report is full of summary tables on particular topics, for example:

TABLE 1: REPORTED TOTAL COSTS OF DEPARTMENT-RUN WEBSITES
COI web report 2009-10 table 1

TABLE 2: REPORTED WEBSITE COSTS BY AREA OF SPENDING
COI web report 2009-10 table 2

TABLE 3: USAGE OF DEPARTMENT-RUN WEBSITES
COI website report 2009-10 table 3

Whilst I firmly believe it is a Good Thing that the COI published the data alongside the report, there is a still a disconnect between the two. The report is publishing fragments of the released dataset as information in the form of tables relating to particular reporting categories – reported website costs, or usage, for example – but there is no direct link back to the CSV data table.

Looking at the CSV data, we see a range of columns relating to costs, such as:

COI website report - costs column headings

and:

COI website report costs

There are also columns headed SEO/SIO, and HEO, for example, that may or may not relate to costs? (To see all the headings, see the CSV doc on Google spreadsheets).

But how does the released data relate to the summary reported data? It seems to me that there is a huge “hence” between the released CSV data and the summary report. Relating the two appears to be left as an exercise for the reader (or maybe for the data journalist looking to hold the report writers to account?).

The recently published New Public Sector Transparency Board and Public Data Transparency Principles, albeit in draft form, has little to say on this matter either. The principles appear to be focussed on the way in which the data is released, in a context free way, (where by “context” I mean any of the uses to which government may be putting the data).

For data to be useful as an exercise in transparency, it seems to me that when government releases reports, or when government, NGOs, lobbiests or the media make claims using summary figures based on, or derived from, government data, the transparency arises from an audit trail that allows us to see where those numbers came from.

So for example, around the COI website report, the Guardian reported that “[t]he report showed uktradeinvest.gov.uk cost £11.78 per visit, while businesslink.gov.uk cost £2.15.” (Up to 75% of government websites face closure). But how was that number arrived at?

The publication of data means that report writers should be able to link to views over original government data sets that show their working. The publication of data allows summary claims to be justified, and contributes to transparency by allowing others to see the means by which those claims were arrived at and the assumptions that went in to making the summary claim in the first place. (By summary claim, I mean things like “non-staff costs were X”, or the “cost per visit was Y”.)

[Just an aside on summary claims made by, or "discovered" by, the media. Transparency in terms of being able to justify the calculation from raw data is important because people often use the fact that a number was reported in the media as evidence that the number is in some sense meaningful and legitimately derived. ("According to the Guardian/Times/Telegraph/FT, etc etc etc". To a certain extent, data journalists need to behave like academic researchers in being able to justify their claims to others.]

So what would I like to see? Taking the example of the COI websites report, what I’d like to be able to see would be links from each of the tables to a page that “shows the working”.

In Using CSV Docs As a Database, I show how by putting the CSV data into a Google spreadsheet, we can generate several different views over the data using the using the Google Query language. For example, here’s a summary of the satisfaction levels, and here’s one over some of the costs:

COI website report - costs
select A,B,EL,EN,EP,ER,ET

We can even have a go at summing the costs:

COI summed website costs
select A,B,EL+EN+EP+ER+ET

In short, it seems to me that releasing the data as data is a good start, but the promise for transparency lays in being able to share queries over data sets that make clear the origins of data-derived information that we are provided with, such as the total non-staff costs of website development, or the average cost per visit to the blah, blah website.

So what would I like to see? Well, for each of the tables in the COI website report, a link to a query over the co-released CSV dataset that generates the summary table “live” from the original dataset would be a start… ;-)

PS In the meantime, to the extent that journalists and the media hold government to account, is there maybe a need for data journalysts (journalist+analyst portmanteau) to recreate the queries used to generate summary tables in government reports to find out exactly how they were derived from released data sets? Finding queries over the COI dataset that generate the tables published in the summary report is left as an exercise for the reader… ;-) If you manage to generate queries, in a bookmarkable form (e.g. using the COI website data explorer (see also this for more hints), please feel free to share the links in the comments below :-)


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl