Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

October 18 2010

13:10

Mapping the budget cuts

budget cuts map

Richard Pope and Jordan Hatch have been building a very useful site tracking recent budget cuts, building up to this week’s spending review.

Where Are The Cuts? uses the code behind the open source Ushahidi platform (covered previously on OJB by Claire Wardle) to present a map of the UK representing where cuts are being felt. Users can submit their own reports of cuts, or add details to others via a comments box.

It’s early days in the project – currently many of the cuts are to national organisations with local-level impacts yet to be dug out.

Closely involved is the public expenditure-tracking site Where Does My Money Go? which has compiled a lot of relevant data.

Meanwhile, in Birmingham a couple of my MA Online Journalism students have set up a hyperlocal blog for the 50,000 public sector workers in the region, primarily to report those budget cuts and how they are affecting people. Andy Watt, who – along with Hedy Korbee – is behind the site, has blogged about the preparation for the site’s launch here. It’s a good example of how journalists can react to a major issue with a niche blog. Andy and Hedy will be working with the local newspapers to combine expertise.

October 13 2010

07:47

Stories hidden in the data, stories in the comments

the tax gap

My attention was drawn this week by David Hayward to a visualisation by David McCandless of the tax gap (click on image for larger version). McCandless does some beautiful stuff, but what was particularly interesting in this graphic was how it highlighted areas that are rarely covered by the news agenda.

Tax avoidance and evasion, for example, account for £7.4bn each, while benefit fraud and benefit system error account for £1.5 and £1.6bn respectively.

Yet while the latter dominate the news agenda, and benefit cheats subject to regular exposure, tax avoidance and evasion are rare guests on the pages of newspapers.

In other words, the data is identifying a news hole of sorts. There are many reasons for this – Galtung & Ruge would have plenty of ideas, for example – but still: there it is.

The comments

But that’s only part of what makes this so interesting. By publishing the data and having built the healthy community that exists around the data blog, McCandless and The Guardian benefit from some very useful comments (aside from the odd political one) on how to improve both the data and the visualisation.

This is a great example of how the newspaper is stealing an enormous march on its rivals in working beyond its newsroom in collaboration with users – benefiting from what Clay Shirky would call cognitive surplus. Data is not just an informational object, but a social one too.

October 08 2010

08:25

Online journalism student RSS reader starter pack: 50 RSS feeds

Teaching has begun in the new academic year and once again I’m handing out a list of recommended RSS feeds. Last year this came in the form of an OPML file, but this year I’m using Google Reader bundles (instructions on how to create one of your own are here). There are 50 feeds in all – 5 feeds in each of 10 categories. Like any list, this is reliant on my own circles of knowledge and arbitrary in various respects. But it’s a start. I’d welcome other suggestions.

Here is the list with links to the bundles. Each list is in alphabetical order – there is no ranking:

5 of the best: Community

A link to the bundle allowing you to add it to your Google Reader is here.

  1. Blaise Grimes-Viort
  2. Community Building & Community Management
  3. FeverBee
  4. ManagingCommunities.com
  5. Online Community Strategist

5 of the best: Data

This was a particularly difficult list to draw up – I went for a mix of visualisation (FlowingData), statistics (The Numbers Guy), local and national data (CountCulture and Datablog) and practical help on mashups (OUseful). I cheated a little by moving computer assisted reporting blog Slewfootsnoop into the 5 UK feeds and 10,000 Words into Multimedia. Bundle link here.

  1. CountCulture
  2. FlowingData
  3. Guardian Datablog
  4. OUseful.info
  5. WSJ.com: The Numbers Guy

5 of the best: Enterprise

There’s a mix of UK and US blogs covering the economic side of publishing here (if you know of ones with a more international perspective I’d welcome suggestions), and a blog on advertising to round things up. Frequency of updates was another factor in drawing up the list. Bundle link here.

  1. Ad Sales Blog
  2. Media Money
  3. Newsonomics
  4. Newspaper Death Watch
  5. The Information Valet

5 of the best: Industry feeds

Something of a catch-all category. There are a number of BBC blogs I could have included but The Editors is probably the most important. The other 4 feeds cover the 2 most important external drivers of traffic to news sites: search engines and Facebook. Bundle link here.

  1. All Facebook
  2. BBC News – The Editors
  3. Facebook Blog
  4. Search Engine Journal
  5. Search Engine Land

5 of the best: Feeds on law, ethics and regulation

Trying to cover the full range here: Jack of Kent is a leading source of legal discussion and analysis, and Martin Moore covers regulation, ethics and law regularly. Techdirt is quite transparent about where it sits on legal issues, but its passion is also a strength in how well it covers those grey areas of law and the web. Tech and Law is another regular source, while Judith Townend’s new blog on Media Law & Ethics is establishing itself at the heart of UK bloggers’ attempts to understand where they stand legally. Bundle link here.

  1. Jack of Kent
  2. Martin Moore
  3. Media Law & Ethics
  4. Tech and Law
  5. Techdirt

5 of the best: Media feeds

There’s an obvious UK slant to this selection, with Editors Weblog and E-Media Tidbits providing a more global angle. Here’s the bundle link.

  1. Editors Weblog
  2. E-Media Tidbits
  3. Journalism.co.uk
  4. MediaGuardian
  5. paidContent

5 of the best: Feeds about multimedia journalism

Another catch-all category. Andy Dickinson tops my UK feeds, but he’s also a leading expert on online video and related areas. 10,000 Words is strong on data, among other things. And Adam Westbrook is good on enterprise as well as practising video journalism and audio slideshows. Bundle link here.

  1. 10,000 Words
  2. Adam Westbrook
  3. Advancing the Story
  4. Andy Dickinson
  5. News Videographer

5 of the best: Technology feeds

A mix of the mainstream, the new, and the specialist. As the Guardian’s technology coverage is incorporated into its Media feed, I was able to include ReadWriteWeb instead, which often provides a more thoughtful take on technology news. Bundle link here.

  1. Mashable
  2. ReadWriteWeb
  3. TechCrunch
  4. Telegraph Connected
  5. The Register

5 of the best: UK feeds

Alison Gow’s Headlines & Deadlines is the best blog by a regional journalist I can think of (you may differ – let me know). Adam Tinworth’s One Man and his Blog represents the magazines sector, and Martin Belam’s Currybetdotnet casts an eye across a range of areas, including the more technical side of things. Murray Dick (Slewfootsnoop) is an expert on computer assisted reporting and has a broadcasting background. The Online Journalism Blog is there because I expect them to read my blog, of course. Bundle link here.

  1. Currybetdotnet
  2. Headlines and Deadlines
  3. One Man & His Blog
  4. Online Journalism Blog
  5. Slewfootsnoop

5 of the best: US feeds

Jay, Jeff and Mindy are obvious choices for me, after which it is relatively arbitrary, based on the blogs that update the most – particularly open to suggestions here. Bundle link here.

  1. BuzzMachine
  2. Jay Rosen: Public Notebook
  3. OJR
  4. Teaching Online Journalism
  5. Yelvington.com

October 06 2010

10:52

Something I wrote for the Guardian Datablog (and caveats)

I’ve written a piece on ‘How to be a data journalist’ for The Guardian’s Datablog. It seems to have proved very popular, but I thought I should blog briefly about it if you haven’t seen one of those tweets.

The post is necessarily superficial – it was difficult enough to cover the subject area for a 12,000-word book chapter, so summarising further into a 1,000 word article was almost impossible.

In the process I had to leave a huge amount out, compensating slightly by linking to webpages which expanded further.

Visualising and mashing, as the more advanced parts of data journalism, suffered most, because it seemed to me that locating and understanding data necessarily took precedence. (Heather Billings blogged about a “very British footnote [which was the] only nod to visual presentation”. If you do want to know more about visualisation tips, I wrote 1,000 words on that alone here.)

On Monday I blogged the advice on where aspiring data journalists should start in full. There’s also the selection of passages from the book chapter linked above. And my Delicious bookmarks on data journalism, visualisation and mashups. Each has an RSS feed.

I hope that helps. If you do some data journalism as a result, it would be great if you could let me know about it – and what else you picked up.

October 04 2010

12:24

Open data meets FOI via some nifty automation

OpenlyLocal generated FOI request

Now this is an example of what’s possible with open data and some very clever thinking. Chris Taggart blogs about a new tool on his OpenlyLocal platform that allows you to send a Freedom of Information (FOI) request based on a particular item of spending. “This further lowers the barriers to armchair auditors wanting to understand where the money goes, and the request even includes all the usual ‘boilerplate’ to help avoid specious refusals.”

It takes around a minute to generate an FOI request.

The function is limited to items of spending above £10,000. Cleverly, it’s also all linked so you can see if an FOI request has already been generated and answered.

Although the tool sits on OpenlyLocalFrancis Irving at WhatDoTheyKnow gets enormous credit for making their side of the operation work with it.

Once again you have to ask why a media organisation isn’t creating these sorts of tools to help generate journalism beyond the walls of its newsroom.

07:41

Where should an aspiring data journalist start?

In writing last week’s Guardian Data Blog piece on How to be a data journalist I asked various people involved in data journalism where they would recommend starting. The answers are so useful that I thought I’d publish them in full here.

The Telegraph’s Conrad Quilty-Harper:

Start reading:

http://www.google.com/reader/bundle/user%2F06076274130681848419%2Fbundle%2Fdatavizfeeds

Keep adding to your knowledge and follow other data journalists/people who work with data on Twitter.

Look for sources of data:

ONS stats release calendar is a good start http://www.statistics.gov.uk/hub/release-calendar/index.html Look at the Government data stores (Data.gov, Data.gov.uk, Data.london.gov.uk etc).

Check out What do they know, Freebase, Wikileaks, Manyeyes, Google Fusion charts.

Find out where hidden data is and try and get hold of it: private companies looking for publicity, under appreciated research departments, public bodies that release data but not in a granular form (e.g. Met Office).

Test out cleaning/visualisation tools:

You want to be able to collect data, clean it, visualise it and map it.

Obviously you need to know basic Excel skills (pivot tables are how journalists efficiently get headline numbers from big spreadsheets).

For publishing just use Google Spreadsheets graphs, or ManyEyes or Timetric. Google MyMaps coupled with http://batchgeo.com is a great beginner mapping combo.

Further on from that you want to try out Google Spreadsheets importURL service, Yahoo Pipes for cleaning data, Freebase Gridworks and Dabble DB.

More advanced stuff you want to figure out query language and be able to work with relational databases, Google BigQuery, Google Visualisation API (http://code.google.com/apis/charttools/), Google code playgrounds (http://code.google.com/apis/ajax/playground/?type=visualization#org_chart) and other Javascript tools. The advanced mapping equivalents are ArcGIS or GeoConcept, allowing you to query geographical data and find stories.

You could also learn some Ruby for building your own scrapers, or Python for ScraperWiki.

Get inspired:

Get the data behind some big data stories you admire, try and find a story, visualise it and blog about it. You’ll find that the whole process starts with the data, and your interpretation of it. That needs to be newsworthy/valuable.

Look to the past!

Edward Tufte’s work is very inspiring: http://www.edwardtufte.com/tufte/ His favourite data visualisation is from 1869! Or what about John Snow’s Cholera map? http://www.york.ac.uk/depts/maths/histstat/snow_map.htm

And for good luck here’s an assorted list of visualisation tutorials.

The Times’ Jonathan Richards

I’d say a couple of blogs.

Others that spring to mind are:

If people want more specific advice, tell them to come to the next London Hack/Hackers and track me down!

The Guardian’s Charles Arthur:

Obvious thing: find a story that will be best told through numbers. (I’m thinking about quizzing my local council about the effects of stopping free swimming for children. Obvious way forward: get numbers for number of children swimming before, during and after free swimming offer.)

If someone already has the skills for data journalism (which I’d put at (1) understanding statistics and relevance (2) understanding how to manipulate data (3) understanding how to make the data visual) the key, I’d say, is always being able to spot a story that can be told through data – and only makes sense that way, and where being able to manipulate the data is key to extracting the story. It’s like interviewing the data. Good interviewers know how to get what they want out from the conversation. Ditto good data journalists and their data.

The New York Times’ Aron Pilhofer:

I would start small, and start with something you already know and already do. And always, always, always remember that the goal here is journalism. There is a tendency to focus too much on the skills for the sake of skills, and not enough on how those skills help enable you to do better journalism. Be pragmatic about it, and resist the tendency to think you need to know everything about the techy stuff before you do anything — nothing could be further from the truth.

Less abstractly, I would start out learning some basic computer-assisted reporting skills and then moving from there as your interests/needs dictate. A lot of people see the programmer/journalism thing as distinct from computer-assisted reporting, but I don’t. I see it as a continuum. I see CAR as a “gateway drug” of sorts: Once you start working with small data sets using tools like Excel, Access, MySQL, etc., you’ll eventually hit limits of what you can do with macros and SQL.

Soon enough, you’ll want to be able to script certain things. You’ll want to get data from the web. You’ll want to do things you can only do using some kind of scripting language, and so it begins.

But again, the place to start isn’t thinking about all these technologies. The place to start is thinking about how these technologies can enable you to tell stories you otherwise would never be able to tell otherwise. And you should start small. Look for little things to start, and go from there.

September 23 2010

07:20

“The mass market was a hack”: Data and the future of journalism

The following is an unedited version of an article written for the International Press Institute report ‘Brave News Worlds (PDF)

For the past two centuries journalists have dealt in the currency of information: we transmuted base metals into narrative gold. But information is changing.

At first, the base metals were eye witness accounts, and interviews. Later we learned to melt down official reports, research papers, and balance sheets. And most recently our alloys have been diluted by statements and press releases.

But now journalists are having to get to grips with a new type of information: data. And this is a very rich seam indeed.

Data: what, how and why

Data is a broad term so I should define it here: I am not talking here about statistics or numbers in general, because those are nothing new to journalists. When I talk about data I mean information that can be processed by computers.

This is a crucial distinction: it is one thing for a journalist to look at a balance sheet on paper; it is quite another to be able to dig through those figures on a spreadsheet, or to write a programming script to analyse that data, and match it to other sources of information. We can also more easily analyse new types of data, such as live data, large amounts of text, user behaviour patterns, and network connections.

And that, for me, is hugely important. Indeed, it is potentially transformational. Adding computer processing power to our journalistic arsenal allows us to do more, faster, more accurately, and with others. All of which opens up new opportunities – and new dangers. Things are going to change.

We’ve had over 40 years to see this coming. The growth of the spreadsheet and the database from the 1960s onwards kicked things off by making it much easier for organisations – including governments – to digitise information from what they spent our money on to how many people were being treated for which diseases, and where.

In the 1990s the invention of the world wide web accelerated the data at journalists’ disposal by providing a platform for those spreadsheets and databases to be published and accessed by both humans and computer programs – and a network to distribute it.

And now two cultural movements have combined to add a political dimension to the spread of data: the open data movement, and the linked data movement. Journalists should be familiar with these movements: the arguments that they have developed in holding power to account are a lesson in dealing with entrenched interests, while their experiments with the possibilities of data journalism show the way forward.

While the open data movement campaigns for important information – such as government spending, scientific information and maps – to be made publicly available for the benefit of society both democratically and economically, the linked data movement (championed by the inventor of the web, Sir Tim Berners-Lee) campaigns for that data to be made available in such a way that it can be linked to other sets of data so that, for instance, a computer can see that the director of a company named in a particular government contract is the same person who was paid as a consultant on a related government policy document. Advocates argue that this will also result in economic and social benefits.

Concrete results of both movements can be seen in the US and UK – most visibly with the launch of government data repositories Data.gov and Data.gov.uk in 2009 and 2010 respectively – but also less publicised experiments such as Where Does My Money Go? – which uses data to show how public expenditure is distributed – and Mapumental – which combines travel data, property prices and public ratings of ‘scenicness’ to help you see at a glance which areas of a city might be the best place to live based on your requirements.

But there are dozens if not hundreds of similar examples in industries from health and science to culture and sport. We are experiencing an unprecedented release of data – some have named it ‘Big Data’ – and yet for the most part, media organisations have been slow to react.

That is about to change.

The data journalist

Over the last year an increasing number of news organisations have started to wake from their story-centric production lines and see the value of data. In the UK the MPs’ expenses story was seminal: when a newspaper dictates the news agenda for six weeks, the rest of Fleet Street pays attention – and at the core of this story was a million pieces of data on a disc. Since then every serious news organisation has expanded its data operations.

In the US the journalist-programmer Adrian Holovaty has pioneered the form with the data mashup ChicagoCrime.org and its open source offspring Everyblock, while Aron Pilhofer has innovated at the interactive unit at The New York Times, and new entrants from Talking Points Memo to ProPublica have used data as a launchpad for interrogating the workings of government.

To those involved, it feels like heady days. In reality, it’s very early days indeed. Data journalism takes in a huge range of disciplines, from Computer Assisted Reporting (CAR) and programming, to visualisation and statistics. If you are a journalist with a strength in one of those areas, you are currently exceptional. This cannot last for long: the industry will have to skill up, or it will have nothing left to sell.

Because while news organisations for years made a business out of being a middleman processing content between commerce and consumers, and government and citizens, the internet has made that business model obsolete. It is not enough any more for a journalist to simply be good at writing – or rewriting. There are a million others out there who can write better – large numbers of them working in PR, marketing, or government. While we will always need professional storytellers, many journalists are simply factory line workers.

So on a commercial level if nothing else, publishing will need to establish where the value lies in this new environment – and the new efficiencies to make journalism viable.

Data journalism is one of those areas. With a surfeit of public data being made available, there is a rich supply of raw material. The scarcity lies in the skills to locate and make sense of that – whether the programming skills to scrape it and compare it with other sources in the first place, the design flair to visualise it, or the statistical understanding to unpick it.

“The mass market was a hack”: opportunities for the new economy

The technological opportunity is massive. As processing power continues to grow, the ability to interrogate, combine and present data continues to increase. The development of augmented reality provides a particularly attractive publishing opportunity: imagine being able to see local data-based stories through your mobile phone, or indeed add data to the picture through your own activity. The experiments of the past five years will come to see crude in comparison.

And then there is the commercial opportunity. Publishing is for most publishers, after all, not about selling content but about selling advertising. And here also data has taken on increasing importance. The mass market was a hack. As the saying goes: “Half the money I spend on advertising is wasted; the trouble is I don’t know which half.”

But Google, Facebook and others have used the measurability of the web to reduce the margin of error, and publishers will have to follow suit. It makes sense to put data at the centre of that – while you allow users to drill into the data you have gathered around automotive safety, the offering to advertisers is likely to say “We can display different adverts based on what information the user is interested in”, or “We can point the user to their local dealership based on their location”.

A collaborative future

I’m skeptical of the ability of established publishers to adapt to such a future but, whether they do or not, others will. And the backgrounds of journalists will have to change. The profession has a history of arts graduates who are highly literate but not typically numerate. That has already been the source of ongoing embarrassment for the profession as expert bloggers have highlighted basic errors in the way journalists cover science, health and finance – and it cannot continue.

We will need more journalists who can write a killer Freedom of Information request; more researchers with a knowledge of the hidden corners of the web where databases – the ‘invisible web’ – reside. We will need programmer-journalists who can write a screen scraper to acquire, sort, filter and store that information, and combine or compare it with other sources. We will need designers who can visualise that data in the clearest way possible – not just for editorial reasons but distribution too: infographics are an increasingly significant source of news site traffic.

There is a danger of ‘data churnalism’ – taking public statistics and visualising them in a spectacular way that lacks insight or context. Editors will need the statistical literacy to guard against this, or they will be found out.

And it is not just in editorial that innovation will be needed. Advertising sales will need to experience the same revolution that journalists have experienced, learning the language of web metrics, behavioural advertising and selling the benefits to advertisers.

And as publishers of data too, executives will need to adopt the philosophies of the open data and linked data movements to take advantage of the efficiencies that they provide. The New York Times and The Guardian have both published APIs that allow others to build web services with their content. In return they get access to otherwise unaffordable technical, mathematical and design expertise, and benefit from new products and new audiences, as (in the Guardian’s case) advertising is bundled in with the service. As these benefits become more widely recognised, other publishers will follow.

I have a hope that this will lead to a more collaborative form of journalism. The biggest resource a publisher has is its audience. Until now publishers have simply packaged up that resource for advertisers. But now that the audience is able to access the same information and tools as journalists, to interact with publishers and with each other, they are valuable in different ways.

At the same time the value of the newsroom has diminished: its size has shrunk, its competitive advantage reduced; and no single journalist has the depth and breadth of skillset needed across statistics, CAR, programming and design that data journalism requires. A new medium – and a new market – demands new rules. The more networked and iterative form of journalism that we’ve already seen emerge online is likely to become even more conventional as publishers move from a model that sees the story as the unit of production, to a model that starts with data.

September 22 2010

10:40

Why did you get into data journalism?

In researching my book chapter I asked a group of journalists who worked with data what led them to do so. Here are their answers:

Jonathon Richards, The Times:

The flood of information online presents an amazing opportunity for journalists, but also a challenge: how on earth does one keep up with; make sense of it? You could go about it in the traditional way, fossicking in individual sites, but much of the journalistic value in this outpouring, it seems, comes in aggregation: in processing large amounts of data, distilling them, and exploring them for patterns. To do that – unless you’re superhuman, or have a small army of volunteers – you need the help of a computer.

I ‘got into’ data journalism because I find this mix exciting. It appeals to the traditional journalistic instinct, but also calls for a new skill which, once harnessed, dramatically expands the realm of ‘stories I could possibly investigate…’

Mary Hamilton, Eastern Daily Press:

I started coding out of necessity, not out of desire. In my day-to-day work for local newspapers I came across stories that couldn’t be told any other way. Excel spreadsheets full of data that I knew was relevant to readers if I could break it down or aggregate it up. Lists of locations that meant nothing on the page without a map. Timelines of events and stacks of documents. The logical response for me was to try to develop the skills to parse data to get to the stories it can tell, and to present it in interactive, interesting and – crucially – relevant ways. I see data journalism as an important skill in my storytelling toolkit – not the only option, but an increasingly important way to open up information to readers and users.

Charles Arthur, The Guardian:

When I was really young, I read a book about computers which made the point – rather effectively – that if you found yourself doing the same process again and again, you should hand it over to a computer. That became a rule for me: never do some task more than once if you can possibly get a computer to do it.

Obviously, to implement that you have to do a bit of programming. It turns out all programming languages are much the same – they vary in their grammar, but they’re all about making the computer do stuff. And it’s often the same stuff (at least in my ambit) – fetch a web page, mash up two sets of data, filter out some rubbish and find the information you want.

I got into data journalism because I also did statistics – and that taught me that people are notoriously bad at understanding data. Visualisation and simplification and exposition are key to helping people understand.

So data journalism is a compound of all those things: determination to make the computer do the slog, confidence that I can program it to, and the desire to tell the story that the data is holding and hiding.

I don’t think there was any particular point where I suddenly said “ooh, this is data journalism” – it’s more that the process of thinking “oh, big dataset, stuff it into an ad-hoc MySQL database, left join against that other database I’ve got, see what comes out” goes from being a huge experiment to your natural reaction.

It’s not just data though – I use programming to slough off the repetitive tasks of the day, such as collecting links, or resizing pictures, or getting the picture URL and photographer and licence from a Flickr page and stuffing it into a blogpost.

Data journalism is actually only half the story. The other half is that journalists should be **actively unwilling** to do repetitive tasks if it’s machine-like (say, removing line breaks from a piece of copy, or changing a link format).

Time spent doing those sorts of tasks is time lost to journalism and given up to being a machine. Let the damn machines do it. Humans have better things to do.

Stijn Debrouwere, Belgian information designer:

I used to love reading the daily newspaper, but lately I can’t seem to be bothered anymore. I’m part of that generation of people news execs fear so much: those that simply don’t care about what newspapers and news magazines have to offer. I enjoy being an information designer because it gives me a chance to help reinvent the way we engage and inform communities through news and analysis, both offline and online. Technology doesn’t solve everything, but it sure can help. My professional goal is simply this: make myself love news and newspapers again, and thereby hopefully getting others to love it too.

September 20 2010

10:50

The BBC and missed data journalism opportunities

Bar chart: UN progress on eradication of world hunger

I’ve tweeted a couple of times recently about frustrations with BBC stories that are based on data but treat it poorly. As any journalist knows, two occasions of anything in close proximity warrants an overreaction about a “worrying trend”. So here it is.

“One in four council homes fails ‘Decent Homes Standard’”

This is a great piece of newsgathering, but a frustrating piece of online journalism. “Almost 100,000 local authority dwellings have not reached the government’s Decent Homes Standard,” it explained. But according to what? Who? “Government figures seen by BBC London”. Ah, right. Any more detail on that? No.

The article is scattered with random statistics from these figures “In Havering, east London, 56% of properties do not reach Decent Homes Standard – the highest figure for any local authority in the UK … In Tower Hamlets the figure is 55%.”

It’s a great story – if you live in those two local authorities. But it’s a classic example of narrowing a story to fit the space available. This story-centric approach serves readers in those locations, and readers who may be titillated by the fact that someone must always finish bottom in a chart – but the majority of readers will not live in those areas, and will want to know what the figures are for their own area. The article does nothing to help them do this. There are only 3 links, and none of them are deep links: they go to the homepages for Havering Council, Tower Hamlets Council, and the Department of Communities and Local Government.

In the world of print and broadcast, narrowing a story to fit space was a regrettable limitation of the medium; in the online world, linking to your sources is a fundamental quality of the medium. Not doing so looks either ignorant or arrogant.

“Uneven progress of UN Millennium Development Goals”

An impressive piece of data journalism that deserves credit, this looks at the UN’s goals and how close they are to being achieved, based on a raft of stats, which are presented in bar chart after bar chart (see image above). Each chart gives the source of the data, which is good to see. However, that source is simply given as “UN”: there is no link either on the charts or in the article (there are 2 links at the end of the piece – one to the UN Development Programme and the other to the official UN Millennium Development Goals website).

This lack of a link to the specific source of the data raises a number of questions: did the journalist or journalists (in both of these stories there is no byline) find the data themselves, or was it simply presented to them? What is it based on? What was the methodology?

The real missed opportunity here, however, is around visualisation. The relentless onslaught on bar charts makes this feel like a UN report itself, and leaves a dry subject still looking dry. This needed more thought.

Off the top of my head, one option might have been an overarching visualisation of how funding shortfalls overall differ between different parts of the world (allowing you to see that, for example, South America is coming off worst). This ‘big picture’ would then draw in people to look at the detail behind it (with an opportunity for interactivity).

Had they published a link to the data someone else might have done this – and other visualisations – for them. I would have liked to try it myself, in fact.

Compare this article, for example, with the Guardian Datablog’s treatment of the coalition agreement: a harder set of goals to measure, and they’ve had to compile the data themselves. But they’re transparent about the methodology (it’s subjective) and the data is there in full for others to play with.

It’s another dry subject matter, but The Guardian have made it a social object.

No excuses

The BBC is not a print outlet, so it does not have the excuse of these stories being written for print (although I will assume they were researched with broadcast as the primary outlet in mind).

It should also, in theory, be well resourced for data journalism. Martin Rosenbaum, for example, is a pioneer in the field, and the team behind the BBC website’s Special Reports section does some world class work. The corporation was one of the first in the world to experiment with open innovation with Backstage, and runs a DataArt blog too. But the core newsgathering operation is missing some basic opportunities for good data journalism practice.

In fact, it’s missing just one basic opportunity: link to your data. It’s as simple as that.

On a related note, the BBC Trust wants your opinions on science reporting. On this subject, David Colquhoun raises many of the same issues: absence of links to sources, and anonymity of reporters. This is clearly more a cultural issue than a technical one.

Of all the UK’s news organisations, the BBC should be at the forefront of transparency and openness in journalism online. Thinking politically, allowing users to access the data they have spent public money to acquire also strengthens their ideological hand in the Big Society bunfight.

06:22

When crowdsourcing is your only option

Crowdsourced map - the price of weed

PriceOfWeed.com is a great example of when you need to turn to crowdsourcing to obtain data for your journalism. As Paul Kedrosky writes, it’s “Not often that you get to combine economics, illicit substances, map mashups and crowd-sourcing in one post like this.” The resulting picture is surprisingly clear.

And news organisations could learn a lot from the way this has been executed. Although the default map view is of the US, the site detects your location and offers you prices nearest to you. It’s searchable and browsable. Sadly, the raw data isn’t available – although it would be relatively straightforward to scrape it.

As the site expands globally it is also adding extra data on the social context – tolerance and  law enforcement. (via)

September 17 2010

16:18

A First – Not Very Successful – Look at Using Ordnance Survey OpenLayers…

What’s the easiest way of creating a thematic map, that shows regions coloured according to some sort of measure?

Yesterday, I saw a tweet go by from @datastore about Carbon emissions in every local authority in the UK, detailing those emissions for a list of local authorities (whatever they are… I’ll come on to that in a moment…)

Carbon emissions data table

The dataset seemed like a good opportunity to try out the Ordnance Survey’s OpenLayers API, which I’d noticed allows you to make use of OS boundary data and maps in order to create thematic maps for UK data:

OS thematic map demo

So – what’s involved? The first thing was to try and get codes for the authority areas. The ONS make various codes available (download here) and the OpenSpace website also makes available a list of boundary codes that it can render (download here), so I had a poke through the various code files and realised that the Guardian emissions data seemed to identify regions that were coded in different ways? So I stalled there and looked at another part f the jigsaw…

…specifically, OpenLayers. I tried the demo – Creating thematic boundaries – got it to work for the sample data, then tried to put in some other administrative codes to see if I could display boundaries for other area types… hmmm…. No joy:-) A bit of digging identified this bit of code:

boundaryLayer = new OpenSpace.Layer.Boundary("Boundaries", {
strategies: [new OpenSpace.Strategy.BBOX()],
area_code: ["EUR"],
styleMap: styleMap });

which appears to identify the type of area codes/boundary layer required, in this case “EUR”. So two questions came to mind:

1) does this mean we can’t plot layers that have mixed region types? For example, the emissions data seemed to list names from different authority/administrative area types?
2) what layer types are available?

A bit of digging on the OpenLayers site turned up something relevant on the Technical FAQ page:

OS OpenSpace boundary DESCRIPTION, (AREA_CODE) and feature count (number of boundary areas of this type)

County, (CTY) 27
County Electoral Division, (CED) 1739
District, (DIS) 201
District Ward, (DIW) 4585
European Region, (EUR) 11
Greater London Authority, (GLA) 1
Greater London Authority Assembly Constituency, (LAC) 14
London Borough, (LBO) 33
London Borough Ward, (LBW) 649
Metropolitan District, (MTD) 36
Metropolitan District Ward, (MTW) 815
Scottish Parliament Electoral Region, (SPE) 8http://ouseful.wordpress.com/wp-admin/edit.php
Scottish Parliament Constituency, (SPC) 73
Unitary Authority, (UTA) 110
Unitary Authority Electoral Division, (UTE) 1334
Unitary Authority Ward, (UTW) 1464
Welsh Assembly Electoral Region, (WAE) 5
Welsh Assembly Constituency, (WAC) 40
Westminster Constituency, (WMC) 632

so presumably all those code types can be used as area_code arguments in place of “EUR”?

Back to one of the other pieces of the jigsaw: the OpenLayers API is called using official area codes, but the emissions data just provides the names of areas. So somehow I need to map from the area names to an area code. This requires: a) some sort of lookup table to map from name to code; b) a way of doing that.

Normally, I’d be tempted to use a Google Fusion table to try to join the emissions table with the list of boundary area names/codes supported by OpenSpace, but then I recalled a post by Paul Bradshaw on using the Google spreadsheets VLOOKUP formula (to create a thematic map, as it happens: Playing with heat-mapping UK data on OpenHeatMap), so thought I’d give that a go… no joy:-( For seem reason, the vlookup just kept giving rubbish. Maybe it was happy with really crappy best matches, even if i tried to force exact matches. It almost felt like formula was working on a differently ordered column to the one it should have been, I have no idea. So I gave up trying to make sense of it (something to return to another day maybe; I was in the wrong mood for trying to make sense of it, and now I am just downright suspicious of the VLOOKUP function!)…

…and instead thought I’d give the openheatmap application Paul had mentioned a go…After a few false starts (I thought I’d be able to just throw a spreadsheet at it and then specify the data columns I wanted to bind to the visualisation, (c.f. Semantic reports), but it turns out you have to specify particular column names, value for the data value, and one of the specified locator labels) I managed to upload some of the data as uk_council data (quite a lot of it was thrown away) and get some sort of map out:

openheatmap demo

You’ll notice there are a few blank areas where council names couldn’t be identified.

So what do we learn? Firstly, the first time you try out a new recipe, it rarely, if ever, “just works”. When you know what you’re doing, and “all you have to do is…”, all is a little word. When you don’t know what you’re doing, all is a realm of infinite possibilities of things to try that may or may not work…

We also learn that I’m not really that much closer to getting my thematic map out… but I do have a clearer list of things I need to learn more about. Firstly, a few hello world examples using the various different OpenLayer layers. Secondly, a better understanding of the differences between the various authority types, and what sorts of mapping there might be between them. Thirdly, I need to find a more reliable way of reconciling data from two tables and in particular looking up area codes from area names (in two ways: code and area type from area name; code from area name and area type). VLOOKUP didn’t work for me this time, so I need to find out if that was my problem, or an “issue”.

Something else that comes to mind is this: the datablog asks: “Can you do something with this data? Please post your visualisations and mash-ups on our Flickr group”. IF the data had included authority codes, I would have been more likely to persist in trying to get them mapped using OpenLayers. But my lack of understanding about how to get from names to codes meant I stumbled at this hurdle. There was too much friction in going from area name to OpenLayer boundary code. (I have no idea, for example, whether the area names relate to one administrative class, or several).

Although I don’t think the following is the case, I do think it is possible to imagine a scenario where the Guardian do have a table that includes the administrative codes as well as names for this data, or an environment/application/tool for rapidly and reliably generating such a table, and that they know this makes the data more valuable because it means they can easily map it, but others can’t. The lack of codes means that work needs to be done in order to create a compelling map from the data that may attract web traffic. If it was that easy to create the map, a “competitor” might make the map and get the traffic for no real effort. The idea I’m fumbling around here is that there is a spectrum of stuff around a data set that makes it more or less easy to create visualiations. In the current example, we have area name, area code, map. Given an area code, it’s presumably (?) easy enough to map using e.g. OpenLayers becuase the codes are unambiguous. Given an area name, if we can reliably look up the area code, it’s presumably easy to generate the map from the name via the code. Now, if we want to give the appearance of publishing the data, but make it hard for people to use, we can make it hard for them to map from names to codes, either by messing around with the names, or using a mix of names that map on to area codes of different types. So we can taint the data to make it hard for folk to use easily whilst still be being seen to publish the data.

Now I’m not saying the Guardian do this, but a couple of things follow: firstly, obfuscating or tainting data can help you prevent casual use of it by others (it can also help you track the data; e.g. mapping agencies that put false artefacts in their maps to help reveal plagiarism); secondly, if you are casual with the way you publish data, you can make it hard for people to make effective use of that data. For a long time, I used to hassle folk into publishing RSS feeds. Some of them did… or at least thought they did. For as soon as I tried to use their feeds, they turned out to be broken. No-one had ever tried to consume them. Same with data. If you publish your data, try to do something with it. So for example, the emissions data is illustrated with a Many Eyes visualisation of it; it works as data in at least that sense. From the place names, it would be easy enough to vaguely place a marker on a map showing a data value roughly in the area of each council. But for identifying exact administrative areas – the data is lacking.

It might seem as is if I’m angling against the current advice to councils and government departments to just “get their data out there” even if it is a bit scrappy, but I’m not… What I am saying (I think) is that folk should just try to get their data out, abut also:

- have a go at trying to use it for something themselves, or at least just demo a way of using it. This can have a payoff in at least a three ways I can think of: a) it may help you spot a problem with the way you published the data that you can easily fix, or at least post a caveat about; b) it helps you develop you’re own data handling skills; c) you might find that you can encourage reuse of the data you have just published in your own institution…

- be open to folk coming to you with suggestions for ways in which you might be able to make the data more valuable/easier to use for them for little effort on your own part, and that in turn may help you publish future data releases in an ever more useful way.

Can you see where this is going? Towards Linked Data… ;-)

PS just by the by, a related post (that just happens to mention OUseful.info:-) on the Telegraph blogs about Open data ‘rights’ require responsibility from the Government led me to a quick chat with Telegraph data hack @coneee and the realisation that the Telegraph too are starting to explore the release of data via Google spreadsheets. So for example, a post on Councils spending millions on website redesigns as job cuts loom also links to the source data here: Data: Council spending on websites.


June 28 2010

09:22

So Where Do the Numbers in Government Reports Come From?

Last week, the COI (Central Office of Information) released a report on the “websites run by ministerial and non-ministerial government departments”, detailing visitor numbers, costs, satisfaction levels and so on, in accordance with COI standards on guidance on website reporting (Reporting on progress: Central Government websites 2009-10).

As well as the print/PDF summary report (>a href=”http://coi.gov.uk/websitemetricsdata/websitemetrics2009-10.pdf”>Reporting on progress: Central Government websites 2009-10 (Summary) [PDF, 33 pages, 942KB]) , a dataset was also released as a CSV document (Reporting on progress: Central Government websites 2009-10 (Data) [CSV, 66KB]).

The summary report is full of summary tables on particular topics, for example:

TABLE 1: REPORTED TOTAL COSTS OF DEPARTMENT-RUN WEBSITES
COI web report 2009-10 table 1

TABLE 2: REPORTED WEBSITE COSTS BY AREA OF SPENDING
COI web report 2009-10 table 2

TABLE 3: USAGE OF DEPARTMENT-RUN WEBSITES
COI website report 2009-10 table 3

Whilst I firmly believe it is a Good Thing that the COI published the data alongside the report, there is a still a disconnect between the two. The report is publishing fragments of the released dataset as information in the form of tables relating to particular reporting categories – reported website costs, or usage, for example – but there is no direct link back to the CSV data table.

Looking at the CSV data, we see a range of columns relating to costs, such as:

COI website report - costs column headings

and:

COI website report costs

There are also columns headed SEO/SIO, and HEO, for example, that may or may not relate to costs? (To see all the headings, see the CSV doc on Google spreadsheets).

But how does the released data relate to the summary reported data? It seems to me that there is a huge “hence” between the released CSV data and the summary report. Relating the two appears to be left as an exercise for the reader (or maybe for the data journalist looking to hold the report writers to account?).

The recently published New Public Sector Transparency Board and Public Data Transparency Principles, albeit in draft form, has little to say on this matter either. The principles appear to be focussed on the way in which the data is released, in a context free way, (where by “context” I mean any of the uses to which government may be putting the data).

For data to be useful as an exercise in transparency, it seems to me that when government releases reports, or when government, NGOs, lobbiests or the media make claims using summary figures based on, or derived from, government data, the transparency arises from an audit trail that allows us to see where those numbers came from.

So for example, around the COI website report, the Guardian reported that “[t]he report showed uktradeinvest.gov.uk cost £11.78 per visit, while businesslink.gov.uk cost £2.15.” (Up to 75% of government websites face closure). But how was that number arrived at?

The publication of data means that report writers should be able to link to views over original government data sets that show their working. The publication of data allows summary claims to be justified, and contributes to transparency by allowing others to see the means by which those claims were arrived at and the assumptions that went in to making the summary claim in the first place. (By summary claim, I mean things like “non-staff costs were X”, or the “cost per visit was Y”.)

[Just an aside on summary claims made by, or "discovered" by, the media. Transparency in terms of being able to justify the calculation from raw data is important because people often use the fact that a number was reported in the media as evidence that the number is in some sense meaningful and legitimately derived. ("According to the Guardian/Times/Telegraph/FT, etc etc etc". To a certain extent, data journalists need to behave like academic researchers in being able to justify their claims to others.]

So what would I like to see? Taking the example of the COI websites report, what I’d like to be able to see would be links from each of the tables to a page that “shows the working”.

In Using CSV Docs As a Database, I show how by putting the CSV data into a Google spreadsheet, we can generate several different views over the data using the using the Google Query language. For example, here’s a summary of the satisfaction levels, and here’s one over some of the costs:

COI website report - costs
select A,B,EL,EN,EP,ER,ET

We can even have a go at summing the costs:

COI summed website costs
select A,B,EL+EN+EP+ER+ET

In short, it seems to me that releasing the data as data is a good start, but the promise for transparency lays in being able to share queries over data sets that make clear the origins of data-derived information that we are provided with, such as the total non-staff costs of website development, or the average cost per visit to the blah, blah website.

So what would I like to see? Well, for each of the tables in the COI website report, a link to a query over the co-released CSV dataset that generates the summary table “live” from the original dataset would be a start… ;-)

PS In the meantime, to the extent that journalists and the media hold government to account, is there maybe a need for data journalysts (journalist+analyst portmanteau) to recreate the queries used to generate summary tables in government reports to find out exactly how they were derived from released data sets? Finding queries over the COI dataset that generate the tables published in the summary report is left as an exercise for the reader… ;-) If you manage to generate queries, in a bookmarkable form (e.g. using the COI website data explorer (see also this for more hints), please feel free to share the links in the comments below :-)


June 25 2010

12:51

Guardian Datastore MPs’ Expenses Spreadsheet as a Database

Continuing my exploration of what is and isn’t acceptable around the edges of doing stuff with other people’s data(?!), the Guardian datastore have just published a Google spreadsheet containing partial details of MPs’ expenses data over the period July-Decoember 2009 (MPs’ expenses: every claim from July to December 2009):

thanks to the work of Guardian developer Daniel Vydra and his team, we’ve managed to scrape the entire lot out of the Commons website for you as a downloadable spreadsheet. You cannot get this anywhere else.

In sharing the data, the Guardian folks have opted to share the spreadsheet via a link that includes an authorisation token. Which means that if you try to view the spreadsheet just using the spreadsheet key, you won’t be allowed to see it; (you also need to be logged in to a Google account to view the data, both as a spreadsheet, and in order to interrogate it via the visualisation API). Which is to say, the Guardian datastore folks are taking what steps they can to make the data public, whilst retaining some control over it (because they have invested resource in collecting the data in the form they’re re-presenting it, and reasonably want to make a return from it…)

But in sharing the link that includes the token on a public website, we can see the key – and hence use it to access the data in the spreadsheet, and do more with it… which may be seen as providing a volume add service over the data, or unreasonably freeloading off the back of the Guardian’s data scraping efforts…

So, just pasting the spreadsheet key and authorisation token into the cut down Guardian datastore explorer script I used in Using CSV Docs As a Database to generate an explorer for the expenses data.

So for example, we can run for example run a report to group expenses by category and MP:

MP expesnes explorer

Or how about claims over 5000 pounds (also viewing the information as an HTML table, for example).

Remember, on the datastore explorer page, you can click on column headings to order the data according to that column.

Here’s another example – selecting A,sum(E), where E>0 group by A and order is by sum(E) then asc and viewing as a column chart:

Datastore exploration

We can also (now!) limit the number of results returned, e.g. to show the 10 MPs with lowest claims to date (the datastore blog post explains that why the data is incomplete and to be treated warily).

Limiting results in datstore explorer

Changing the asc order to desc in the above query gives possibly a more interesting result, the MPs who have the largest claims to date (presumably because they have got round to filing their claims!;-)

Datastore exploring

Okay – enough for now; the reason I’m posting this is in part to ask the question: is the this an unfair use of the Guardian datastore data, does it detract from the work they put in that lets them claim “You cannot get this anywhere else”, and does it impact on the returns they might expect to gain?

Sbould they/could they try to assert some sort of database collection right over the collection/curation and re-presentation of the data that is otherwise publicly available that would (nominally!) prevent me from using this data? Does the publication of the data using the shared link with the authorisation token imply some sort of license with which that data is made available? E.g. by accepting the link by clicking on it, becuase it is a shared link rather than a public link, could the Datastore attach some sort of tacit click-wrap license conditions over the data that I accept when I accept the shared data by clicking through the shared link? (Does the/can the sharing come with conditions attached?)

PS It seems there was a minor “issue” with the settings of the spreadsheet, a result of recent changes to the Google sharing setup. Spreadsheets should now be fully viewable… But as I mention in a comment below, I think there are still interesting questions to be considered around the extent to which publishers of “public” data can get a return on that data?

Continuing my exploration of what is and isn’t acceptable around the edges of doing stuff with other people’s data(?!), the Guardian datastore have just published a Google spreadsheet containing partial details of MPs’ expenses data over the period July-Decoember 2009 (MPs’ expenses: every claim from July to December 2009):

thanks to the work of Guardian developer Daniel Vydra and his team, we’ve managed to scrape the entire lot out of the Commons website for you as a downloadable spreadsheet. You cannot get this anywhere else.

In sharing the data, the Guardian folks have opted to share the spreadsheet via a link that includes an authorisation token. Which means that if you try to view the spreadsheet just using the spreadsheet key, you won’t be allowed to see it; (you also need to be logged in to a Google account to view the data, both as a spreadsheet, and in order to interrogate it via the visualisation API). Which is to say, the Guardian datastore folks are taking what steps they can to make the data public, whilst retaining some control over it (because they have invested resource in collecting the data in the form they’re re-presenting it, and reasonably want to make a return from it…)

But in sharing the link that includes the token on a public website, we can see the key – and hence use it to access the data in the spreadsheet, and do more with it… which may be seen as providing a volume add service over the data, or unreasonably freeloading off the back of the Guardian’s data scraping efforts…

So, just pasting the spreadsheet key and authorisation token into the cut down Guardian datastore explorer script I used in Using CSV Docs As a Database to generate an explorer for the expenses data.

So for example, we can run for example run a report to group expenses by category and MP:

MP expesnes explorer

Or how about claims over 5000 pounds (also viewing the information as an HTML table, for example).

Remember, on the datastore explorer page, you can click on column headings to order the data according to that column.

Here’s another example – selecting A,sum(E), where E>0 group by A and order is by sum(E) then asc and viewing as a column chart:

Datastore exploration

We can also (now!) limit the number of results returned, e.g. to show the 10 MPs with lowest claims to date (the datastore blog post explains that why the data is incomplete and to be treated warily).

Limiting results in datstore explorer

Changing the asc order to desc in the above query gives possibly a more interesting result, the MPs who have the largest claims to date (presumably because they have got round to filing their claims!;-)

Datastore exploring

Okay – enough for now; the reason I’m posting this is in part to ask the question: is the this an unfair use of the Guardian datastore data, does it detract from the work they put in that lets them claim “You cannot get this anywhere else”, and does it impact on the returns they might expect to gain?

Sbould they/could they try to assert some sort of database collection right over the collection/curation and re-presentation of the data that is otherwise publicly available that would (nominally!) prevent me from using this data? Does the publication of the data using the shared link with the authorisation token imply some sort of license with which that data is made available? E.g. by accepting the link by clicking on it, becuase it is a shared link rather than a public link, could the Datastore attach some sort of tacit click-wrap license conditions over the data that I accept when I accept the shared data by clicking through the shared link? (Does the/can the sharing come with conditions attached?)

PS It seems there was a minor “issue” with the settings of the spreadsheet, a result of recent changes to the Google sharing setup. Spreadsheets should now be fully viewable… But as I mention in a comment below, I think there are still interesting questions to be considered around the extent to which publishers of “public” data can get a return on that data?


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl