Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

July 15 2011

07:19

When information is power, these are the questions we should be asking

Various commentators over the past year have made the observation that “Data is the new oil“. If that’s the case, journalists should be following the money. But they’re not.

Instead it’s falling to the likes of Tony Hirst (an Open University academic), Dan Herbert (an Oxford Brookes academic) and Chris Taggart (a developer who used to be a magazine publisher) to fill the scrutiny gap. Recently all three have shone a light into the move towards transparency and open data which anyone with an interest in information would be advised to read.

Hirst wrote a particularly detailed post breaking down the results of a consultation about higher education data.

Herbert wrote about the publication of the first Whole of Government Accounts for the UK.

And Taggart made one of the best presentations I’ve seen on the relationship between information and democracy.

What all three highlight is how control of information still represents the exercise of power, and how shifts in that control as a result of the transparency/open data/linked data agenda are open to abuse, gaming, or spin.

Control, Cost, Confusion

Hirst, for example, identifies the potential for data about higher education to be monopolised by one organisation – UCAS, or HEFCE – at extra cost to universities, resulting in less detailed information for students and parents.

His translation of the outcomes of a HEFCE consultation brings to mind the situation that existed for years around Ordnance Survey data: taxpayers were paying for the information up to 8 times over, and the prohibitive cost of accessing that data ended up inspiring the Free Our Data campaign. As Hirst writes:

“The data burden is on the universities?! But the aggregation – where the value is locked up – is under the control of the centre? … So how much do we think the third party software vendors are going to claim for to make the changes to their systems? And hands up who thinks that those changes will also be antagonistic to developers who might be minded to open up the data via APIs. After all, if you can get data out of your commercially licensed enterprise software via a programmable API, there’s less requirement to stump up the cash to pay for maintenance and the implementation of “additional” features…”

Meanwhile Dan Herbert analyses another approach to data publication: the arrival of commercial-style accounting reports for the public sector. On the surface this all sounds transparent, but it may be just the opposite:

“There is absolutely no empiric evidence that shows that anyone actually uses the accounts produced by public bodies to make any decision. There is no group of principals analogous to investors. There are many lists of potential users of the accounts. The Treasury, CIPFA (the UK public sector accounting body) and others have said that users might include the public, taxpayers, regulators and oversight bodies. I would be prepared to put up a reward for anyone who could prove to me that any of these people have ever made a decision based on the financial reports of a public body. If there are no users of the information then there is no point in making the reports better. If there are no users more technically correct reports do nothing to improve the understanding of public finances. In effect all that better reports do is legitimise the role of professional accountants in the accountability process.

Like Hirst, he argues that the raw data – and the ability to interrogate that – should instead be made available because (quoting Anthony Hopwood): “Those with the power to determine what enters into organisational accounts have the means to articulate and diffuse their values and concerns, and subsequently to monitor, observe and regulate the actions of those that are now accounted for.”

This is a characteristic of the transparency initiative that we need to be sharper around as journalists. The Manchester Evening News discovered this when they wanted to look at spending cuts. What they found was a dataset that had been ‘spun’ to make it harder to see the story hidden within, and to answer their question they first had to unspin it – or, in data journalism parlance, clean it. Likewise, having granular data – ideally from more than one source – allows us to better judge the quality of the information itself.

Chris Taggart meanwhile looks at the big picture: friction, he says, underpins society as we know it. Businesses such as real estate are based on it; privacy exists because of it; and democracies depend on it. As friction is removed through access to information, we get problems such as “jurisdiction failure” (corporate lawyers having “hacked” local laws to international advantage), but also issues around the democratic accountability of ad hoc communities and how we deal with different conceptions of privacy across borders.

Questions to ask of ‘transparency’

The point isn’t about the answers to the questions that Taggart, Herbert and Hirst raise – it’s the questions themselves, and the fact that journalists are, too often, not asking them when we are presented with yet another ‘transparency initiative‘.

If data is the new oil those three posts and a presentation provide a useful introduction to following the money.

(By the way, for a great example of a journalist asking all the right questions of one such initiative, however, see The Telegraph’s Conrad Quilty-Harper on the launch of Police.uk)

Data is not just some opaque term; something for geeks: it’s information: the raw material we deal in as journalists. Knowledge. Power. The site of a struggle for control. And considering it’s a site that journalists have always fought over, it’s surprisingly placid as we enter one of the most important ages in the history of information control.

As Heather Brooke writes today of the hacking scandal:

“Journalism in Britain is a patronage system – just like politics. It is rare to get good, timely information through merit (eg by trawling through public records); instead it’s about knowing the right people, exchanging favours. In America reporters are not allowed to accept any hospitality. In Britain, taking people out to lunch is de rigueur. It’s where information is traded. But in this setting, information comes at a price.

“This is why there is collusion between the elites of the police, politicians and the press. It is a cartel of information. The press only get information by playing the game. There is a reason none of the main political reporters investigated MPs’ expenses – because to do so would have meant falling out with those who control access to important civic information. The press – like the public – have little statutory right to information with no strings attached. Inside parliament the lobby system is an exercise in client journalism that serves primarily the interests of the powerful. Freedom of information laws bust open the cartel.”

But laws come with loopholes and exemptions, red tape and ignorance. And they need to be fought over.

One bill to extend the FOI law to “remove provisions permitting Ministers to overrule decisions of the Information Commissioner and Information Tribunal; to limit the time allowed for public authorities to respond to requests involving consideration of the public interest; to amend the definition of public authorities” and more, for example, was recently put on indefinite hold. How many publishers and journalists are lobbying to un-pause this?

So let’s simplify things. And in doing so, there’s no better place to start than David Eaves’ 3 laws of government data.

This is summed up as the need to be able to “Find, Play and Share” information. For the purposes of journalism, however, I’ll rephrase them as 3 questions to ask of any transparency initiative:

  1. If information is to be published in a database behind a form, then it’s hidden in plain sight. It cannot be easily found by a journalist, and only simple questions will be answered.
  2. If information is to be published in PDFs or JPEGs, or some format that you need proprietary software to see, then it cannot be easily be questioned by a journalist
  3. If you will have to pass a test to use the information, then obstacles will be placed between the journalist and that information

The next time an organisation claims that they are opening up their information, tick those questions off. (If you want more, see Gurstein’s list of 7 elements that are needed to make effective use of open data).

At the moment, the history of information is being written without journalists.

PrintFriendly

May 27 2011

17:18

#Newsrw: ‘There is no point in having data unless you have context to go with it’

Open data is essential, but useless without context – that was the consensus at the local data session.

A lively debate took place, where delegates heard from a range of speakers and their attempts to fill the niche in local data, creating “open data cities” and encouraging transparency.

Greg Hadfield of Cogapp cited three key elements that were required to create an open data city.

“Data must be made available in a structured format, they must have a commitment to transparency and accountability, and be a city that thinks like the web,” he said.

Hadfield also mentioned the Road Map for the Digital City, a project based in New York headed by Rachel Sterne intended to provide a comprehensive guide to make it the leading digital city.

He also referenced his own experience in trying to transform Brighton into an open data city, saying the only organisation who weren’t part of a conversation that included councils, nhs trusts and businesses were the traditional media.

“Monolithic media is lacking in innovation, is organisationally dysfunctional, careless about readers, users and communities.

“It’s guilty of continuous betrayals of trust at the expense of journalists, communities and shareholders,” he said.

Philip John of the Lichfield Blog confirmed this view, saying that people who ran hyperlocal websites like his were often far more passionate about their local area than journalists who worked for local newspapers.

@philipjohn speaking at #newsrw by JosephStash

The Lichfield Blog was born out of connecting to the community and creating something more in tune with the people who live there.

“There is no point in having data unless you have context to go with it. If we’re talking about journalism, we’re trying to find a story,” he said.

Chris Taggart and Jonathan Carr-West (of OpenlyLocal and the Local Government Information Unit respectively) both touched on the idea of not confusing technology with innovation.

“What matters is not the technology or the tools, but the uses that you put them to. This is the new emphasis on openness,” said Carr-West.

He spoke about finding the right tool for the job, and not thinking there is a one-size fits all solution to finding and releasing data.

Taggart has a background in magazine journalism but identified the lack of local data that was available on the web when he started his OpenlyLocal project.

He spoke about the brilliant stories that journalists can find by digging in councils, and the new opportunities that open local data presented.

“The information is out there, but we have a lack of resources and journalists.

“The opportunity is out there, but we need people to chase it and follow it up,” he said.

13:32

LIVE: Session 3A – Local data

We have Matt Caines and Ben Whitelaw from Wannabe Hacks liveblogging for us at news:rewired all day. You can follow session 3A ‘Local Data’, below.

Session 3A features: Philip John, director, Lichfield Blog; Chris Taggart, founder, OpenlyLocal; Greg Hadfield, director of strategic projects, Cogapp; Jonathan Carr-West, director, Local Government Information Unit. Moderated by Matthew Eltringham, editor, BBC College of Journalism website.

March 25 2011

13:50

All the news that’s fit to scrape

Channel 4/Scraperwiki collaboration

There have been quite a few scraping-related stories that I’ve been meaning to blog about – so many I’ve decided to write a round up instead. It demonstrates just the increasing role that scraping is playing in journalism – and the possibilities for those who don’t know them:

Scraping company information

Chris Taggart explains how he built a database of corporations which will be particularly useful to journalists and anyone looking at public spending:

“Let’s have a look at one we did earlier: the Isle of Man (there’s also one for Gibraltar, Ireland, and in the US, the District of Columbia) … In the space of a couple of hours not only have we liberated the data, but both the code and the data are there for anyone else to use too, as well as being imported in OpenCorporates.”

OpenCorporates are also offering a bounty for programmers who can scrape company information from other jurisdictions.

Scraperwiki on the front page of The Guardian…

The Scraperwiki blog gives the story behind a front page investigation by James Ball on lobbyist influence in the UK Parliament:

“James Ball’s story is helped and supported by a ScraperWiki script that took data from registers across parliament that is located on different servers and aggregates them into one source table that can be viewed in a spreadsheet or document.  This is now a living source of data that can be automatically updated.  http://scraperwiki.com/scrapers/all_party_groups/

“Journalists can put down markers that run and update automatically and they can monitor the data over time with the objective of holding ‘power and money’ to account. The added value  of this technique is that in one step the data is represented in a uniform structure and linked to the source thus ensuring its provenance.  The software code that collects the data can be inspected by others in a peer review process to ensure the fidelity of the data.”

…and on Channel 4′s Dispatches

From the Open Knowledge Foundation blog (more on Scraperwiki’s blog):

“ScraperWiki worked with Channel 4 News and Dispatches to make two supporting data visualisations, to help viewers understand what assets the UK Government owns … The first is a bubble chart of what central Government owns. The PDFs were mined by hand (by Nicola) to make the visualisation, and if you drill down you will see an image of the PDF with the source of the data highlighted. That’s quite an innovation – one of the goals of the new data industry is transparency of source. Without knowing the source of data, you can’t fully understand the implications of making a decision based on it.

“The second is a map of brownfield landed owned by local councils in England … The dataset is compiled by the Homes and Communities Agency, who have a goal of improving use of brownfield land to help reduce the housing shortage. It’s quite interesting that a dataset gathered for purposes of developing housing is also useful, as an aside, for measuring what the state owns. It’s that kind of twist of use of data that really requires understanding of the source of the data.

Which chiropractors were making “bogus” claims?

This is an example from last summer. Following the Simon Singh case Simon Perry wrote a script to check which chiropractors were making the same “bogus claims” that Singh was being sued over:

“The BCA web site lists all it’s 1029 members online, including for many of them, about 400 web site URLs. I wrote a quick computer program to download the member details, record them in a database and then download the individual web sites. I then searched the data for the word “colic” and then manually checked each site to verify that the chiropractors were either claiming to treat colic, or implying that chiropractic was an efficacious treatment for it. I found 160 practices in total, with around 500 individual chiropractors.

“The final piece in the puzzle was a simple mail-merge. Not wanting to simultaneously report several quacks to the same Trading Standards office, I limited the mail-merge to one per authority and sent out 84 letters.

“On the 10th, the science blogs went wild when Le Canard Noir published a very amusing email from the McTimoney Chiropractic Association, advising their members to take down their web site. It didn’t matter, I had copies of all the web sites.”

11:25

OpenCorporates partners with ScraperWiki & offers bounties for open data scrapers

This is a guest post by Chris Taggart, co-founder of OpenCorporates

When we started OpenCorporates it was to solve a real need that we and a number of other people in the open data community had: whether it’s Government spending, subsidy info or court cases, we needed a database of corporate entities to match against, and not just for one country either.

But we knew from the first that we didn’t want this to be some heavily funded monolithic project that threw money at the project in order to create a walled garden of new URIs unrelated to existing identifiers. It’s also why we wanted to work with existing projects like OpenKvK, rather than trying to replace them.

So the question was, how do we make this scale, and at the same time do the right thing – that is work with a variety of different people using different solutions and different programming languages. The answer to both, it turns out, was to use open data, and the excellent ScraperWiki.

How does it work? Well, the basics we need in order to create a company record at OpenCorporates is the company number, the jurisdiction and the company’s name. (If there’s a status field — e.g. dissolved/active — company type or url for more data, that’s a bonus). So, all you need to do is write a scraper for a country we haven’t got data for, name the fields in a standard way (CompanyName, CompanyNumber, Status, EntityType, RegistryUrl, if the url of the company page can’t be worked out from the company number), and bingo, we can pull it into OpenCorporates, with just a couple of lines of code.

Let’s have a look at one we did earlier: the Isle of Man (there’s also one for GibraltarIreland, and in the US, the District of Columbia). It’s written in Ruby, because that’s what we at OpenCorporates code in, but ScraperWiki allows you to write scrapers in Python or php too, and the important thing here is the data, not the language used to produce it.

The Isle of Man company registry website is a .Net system which uses all sorts of hidden fields and other nonsense in the forms and navigation. This is a normally bit of a pain, but because you can use the Ruby Mechanize library to submit forms found on the pages (there’s even a tutorial scraper which shows how to do it), it becomes fairly straightforward.

The code itself should be fairly readable to anyone familiar with Ruby or Python, but essentially it tackles the problem by doing multiple searches for companies beginning with two letters, starting with ‘aa’ then ‘ab’ and so on, and for each letter pair iterating through each page of results in turn, which in turn is scraped to extract the data, using the standardised headings to save them in.  That’s it.

In the space of a couple of hours not only have we liberated the data, but both the code and the data are there for anyone else to use too, as well as being imported in OpenCorporates.

However, that’s not all. In order to kickstart the effort OpenCorporates (technically Chrinon Ltd, the micro start-up that’s behind OpenCorporates) is offering a bounty for new jurisdictions opened up.

It’s not huge (we’re a micro-startup remember): £100 for any jurisdiction that hasn’t been done yet, £250 for those territories we want to import sooner rather than later (Australia, France, Spain), and £500 for Delaware (there’s a captcha there, so not sure it’s even possible), and there’s an initial cap of £2500 on the bounty pot (details at the bottom of this post).

However, often the scrapers can be written in a couple of hours, and it’s worth stressing again that neither the code nor the data will belong to OpenCorporates, but to the open data community, and if people build other things on it, so much the better. Of course we think it would make sense for them to use the OpenCorporates URIs to make it easy to exchange data in a consistent and predictable way, but, hey, it’s open data ;-)

Small, simple pieces, loosely connected, to build something rather cool. So now you can do a search for, oh say Barclays, and get this:

The bounty details: how it works

Find a country/company registry that you fancy opening up the data for (here are a couple of lists of registries). Make sure it’s from the official registry, and not a commercial reseller. Check too that no-one has already written one, or is in the middle of writing one, by checking the scrapers tagged with opencorporates (be nice, and respect other people’s attempts, but feel free to start one if it looks as if someone’s given up on a scraper).

All clear? Go ahead and start a new scraper (useful tutorials here). Call it something like trial_fr_company_numbers (until it’s done and been OK’d) and get coding, using the headings detailed above for the CompanyNumber, CompanyName etc. When it’s done, and it’s churning away pulling in data, email us info@opencorporates.com, and assuming it’s OK, we’ll pay you by Paypal, or by bank transfer (you’ll need to give us an invoice in that case). If it’s not we’ll add comments to the scraper. Any questions, email us at info@opencorporates.com, and happy scraping.

February 23 2011

20:10

Help Me Investigate is now open source

I have now released the source code behind Help Me Investigate, meaning others can adapt it, install it, and add to it if they wish to create their own crowdsourcing platform or support the idea behind it.

This follows the announcement 2 weeks ago on the Help Me Investigate blog (more coverage on Journalism.co.uk and Editors Weblog),

The code is available on GitHub, here.

Collaborators wanted

I’m looking for collaborators and coders to update the code to Rails 3, write documentation to help users install it, improve the code/test, or even be the project manager for this project.

Over the past 18 months the site has surpassed my expectations. It’s engaged hundreds of people in investigations, furthered understanding and awareness of crowdsourcing, and been runner-up for Multimedia Publisher of the Year. In the process it attracted attention from around the world – people wanting to investigate everything from drug running in Mexico to corruption in South Africa.

Having the code on one site meant we couldn’t help those people: making it open source opens up the possibility, but it needs other people to help make that a reality.

If you know anyone who might be able to help, please shoot them a link. Or email me at paul(at)helpmeinvestigate.com

Many thanks to Chris Taggart and Josh Hart for their help with moving the code across.

October 04 2010

12:24

Open data meets FOI via some nifty automation

OpenlyLocal generated FOI request

Now this is an example of what’s possible with open data and some very clever thinking. Chris Taggart blogs about a new tool on his OpenlyLocal platform that allows you to send a Freedom of Information (FOI) request based on a particular item of spending. “This further lowers the barriers to armchair auditors wanting to understand where the money goes, and the request even includes all the usual ‘boilerplate’ to help avoid specious refusals.”

It takes around a minute to generate an FOI request.

The function is limited to items of spending above £10,000. Cleverly, it’s also all linked so you can see if an FOI request has already been generated and answered.

Although the tool sits on OpenlyLocalFrancis Irving at WhatDoTheyKnow gets enormous credit for making their side of the operation work with it.

Once again you have to ask why a media organisation isn’t creating these sorts of tools to help generate journalism beyond the walls of its newsroom.

September 06 2010

20:35

Charities data opened up – journalists: say thanks.

Having made significant inroads in opening up council and local election data, Chris Taggart has now opened up charities data from the less-than-open Charity Commission website. The result: a new website – Open Charities.

The man deserves a round of applause. Charity data is enormously important in all sorts of ways – and is likely to become more so as the government leans on the third sector to take on a bigger role in providing public services. Making it easier to join the dots between charitable organisations, the private and public sector, contracts and individuals – which is what Open Charities does – will help journalists and bloggers enormously.

A blog post by Chris explains the site and its background in more depth. In it he explains that:

“For now, it’s just a the simplest of things, a web application with a unique URL for every charity based on its charity number, and with the basic information for each charity available as data (XML, JSON and RDF). It’s also searchable, and sortable by most recent income and spending, and for linked data people there are dereferenceable Resource URIs.

“The entire database is available to download and reuse (under an open, share-alike attribution licence). It’s a compressed CSV file, weighing in at just under 20MB for the compressed version, and should probably only attempted by those familiar with manipulating large datasets (don’t try opening it up in your spreadsheet, for example). I’m also in the process of importing it into Google Fusion Tables (it’s still churning away in the background) and will post a link when it’s done.”

Chris promises to add more features “if there’s any interest”.

Well, go on…

July 22 2010

19:08

Some other online innovators for some other list

Journalism.co.uk have a list of this year’s “leading innovators in journalism and media”. I have some additions. You may too.

Nick Booth

I brought Nick in to work with me on Help Me Investigate, a project for which he doesn’t get nearly enough credit. It’s his understanding of and connections with local communities that lie behind most of the successful investigations on the site. In addition, Nick helped spread the idea of the social media surgery, where social media savvy citizens help others find their online voice. The idea has spread as far as Australia and Africa.

Matt Buck and Alex Hughes

Matt and Alex have been busily reinventing news cartoons for a digital age with a number of projects, including Drawnalism (event drawing), animated illustrations, and socially networked characters such as Tobias Grubbe.

Pete Cashmore

Mashable.

Tony Hirst

Tony has been blogging about mashups for longer than most at OUseful.info, providing essential help for journalists getting to grips with Yahoo! Pipes, Google spreadsheets, scraping, and – this week – Google App Inventor.

Adrian Holovaty and Simon Willison

I’m unfairly bunching these two together because they were responsible – with others – for the Django web framework, which has been the basis for some very important data journalism projects including The Guardian’s experiment in crowdsourcing analysis of MPs’ redacted expenses, and Holovaty’s Everyblock.

Philip John

Behind the Lichfield Blog but equally importantly, Journal Local, the platform for hyperlocal publishers which comes with a raft of useful plugins pre-installed, and he runs the West Midlands Future of News Group.

Christian Payne

Documentally has been innovating and experimenting with mobile journalism for years in the UK, with a relaxed-but-excitable on-screen/on-audio presence that suits the medium perfectly. And he really, really knows his kit.

Meg Pickard

Meg is an anthropologist by training, a perfect background for community management, especially when combined with blogging experience that pre-dates most of the UK. The practices she has established on the community management front at The Guardian’s online operations are an exemplar for any news organisation – and she takes lovely photos too.

Chris Taggart

Chris has been working so hard on open data in 2010 I expect steam to pour from the soles of his shoes every time I see him. His ambition to free up local government data is laudable and, until recently, unfashionable. And he deserves all the support and recognition he gets.

Rick Waghorn

One of the first regional newspaper reporters to take the payoff and try to go it alone online – first with his Norwich City website, then the MyFootballWriter network, and more recently with the Addiply self-serve ad platform. Rick is still adapting and innovating in 2010 with some promising plans in the pipeline.

I freely admit that these are based on my personal perspective and knowledge. And yes, lists are pointless, and linkbait.

July 06 2010

10:05

Don’t stop us digging into public spending data

A very disturbing discovery by Chris Taggart last week: a number of councils in the UK are handing over their ‘open’ data to a company which only allows it to be downloaded for “personal” use.

As Chris himself points out, this runs completely against the spirit of the push to release public data in a number of ways:

  • Data cannot be used for “commercial gain”. This includes publishers wanting to present the information in ways that make most sense to the reader, and startups wanting to find innovative ways to involve people in their local area. Oh, and that whole ‘Big Society‘ stuff.
  • The way the sites are built means you couldn’t scrape this information with a computer anyway
  • It’s only a part of the data. “Download the data from SpotlightOnSpend and it’s rather different from the published data [on the Windsor & Maidenhead site]. Different in that it is missing core data that is in W&M published data (e.g. categories), and that includes data that isn’t in the published data (e.g. data from 2008).”

It’s a very worrying path indeed. As Chris sums it up: ” Councils hand over all their valuable financial data to a company which aggregates for its own purposes, and, er, doesn’t open up the data, shooting down all those goals of mashing up the data, using the community to analyse and undermining much of the good work that’s been done.”

The Transparency Board quickly issued a statement about this issue saying that “urgent” measures are taking place to rectify the problem.

And Spikes Cavell, who make the software, responded in Information Age, pointing out that “it is first and foremost a spend analysis software and consultancy supplier, and that it publishes data through SpotlightOnSpend as a free, optional and supplementary service for its local government customers. The hope is that this might help the company to win business, he explains, but it is not a money-spinner in itself.”

They are now promising to make the data available for download in its “raw form”, although it’s not clear what that will be. Adrian Short’s comment to the piece is worth reading.

Nevertheless, this is an issue that anyone interested in holding power to account should keep a close eye on. And to that aim, Chris has started an investigation on Help Me Investigate to find out how and why councils are giving access to their spending data. Please join it and help here.

(Comment or email me on paul at helpmeinvestigate.com if you want an invitation.)

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl