Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

May 27 2011


#newsrw: Making sense of the numbers in data journalism

The next big developments in data journalism is live data and also getting your audience involved, according the Martin Stabe, an interactive producer of FT.com.

He was one of four data journalists giving tips on what is in the data journalism toolkit and advice on tools, many of which are free and how to find the data and clean it.

James Ball, data journalist, the Guardian investigations team, worked on the WikiLeaks cables and  discussed the “use and abuse of statistics”.

He showed “a really awful infographic” on the amount of water it takes to make a pizza and a slice of bread.

“You don’t have to do must research to realise that is just tosh,” he said.

“We have to sense-check numbers.” He gave the example of culture secretary Jeremy Hunt giving expected TV viewers for the Royal Wedding of the unrealistic figure of two billion. The estimated audience was 300 million.

He asked: “Why might it matter?” And explained the dangers of bad statistics and bad journalism. “The best bit of your toolkit is understanding a bit of maths,” he advised.

Kevin Anderson, data journalism trainer and digital strategist, trained as a journalist in US, gave more tips on tools. One of the revolutions is access to data, the other is the access to tools, he said.

One tool in his kit is Google Docs. Google Spreadsheets, which Anderson used when he was at the Guardian and recommended the OUseful blog.

“You can import data live data feed,” he said, and suggests collecting your own data in a form. You can ask questions, including multiple choice, and embed the form it into a story.

For easy mapping tools he advises Google and Zeemaps. Once you have the data he said the next process is “link scraping”.

You can “grab data” from existing sources. He gives an example of using Outwit Hub, a plugin which works with Firefox, which allows you to pull in links, with the URLs, from any search and then export it as a Google Spreadsheet or SQL.

Anderson also recommends tools to order data from text. He gives the example of OpenCalais, a Thomson Reuters tool, which “allows you to see patterns in your own coverage” and connections between stories.

He also pointed journalists towards ThinkMap and gave the example of ‘Who Runs Hong Kong’, a data visualisation showing the connections of power.

“The ability for news organisations to extract more value through data journalism is a huge opportunity,” he said.

Martin Stabe, interactive producer, FT.com, who, like Anderson, is orginally from the states, described how data-driven news stories at FT.com are handled by a team.

He explained the team consists of a reporter, “who really knows the story”, a producer, like Stabe, a designer and a developer.

“One of the best things you can do in your newsroom is to get your head round administrative geography,” he said and understand statistical data.

He said it is very difficult to get data on all local authorities, on when they hold local elections and how their public spending is changing. Local data is often coded in different ways, he explained and gave the example of the “Cuts of £6bn hit the elderly the hardest” report on FT.com.

When you have a large dataset you need to ask questions. But data maybe “dirty” with a mix of local of coding conventions.

“The very act of cleaning the data is the key step,” he said.

“Data is only useful if it’s personal”, Bella Hurrell from the BBC recently said on Paul Bradshaw’s blog, a quote echoed by Stabe, giving the example on data collected on how likely a 16-year-old receiving free school lunches will get good or bad GCSE results.

He pointed out that readers are usually only interested in one area, one school, so an interactive version allows people to drill down. The data journalism steps are to obtain, warehouse and publish the data.

In obtaining the data, “sometimes we ask for it nicely” Stabe said, but usually the FT scrapes the data, and it then goes into a database.

His tips for journalists include learning how to manipulate text in Excel.

Next came advice from Simon Rogers, editor of the Guardian’s Data Store and Datablog.

Newspapers are all about the geography of the newsroom, he said, describing how he sits beside the investigations team and news desk.

He spoke about the difficulty in getting usable public data and dealing with the government’s “annual resource accounts”.

The Guardian is now providing ordered data to the people in government who supplied it, he explained.

The Guardian’s data work flow is: getting sent data, data from breaking news, recurring events and “theories to be exploited”. The journalists then have to think about how to “mash it together”, as the combined data makes it more inetersting.

A couple of Rogers’ tips are to use ManyEyes, Google Spreadsheets but “sometimes numbers alone are interesting enough to make a story,” he said.

He gave the example of a map made using Google Fusion Tables showing “patterns of awfulness” every death in Iraq mapped – which took about half an hour.

More recent examples include accessing data provided on the Nato Libya website. The site produces a daily archive for what happens each day, including data on missions.

Every day they add the NATO data to a map to show visually what has been hit where. It can also make stories as journalists notice patterns.


LIVE: Session 1A – The data journalism toolkit

We have Matt Caines and Ben Whitelaw from Wannabe Hacks liveblogging for us at news:rewired all day. You can follow session 1A ‘The data journalism toolkit’, below.

Session 1A features: Kevin Anderson, data journalism trainer and digital strategist; James Ball, data journalist, Guardian investigations team Martin Stabe, interactive producer, FT.com. Simon Rogers; editor, Guardian datablog and datastore. Moderated by David Hayward, head of journalism programme, BBC College of Journalism.

news:rewired – Session 1A: The data journalism kit

Sponsored post

March 25 2011


All the news that’s fit to scrape

Channel 4/Scraperwiki collaboration

There have been quite a few scraping-related stories that I’ve been meaning to blog about – so many I’ve decided to write a round up instead. It demonstrates just the increasing role that scraping is playing in journalism – and the possibilities for those who don’t know them:

Scraping company information

Chris Taggart explains how he built a database of corporations which will be particularly useful to journalists and anyone looking at public spending:

“Let’s have a look at one we did earlier: the Isle of Man (there’s also one for Gibraltar, Ireland, and in the US, the District of Columbia) … In the space of a couple of hours not only have we liberated the data, but both the code and the data are there for anyone else to use too, as well as being imported in OpenCorporates.”

OpenCorporates are also offering a bounty for programmers who can scrape company information from other jurisdictions.

Scraperwiki on the front page of The Guardian…

The Scraperwiki blog gives the story behind a front page investigation by James Ball on lobbyist influence in the UK Parliament:

“James Ball’s story is helped and supported by a ScraperWiki script that took data from registers across parliament that is located on different servers and aggregates them into one source table that can be viewed in a spreadsheet or document.  This is now a living source of data that can be automatically updated.  http://scraperwiki.com/scrapers/all_party_groups/

“Journalists can put down markers that run and update automatically and they can monitor the data over time with the objective of holding ‘power and money’ to account. The added value  of this technique is that in one step the data is represented in a uniform structure and linked to the source thus ensuring its provenance.  The software code that collects the data can be inspected by others in a peer review process to ensure the fidelity of the data.”

…and on Channel 4′s Dispatches

From the Open Knowledge Foundation blog (more on Scraperwiki’s blog):

“ScraperWiki worked with Channel 4 News and Dispatches to make two supporting data visualisations, to help viewers understand what assets the UK Government owns … The first is a bubble chart of what central Government owns. The PDFs were mined by hand (by Nicola) to make the visualisation, and if you drill down you will see an image of the PDF with the source of the data highlighted. That’s quite an innovation – one of the goals of the new data industry is transparency of source. Without knowing the source of data, you can’t fully understand the implications of making a decision based on it.

“The second is a map of brownfield landed owned by local councils in England … The dataset is compiled by the Homes and Communities Agency, who have a goal of improving use of brownfield land to help reduce the housing shortage. It’s quite interesting that a dataset gathered for purposes of developing housing is also useful, as an aside, for measuring what the state owns. It’s that kind of twist of use of data that really requires understanding of the source of the data.

Which chiropractors were making “bogus” claims?

This is an example from last summer. Following the Simon Singh case Simon Perry wrote a script to check which chiropractors were making the same “bogus claims” that Singh was being sued over:

“The BCA web site lists all it’s 1029 members online, including for many of them, about 400 web site URLs. I wrote a quick computer program to download the member details, record them in a database and then download the individual web sites. I then searched the data for the word “colic” and then manually checked each site to verify that the chiropractors were either claiming to treat colic, or implying that chiropractic was an efficacious treatment for it. I found 160 practices in total, with around 500 individual chiropractors.

“The final piece in the puzzle was a simple mail-merge. Not wanting to simultaneously report several quacks to the same Trading Standards office, I limited the mail-merge to one per authority and sent out 84 letters.

“On the 10th, the science blogs went wild when Le Canard Noir published a very amusing email from the McTimoney Chiropractic Association, advising their members to take down their web site. It didn’t matter, I had copies of all the web sites.”

February 25 2011


Read all about it read all about it: “ScraperWiki gets on the Guardian front page…”

A data driven story by investigative journalist James Ball on lobbyist influence in the UK Parliament has made it on to the front page of the Guardian. What is exciting for us is that James Ball’s story is helped and supported by a ScraperWiki script that took data from registers across parliament that is located on different servers and aggregates them into one source table that can be viewed in a spreadsheet or document.  This is now a living source of data that can be automatically updated.  http://scraperwiki.com/scrapers/all_party_groups/

For the past year the team at ScraperWiki has been running media events around the country. Our next one is in Cardiff and fully subscribed; we also have an event at BBC Scotland in Glasgow on 25 March.   Throughout the programme we have had the opportunity to meet great journalists and bloggers from national and local press so we always thought we would make it to the front page -  we just didn’t know when or by whom.

The story demonstrates the potential power of ScraperWiki to help journalists and researchers join the dots efficiently by collaboratively working with data specialists and software systems. Journalists can put down markers that run and update automatically and they can monitor the data over time with the objective of holding ‘power and money’ to account. The added value  of this technique is that in one step the data is represented in a uniform structure and linked to the source thus ensuring its provenance.  The software code that collects the data can be inspected by others in a peer review process to ensure the fidelity of the data.

In addition and because of the collaborative and social nature of the platform there is also the potential to involve others in the wider technical and data community to continue to improve the data.  Since the data is delivered using a scheduled script that runs daily  - journalists and interested parties can now subscribe to the data set for future changes and amendments.  So, for example, a journalist interested in any influence by a company, such as Virgin, can now have a specific email alert for donations or other actions by the conglomerate.

We know and understand that data in the media sector needs to be kept embargoed until the story breaks.  Next month we will be launching an opportunity for data consumers to request and subscribe to specific data feeds.

There is a tsunami of data being published and its increasingly hard for investigative journalists to find the time to sift through the masses of information and to make sense of it.  We believe that ScraperWiki helps to solve some of the ‘hard’ data issues that people in the media face on a daily basis.

Congratulations to James on his front page story and to the fantastic team at the Guardian who do fabulous work on open data and data driven journalism – long may it continue!

October 26 2010


Hacks/Hackers London meetup to discuss Iraq War logs

Scraperwiki will supporting the November Hacks/Hackers London meetup at 7pm on Wednesday 24th November 2010 at The Irish Club, 2-4 Tudor Street, EC4Y 0AA, London. A few tickets are still available, but places are filling fast.


  • 7.00pm: The data journalism behind the Iraq War Logs James Ball, Bureau of Investigative Journalism

James, Development Producer for the Bureau of Investigative Journalism and Chief Data Analyst on the TBIJ/Channel 4 Dispatches investigation into the Iraq War Logs, will explain how data journalism powered the process.

  • 7.30pm: TBC
  • 8pm: Social!

October 22 2010


Help Me Investigate – anatomy of an investigation

Earlier this year I and Andy Brightwell conducted some research into one of the successful investigations on my crowdsourcing platform Help Me Investigate. I wanted to know what had made the investigation successful – and how (or if) we might replicate those conditions for other investigations.

I presented the findings (presentation embedded above) at the Journalism’s Next Top Model conference in June. This post sums up those findings.

The investigation in question was ‘What do you know about The London Weekly?‘ – an investigation into a free newspaper that was (they claimed – part of the investigation was to establish if this was a hoax) about to launch in London.

The people behind the paper had made a number of claims about planned circulation, staffing and investment that most of the media reported uncritically. Martin Stabe, James Ball and Judith Townend, however, wanted to dig deeper. So, after an exchange on Twitter, Judith logged onto Help Me Investigate and started an investigation.

A month later members of the investigation had unearthed a wealth of detail about the people behind The London Weekly and the facts behind their claims. Some of the information was reported in MediaWeek and The Media Guardian podcast Media Talk; some formed the basis for posts on James Ball’s blog, Journalism.co.uk and the Online Journalism Blog. Some has, for legal reasons, remained unpublished.

A note on methodology

Andrew conducted a number of semi-structured interviews with contributors to the investigation. The sample was randomly selected but representative of the mix of contributors, who were categorised as either ‘alpha’ contributors (over 6 contributions), ‘active’ (2-6 contributions) and ‘lurkers’ (whose only contribution was to join the investigation). These interviews formed the qualitative basis for the research.

Complementing this data was quantitative information about users of the site as a whole. This was taken from two user surveys – one when the site was 3 months’ old and another at 12 months – and analysis of analytics taken from the investigation (such as numbers and types of actions, frequency, etc.)

What are the characteristics of a crowdsourced investigation?

One of the first things I wanted to analyse was whether the investigation data matched up to patterns observed elsewhere in crowdsourcing and online activity. An analysis of the number of actions by each user, for example, showed a clear ‘power law’ distribution, where a minority of users accounted for the majority of activity.

This power law, however, did not translate into a breakdown approaching the 90-9-1 ‘law of participation inequality‘ observed by Jakob Nielsen. Instead, the balance between those who made a couple of contributions (normally the 9% of the 90-9-1 split) and those who made none (the 90%) was roughly equal. This may have been because the design of the site meant it was not possible to ‘lurk’ without being a member of the site already, or being invited and signing up.

Adding in data on those looking at the investigation page who were not members may have shed further light on this.

What made the crowdsourcing successful?

Clearly, it is worth making a distinction between what made the investigation successful as a series of outcomes, and what made crowdsourcing successful as a method.

What made the community gather, and continue to return? One hypothesis was that the nature of the investigation provided a natural cue to interested parties – The London Weekly was published on Fridays and Saturdays and there was a build up of expectation to see if a new issue would indeed appear.

I was curious to see if the investigation had any ‘rhythm’. Would there be peaks of interest correlating to the expected publication?

The data threw up something else entirely. There was indeed a rhythm but it was Wednesdays that were the most popular day for people contributing to the investigation.

Why? Well, it turned out that one of the investigation’s ‘alpha’ contributors – James Ball – set himself a task to blog about the investigation every week. His blog posts appeared on a Wednesday.

That this turned out to be a significant factor in driving activity tells us one important lesson: talking publicly and regularly about the investigation’s progress is key.

This data was backed up from the interviews. One respondent mentioned the “weekly cue” explicitly.

More broadly, it seems that the site helped keep track of a number of discussions taking place around the web. Having been born from a discussion on Twitter, further conversations on Twitter resulted in further people signing up, along with comments threads and other online discussion. This fit the way the site was designed culturally – to be part of a network rather than asking people to do everything on-site.

But the planned technical connectivity of the site with the rest of the web (being able to pull related tweets or bookmarks, for example) had been dropped during development as we focused on core functionality. This was not a bad thing, I should emphasise, as it prevented us becoming distracted with ‘bells and whistles’ and allowed us to iterate in reaction to user activity rather than our own assumptions of what users would want. This research shows that user activity and informs future development accordingly.

The presence of ‘alpha’ users like James and Judith was crucial in driving activity on the site – a pattern observed in other successful investigations. They picked up the threads contributed by others and not only wove them together into a coherent narrative that allowed others to enter more easily, but also set the new challenges that provided ways for people to contribute. The fact that they brought with them a strong social network presence is probably also a factor – but one that needs further research.

The site has always been designed to emphasise the role of the user in driving investigations. The agenda is not owned by a central publisher, but by the person posing the question – and therefore the responsibility is theirs as well. In this sense it draws on Jenkins’ argument that “Consumers will be more powerful within convergence culture – but only if they recognise and use that power.” This cultural hurdle may be the biggest one that the site has to address.

Indeed, the site is also designed to offer “Failure for free”, allowing users to learn what works and what doesn’t, and begin to take on that responsibility where required.

The investigation also suited crowdsourcing well, as it could be broken down into separate parts and paths – most of which could be completed online: “Where does this claim come from?” “Can you find out about this person?” “What can you discover about this company?”. One person, for example, used Google Streetview to establish that the registered address of the company was a postbox.

Other investigations that are less easily broken down may be less suitable for crowdsourcing – or require more effort to ensure success.

A regular supply of updates provided the investigation with momentum. The accumulation of discoveries provided valuable feedback to users, who then returned for more. In his book on Wikipedia, Andrew Lih (2009 p82) notes a similar pattern – ‘stigmergy‘ – that is observed in the natural world: “The situation in which the product of previous work, rather than direct communication [induces and directs] additional labour”. An investigation without these ‘small pieces, loosely joined’ might not suit crowdsourcing so well.

One problem, however, was that those paths led to a range of potential avenues of enquiry. In the end, although the core questions were answered (was the publication a hoax and what were the bases for their claims) the investigation raised many more questions.

These remained largely unanswered once the majority of users felt that their questions had been answered. Like any investigation, there came a point at which those involved had to make a judgement whether they wished to invest any more time in it.

Finally, the investigation benefited from a diverse group of contributors who contributed specialist knowledge or access. Some physically visited stations where the newspaper was claiming distribution to see how many copies were being handed out. Others used advanced search techniques to track down details on the people involved and the claims being made, or to make contact with people who had had previous experiences with those behind the newspaper.

The visibility of the investigation online led to more than one ‘whistleblower’ approach providing inside information.

What can be done to make it better?

Looking at the reasons that users of the site as a whole gave for not contributing to an investigation, the majority attributed this to ‘not having enough time’. Although at least one interviewee, in contrast, highlighted the simplicity and ease of contributing, it needs to be as easy and simple as possible for users to contribute in order to lower the perception of effort and time needed.

Notably, the second biggest reason for not contributing was a ‘lack of personal connection with an investigation’, demonstrating the importance of the individual and social dimension of crowdsourcing. Likewise, a ‘personal interest in the issue’ was the single largest factor in someone contributing. A ‘Why should I contribute?’ feature on each investigation may be worth considering.

Others mentioned the social dimension of crowdsourcing – the “sense of being involved in something together” – what Jenkins (2006) would refer to as “consumption as a networked practice”.

This motivation is also identified by Yochai Benkler in his work on networks. Looking at non-financial reasons why people contribute their time to online projects, he refers to “socio-psychological reward”. He also identifies the importance of “hedonic personal gratification”. In other words, fun. (Interestingly, these match two of the three traditional reasons for consuming news: because it is socially valuable, and because it is entertaining. The third – because it is financially valuable – neatly matches the third reason for working).

While it is easy to talk about “Failure for free”, more could be done to identify and support failing investigations. We are currently developing a monthly update feature that would remind users of recent activity and – more importantly – the lack of activity. The investigators in a group might be asked whether they wish to terminate the investigation in those cases, emphasising their role in its progress and helping ‘clean up’ the investigations listed on the first page of the site.

That said, there is also a danger is interfering too much in reducing failure. This is a natural instinct, and I have to continually remind myself that I started the project with an expectation of 95-99% of investigations ‘failing’ through a lack of motivation on the part of the instigator. That was part of the design. It was the 1-5% of questions that gained traction that would be the focus of the site (this is how Meetup works, for example – most groups ‘fail’ but there is no way to predict which ones. As it happens, the ‘success’ rate of investigations has been much higher than expected). One analogy is a news conference where members throw out ideas – only a few are chosen for investment of time and energy, the rest ‘fail’.

In the end, it is the management of that tension between interfering to ensure everything succeeds – and so removing the incentive for users to be self-motivated – and not interfering at all – leaving users feeling unsupported and unmotivated – that is likely to be the key to a successful crowdsourcing project. More than a year into the project, this is still a skill that I am learning.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...