Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 06 2012

07:38

A case study in online journalism part 2: verification, SEO and collaboration (investigating the Olympic torch relay)

corporate Olympic torchbearers image

Having outlined some of the data journalism processes involved in the Olympic torch relay investigation, in part 2 I want to touch on how verification and ‘passive aggressive newsgathering’ played a role.

Verification: who’s who

Data in this story not only provided leads which needed verifying, but also helped verify leads from outside the data.

In one example, an anonymous tip-off suggested that both children of one particular executive were carrying the Olympic torch on different legs of the relay. A quick check against his name in the data suggested this was so: two girls with the same unusual surname were indeed carrying the torch. Neither mentioned the company or their father. But how could we confirm it?

The answer involved checking planning applications, Google Streetview, and a number of other sources, including newsletters from the private school that they both attended which identified the father.

In another example, I noticed that one torchbearer had mentioned running alongside two employees of Aggreko, who were paying for their torches. I searched for other employees, and found a cake shop which had created a celebratory cake for three of them. Having seen how some corporate sponsors used their places, I went on a hunch and looked up the board of directors, searching in the data first for the CEO Rupert Soames. His name turned up – with no nomination story. A search for other directors found that more than half the executive board were carrying torches – which turned out to be our story. The final step: a call to the company to get a reaction and confirmation.

The more that we knew about how torch relay places had been used, the easier it was to verify other torchbearers. As a pattern emerged of many coming from the telecomms industry, that helped focus the search – but we had to be aware that having suspicions ‘confirmed’ didn’t mean that the name itself was confirmed – it was simply that you were more likely to hit a match that you could verify.

Scepticism was important: at various times names seemed to match with individuals but you had to ask ‘Would that person not use his title? Why would he be nominated? Would he be that age now?’

Images helped – sometimes people used the same image that had been used elsewhere (you could match this with Google Images ‘match image’ feature, then refine the search). At other times you could match with public photos of the person as they carried the torch.

This post on identifying mystery torchbearers gives more detail.

Passive aggressive newsgathering

Alerts proved key to the investigation. Early on I signed up for daily alerts on any mention of the Olympic torch. 95% of stories were formulaic ‘local town/school/hero excited about torch’ reports, but occasionally key details would emerge in other pieces – particularly those from news organisations overseas.

Google Alerts for Olympic torch

It was from these that I learned how many places exactly Dow, Omega, Visa and others had, and how many were nominated. It was how I learned about torchbearers who were not even listed on the official site, about the ‘criteria’ that were supposed to be adhered to by some organisations, about public announcements of places which suggested a change from previous numbers, and more besides.

As I came across anything that looked interesting, I bookmarked and tagged it. Some would come in useful immediately, but most would only come in useful later when I came to write up the full story. Essentially, they were pieces of a jigsaw I was yet to put together.  (For example, this report mentioned that 2,500 employees were nominated within Dow for just 10 places. How must those employees feel when they find the company’s VP of Olympic operations took up one of the few places? Likewise, he fit a broader pattern of sponsorship managers carrying the torch)

I also subscribed to any mention of the torch relay in Parliament, and any mention in FOI requests.

SEO – making yourself findable

One of the things I always emphasise to my students is the importance of publishing early and often on a subject to maximise the opportunities for others in the field to find out – and get in touch. This story was no exception to this. From the earliest stages through to the last week of the relay, users stumbled across the site as they looked for information on the relay – and passed on their concerns and leads.

It was particularly important with a big public event like the Olympic torch relay, which generated a lot of interest among local people. In the first week of the investigation one photographer stumbled across the site because he was searching for the name of one of the torchbearers we had identified as coming from adidas. He passed on his photographs – but more importantly, made me aware that there may be photographs of other executives who had already carried the torch.

That led to the strongest image of the investigation – two executives exchanging a ‘torch kiss’ (shown at the top of this post) – which was in turn picked up by The Daily Mail.

Other leads kept coming. The tip-off about the executive’s daughters mentioned above; someone mentioning two more Aggreko directors – one of which had never been published on the official site, and the other had been listed and then removed. Questions about a Polish torchbearer who was not listed on the official site or, indeed, anywhere on the web other than the BBC’s torch relay liveblog. Challenges to one story we linkblogged, which led to further background that helped flesh out the processes behind the nominations given to universities.

When we published the ‘mystery torchbearers’ with The Guardian some got in touch to tell us who they were. In one case, that contact led to an interview which closed the book: Geoff Holt, the first quadriplegic to sail single-handed across the Atlantic Ocean.

Collaboration

I could have done this story the old-fashioned way: kept it to myself, done all the digging alone, and published one big story at the end.

It wouldn’t have been half as good. It wouldn’t have had the impact, it wouldn’t have had the range, and it would have missed key ingredients.

Collaboration was at the heart of this process. As soon as I started to unearth the adidas torchbearers I got in touch with The Guardian’s James Ball. His report the week after added reactions from some of the companies involved, and other torchbearers we’d simultaneously spotted. But James also noticed that one of Coca Cola’s torchbearers was a woman “who among other roles sits on a committee of the US’s Food and Drug Administration”.

It was collaborating with contacts in Staffordshire which helped point me to the ‘torch kiss’ image. They in turn followed up the story behind it (a credit for Help Me Investigate was taken out of the piece – it seems old habits die hard), and The Daily Mail followed up on that to get some further reaction and response (and no, they didn’t credit the Stoke Sentinel either). In Bournemouth and Sussex local journalists took up the baton (sorry), and the Times Higher did their angle.

We passed on leads to Ventnor Blog, whose users helped dig into a curious torchbearer running through the area. And we published a list of torchbearers missing stories in The Guardian, where users helped identify them.

Collaborating with an international mailing list for investigative journalists, I generated datasets of local torchbearers in Hungary, Italy, India, the Middle East, Germany, and Romania. German daily newspaper Der Tagesspiegel got in touch and helped trace some of the Germans.

And of course, within the Help Me Investigate network people were identifying mystery torchbearers, getting responses from sponsors, visualising data, and chasing interviews. One contributor in particular – Carol Miers – came on board halfway through and contributed some of the key elements of the final longform report – in particular the interview that opens the book, which I’ll talk about in the final part tomorrow.

August 02 2012

14:18

A case study in online journalism: investigating the Olympic torch relay

Torch relay places infographic by Caroline Beavon

For the last two months I’ve been involved in an investigation which has used almost every technique in the online journalism toolbox. From its beginnings in data journalism, through collaboration, community management and SEO to ‘passive-aggressive’ newsgathering,  verification and ebook publishing, it’s been a fascinating case study in such a range of ways I’m going to struggle to get them all down.

But I’m going to try.

Data journalism: scraping the Olympic torch relay

The investigation began with the scraping of the official torchbearer website. It’s important to emphasise that this piece of data journalism didn’t take place in isolation – in fact, it was while working with Help Me Investigate the Olympics‘s Jennifer Jones (coordinator for#media2012, the first citizen media network for the Olympic Games) and others that I stumbled across the torchbearer data. So networks and community are important here (more later).

Indeed, it turned out that the site couldn’t be scraped through a ‘normal’ scraper, and it was the community of the Scraperwiki site – specifically Zarino Zappia – who helped solve the problem and get a scraper working. Without both of those sets of relationships – with the citizen media network and with the developer community on Scraperwiki – this might never have got off the ground.

But it was also important to see the potential newsworthiness in that particular part of the site. Human stories were at the heart of the torch relay – not numbers. Local pride and curiosity was here – a key ingredient of any local newspaper. There were the promises made by its organisers – had they been kept?

The hunch proved correct – this dataset would just keep on giving stories.

The scraper grabbed details on around 6,000 torchbearers. I was curious why more weren’t listed – yes, there were supposed to be around 800 invitations to high profile torchbearers including celebrities, who might reasonably be expected to be omitted at least until they carried the torch – but that still left over 1,000.

I’ve written a bit more about the scraping and data analysis process for The Guardian and the Telegraph data blog. In a nutshell, here are some of the processes used:

  • Overview (pivot table): where do most come from? What’s the age distribution?
  • Focus on details in the overview: what’s the most surprising hometown in the top 5 or 10? Who’s oldest and youngest? What about the biggest source outside the UK?
  • Start asking questions of the data based on what we know it should look like – and hunches
  • Don’t get distracted – pick a focus and build around it.

This last point is notable. As I looked for mentions of Olympic sponsors in nomination stories, I started to build up subsets of the data: a dozen people who mentioned BP, two who mentioned ArcelorMittal (the CEO and his son), and so on. Each was interesting in its own way – but where should you invest your efforts?

One story had already caught my eye: it was written in the first person and talked about having been “engaged in the business of sport”. It was hardly inspirational. As it mentioned adidas, I focused on the adidas subset, and found that the same story was used by a further six people – a third of all of those who mentioned the company.

Clearly, all seven people hadn’t written the same story individually, so something was odd here. And that made this more than a ‘rotten apple’ story, but something potentially systemic.

Signals

While the data was interesting in itself, it was important to treat it as a set of signals to potentially more interesting exploration. Seven torchbearers having the same story was one of those signals. Mentions of corporate sponsors was another.

But there were many others too.

That initial scouring of the data had identified a number of people carrying the torch who held executive positions at sponsors and their commercial partners. The Guardian, The Independent and The Daily Mail were among the first to report on the story.

I wondered if the details of any of those corporate torchbearers might have been taken off off the site afterwards. And indeed they had: seven disappeared entirely (many still had a profile if you typed in the URL directly - but could not be found through search or browsing), and a further two had had their stories removed.

Now, every time I scraped details from the site I looked for those who had disappeared since the last scrape, and those that had been added late.

One, for example – who shared a name with a very senior figure at one of the sponsors – appeared just once before disappearing four days later. I wouldn’t have spotted them if they – or someone else – hadn’t been so keen on removing their name.

Another time, I noticed that a new torchbearer had been added to the list with the same story as the 7 adidas torchbearers. He turned out to be the Group Chief Executive of the country’s largest catalogue retailer, providing “continuing evidence that adidas ignored LOCOG guidance not to nominate executives.”

Meanwhile, the number of torchbearers running without any nomination story went from just 2.7% in the first scrape of 6,056 torchbearers, to 7.2% of 6,891 torchbearers in the last week, and 8.1% of all torchbearers – including those who had appeared and then disappeared – who had appeared between the two dates.

Many were celebrities or sportspeople where perhaps someone had taken the decision that they ‘needed no introduction’. But many also turned out to be corporate torchbearers.

By early July the numbers of these ‘mystery torchbearers’ had reached 500 and, having only identified a fifth, we published them through The Guardian datablog.

There were other signals, too, where knowing the way the torch relay operated helped.

For example, logistics meant that overseas torchbearers often carried the torch in the same location. This led to a cluster of Chinese torchbearers in Stansted, Hungarians in Dorset, Germans in Brighton, Americans in Oxford and Russians in North Wales.

As many corporate torchbearers were also based overseas, this helped narrow the search, with Germany’s corporate torchbearers in particular leading to an article in Der Tagesspiegel.

I also had the idea to total up how many torchbearers appeared each day, to identify days when details on unusually high numbers of torchbearers were missing – thanks to Adrian Short – but it became apparent that variation due to other factors such as weekends and the Jubilee made this worthless.

However, the percentage per day missing stories did help (visualised below by Caroline Beavon), as this also helped identify days when large numbers of overseas torchbearers were carrying the torch. I cross-referenced this with the ‘mystery torchbearer’ spreadsheet to see how many had already been checked, and which days still needed attention.

Daily totals - bar chart

But the data was just the beginning. In the second part of this case study, I’ll talk about the verification process.

14:18

A case study in online journalism: investigating the Olympic torch relay

Torch relay places infographic by Caroline Beavon

For the last two months I’ve been involved in an investigation which has used almost every technique in the online journalism toolbox. From its beginnings in data journalism, through collaboration, community management and SEO to ‘passive-aggressive’ newsgathering,  verification and ebook publishing, it’s been a fascinating case study in such a range of ways I’m going to struggle to get them all down.

But I’m going to try.

Data journalism: scraping the Olympic torch relay

The investigation began with the scraping of the official torchbearer website. It’s important to emphasise that this piece of data journalism didn’t take place in isolation – in fact, it was while working with Help Me Investigate the Olympics‘s Jennifer Jones (coordinator for#media2012, the first citizen media network for the Olympic Games) and others that I stumbled across the torchbearer data. So networks and community are important here (more later).

Indeed, it turned out that the site couldn’t be scraped through a ‘normal’ scraper, and it was the community of the Scraperwiki site – specifically Zarino Zappia – who helped solve the problem and get a scraper working. Without both of those sets of relationships – with the citizen media network and with the developer community on Scraperwiki – this might never have got off the ground.

But it was also important to see the potential newsworthiness in that particular part of the site. Human stories were at the heart of the torch relay – not numbers. Local pride and curiosity was here – a key ingredient of any local newspaper. There were the promises made by its organisers – had they been kept?

The hunch proved correct – this dataset would just keep on giving stories.

The scraper grabbed details on around 6,000 torchbearers. I was curious why more weren’t listed – yes, there were supposed to be around 800 invitations to high profile torchbearers including celebrities, who might reasonably be expected to be omitted at least until they carried the torch – but that still left over 1,000.

I’ve written a bit more about the scraping and data analysis process for The Guardian and the Telegraph data blog. In a nutshell, here are some of the processes used:

  • Overview (pivot table): where do most come from? What’s the age distribution?
  • Focus on details in the overview: what’s the most surprising hometown in the top 5 or 10? Who’s oldest and youngest? What about the biggest source outside the UK?
  • Start asking questions of the data based on what we know it should look like – and hunches
  • Don’t get distracted – pick a focus and build around it.

This last point is notable. As I looked for mentions of Olympic sponsors in nomination stories, I started to build up subsets of the data: a dozen people who mentioned BP, two who mentioned ArcelorMittal (the CEO and his son), and so on. Each was interesting in its own way – but where should you invest your efforts?

One story had already caught my eye: it was written in the first person and talked about having been “engaged in the business of sport”. It was hardly inspirational. As it mentioned adidas, I focused on the adidas subset, and found that the same story was used by a further six people – a third of all of those who mentioned the company.

Clearly, all seven people hadn’t written the same story individually, so something was odd here. And that made this more than a ‘rotten apple’ story, but something potentially systemic.

Signals

While the data was interesting in itself, it was important to treat it as a set of signals to potentially more interesting exploration. Seven torchbearers having the same story was one of those signals. Mentions of corporate sponsors was another.

But there were many others too.

That initial scouring of the data had identified a number of people carrying the torch who held executive positions at sponsors and their commercial partners. The Guardian, The Independent and The Daily Mail were among the first to report on the story.

I wondered if the details of any of those corporate torchbearers might have been taken off off the site afterwards. And indeed they had: seven disappeared entirely (many still had a profile if you typed in the URL directly - but could not be found through search or browsing), and a further two had had their stories removed.

Now, every time I scraped details from the site I looked for those who had disappeared since the last scrape, and those that had been added late.

One, for example – who shared a name with a very senior figure at one of the sponsors – appeared just once before disappearing four days later. I wouldn’t have spotted them if they – or someone else – hadn’t been so keen on removing their name.

Another time, I noticed that a new torchbearer had been added to the list with the same story as the 7 adidas torchbearers. He turned out to be the Group Chief Executive of the country’s largest catalogue retailer, providing “continuing evidence that adidas ignored LOCOG guidance not to nominate executives.”

Meanwhile, the number of torchbearers running without any nomination story went from just 2.7% in the first scrape of 6,056 torchbearers, to 7.2% of 6,891 torchbearers in the last week, and 8.1% of all torchbearers – including those who had appeared and then disappeared – who had appeared between the two dates.

Many were celebrities or sportspeople where perhaps someone had taken the decision that they ‘needed no introduction’. But many also turned out to be corporate torchbearers.

By early July the numbers of these ‘mystery torchbearers’ had reached 500 and, having only identified a fifth, we published them through The Guardian datablog.

There were other signals, too, where knowing the way the torch relay operated helped.

For example, logistics meant that overseas torchbearers often carried the torch in the same location. This led to a cluster of Chinese torchbearers in Stansted, Hungarians in Dorset, Germans in Brighton, Americans in Oxford and Russians in North Wales.

As many corporate torchbearers were also based overseas, this helped narrow the search, with Germany’s corporate torchbearers in particular leading to an article in Der Tagesspiegel.

I also had the idea to total up how many torchbearers appeared each day, to identify days when details on unusually high numbers of torchbearers were missing – thanks to Adrian Short – but it became apparent that variation due to other factors such as weekends and the Jubilee made this worthless.

However, the percentage per day missing stories did help (visualised below by Caroline Beavon), as this also helped identify days when large numbers of overseas torchbearers were carrying the torch. I cross-referenced this with the ‘mystery torchbearer’ spreadsheet to see how many had already been checked, and which days still needed attention.

Daily totals - bar chart

But the data was just the beginning. In the second part of this case study, I’ll talk about the verification process.

13:26

Stable at Last, PANDA Reaches 1.0!

Eleven months ago, we began prototyping PANDA. The PANDA project aims to make basic data analysis quick and easy for news organizations, and make data sharing simple. I hacked for a month on an experimental version, verified that our technology choices worked, and then threw it out and started over. Since that time, development has proceeded in steady, week-long iterations, checkpointed by numerous releases and two-day long PANDA team-planning sessions. We've implemented every feature from our "must have" list, a large chunk of our "want" list, and even one or two off our "not likely" list (in response to user feedback).

Today, I'm pleased to announce that we have reached the end of our road map: PANDA Version 1.0 is ready!

YOU CAN HAVE A PANDA NOW

thumbsuppanda.jpg

If you've been taking a wait-and-see approach to getting PANDA in your newsroom, now is the time to see. Version 1.0 is the most polished release we've ever done. Among the highlights:

  • New user-oriented documentation at pandaproject.net.
  • No more default user accounts. A setup mode allows you to configure an admin user after installation.
  • Search for data within categories.
  • Additional metadata for datasets, including "related links."
  • Many, many, many bug fixes.

To get started with PANDA now, head over to our installation docs.

I have one month left to keep working full-time on PANDA. That means you have a month to get personalized help with any issues you encounter while setting up. If you get started now, I'll be answering your emails, tracking your bugs, and logging your future development requests. If you wait, you may have to get in line.

Still not persuaded? Check out an awesome presentation from Nolan Hicks, San Antonio Express-News reporter and PANDA beta tester.

Every newsroom can be a data-friendly newsroom. Get started with PANDA now.

04:37

Open data: Will citizen publishing of digital information trump online news?

Lean Back 2.0 | Economist Group :: Open data – not subject to costly licenses for access – is part of a growing phenomenon where citizens depend, not on government, but on their own production of digital information to tell stories. Crowdsourcing, using digital platforms, does not need to involve paid contributors (although it sometimes does in the case of Amazon’s Mechanical Turk). It does not require professional training as a journalist or data analyst. It seems to work faster than traditional publishing to galvanize people into social action.

Public policy and open data - A report by Robin Mansell, LSE, www.economistgroup.com

May 06 2012

07:57

Data journalism: Four ways to slice Obama’s 2013 budget proposal

A great example of how data journalism can help to get excited about "boring" issues.

New York Times :: "How $3.7 Trillion is Spent" - Explore every nook and cranny of President Obama's federal budget proposal.

Explore the interactive graphic Shan Carter, www.nytimes.com

07:49

'Journalism in the Age of Data': A 54min video documentary online

Perfect for rainy Sundays (and other similar days): a 54 minutes video documentary.

Stanford :: A video report on data visualization as a storytelling medium. Produced during a 2009-2010 Knight Journalism Fellowship. Total running time: 54 min with related information and links.

Watch it here datajournalism.stanford.edu

07:38

Data journalism at the Guardian: What is it and how do we do it?

It is an "older" article in a fast growing field, but I recommend to read it if you like to find your way into data journalism.

Guardian :: Has data journalism become curation? - Sometimes. There's now so much data out there in the world that we try to provide the key facts for each story - and finding the right information can be as much of a lengthy journalistic task as finding the right interviewee for an article. We've started providing searches into world government data and international development data.

Our 10 point guide to data journalism - Continue to read Simon Rogers, www.guardian.co.uk

07:24

Why journalists should use data? Data journalism at a glance, handbook for free online

Data Journalism Handbook :: What This (free) Book Is (And What It Isn’t): The Data Journalism Handbook is intended to be a useful resource for anyone who thinks that they might be interested in becoming a data journalist, or dabbling in data journalism. Lots of people have contributed to writing it, and through our editorial we have tried to let their different voices and views shine through. We hope that it reads like a rich and informative conversation about what data journalism is, why it is important, and how to do it.

Download here - Continue to read datajournalismhandbook.org

May 03 2012

16:17

Big Data: Did you know? Twitter was faster at tracking the spread of cholera in Haiti

GigaOM :: There are expected to be more than 9 billion people on the planet by 2050, and you can expect that type of population growth to strain the world’s resources like energy, water and food. But it turns out big data tools will be particularly adept at helping organizations track — and attempt to solve — severe shortages in these resources.

Did you know?

A recent study by medical researchers at Harvard showed that Twitter was substantially faster at tracking the spread of cholera in Haiti than more traditional methods

Continue to read Katie Fehrenbacher, gigaom.com

April 27 2012

13:04

Free Data Journalism Handbook launched tomorrow

Data Journalism Handbook

I’ve contributed to a “free, open-source book that aims to help journalists to use data to improve the news” – and it will be published online tomorrow (Saturday 28th April)

The Data Journalism Handbook was coordinated by the European Journalism Centre and the Open Knowledge Foundation (in particular Liliana Bounegru), and includes contributions from:

“Dozens of data journalism’s leading advocates and best practitioners – including from Australian Broadcasting Corporation, the BBC, the Chicago Tribune, Deutsche Welle, the Guardian, the Financial Times, Helsingin Sanomat, La Nacion, the New York Times, ProPublica, the Washington Post, the Texas Tribune, Verdens Gang, Wales Online, Zeit Online and many others.”

The book will be available for download at datajournalismhandbook.org under a Creative Commons Attribution ShareAlike License. There will also be a printed and e-book version published by O’Reilly Media.

13:04

Free Data Journalism Handbook launched tomorrow

Data Journalism Handbook

I’ve contributed to a “free, open-source book that aims to help journalists to use data to improve the news” – and it will be published online tomorrow (Saturday 28th April)

The Data Journalism Handbook was coordinated by the European Journalism Centre and the Open Knowledge Foundation (in particular Liliana Bounegru), and includes contributions from:

“Dozens of data journalism’s leading advocates and best practitioners – including from Australian Broadcasting Corporation, the BBC, the Chicago Tribune, Deutsche Welle, the Guardian, the Financial Times, Helsingin Sanomat, La Nacion, the New York Times, ProPublica, the Washington Post, the Texas Tribune, Verdens Gang, Wales Online, Zeit Online and many others.”

The book will be available for download at datajournalismhandbook.org under a Creative Commons Attribution ShareAlike License. There will also be a printed and e-book version published by O’Reilly Media.

April 26 2012

04:07

Baby Steps in Data Journalism

This is a Tumblr blog that I started in early April 2012: Baby Steps in Data Journalism.

Its purpose is to collect links and other useful information related to learning about data journalism.

To go directly to something you might be searching for, try these links:

How I installed Python (April 2012): Mac OS

A review of basic Unix commands:

  1. Baby Steps in Unix/Linux Part 1: Listing
  2. Baby Steps in Unix/Linux Part 2: Change Directories
  3. Baby Steps in Unix/Linux Part 3: Your Home Directory

Major categories so far:

Other tagged categories:

Eventually I will organize some of this material into new sections here at Journalists’ Toolkit. But for the near future, it’s all at the “Baby Steps” site.

April 23 2012

11:40

Guide: How to start in a data journalist role

Online Journalism Blog :: Following my previous posts on the network journalist and community manager roles as part of an investigation team, this post expands on the first steps a student journalist can take in filling the data journalist role.

Step by step guide - Continue to read Paul Bradshaw, onlinejournalismblog.com

07:23

Step by step: how to start in a data journalist role

Investigations team flowchart

Following my previous posts on the network journalist and community manager roles as part of an investigation team, this post expands on the first steps a student journalist can take in filling the data journalist role.

1: Brainstorm data that might be relevant to your investigation or field

Before you begin digging for data, it’s worth mapping out the territory you’re working in. Some key questions to ask include:

  • Who measures or monitors your field? For example:
  • Where is spending recorded? This might be at both a local and national level.
  • What are the key things that might be measured in your field? For example, in prisons they might be interested in reoffending, or overcrowding, or staffing.
  • Can you find historical data?
  • What data do you need to provide basic context? e.g.
    • Where – addresses for all institutions in your field (e.g. schools, prisons, etc.)
    • Codes – often these are used instead of institution or area names
    • Who – names of those responsible for particular aspects of your field
    • Demographics – the distribution of age, gender, ethnicity, industries, wealth, property or other elements may be important to your work
    • Politics – who is in charge in each area (local authority and local MP)
  • How could you collate data that doesn’t exist? E.g. public awareness of something; or how the policies of different bodies compare, etc.

Sometimes the simplest and quickest way to find out these things is to pick up the phone and speak to someone in a relevant organisation and ask them: what information is collected about your field, and by whom?

You can also make content from this process of research: post a guide to how your field is regulated and measured (and what information isn’t); who’s who in your field - the regulators, monitors, politicians and bodies that all have a hand in keeping it on track.

2. Learn advanced techniques to obtain that data

Once you’ve mapped it all out you can start to prioritise the datasets that are most relevant to your particular investigation. You may need to use different techniques to get hold of these, including:

Again, you can make content from this process, for example: “How we found…” or “Why we’re asking the MoJ for…” (with a link to the FOI request) or “Get the data” (here’s how to publish data online)

The flow chart below (from this previous post) helps guide you to the relevant techniques for your data:

Gathering data: a flow chart for data journalist
Gathering data: a flow chart for data journalist

3. Pull out the parts of data relevant to your field/investigation

For example:

4. Add value to the data

Here are just some suggestions. You can use one or many:

Any of these provide useful opportunities for posting new content with the new contextual information (e.g. “How the data on X was gathered“) or new combined data (“Now with QOF data“) or the issues that they raise (“Why schools data may be worthless“).

5. Communicate the story in the data

I’ve written separately about the different ways of communicating data stories, so you can read that here. In short, human case studies are helpful, and visualisation is often useful.

And it’s at this point that you can also link to the further detail provided in all the content you’ve written in the previous 4 steps: How you got the data, the wider context, the specific data that’s of interest, the more detailed expert analysis or background, and so on.

07:23

Step by step: how to start in a data journalist role

Investigations team flowchart

Following my previous posts on the network journalist and community manager roles as part of an investigation team, this post expands on the first steps a student journalist can take in filling the data journalist role.

1: Brainstorm data that might be relevant to your investigation or field

Before you begin digging for data, it’s worth mapping out the territory you’re working in. Some key questions to ask include:

  • Who measures or monitors your field? For example:
  • Where is spending recorded? This might be at both a local and national level.
  • What are the key things that might be measured in your field? For example, in prisons they might be interested in reoffending, or overcrowding, or staffing.
  • Can you find historical data?
  • What data do you need to provide basic context? e.g.
    • Where – addresses for all institutions in your field (e.g. schools, prisons, etc.)
    • Codes – often these are used instead of institution or area names
    • Who – names of those responsible for particular aspects of your field
    • Demographics – the distribution of age, gender, ethnicity, industries, wealth, property or other elements may be important to your work
    • Politics – who is in charge in each area (local authority and local MP)
  • How could you collate data that doesn’t exist? E.g. public awareness of something; or how the policies of different bodies compare, etc.

Sometimes the simplest and quickest way to find out these things is to pick up the phone and speak to someone in a relevant organisation and ask them: what information is collected about your field, and by whom?

You can also make content from this process of research: post a guide to how your field is regulated and measured (and what information isn’t); who’s who in your field - the regulators, monitors, politicians and bodies that all have a hand in keeping it on track.

2. Learn advanced techniques to obtain that data

Once you’ve mapped it all out you can start to prioritise the datasets that are most relevant to your particular investigation. You may need to use different techniques to get hold of these, including:

Again, you can make content from this process, for example: “How we found…” or “Why we’re asking the MoJ for…” (with a link to the FOI request) or “Get the data” (here’s how to publish data online)

The flow chart below (from this previous post) helps guide you to the relevant techniques for your data:

Gathering data: a flow chart for data journalist
Gathering data: a flow chart for data journalist

3. Pull out the parts of data relevant to your field/investigation

For example:

4. Add value to the data

Here are just some suggestions. You can use one or many:

Any of these provide useful opportunities for posting new content with the new contextual information (e.g. “How the data on X was gathered“) or new combined data (“Now with QOF data“) or the issues that they raise (“Why schools data may be worthless“).

5. Communicate the story in the data

I’ve written separately about the different ways of communicating data stories, so you can read that here. In short, human case studies are helpful, and visualisation is often useful.

And it’s at this point that you can also link to the further detail provided in all the content you’ve written in the previous 4 steps: How you got the data, the wider context, the specific data that’s of interest, the more detailed expert analysis or background, and so on.

05:43

Data journalism: Miso Project or how to create your own Guardian-style data visualisations

Guardian :: Here on the Guardian's data team, we've wanted to help you visualise our data and create new viz styles for a long time. And now, thanks to some great work by the Guardian's Interactive team, that dream has moved one step closer. This week, developers Alastair Dant and Alex Graul launched the first part of the Miso project. In this piece, Alex explains it.

Continue to read www.guardian.co.uk

April 20 2012

06:26

Programming and journalism students: A conversation

I think it’s pretty cool to use Storify to sort out the threads of a bunch of simultaneous conversations on Twitter:

[View the story "Programming and journalism students: A conversation" on Storify]

Please join in — on Twitter, on Facebook, or here.

06:26

Programming and journalism students: A conversation

I think it’s pretty cool to use Storify to sort out the threads of a bunch of simultaneous conversations on Twitter:

[View the story "Programming and journalism students: A conversation" on Storify]

Please join in — on Twitter, on Facebook, or here.

April 19 2012

13:34

When data goes bad

Data is so central to the decision-making that shapes our countries, jobs and even personal lives that an increasing amount of data journalism involves scrutinising the problems with the very data itself. Here’s an illustrative list of when bad data becomes the story – and the lessons they can teach data journalists:

Deaths in police custody unrecorded

This investigation by the Bureau of Investigative Journalism demonstrates an important question to ask about data: who decides what gets recorded?

In this case, the BIJ identified “a number of cases not included in the official tally of 16 ‘restraint-related’ deaths in the decade to 2009 … Some cases were not included because the person has not been officially arrested or detained.”

As they explain:

“It turns out the IPCC has a very tight definition of ‘in custody’ –  defined only as when someone has been formally arrested or detained under the mental health act. This does not include people who have died after being in contact with the police.

“There are in fact two lists. The one which includes the widely quoted list of sixteen deaths in custody only records the cases where the person has been arrested or detained under the mental health act. So, an individual who comes into contact with the police – is never arrested or detained – but nonetheless dies after being restrained, is not included in the figures.

“… But even using the IPCC’s tightly drawn definition, the Bureau has identified cases that are still missing.”

Cross-checking the official statistics against wider reports was key technique. As was using the Freedom of Information Act to request the details behind them and the details of those “ who died in circumstances where restraint was used but was not necessarily a direct cause of death”.

Cooking the books on drug-related murders

Drug related murders in Mexico
Cross-checking statistics against reports was also used in this investigation by Diego Valle-Jones into Mexican drug deaths:

“The Acteal massacre committed by paramilitary units with government backing against 45 Tzotzil Indians is missing from the vital statistics database. According to the INEGI there were only 2 deaths during December 1997 in the municipality of Chenalho, where the massacre occurred. What a silly way to avoid recording homicides! Now it is just a question of which data is less corrupt.”

Diego also used the Benford’s Law technique to identify potentially fraudulent data, which was also used to highlight relationships between dodgy company data and real world events such as the dotcom bubble and deregulation.

Poor records mean no checks

Detective Inspector Philip Shakesheff exposed a “gap between [local authority] records and police data”, reported The Sunday Times in a story headlined ‘Care home loses child 130 times‘:

“The true scale of the problem was revealed after a check of records on police computers. For every child officially recorded by local authorities as missing in 2010, another seven were unaccounted for without their absence being noted.”

Why is it important?

“The number who go missing is one of the indicators on which Ofsted judges how well children’s homes are performing and the homes have a legal duty to keep accurate records.

“However, there is evidence some homes are failing to do so. In one case, Ofsted gave a good report to a private children’s home in Worcestershire when police records showed 1,630 missing person reports in five years. Police stationed an officer at the home and pressed Ofsted to look closer. The home was downgraded to inadequate and it later closed.

“The risks of being missing from care are demonstrated by Zoe Thomsett, 17, who was Westminster council’s responsibility. It sent her to a care home in Herefordshire, where she went missing several times, the final time for three days. She had earlier been found at an address in Hereford, but because no record was kept, nobody checked the address. She died there of a drugs overdose.

“The troubled life of Dane Edgar, 14, ended with a drugs overdose at a friend’s house after he repeatedly went missing from a children’s home in Northumberland. Another 14-year-old, James Jordan, was killed when he absconded from care and was the passenger in a stolen car.”

Interests not registered

When there are no formal checks on declarations of interest, how can we rely on it? In Chile, the Ciudadano Inteligente Fundaciondecided to check the Chilean MPs’ register of assets and interests by building a database:

“No-one was analysing this data, so it was incomplete,” explained Felipe Heusser, executive president of the Fundacion. “We used technology to build a database, using a wide range of open data and mapped all the MPs’ interests. From that, we found that nearly 40% of MPs were not disclosing their assets fully.”

The organisation has now launched a database that “enables members of the public to find potential conflicts of interest by analysing the data disclosed through the members’ register of assets.”

Data laundering

Tony Hirst’s post about how dodgy data was “laundered” by Facebook in a consultants report is a good illustration of the need to ‘follow the data’.

We have some dodgy evidence, about which we’re biased, so we give it to an “independent” consultant who re-reports it, albeit with caveats, that we can then report, minus the caveats. Lovely, clean evidence. Our lobbyists can then go to a lazy policy researcher and take this scrubbed evidence, referencing it as finding in the Deloitte report, so that it can make its way into a policy briefing.”

“Things just don’t add up”

In the video below Ellen Miller of the Sunlight Foundation takes the US government to task over the inconsistencies in its transparency agenda, and the flawed data published on its USAspending.gov – so flawed that they launched the Clearspending website to automate and highlight the discrepancy between two sources of the same data:

Key budget decisions made on useless data

Sometimes data might appear to tell an astonishing story, but this turns out to be a mistake – and that mistake itself leads you to something much more newsworthy, as Channel 4′s FactCheck found when it started trying to find out if councils had been cutting spending on Sure Start children’s centres:

“That ought to be fairly straightforward, as all councils by law have to fill in something called a Section 251 workbook detailing how much they are spending on various services for young people.

“… Brent Council in north London appeared to have slashed its funding by nearly 90 per cent, something that seemed strange, as we hadn’t heard an outcry from local parents.

“The council swiftly admitted making an accounting error – to the tune of a staggering £6m.”

And they weren’t the only ones. In fact, the Department for Education  admitted the numbers were “not very accurate”:

“So to recap, these spending figures don’t actually reflect the real amount of money spent; figures from different councils are not comparable with each other; spending in one year can’t be compared usefully with other years; and the government doesn’t propose to audit the figures or correct them when they’re wrong.”

This was particularly important because the S251 form “is the document the government uses to reallocate funding from council-run schools to its flagship academies.”:

“The Local Government Association (LGA) says less than £250m should be swiped from council budgets and given to academies, while the government wants to cut more than £1bn, prompting accusations that it is overfunding its favoured schools to the detriment of thousands of other children.

“Many councils’ complaints, made plain in responses to an ongoing government consultation, hinge on DfE’s use of S251, a document it has variously described as “unaudited”, “flawed” and”not fit for purpose”.

No data is still a story

Sticking with education, the TES reports on the outcome of an FOI request on the experience of Ofsted inspectors:

“[Stephen] Ball submitted a Freedom of Information request, asking how many HMIs had experience of being a secondary head, and how many of those had led an outstanding school. The answer? Ofsted “does not hold the details”.

““Secondary heads and academy principals need to be reassured that their work is judged by people who understand its complexity,” Mr Ball said. “Training as a good head of department or a primary school leader on the framework is no longer adequate. Secondary heads don’t fear judgement, but they expect to be judged by people who have experience as well as a theoretical training. After all, a working knowledge of the highway code doesn’t qualify you to become a driving examiner.”

“… Sir Michael Wilshaw, Ofsted’s new chief inspector, has already argued publicly that raw data are a key factor in assessing a school’s performance. By not providing the facts to back up its boasts about the expertise of its inspectors, many heads will remain sceptical of the watchdog’s claims.”

Men aren’t as tall as they say they are

To round off, here’s a quirky piece of data journalism by dating site OkCupid, which looked at the height of its members and found an interesting pattern:

Male height distribution on OKCupid

 

“The male heights on OkCupid very nearly follow the expected normal distribution—except the whole thing is shifted to the right of where it should be.

“Almost universally guys like to add a couple inches. You can also see a more subtle vanity at work: starting at roughly 5′ 8″, the top of the dotted curve tilts even further rightward. This means that guys as they get closer to six feet round up a bit more than usual, stretching for that coveted psychological benchmark.”

Do you know of any other examples of bad data forming the basis of a story? Please post a comment – I’m collecting examples.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
Get rid of the ads (sfw)

Don't be the product, buy the product!

Schweinderl