Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 19 2012

09:08

When data goes bad

Incorrect-statistics

Image by Lauren York

Data is so central to the decision-making that shapes our countries, jobs and even personal lives that an increasing amount of data journalism involves scrutinising the problems with the very data itself. Here’s an illustrative list of when bad data becomes the story – and the lessons they can teach data journalists:

Deaths in police custody unrecorded

This investigation by the Bureau of Investigative Journalism demonstrates an important question to ask about data: who decides what gets recorded?

In this case, the BIJ identified “a number of cases not included in the official tally of 16 ‘restraint-related’ deaths in the decade to 2009 … Some cases were not included because the person has not been officially arrested or detained.”

As they explain:

“It turns out the IPCC has a very tight definition of ‘in custody’ –  defined only as when someone has been formally arrested or detained under the mental health act. This does not include people who have died after being in contact with the police.

“There are in fact two lists. The one which includes the widely quoted list of sixteen deaths in custody only records the cases where the person has been arrested or detained under the mental health act. So, an individual who comes into contact with the police – is never arrested or detained – but nonetheless dies after being restrained, is not included in the figures.

“… But even using the IPCC’s tightly drawn definition, the Bureau has identified cases that are still missing.”

Cross-checking the official statistics against wider reports was key technique. As was using the Freedom of Information Act to request the details behind them and the details of those “ who died in circumstances where restraint was used but was not necessarily a direct cause of death”.

Cooking the books on drug-related murders

Drug related murders in Mexico
Cross-checking statistics against reports was also used in this investigation by Diego Valle-Jones into Mexican drug deaths:

“The Acteal massacre committed by paramilitary units with government backing against 45 Tzotzil Indians is missing from the vital statistics database. According to the INEGI there were only 2 deaths during December 1997 in the municipality of Chenalho, where the massacre occurred. What a silly way to avoid recording homicides! Now it is just a question of which data is less corrupt.”

Diego also used the Benford’s Law technique to identify potentially fraudulent data, which was also used to highlight relationships between dodgy company data and real world events such as the dotcom bubble and deregulation.

Poor records mean no checks

Detective Inspector Philip Shakesheff exposed a “gap between [local authority] records and police data”, reported The Sunday Times in a story headlined ‘Care home loses child 130 times‘:

“The true scale of the problem was revealed after a check of records on police computers. For every child officially recorded by local authorities as missing in 2010, another seven were unaccounted for without their absence being noted.”

Why is it important?

“The number who go missing is one of the indicators on which Ofsted judges how well children’s homes are performing and the homes have a legal duty to keep accurate records.

“However, there is evidence some homes are failing to do so. In one case, Ofsted gave a good report to a private children’s home in Worcestershire when police records showed 1,630 missing person reports in five years. Police stationed an officer at the home and pressed Ofsted to look closer. The home was downgraded to inadequate and it later closed.

“The risks of being missing from care are demonstrated by Zoe Thomsett, 17, who was Westminster council’s responsibility. It sent her to a care home in Herefordshire, where she went missing several times, the final time for three days. She had earlier been found at an address in Hereford, but because no record was kept, nobody checked the address. She died there of a drugs overdose.

“The troubled life of Dane Edgar, 14, ended with a drugs overdose at a friend’s house after he repeatedly went missing from a children’s home in Northumberland. Another 14-year-old, James Jordan, was killed when he absconded from care and was the passenger in a stolen car.”

Interests not registered

When there are no formal checks on declarations of interest, how can we rely on it? In Chile, the Ciudadano Inteligente Fundaciondecided to check the Chilean MPs’ register of assets and interests by building a database:

“No-one was analysing this data, so it was incomplete,” explained Felipe Heusser, executive president of the Fundacion. “We used technology to build a database, using a wide range of open data and mapped all the MPs’ interests. From that, we found that nearly 40% of MPs were not disclosing their assets fully.”

The organisation has now launched a database that “enables members of the public to find potential conflicts of interest by analysing the data disclosed through the members’ register of assets.”

Data laundering

Tony Hirst’s post about how dodgy data was “laundered” by Facebook in a consultants report is a good illustration of the need to ‘follow the data’.

We have some dodgy evidence, about which we’re biased, so we give it to an “independent” consultant who re-reports it, albeit with caveats, that we can then report, minus the caveats. Lovely, clean evidence. Our lobbyists can then go to a lazy policy researcher and take this scrubbed evidence, referencing it as finding in the Deloitte report, so that it can make its way into a policy briefing.”

“Things just don’t add up”

In the video below Ellen Miller of the Sunlight Foundation takes the US government to task over the inconsistencies in its transparency agenda, and the flawed data published on its USAspending.gov – so flawed that they launched the Clearspending website to automate and highlight the discrepancy between two sources of the same data:

Key budget decisions made on useless data

Sometimes data might appear to tell an astonishing story, but this turns out to be a mistake – and that mistake itself leads you to something much more newsworthy, as Channel 4′s FactCheck foundwhen it started trying to find out if councils had been cutting spending on Sure Start children’s centres:

“That ought to be fairly straightforward, as all councils by law have to fill in something called a Section 251 workbook detailing how much they are spending on various services for young people.

“… Brent Council in north London appeared to have slashed its funding by nearly 90 per cent, something that seemed strange, as we hadn’t heard an outcry from local parents.

“The council swiftly admitted making an accounting error – to the tune of a staggering £6m.”

And they weren’t the only ones. In fact, the Department for Education  admitted the numbers were “not very accurate”:

“So to recap, these spending figures don’t actually reflect the real amount of money spent; figures from different councils are not comparable with each other; spending in one year can’t be compared usefully with other years; and the government doesn’t propose to audit the figures or correct them when they’re wrong.”

This was particularly important because the S251 form “is the document the government uses to reallocate funding from council-run schools to its flagship academies.”:

“The Local Government Association (LGA) says less than £250m should be swiped from council budgets and given to academies, while the government wants to cut more than £1bn, prompting accusations that it is overfunding its favoured schools to the detriment of thousands of other children.

“Many councils’ complaints, made plain in responses to an ongoing government consultation, hinge on DfE’s use of S251, a document it has variously described as “unaudited”, “flawed” and”not fit for purpose”.

No data is still a story

Sticking with education, the TES reports on the outcome of an FOI request on the experience of Ofsted inspectors:

“[Stephen] Ball submitted a Freedom of Information request, asking how many HMIs had experience of being a secondary head, and how many of those had led an outstanding school. The answer? Ofsted “does not hold the details”.

““Secondary heads and academy principals need to be reassured that their work is judged by people who understand its complexity,” Mr Ball said. “Training as a good head of department or a primary school leader on the framework is no longer adequate. Secondary heads don’t fear judgement, but they expect to be judged by people who have experience as well as a theoretical training. After all, a working knowledge of the highway code doesn’t qualify you to become a driving examiner.”

“… Sir Michael Wilshaw, Ofsted’s new chief inspector, has already argued publicly that raw data are a key factor in assessing a school’s performance. By not providing the facts to back up its boasts about the expertise of its inspectors, many heads will remain sceptical of the watchdog’s claims.”

Men aren’t as tall as they say they are

To round off, here’s a quirky piece of data journalism by dating site OkCupid, which looked at the height of its members and found an interesting pattern:

Male height distribution on OKCupid

“The male heights on OkCupid very nearly follow the expected normal distribution—except the whole thing is shifted to the right of where it should be.

“Almost universally guys like to add a couple inches. You can also see a more subtle vanity at work: starting at roughly 5′ 8″, the top of the dotted curve tilts even further rightward. This means that guys as they get closer to six feet round up a bit more than usual, stretching for that coveted psychological benchmark.”

Do you know of any other examples of bad data forming the basis of a story? Please post a comment – I’m collecting examples.

UPDATE (April 20 2012): A useful addition from Simon Rogers: Named and shamed: the worst government annual reports explains why government department spending reports fail to support the Government’s claimed desire for an “army of armchair auditors”, with a list of the worst offenders at the end.

Also:

09:08

When data goes bad

Incorrect-statistics

Image by Lauren York

Data is so central to the decision-making that shapes our countries, jobs and even personal lives that an increasing amount of data journalism involves scrutinising the problems with the very data itself. Here’s an illustrative list of when bad data becomes the story – and the lessons they can teach data journalists:

Deaths in police custody unrecorded

This investigation by the Bureau of Investigative Journalism demonstrates an important question to ask about data: who decides what gets recorded?

In this case, the BIJ identified “a number of cases not included in the official tally of 16 ‘restraint-related’ deaths in the decade to 2009 … Some cases were not included because the person has not been officially arrested or detained.”

As they explain:

“It turns out the IPCC has a very tight definition of ‘in custody’ –  defined only as when someone has been formally arrested or detained under the mental health act. This does not include people who have died after being in contact with the police.

“There are in fact two lists. The one which includes the widely quoted list of sixteen deaths in custody only records the cases where the person has been arrested or detained under the mental health act. So, an individual who comes into contact with the police – is never arrested or detained – but nonetheless dies after being restrained, is not included in the figures.

“… But even using the IPCC’s tightly drawn definition, the Bureau has identified cases that are still missing.”

Cross-checking the official statistics against wider reports was key technique. As was using the Freedom of Information Act to request the details behind them and the details of those “ who died in circumstances where restraint was used but was not necessarily a direct cause of death”.

Cooking the books on drug-related murders

Drug related murders in Mexico
Cross-checking statistics against reports was also used in this investigation by Diego Valle-Jones into Mexican drug deaths:

“The Acteal massacre committed by paramilitary units with government backing against 45 Tzotzil Indians is missing from the vital statistics database. According to the INEGI there were only 2 deaths during December 1997 in the municipality of Chenalho, where the massacre occurred. What a silly way to avoid recording homicides! Now it is just a question of which data is less corrupt.”

Diego also used the Benford’s Law technique to identify potentially fraudulent data, which was also used to highlight relationships between dodgy company data and real world events such as the dotcom bubble and deregulation.

Poor records mean no checks

Detective Inspector Philip Shakesheff exposed a “gap between [local authority] records and police data”, reported The Sunday Times in a story headlined ‘Care home loses child 130 times‘:

“The true scale of the problem was revealed after a check of records on police computers. For every child officially recorded by local authorities as missing in 2010, another seven were unaccounted for without their absence being noted.”

Why is it important?

“The number who go missing is one of the indicators on which Ofsted judges how well children’s homes are performing and the homes have a legal duty to keep accurate records.

“However, there is evidence some homes are failing to do so. In one case, Ofsted gave a good report to a private children’s home in Worcestershire when police records showed 1,630 missing person reports in five years. Police stationed an officer at the home and pressed Ofsted to look closer. The home was downgraded to inadequate and it later closed.

“The risks of being missing from care are demonstrated by Zoe Thomsett, 17, who was Westminster council’s responsibility. It sent her to a care home in Herefordshire, where she went missing several times, the final time for three days. She had earlier been found at an address in Hereford, but because no record was kept, nobody checked the address. She died there of a drugs overdose.

“The troubled life of Dane Edgar, 14, ended with a drugs overdose at a friend’s house after he repeatedly went missing from a children’s home in Northumberland. Another 14-year-old, James Jordan, was killed when he absconded from care and was the passenger in a stolen car.”

Interests not registered

When there are no formal checks on declarations of interest, how can we rely on it? In Chile, the Ciudadano Inteligente Fundaciondecided to check the Chilean MPs’ register of assets and interests by building a database:

“No-one was analysing this data, so it was incomplete,” explained Felipe Heusser, executive president of the Fundacion. “We used technology to build a database, using a wide range of open data and mapped all the MPs’ interests. From that, we found that nearly 40% of MPs were not disclosing their assets fully.”

The organisation has now launched a database that “enables members of the public to find potential conflicts of interest by analysing the data disclosed through the members’ register of assets.”

Data laundering

Tony Hirst’s post about how dodgy data was “laundered” by Facebook in a consultants report is a good illustration of the need to ‘follow the data’.

We have some dodgy evidence, about which we’re biased, so we give it to an “independent” consultant who re-reports it, albeit with caveats, that we can then report, minus the caveats. Lovely, clean evidence. Our lobbyists can then go to a lazy policy researcher and take this scrubbed evidence, referencing it as finding in the Deloitte report, so that it can make its way into a policy briefing.”

“Things just don’t add up”

In the video below Ellen Miller of the Sunlight Foundation takes the US government to task over the inconsistencies in its transparency agenda, and the flawed data published on its USAspending.gov – so flawed that they launched the Clearspending website to automate and highlight the discrepancy between two sources of the same data:

Key budget decisions made on useless data

Sometimes data might appear to tell an astonishing story, but this turns out to be a mistake – and that mistake itself leads you to something much more newsworthy, as Channel 4′s FactCheck foundwhen it started trying to find out if councils had been cutting spending on Sure Start children’s centres:

“That ought to be fairly straightforward, as all councils by law have to fill in something called a Section 251 workbook detailing how much they are spending on various services for young people.

“… Brent Council in north London appeared to have slashed its funding by nearly 90 per cent, something that seemed strange, as we hadn’t heard an outcry from local parents.

“The council swiftly admitted making an accounting error – to the tune of a staggering £6m.”

And they weren’t the only ones. In fact, the Department for Education  admitted the numbers were “not very accurate”:

“So to recap, these spending figures don’t actually reflect the real amount of money spent; figures from different councils are not comparable with each other; spending in one year can’t be compared usefully with other years; and the government doesn’t propose to audit the figures or correct them when they’re wrong.”

This was particularly important because the S251 form “is the document the government uses to reallocate funding from council-run schools to its flagship academies.”:

“The Local Government Association (LGA) says less than £250m should be swiped from council budgets and given to academies, while the government wants to cut more than £1bn, prompting accusations that it is overfunding its favoured schools to the detriment of thousands of other children.

“Many councils’ complaints, made plain in responses to an ongoing government consultation, hinge on DfE’s use of S251, a document it has variously described as “unaudited”, “flawed” and”not fit for purpose”.

No data is still a story

Sticking with education, the TES reports on the outcome of an FOI request on the experience of Ofsted inspectors:

“[Stephen] Ball submitted a Freedom of Information request, asking how many HMIs had experience of being a secondary head, and how many of those had led an outstanding school. The answer? Ofsted “does not hold the details”.

““Secondary heads and academy principals need to be reassured that their work is judged by people who understand its complexity,” Mr Ball said. “Training as a good head of department or a primary school leader on the framework is no longer adequate. Secondary heads don’t fear judgement, but they expect to be judged by people who have experience as well as a theoretical training. After all, a working knowledge of the highway code doesn’t qualify you to become a driving examiner.”

“… Sir Michael Wilshaw, Ofsted’s new chief inspector, has already argued publicly that raw data are a key factor in assessing a school’s performance. By not providing the facts to back up its boasts about the expertise of its inspectors, many heads will remain sceptical of the watchdog’s claims.”

Men aren’t as tall as they say they are

To round off, here’s a quirky piece of data journalism by dating site OkCupid, which looked at the height of its members and found an interesting pattern:

Male height distribution on OKCupid

“The male heights on OkCupid very nearly follow the expected normal distribution—except the whole thing is shifted to the right of where it should be.

“Almost universally guys like to add a couple inches. You can also see a more subtle vanity at work: starting at roughly 5′ 8″, the top of the dotted curve tilts even further rightward. This means that guys as they get closer to six feet round up a bit more than usual, stretching for that coveted psychological benchmark.”

Do you know of any other examples of bad data forming the basis of a story? Please post a comment – I’m collecting examples.

UPDATE (April 20 2012): A useful addition from Simon Rogers: Named and shamed: the worst government annual reports explains why government department spending reports fail to support the Government’s claimed desire for an “army of armchair auditors”, with a list of the worst offenders at the end.

Also:


Filed under: online journalism Tagged: bad data, benford's law, BIJ, bureau of investigative journalism, Channel 4, Chile, Ciudadano Inteligente Fundacion, Clearspending, data laundering, dating, Deaths in custody, ellen miller, FactCheck, Felipe Heusser, height, IPCC, Lauren York, missing children, OKCupid, Philip Shakesheff, register of interests, S251, sex trafficking, simon rogers, sunday times, sunlight foundation, tony hirst
09:08

When data goes bad

Incorrect-statistics

Image by Lauren York

Data is so central to the decision-making that shapes our countries, jobs and even personal lives that an increasing amount of data journalism involves scrutinising the problems with the very data itself. Here’s an illustrative list of when bad data becomes the story – and the lessons they can teach data journalists:

Deaths in police custody unrecorded

This investigation by the Bureau of Investigative Journalism demonstrates an important question to ask about data: who decides what gets recorded?

In this case, the BIJ identified “a number of cases not included in the official tally of 16 ‘restraint-related’ deaths in the decade to 2009 … Some cases were not included because the person has not been officially arrested or detained.”

As they explain:

“It turns out the IPCC has a very tight definition of ‘in custody’ –  defined only as when someone has been formally arrested or detained under the mental health act. This does not include people who have died after being in contact with the police.

“There are in fact two lists. The one which includes the widely quoted list of sixteen deaths in custody only records the cases where the person has been arrested or detained under the mental health act. So, an individual who comes into contact with the police – is never arrested or detained – but nonetheless dies after being restrained, is not included in the figures.

“… But even using the IPCC’s tightly drawn definition, the Bureau has identified cases that are still missing.”

Cross-checking the official statistics against wider reports was key technique. As was using the Freedom of Information Act to request the details behind them and the details of those “ who died in circumstances where restraint was used but was not necessarily a direct cause of death”.

Cooking the books on drug-related murders

Drug related murders in Mexico
Cross-checking statistics against reports was also used in this investigation by Diego Valle-Jones into Mexican drug deaths:

“The Acteal massacre committed by paramilitary units with government backing against 45 Tzotzil Indians is missing from the vital statistics database. According to the INEGI there were only 2 deaths during December 1997 in the municipality of Chenalho, where the massacre occurred. What a silly way to avoid recording homicides! Now it is just a question of which data is less corrupt.”

Diego also used the Benford’s Law technique to identify potentially fraudulent data, which was also used to highlight relationships between dodgy company data and real world events such as the dotcom bubble and deregulation.

Poor records mean no checks

Detective Inspector Philip Shakesheff exposed a “gap between [local authority] records and police data”, reported The Sunday Times in a story headlined ‘Care home loses child 130 times‘:

“The true scale of the problem was revealed after a check of records on police computers. For every child officially recorded by local authorities as missing in 2010, another seven were unaccounted for without their absence being noted.”

Why is it important?

“The number who go missing is one of the indicators on which Ofsted judges how well children’s homes are performing and the homes have a legal duty to keep accurate records.

“However, there is evidence some homes are failing to do so. In one case, Ofsted gave a good report to a private children’s home in Worcestershire when police records showed 1,630 missing person reports in five years. Police stationed an officer at the home and pressed Ofsted to look closer. The home was downgraded to inadequate and it later closed.

“The risks of being missing from care are demonstrated by Zoe Thomsett, 17, who was Westminster council’s responsibility. It sent her to a care home in Herefordshire, where she went missing several times, the final time for three days. She had earlier been found at an address in Hereford, but because no record was kept, nobody checked the address. She died there of a drugs overdose.

“The troubled life of Dane Edgar, 14, ended with a drugs overdose at a friend’s house after he repeatedly went missing from a children’s home in Northumberland. Another 14-year-old, James Jordan, was killed when he absconded from care and was the passenger in a stolen car.”

Interests not registered

When there are no formal checks on declarations of interest, how can we rely on it? In Chile, the Ciudadano Inteligente Fundaciondecided to check the Chilean MPs’ register of assets and interests by building a database:

“No-one was analysing this data, so it was incomplete,” explained Felipe Heusser, executive president of the Fundacion. “We used technology to build a database, using a wide range of open data and mapped all the MPs’ interests. From that, we found that nearly 40% of MPs were not disclosing their assets fully.”

The organisation has now launched a database that “enables members of the public to find potential conflicts of interest by analysing the data disclosed through the members’ register of assets.”

Data laundering

Tony Hirst’s post about how dodgy data was “laundered” by Facebook in a consultants report is a good illustration of the need to ‘follow the data’.

We have some dodgy evidence, about which we’re biased, so we give it to an “independent” consultant who re-reports it, albeit with caveats, that we can then report, minus the caveats. Lovely, clean evidence. Our lobbyists can then go to a lazy policy researcher and take this scrubbed evidence, referencing it as finding in the Deloitte report, so that it can make its way into a policy briefing.”

“Things just don’t add up”

In the video below Ellen Miller of the Sunlight Foundation takes the US government to task over the inconsistencies in its transparency agenda, and the flawed data published on its USAspending.gov – so flawed that they launched the Clearspending website to automate and highlight the discrepancy between two sources of the same data:

Key budget decisions made on useless data

Sometimes data might appear to tell an astonishing story, but this turns out to be a mistake – and that mistake itself leads you to something much more newsworthy, as Channel 4′s FactCheck foundwhen it started trying to find out if councils had been cutting spending on Sure Start children’s centres:

“That ought to be fairly straightforward, as all councils by law have to fill in something called a Section 251 workbook detailing how much they are spending on various services for young people.

“… Brent Council in north London appeared to have slashed its funding by nearly 90 per cent, something that seemed strange, as we hadn’t heard an outcry from local parents.

“The council swiftly admitted making an accounting error – to the tune of a staggering £6m.”

And they weren’t the only ones. In fact, the Department for Education  admitted the numbers were “not very accurate”:

“So to recap, these spending figures don’t actually reflect the real amount of money spent; figures from different councils are not comparable with each other; spending in one year can’t be compared usefully with other years; and the government doesn’t propose to audit the figures or correct them when they’re wrong.”

This was particularly important because the S251 form “is the document the government uses to reallocate funding from council-run schools to its flagship academies.”:

“The Local Government Association (LGA) says less than £250m should be swiped from council budgets and given to academies, while the government wants to cut more than £1bn, prompting accusations that it is overfunding its favoured schools to the detriment of thousands of other children.

“Many councils’ complaints, made plain in responses to an ongoing government consultation, hinge on DfE’s use of S251, a document it has variously described as “unaudited”, “flawed” and”not fit for purpose”.

No data is still a story

Sticking with education, the TES reports on the outcome of an FOI request on the experience of Ofsted inspectors:

“[Stephen] Ball submitted a Freedom of Information request, asking how many HMIs had experience of being a secondary head, and how many of those had led an outstanding school. The answer? Ofsted “does not hold the details”.

““Secondary heads and academy principals need to be reassured that their work is judged by people who understand its complexity,” Mr Ball said. “Training as a good head of department or a primary school leader on the framework is no longer adequate. Secondary heads don’t fear judgement, but they expect to be judged by people who have experience as well as a theoretical training. After all, a working knowledge of the highway code doesn’t qualify you to become a driving examiner.”

“… Sir Michael Wilshaw, Ofsted’s new chief inspector, has already argued publicly that raw data are a key factor in assessing a school’s performance. By not providing the facts to back up its boasts about the expertise of its inspectors, many heads will remain sceptical of the watchdog’s claims.”

Men aren’t as tall as they say they are

To round off, here’s a quirky piece of data journalism by dating site OkCupid, which looked at the height of its members and found an interesting pattern:

Male height distribution on OKCupid

“The male heights on OkCupid very nearly follow the expected normal distribution—except the whole thing is shifted to the right of where it should be.

“Almost universally guys like to add a couple inches. You can also see a more subtle vanity at work: starting at roughly 5′ 8″, the top of the dotted curve tilts even further rightward. This means that guys as they get closer to six feet round up a bit more than usual, stretching for that coveted psychological benchmark.”

Do you know of any other examples of bad data forming the basis of a story? Please post a comment – I’m collecting examples.

UPDATE (April 20 2012): A useful addition from Simon Rogers: Named and shamed: the worst government annual reports explains why government department spending reports fail to support the Government’s claimed desire for an “army of armchair auditors”, with a list of the worst offenders at the end.

Also:


Filed under: online journalism Tagged: bad data, benford's law, BIJ, bureau of investigative journalism, Channel 4, Chile, Ciudadano Inteligente Fundacion, Clearspending, data laundering, dating, Deaths in custody, ellen miller, FactCheck, Felipe Heusser, height, IPCC, Lauren York, missing children, OKCupid, Philip Shakesheff, register of interests, S251, sex trafficking, simon rogers, sunday times, sunlight foundation, tony hirst

March 29 2012

13:06

Comparing apples and oranges in data journalism: a case study

A must-read for any data journalist, aspiring or otherwise, is Simon Rogers’ post on The Guardian Datablog where he compares public and private sector pay.

This is a classic apples-and-oranges situation where politicians and government bodies are comparing two things that, really, are very different. Is a private school teacher really comparable to someone teaching in an unpopular school? What is the private sector equivalent of a director of public health or a social worker?

But if these issues are being discussed, journalists must try to shed some light, and Simon Rogers does a great job in unpicking the comparisons. From pay and hours worked, to qualifications and age (big differences in both), and gender and pay inequality (more women in the public sector, more lower- and higher-paid workers in the private sector), Rogers crunches all the numbers:

“[T]he proportion of low skill jobs in the private sector has increased, and the proportion of high skill jobs in the public sector increased to around 31% of all jobs by 2011, compared 26% of all private sector jobs.

“But, at the same time, people who are most highly qualified actually get paid worse in the public sector.

“… Public sector workers tend to be older … Average mean hourly earnings peak in the early 40s in both sectors. They decline slightly approaching retirement although the decline happens earlier in the private sector than in the public sector, possibly because the higher earners in the private sector are more likely to leave the labour market earlier.

“It also shows that if you’re older in the public sector, you get paid better than in the private sector.

“… [T]he bottom 5% of workers in the public sector earn less than £6.91 per hour, whereas in the private sector, 5% of workers earn less than £5.93 per hour.”

When you find yourself in an apples-and-oranges situation you can’t avoid, this is the way to do it. Any other examples?

13:06

Comparing apples and oranges in data journalism: a case study

A must-read for any data journalist, aspiring or otherwise, is Simon Rogers’ post on The Guardian Datablog where he compares public and private sector pay.

This is a classic apples-and-oranges situation where politicians and government bodies are comparing two things that, really, are very different. Is a private school teacher really comparable to someone teaching in an unpopular school? What is the private sector equivalent of a director of public health or a social worker?

But if these issues are being discussed, journalists must try to shed some light, and Simon Rogers does a great job in unpicking the comparisons. From pay and hours worked, to qualifications and age (big differences in both), and gender and pay inequality (more women in the public sector, more lower- and higher-paid workers in the private sector), Rogers crunches all the numbers:

“[T]he proportion of low skill jobs in the private sector has increased, and the proportion of high skill jobs in the public sector increased to around 31% of all jobs by 2011, compared 26% of all private sector jobs.

“But, at the same time, people who are most highly qualified actually get paid worse in the public sector.

“… Public sector workers tend to be older … Average mean hourly earnings peak in the early 40s in both sectors. They decline slightly approaching retirement although the decline happens earlier in the private sector than in the public sector, possibly because the higher earners in the private sector are more likely to leave the labour market earlier.

“It also shows that if you’re older in the public sector, you get paid better than in the private sector.

“… [T]he bottom 5% of workers in the public sector earn less than £6.91 per hour, whereas in the private sector, 5% of workers earn less than £5.93 per hour.”

When you find yourself in an apples-and-oranges situation you can’t avoid, this is the way to do it. Any other examples?

July 16 2011

08:45

FAQ: How can broadcasters benefit from online communities?

Here’s another set of questions I’m answering in public in case anyone wants to ask the same:

How can broadcasters benefit from online communities?

Online communities contain many individuals who will be able to contribute different kinds of value to news production. Most obviously, expertise, opinion, and eyewitness testimony. In addition, they will be able to more effectively distribute parts of a story to ensure that it reaches the right experts, opinion-formers and eyewitnesses. The difference from an audience is that a community tends to be specialised, and connected to each other.

If you rephrase the question as ‘How can broadcasters benefit from people?’ it may be clearer.

How does a broadcaster begin to develop an engaged online community, any tips?

Over time. Rather than asking about how you develop an online community ask yourself instead: how do you begin to develop relationships? Waiting until a major news event happens is a bad strategy: it’s like waiting until someone has won the lottery to decide that you’re suddenly their friend.

Journalists who do this well do a little bit every so often – following people in their field, replying to questions on social networks, contributing to forums and commenting on blogs, and publishing blog posts which are helpful to members of that community rather than simply being about ‘the story’ (for instance, ‘Why’ and ‘How’ questions behind the news).

In case you are aware of networks in the middle east, do you think they are tapping into online communities and social media adequately?

I don’t know the networks well enough to comment – but I do think it’s hard for corporations to tap into communities; it works much better at an individual reporter level.

Can you mention any models whether it is news channels or entertainment television which have developed successful online communities, why do they work?

The most successful examples tend to be newspapers: I think Paul Lewis at The Guardian has done this extremely successfully, and I think Simon Rogers’ Data Blog has also developed a healthy community around data and visualisation. Both of these are probably due in part to the work of Meg Pickard there around community in general.

The BBC’s UGC unit is a good example from broadcasting – although that is less about developing a community as about providing platforms for others to contribute, and a way for journalists to quickly find expertise in those communities. More specifically, Robert Peston and Rory Cellan-Jones use their blogs and Twitter accounts well to connect with people in their fields.

Then of course there’s Andy Carvin at NPR, who is an exemplar of how to do it in radio. There’s so much written about what he does that I won’t repeat it here.

What are the reasons that certain broadcasters cannot connect successfully with online communities?

I expect a significant factor is regulation which requires objectivity from broadcasters but not from newspapers. If you can’t express an opinion then it is difficult to build relationships, and if you are more firmly regulated (which broadcasting is) then you take fewer risks.

Also, there are more intermediaries in broadcasting and fewer reporters who are public-facing, which for some journalists in broadcasting makes the prospect of speaking directly to the former audience that much more intimidating.

PrintFriendly

June 03 2011

09:51

Speaker presentations: Session 2A – Developing the data story

Here are the presentations from Session 1A – ‘The data journalism toolkit’, at last week’s news:rewired conference.

The session featured:

With: Professor Paul Bradshaw, visiting professor, City University and founder, helpmeinvestigate.com; Alastair Dant, lead interactive technologist, the Guardian; Federica Cocco, editor, OWNI.eu; Conrad Quilty-Harper, data reporter, the Telegraph. Moderated by Simon Rogers; editor, Guardian datablog and datastore.

Paul Bradshaw, visiting professor, City University, London


Federica Cocco, editor, OWNI.eu

http://owni.eu/2011/05/25/a-map-to-freedom-the-internet-in-europe/
http://influencenetworks.org/
http://owni.fr/2011/04/18/carte-biens-mal-acquis-kadhafi-ben-ali/
http://wikileaks.owni.fr/
http://app.owni.fr/warlogs/
http://warlogs.owni.fr/
http://statelogs.owni.fr/
http://owni.eu/2011/03/04/app-fortress-europe-a-deadly-exodus/

Conrad Quilty-Harper, data mapping reporter, the Telegraph


Alastair Dant, lead interactive technologist, the Guardian



See the full session on video

May 27 2011

11:58

LIVE: Session 2A – Developing the data story

We have Matt Caines and Ben Whitelaw from Wannabe Hacks liveblogging for us at news:rewired all day. You can follow session 2A ‘Developing the data story’, below.

Session 2A features: Professor Paul Bradshaw, visiting professor, City University and founder, helpmeinvestigate.com; Alastair Dant, lead interactive technologist, the Guardian; Federica Cocco, editor, OWNI.eu; Conrad Quilty-Harper, data reporter, the Telegraph. Moderated by  Simon Rogers; editor, Guardian datablog and datastore.

Click on the link below to access the liveblog:

10:46

LIVE: Session 1A – The data journalism toolkit

We have Matt Caines and Ben Whitelaw from Wannabe Hacks liveblogging for us at news:rewired all day. You can follow session 1A ‘The data journalism toolkit’, below.

Session 1A features: Kevin Anderson, data journalism trainer and digital strategist; James Ball, data journalist, Guardian investigations team Martin Stabe, interactive producer, FT.com. Simon Rogers; editor, Guardian datablog and datastore. Moderated by David Hayward, head of journalism programme, BBC College of Journalism.

news:rewired – Session 1A: The data journalism kit

December 21 2010

15:26

Videos: Linked data and the semantic web

Courtesy of the BBC College of Journalism, we’ve got video footage from all of our sessions at news:rewired – beyond the story, 16 December 2010.

We’ll be grouping the video clips by session – you can view all footage by looking at the multimedia category on this site.

Martin Moore

Martin Belam

Simon Rogers

Silver Oliver

December 16 2010

15:05

LIVE: Linked data and the semantic web

We’ll have Matt Caines and Nick Petrie from Wannabe Hacks liveblogging for us at news:rewired all day. Follow individual posts on the news:rewired blog for up to date information on all our sessions.

We’ll also have blogging over the course of the day from freelance journalist Rosie Niven.

December 04 2010

11:22

FAQ: Data journalism, laziness, information overload & localism

I seem to have lost the habit of publishing interview responses here under the FAQ category for the past year, but the following questions from a journalist, and my answers, were worth publishing in case anyone has the same questions:

Simon Rogers, Editor of the Datablog, said that he thinks in the future simply publishing the raw data will become acceptable journalism. Do you not think that an approach like this to raw data is lazy journalism? And equally, do you think that would be a type of journalism that the public will really be able to engage with?

It’s not lazy at all, and to think otherwise is pure journalistic egoism. We have a tendency to undervalue things because we haven’t invested our own effort into it, but the value lies in its usefulness, not in the effort. Increasingly I think being a journalist will be as much about making journalism possible for other people as it will be about creating that journalism yourself. You have to ask yourself: do I just want to write pretty stories, or allow people to hold power to account?

In a world where we can access information directly I think it’s a central function of journalists to make important information findable. The first level of that is to publish raw data.

It’s interesting to see that this seems to be a key principle for hyperlocal bloggers – making civic information findable.

The second level – if you have the time and resources – is then to analyse that raw data and pull stories out of it. But ultimately there will always be other ‘stories’ in the information that people want to find for themselves, which may be too specific to be of interest to the journalist or publisher.

The third level – which really requires a lot of investment – is to create tools that make it easier for the user to find what they want, to make it easier to understand (e.g. through visualisation), and to share it with others.

Do you think that alot of the information can be quite overwhelming and sometimes not go anywhere?

Of course, but that isn’t a reason for not publishing the information. It’s natural that when the information is released some of it will attract more attention than other parts – but also, if other questions come up in future there is a dataset that people can go back and interrogate even if they didn’t at the time.

At the moment we have a lot of data but very few tools to interrogate that. That’s going to change – just in the last 6 months we’ve seen some fantastic new tools for filtering data, and the momentum is building in this area. It’s notable how many of the bids for the Knight News Challenge were data-related.

Additionally, do you tihnk The Guardian continue to pursue stories from the masses of data as consistently as they have done in previous years?

Yes, I think the Guardian has now built a reputation in this field and will want to maintain that, not to mention the fact that its reputation means it will attract more and more data-related stories, and benefit from the work of people outside the organisation who are interrogating data. They’ll also get better and better as they learn from experience.

And why do you think that smaller news resources struggle to use this sort of information as a source for news?

Partly because data has historically been more national than local. Even now I get frustrated when I find a dataset but then discover it’s only broken down into England, Wales, Scotland and Northern Ireland. But we are now finally getting more and more local data.

Also, at a local level journalists tend to be less specialised. On a national you might have a health or environment or financial reporter who is more used to dealing with figures and data. On a local newspaper that’s less likely – and there’s a high turnover of staff because of the low wages.

August 05 2010

17:00

How The Guardian is pioneering data journalism with free tools

The Guardian takes data journalism seriously. They obtain, format, and publish journalistically interesting data sets on their Data Blog, they track transparency initiatives in their searchable index of world government data, and they do original research on data they’ve obtained, such as their amazing in-depth analysis of 90,000 leaked Afghanistan war documents. And they do most of this with simple, free tools.

Data Blog editor Simon Rogers gave me an action-packed interview in The Guardian’s London newsroom, starting with story walkthroughs and ending with a philosophical discussion about the changing role of data in journalism. It’s a must-watch if you’re wondering what the digitization of the world’s facts means for a newsroom. Here’s my take on the highlights; a full transcript is below.

The technology involved is surprisingly simple, and mostly free. The Guardian uses public, read-only Google Spreadsheets to share the data they’ve collected, which require no special tools for viewing and can be downloaded in just about any desired format. Visualizations are mostly via Many Eyes and Timetric, both free.

Data Blog posts are often related to or supporting of news stories, but not always. Rogers sees the publishing of interesting data as a journalistic act that stands alone, and is clear on where the newsroom adds value:

I think you have to apply journalistic treatment to data. You have to choose the data in a selective, editorial fashion. And I think you have to process it in a way that makes it easy for people to use, and useful to people.

The Guardian curates far more data than it creates. Some data sets are generated in-house, such as its yearly executive pay surveys, but more often the data already exists in some form, such as a PDF on a government web site. The Guardian finds such documents, scrapes the data into spreadsheets, cleans it, and adds context in a Data Blog post. But they also maintain an index of world government data which scrapes open government web sites to produce a searchable index of available data sets.

“Helping people find the data, that’s our mission here,” says Rogers. “We want people to come to us when they’re looking for data.”

In alignment with their open strategy, The Guardian encourages re-use and mashups of their data. Readers can submit apps and visualizations that they’ve created, but data has proven to be just as popular with non-developers — regular folks who want the raw information.

Sometimes readers provide additional data or important feedback, typically through the comments on each post. Rogers gives the example of a reader who wrote in to say that the Academy schools listed in his area in a Guardian data set were in wealthy neighborhoods, raising the journalistically interesting question of whether wealthier schools were more likely to take advantage of this charter school-like program. Expanding on this idea, Rogers says,

What used to happen is that we were the kind of gatekeepers to this information. We would keep it to ourselves. So we didn’t want our rivals to get ahold of it, and give them stories. We’d be giving stories away. And we wouldn’t believe that people out there in the world would have any contribution to make towards that.

Now, that’s all changed now. I think now we’ve realized that actually, we’re not always the experts. Be it Doctor Who or Academy schools, there’s somebody out there who knows a lot more than you do, and can thus contribute.

So you can get stories back from them, in a way…If you put the information out there, you always get a return. You get people coming back.

Perhaps surprisingly, data also gets pretty good traffic, with the Data Blog logging a million hits a month during the recent election coverage. “In the firmament of Guardian web sites that’s not bad. That’s kind of upper tier,” says Rogers. “And this is only after being around for a year.” (The even younger Texas Tribune also finds its data pages popular, accounting for a third of total page views.)

Rogers and I also discussed the process of getting useful data out of inept or uncooperative governments, the changing role of data specialists in the newsroom, and how the Guardian tapped its readers to produce the definitive database of Doctor Who villains. Here’s the transcript, lightly edited.

JS: All right. So. I’m here with Simons Rogers in the Guardian newsroom in London, and you’re the editor of the Data Blog.

SR: That’s right, and I’m also a news editor so I work across the organization on data journalism, essentially.

JS: So, first of all, can you tell us what the Data Blog is?

SR: Ok, well basically it came about because, as I said I was a news editor working a lot with graphics, and we realized we were just collecting enormous amounts of data. And we though, well wouldn’t our readers be interested in seeing that? And when the Guardian Open Platform launched, it seemed a good time to think about opening up– we were opening up the Guardian to technical development, so it seemed a good time to open up our data collections as well.

And also it’s the fact that increasingly we’ve found people are after raw information. If you looked– and there’s lots of raw information online, but if you start searching for that information you just get bewildering amounts of replies back. If if you’re looking for, say, carbon emissions, you get millions of entries back. So how do you know what the right set of data is? Whereas we’ve already done that set of work for our readers, because we’ve had to find that data, and we’ve had to choose it, and make an editorial selection about it, I suppose. So we thought we were able to cut out the middle man for people.

But also we kind of thought when we launched it, actually, what we’d be doing is creating data for developers. There seemed to be a lot of developers out there at that point who were interested in raw information, and they would be the people who would use the data blog, and the open platform would get a lot more traffic.

And what actually happened, what’s been interesting about it, is that– what’s actually happened is that it’s been real people who have been using the Data Blog, as much as developers. Probably more so than developers.

JS: What do you mean “real people”?

SR: Real people, I suppose what I mean is that, somebody who’s just interested in finding out what a number is. So for instance, here at the moment we’ve got a big story about a government scheme for building schools, which has just been cut by the new government. It was set up by the old government, who invested millions of pounds into building new school buildings. And so, we’ve got the full list of all the schools, but the parliamentary constituency that they’re in, and where they are and what kind of project they were. And that is really, really popular today, that’s one of our biggest things, because there’s a lot of demonstrations about it, it’s a big issue of the day. And so I would guess that 90% of people looking at it are just people who want to find out what the real raw data is.

And that’s the great thing about the internet, it gives you access to the raw, real information. And I think that’s what people really crave. They want the interpretation and the analysis from people, but they also want the veracity of seeing the real thing, without having it aggregated or put together. They just want to see the raw data.

JS: So you publish all of the original numbers that you get from the government?

SR: Well exactly. The only time– with the Data Blog, I try to make it as newsy as possible. So it’s often hooked around news stories of the day. Partly because it helps the traffic, and you’re kind of hooking on to existing requirements.

Obviously we do– it’s just a really eclectic mix of data. And I can show you the screen, for a sec.

JS: All right. Let’s see something.

SR: Okay, so this is the data blog today. So obviously we’ve got Afghanistan at the top. Afghanistan is often at the top at the moment. This is a full list of everybody who’s died, every British casualty who’s died and been wounded over time. So you’ve got this data here. We use, I tend to use a lot of third party services. This is a company called Timetric, who are very good at visualizing time series data. It takes about five minutes to create that, and you can roll over and get more information.

JS: So is that a free service?

SR: Yeah, absolutely free, you just sign up, and you share it. It works a bit like Many Eyes, you know the IBM service.

JS: Yeah.

SR: We’ll embed these Google docs. We use Google docs, Google spreadsheets to share all our information because it’s very for people to download it. So say you want to download this data. You click on the link, and it will take you through in a second to, there you go, it’s the full Google spreadsheet. And you’ve got everything on here. You’ve got, these are monthly totals, which you can’t get anywhere else, because nobody else does that information.

JS: What do you mean nobody else does it?

SR: Well nobody else bothers to put it together month by month. You can get totals by year from, iCasualties I think do it, but we’ve just collected some month by month, because often we’ve had to draw graphics where it’s month by month. It’s the kind of thing, actually it’s quite interesting to be able to see which month was the worst for casualties.

We’ve got lists of names, which obviously are in a few places. We collect Afghanistan wounded statistics which are terribly confused in the UK, because what they do is they try and make them as complicated as possible. So, the most serious ones, NOTICAS is where your next of kin is notified. That’s a serious event, but also you’ve got all those people evacuated. So anyway, this kind of data. We also keep amputation data, which is a new set that the government refused to release until recently, and a Guardian reporter was instrumental in getting this data released. So we kind thought, maybe we should make this available for people.

So you get all this data, and then what you can do, if you click on “File” there, you can download it as Excel, XML, CSV, or whatever format you want. So that’s why we use Google speadsheets. It’s the kind of thing that’s a very, very easily accessible format for people.

So really what we do is we try and encourage a community, a community to grow up around data and information. So every post has got a talk facility on it.

Anyway, going through it. So this is today’s Data Blog, where you’ve got Afghanistan, Academy schools in the UK. The schools are run by the state, pretty much.

JS: So just to clarify this for the American audience, what’s an Academy school?

SR: Ok, well basically in the UK most schools are state schools, that most children go to. State schools are, we all pay for them, they’re paid for out of our taxes. And they’re run at a local level, which obviously has it’s advantages because it means that you are, kind of, working to an area. What the new government’s proposing to do is allow any school that wants to to become an Academy. And what an Academy is is a school that can run its own finances, and own affairs.

And what we’ve got is we’ve got the data, the government’s published the data — as a PDF of course because governments always publish everything as a PDF, in this country anyway — and what they give you, which we’ve scraped here, is a list of every school in the UK which has expressed an interest. So you’ve got the local authority here, the name of the school, type of school, the address, and the post code. Which is great, because that’s good data, and because it’s on a PDF we can get that into a spreadsheet quite easily.

JS: So did you have to type in all of those things from a PDF, or cut and paste them?

SR: Good god no. No, no, we have, luckily we’ve got a really good editorial support team here, who are, thanks to the Data Blog, are becoming very experienced at getting data off of PDFs. Because every government department would much rather publish something as a PDF, so they can act as if they’re publishing the data but really it’s not open.

JS: So that’s interesting, because in the UK and the US there’s this big government publicity about, you know, we’re publishing all this data.

SR: Absolutely.

JS: But you’re saying that actually–

SR: It’s not 100 percent yet. So, I’ll show you in a second that what they tend to do is just publish– most government departments still want to publish stuff as PDFs. They can’t quite get out of that thing. Or want to say, why would somebody want a spreadsheet? They don’t really get it. A lot of people don’t get it.

And, we wanted the spreadsheet so you can do stuff like this, which is, this is a map of schools interested in becoming Academies by area. And so because we have that raw data in spreadsheet form we can work out how many in the area. You can see suddenly that this part of England, Kent, has 99 schools, which is the biggest in the country. And only one area, which is Barking, up here, in London, which is, sorry, is down here in London, but anyway that has no schools applying at all.

And the government also always said that at the beginning that it would mainly be schools which weren’t “outstanding” would apply. But actually if you look at the figures, which again, we can do, the majority of them are outstanding schools. So they’re already schools which are good, which are applying to become academies. Which kind of isn’t the point. But that kind of analysis, that’s data journalism in a sense. It’s using the numbers to get a story, and to tell a story.

JS: And how long did that story take you to put together? To get the numbers, and do the graphics, and…?

SR: Well, I was helped a bit, because I got, I’ve had one of my helpers who works in editorial support to get the data onto a spreadsheet. And in terms of creating the graphic we have a fantastic tool here, which is set up by one of our technical development team who are over there, and what it does, is it allows you to paste a load of data, geographic data, into this box, and you tell it what kind, is it parliamentary constituency, or local authority, or educational authority, or whatever, however the different regional differentiations we have in the UK, and it will draw a map for you. So this map here was drawn by computer, basically, and then one of the graphics guys help sort out the labels and finesse it and make it look beautiful. But it saves you the hard work of coloring up all those things. So actually that took me maybe a couple of hours. In total.

JS: How about getting the data, how long did that take?

SR: Oh well luckily that data– you know the government makes the data available. But like I say, as a PDF file. So this is the government site, and that’s the list there, and you open it, it opens as a PDF. Because we’ll link to that.

But luckily the guys in the ESD [editorial services department] are very adept now, because of the Data Blog, at getting data into spreadsheets. So, you know they can do that in 20 minutes.

JS: So how many people are working on data overall, then?

SR: Well, in terms of– it’s my full time job to do it. I’m lucky in that I’ve got an awful lot of people around here who have got an interest who I can kind of go and nudge, and ask. It’s a very informal basis, and we’re looking to formalize that, at the moment. We’re working on a whole data strategy, and where it goes. So we’re hoping to kind of make all of these arrangements a bit more formal. But at the moment I have to fit into what other people are doing. But yeah, we’ve got a good team now that can help, and that’s really a unique thing.

So I was going through the Data Blog for you. So this is a typical, a weird day, so schools, and then we’ve got another schools thing because it’s a big schools day today. This is school building projects scrapped by constituency, full list. Now, this is another where the government didn’t make the data easily available. The department for education published a list of all the school projects that were going to be stopped when the government cut the funding, some of which is going towards creating Academy schools, which is why this is a bit of an issue in the country at the moment. And we want to know by constituency how it was working. So which MPs were having the most school projects cut, in their constituency. And we couldn’t get that list out of the department of education, but one MP had lodged it with the House of Commons library. So we managed to get it from the House of Commons library. But it didn’t come in a good form, it came in a PDF again, so again we had to get someone from tech to sort it out for us.

But the great thing is that we can do something like this, which is a map of projects stopped by constituency, by MP. And most of the projects we’ve stopped were in Labour seats. As you know Labour are not in power at the moment. So we can do some of this sort of analysis which is great. So there were 418 projects stopped in Labour constituent seats, and 268 stopped in conservative seats. So basically 40% of Labour MPs had a project stopped, at least one project stopped in their seat, compared to only 27% of Conservatives, and 24% of the Dems who are in power at the moment.

JS: So would it be accurate to say the data drove this story, or showed this story, or…?

SR: Data showed this story, which is great, but the one thing, the caveat — of course, the raw numbers are never 100% — the caveat was there were more projects going on in Labour areas because Labour government, previous government which is Labour set up the projects, and they gave more projects to Labour areas. So you can read it either way.

JS: And you said this in the story?

SR: We said this in the story. Absolutely. We always try and make the caveats available for people. So that’s a big story today, because of there are demonstrations about it in London. You’ve come to us on a very education-centered day today.

But there’s other stuff on the blog too. This is a very British thing. We did this because we thought it would be an interesting project to do. I had somebody in for a week and they didn’t have much to do so I got them to make a list of every Doctor Who villain ever.

JS: This was an intern project?

SR: This was an intern project. We kinda thought, yeah, we’ll get a bit of traffic. And we’ve never had so much involvement in a single piece ever. It’s had 500 retweets, and when you think most pieces will get 30 or 40, it’s kind of interesting. The traffic has been through the roof. And the great thing is, so we created–

JS: Ooh, what’s this? This is good.

SR: It’s quite an easy– we use ManyEyes quite a lot, which is very very quick to create lovely little graphics. And this is every single Doctor Who villain since the start of the program, and how many times they appear. So you see the Daleks lead the way in Doctor Who.

JS: Yeah, absolutely.

SR: Followed by the Cybermen, and the Masters in there a lot. And there are lots of other little things. But we started off with about 106 villains in total, and now we’re up to– we put it out there and we said to people, we know this isn’t going to be the complete list, can you help us? And now we’ve got 212. So my weekend has basically been– I’ll show you the data sheet, it’s amazing. You can see the comments are incredible. You see these kinds of things, “so what about the Sea Devils? The Zygons?” and so on.

And I’ll show you the data set, because it’s quite interesting. So this is the data set. Again Google docs. And you can see over here on the right hand side, this is how many people looking at it at any one time. So at that moment there are 11 people looking on. There could be 40 or 50 people looking at any one moment. And they’re looking and they’re helping us make corrections.

JS: So, wait– this data set is editable?

SR: No, we haven’t made it editable, because we’ve had a bad experience people coming to editable ones and mucking around, you know, putting swear words on stuff.

JS: So how do they help you?

SR: Well they’ll put stuff in the comments field and I’ll go in and put it on the spreadsheet. Because I want a sheet that people can still download. So now we’ve got, we’re now up to 203. We’ve doubled the amount of villains thanks to our readers. It’s Doctor Who. And it just shows we’re an eclectic– we’re a broad church on the Data Blog. Everything can be data. And that’s data. We’ve got number of appearances per villain, and it’s a program that people really care about. And it’s about as British as it’s possible to get. But then we also have other stuff too– and there we go, crashed again.

JS: Well let me just ask you a few questions, and take this opportunity to ask you some broader questions. Because we can do this all day. And I have. I’ve spent hours on your data blog because I’m a data geek. But let’s sort of bring it to some general questions here.

SR: Okay. Go for it.

JS: So first of all, I notice you have the Data Blog, you also have the world data index.

SR: Yes. Now the idea of that was that, obviously lots of governments around the world have started to open up their data. And around the time that the British government was– a lot of developers here were involved in that project — we started to think, what can we do around this that would help people, because suddenly we’ve got lots of sites out there that are offering open government data. And we thought, what if we could just gather them all together into one place. So you’ve got a single search engine. And that’s how we set up the world data search. Sorry to point you at the screen again.

JS: No that’s fine, that’s fine.

SR: Basically, so what we did, we started off with just Australia, New Zealand, UK and America. And basically what this site does, is it searches all of these open government data sites. Now we’ve got Australia, Toronto in Canada, New Zealand, the UK, London, California, San Francisco, and data.gov.

So say you search for “crime,” say you’re interested in crime. There you go. So you come back here, you see you’ve got results here from the UK, London, you’ve got results from data.gov in America, San Francisco, New Zealand and Australia. Say you’re interested in just seeing– you live in San Francisco and you’re only interested in San Francisco results. You’ve three results. And there you go, you click on that.

And you’re still within the Guardian site because what we’re asking people to do is help us rank the data, and submit visualizations and applications. So we want people to tell us what they’ve done with the data.

But anyway if you go and click on that, and you click on “download,” and it will start downloading the data for you. Or, what it will do is take you to the terms and conditions. We don’t bypass any T&Cs. The T&C’s come alongside. But you click on that, you agree to that, and then you get the data. So we really try and make it easy for people. There you go. And this is the crime incidence data. Very variable. This is great because it’s KML files, so if you wanted to visualize that you get really great information. It’s all sorts of stuff. Sometimes it’s CSVs.

JS: What’s a KML file?

SR: So, Google Earth.

JS: Okay.

SR: Sorry. So, it’s mapping, a mapping file straight away.

SR: Okay, so one of the things we ask people to do is to submit visualizations and applications they’ve produced. So for instance, London has some very very good open data. If you haven’t looked around the Data Store, it’s really worth going to. And one of these things they do is they provide a live feed of all the London traffic cameras. You can watch them live. And this is a lovely thing, because what somebody’s done is they’ve written an iPad application. So you can watch live TFL, Transport for London, traffic cameras on your iPad.

And you see that data set has been rated. A couple of people have gone in there and rated it. You’ve got a download button, the download is XML. So we try and help people around this data. And this is growing now. Every time somebody launches an open government data site we’re gonna put it on here, and we’re working on a few more at the moment. So we want it to be the place that people go to. Every time you Google “world government data” it pops up at the top, which is what you want. You want people who are just trying to compare different countries and don’t know where to start, to help them find a way through this maze of information that’s out there.

JS: So do you intend to do this for every country in the world?

SR: Every country in the world that launches an open government data site, we’ll whack it on here. And we’re working– at the moment there are about 20 decent open government data sites around the world. We’re picking those up. We’ve got on here now, how many have we got? One, two, three, four, five, six, seven, eight. We’ll have 20 on in the next couple of weeks. We’re really working through them at the moment.

And what this does is, it scrapes them. So basically, we don’t– for us it’s easy to manage because we don’t have to update these data sets all the time. The computer does that for us. But basically, what we do provide people with is context and background information, because you’re part of the data site there.

JS: So let me make sure I have this clear. So you’re not sucking down the actual data, you’re sucking down the list and descriptions of the data sets available?

SR: Absolutely. So we’re providing people, because basically we want it to be as updated as possible. We don’t– if we just uploaded onto our site, that would kind of be pointless, and it would mean it would be out of date. This way, if something pops up on data.gov and stays there, we’ll get it quick on here. We’ll help people find it. Helping people find the data, that’s our mission here. It’s not just generating traffic, it’s to help people find the information, because we want people to come to us when they’re looking for data.

JS: So, okay. You’ve talked about, it sounds like, two different projects. The Data Blog. where you collect and clean up and present data that you–

SR: That we find interesting. We’re selective.

JS: In the process of the Guardian’s newsgathering.

SR: Yeah, and just things that are interesting anyway. So the Doctor Who post that we were just looking at is just interesting to do. It’s not anything we’re going to do a story about. And often they’ll be things that are in the news, say that day, and I’ll think “oh that’s a good thing to put on the Data Blog.” So it could be crime figures, or it could be– and sometimes, the side effect of that is a great side effect because you end up with a piece in the paper, or a piece on the web site. But often it might be the Data Blog is the only place to get that information.

JS: And you index world government data sites.

SR: Yeah, absolutely.

JS: Does the Guardian do anything else with data?

SR: Yeah, well what we do is, we’re doing a lot of Guardian research with data. So what we want to do is give people a kind of way into that. So for instance, we do do a lot of data-based projects. So for instance we’re doing an executive pay survey of all the biggest companies, how much they pay their bosses and their chief executives. That has always been a thing the paper’s always done for stories. And now what we’ll do is we’ll make that stuff available– that data available for people. So instead of just raw data journalism, it’s quite old data journalism. We’ve been doing it for ten years. But we used to just call it a survey. Now it’s data journalism, because it’s getting stories out of numbers. So we’ll work with that, and we’ll publish that information for people to see. And there are a couple of big projects coming up this week, which I really can’t tell you about, but next week it will be obvious what they are.

JS: Probably by the time this goes up we’ll be able to link to them.

[Simon was referring to the Guardian's data journalism work on the leaked Afghanistan war logs, described in a thorough post on the Data Blog.]

SR: Yeah, I’ll mail you about them. But we’ve got now an area of expertise. So increasingly what I’m finding is that I’m getting people coming to me within The Guardian, saying, so we’ve got this spreadsheet, well how can I do this? So for instance that Academies thing we were just looking at, we were really keen to find out which areas were the most, where the most schools were, for the paper. The correspondent wanted to know that. So actually, because we’ve got this area of expertise now in managing data, we’re becoming kind of a go-to place within The Guardian, for journalists who are just writing stories where they need to know something, or they need to find some information out, which is an interesting side effect. Because it used to be that journalists were kind of scared of numbers, and scared of data. I really think that was the case. And now, increasingly, they’re trying to embrace that, and starting to realize you can get stories out of it.

JS: Well that’s really interesting. Let’s talk for a minute about how this applies to other newsrooms, because it’s– as you say, journalists have been traditionally scared of data.

SR: Yeah, absolutely. You could say they prided themselves, in this country anyway, they prided themselves on lack of mathematical ability. I would say.

JS: Which seems unfortunate in this era.

SR: Yeah, absolutely. Yeah, yeah, absolutely.

JS: But especially a lot of our readers are from smaller newsrooms, and so what kind of technical capability do you need to start tracking data, and publishing data sets?

SR: I think it’s really minimal. I mean, the thing is that actually, what we’re doing is really working with a basic, most of the time just basic spreadsheet packages. Excel or whatever you’ve got. Excel is easy to use, but it could be any package really. And we’re using Google spreadsheets, which again is widely available for people to do information. We’re using visualization tools which are again, ManyEyes or Timetric which are widely available and easy to use. I think what we’re doing is just bringing it together.

I think traditionally that journalists wouldn’t regard data journalism as journalism. It was research. Or, you know, how is publishing data– is that journalism? But I think now, what is happening is that actually, what used to happen is that we were the kind of gatekeepers to this information. We would keep it to ourselves. So we didn’t want our rivals to get ahold of it, and give them stories. We’d be giving stories away. And we wouldn’t believe that people out there in the world would have any contribution to make towards that. Now, that’s all changed now. I think now we’ve realized that actually, we’re not always the experts. Be it Doctor Who or Academy schools, there’s somebody out there who knows a lot more than you do, and can thus contribute. So you can get stories back from them, in a way. So we’re receiving the information much more.

JS: So you publish the data, and then other people build stories out of it, is that what you’re saying?

SR: Other people will let us know– well, we publish say, well that’s an interesting story, or this is a good visualization. We’ve published data for other people to visualize. We thought, that’s quite an interesting thing to mash it up with, we should do that ourselves. So there’s that thing, and there’s also the fact that if you put the information out there, you always get a return. You get people coming back.

So for instance the Academies thing today that we were talking about. We’ve had people come back saying, well I live in Derbyshire and I know that those schools are in quite wealthy areas. So we start to think, well is there a trend towards schools in wealthy areas going to this, and schools in poorer areas not going to this.

So it gives you extra stories or extra angles on stories you wouldn’t think of. And I think that’s part of it. And I think partly there’s just the realization that just publishing data in itself, because it’s interesting, is a journalistic enterprise. Because I think you have to apply journalistic treatment to that data. You have to choose the data in a selective, editorial fashion. And I think you have to process it in a way that makes it easy for people to use, and useful to people.

JS: So last question here, which is of course going to be on many editors’ and publishers’ minds.

SR: Sure.

JS: Let’s talk about traffic and money. How does this contribute to the business of The Guardian?

SR: Okay, it’s a new– it’s an experiment for us, but traffic-wise it’s been pretty healthy. We’ve had– during the election we were getting a million page impressions in a month. Which is not bad. On the Data Blog. Now, as a whole, out of the 36 million that The Guardian gets, it doesn’t seem like a lot. But actually, in the firmament of Guardian web sites that’s not bad. That’s kind of upper tier. And this is only after being around for a year.

So in terms of what it gives us, it gives the same as producing anything that produces traffic gives us. It’s good for the brand, and it’s good for The Guardian site. In the long run, I think that there is probably canny money to be made out of there, for organizations that can manage and interpret data. I don’t know exactly how, but I think we’d have to be pretty dumb if we don’t come up with something. I’d be very surprised. It’s an area where there’s such a lot of potential. There are people who don’t really know how to manage data and don’t really know how to organize data that– for us to get involved in that area. I really think that.

But also I think that just journalistically, it’s as important to do this as it is to write a piece about a fashion week or anything else we might employ a journalist to do. And in a way it’s more important, because if The Guardian is about open information, which– since the beginning of The Guardian we’ve campaigned for freedom of information and access to information, and this is the ultimate expression of that.

And we, on the site, we use the phrase “facts are sacred.” And this comes from the famous C. P. Scott who said that “comment is free,” which as you know is the name of our comment site, but “facts are sacred” was the second part of the saying. And I kinda think that is– you can see it on the comment site, there you go. “Comment is free, but facts are sacred.” And that’s what The Guardian’s about. I really think that, you know, this says a lot about the web. Interestingly, I think that’s how the web is changing, in the sense that a few years ago it was just about comment. People wanted to say what they thought. Now I think it’s, increasingly, people want to find out what the facts are.

JS: All right, well, thank you very much for a thorough introduction to The Guardian’s data work.

SR: Thanks a lot.

Data Blog posts are often related to or supporting of news stories, but not always. Rogers sees the publishing of interesting data as a journalistic act that stands alone, and is clear on where the newsroom adds value:

I think you have to apply journalistic treatment to data. You have to choose the data in a selective, editorial fashion. And I think you have to process it in a way that makes it easy for people to use, and useful to people.

The Guardian curates far more data than it creates. Some data sets are generated in-house, such as the Guardian’s yearly executive pay surveys, but more often the data already exists in some form, such as a PDF on a government web site. The Guardian finds such documents, scrapes the data into spreadsheets, cleans it, and adds context in a Data Blog post. But they also maintain an index of world government data which scrapes open government web sites to produce a searchable index of available data sets.

“Helping people find the data, that’s our mission here,” says Rogers. “We want people to come to us when they’re looking for data.”

In alignment with their open strategy, The Guardian encourages re-use and mashups of their data. Readers can submit apps and visualizations that they’ve created, but data has proven to be just as popular with non-developers — regular folks who want the raw information.

Sometimes readers provide additional data or important feedback, typically through the comments on each post. Rogers gives the example of a reader who wrote in to say that the Academy schools listed in his area in a Guardian data set were in wealthy neighborhoods, raising the journalistically interesting question of whether wealthier schools were more likely to take advantage of this charter school-like program. Expanding on this idea, Rogers says,

What used to happen is that we were the kind of gatekeepers to this information. We would keep it to ourselves. So we didn’t want our rivals to get ahold of it, and give them stories. We’d be giving stories away. And we wouldn’t believe that people out there in the world would have any contribution to make towards that.

Now, that’s all changed now. I think now we’ve realized that actually, we’re not always the experts. Be it Doctor Who or Academy schools, there’s somebody out there who knows a lot more than you do, and can thus contribute.

So you can get stories back from them, in a way. … If you put the information out there, you always get a return. You get people coming back.

Perhaps surprisingly, data also gets pretty good traffic, with the Data Blog logging a million hits a month during the recent election coverage. “In the firmament of Guardian web sites that’s not bad. That’s kind of upper tier,” says Rogers. “And this is only after being around for a year.” (The even younger Texas Tribune also finds its data pages popular, accounting for a third of total page views.)

Rogers and I also discussed the process of getting useful data out of inept or uncooperative governments, the changing role of data specialists in the newsroom, and how the Guardian tapped its readers to produce the definitive database of Doctor Who villains. Here’s the transcript, lightly edited.

JS: All right. So. I’m here with Simons Rogers in the Guardian newsroom in London, and you’re the editor of the Data Blog.

SR: That’s right, and I’m also a news editor so I work across the organization on data journalism, essentially.

JS: So, first of all, can you tell us what the Data Blog is?

SR: Ok, well basically it came about because, as I said I was a news editor working a lot with graphics, and we realized we were just collecting enormous amounts of data. And we though, well wouldn’t our readers be interested in seeing that? And when the Guardian Open Platform launched, it seemed a good time to think about opening up– we were opening up the Guardian to technical development, so it seemed a good time to open up our data collections as well.

And also it’s the fact that increasingly we’ve found people are after raw information. If you looked– and there’s lots of raw information online, but if you start searching for that information you just get bewildering amounts of replies back. If if you’re looking for, say, carbon emissions, you get millions of entries back. So how do you know what the right set of data is? Whereas we’ve already done that set of work for our readers, because we’ve had to find that data, and we’ve had to choose it, and make an editorial selection about it, I suppose. So we thought we were able to cut out the middle man for people.

But also we kind of thought when we launched it, actually, what we’d be doing is creating data for developers. There seemed to be a lot of developers out there at that point who were interested in raw information, and they would be the people who would use the data blog, and the open platform would get a lot more traffic.

And what actually happened, what’s been interesting about it, is that– what’s actually happened is that it’s been real people who have been using the Data Blog, as much as developers. Probably more so than developers.

JS: What do you mean “real people”?

SR: Real people, I suppose what I mean is that, somebody who’s just interested in finding out what a number is. So for instance, here at the moment we’ve got a big story about a government scheme for building schools, which has just been cut by the new government. It was set up by the old government, who invested millions of pounds into building new school buildings. And so, we’ve got the full list of all the schools, but the parliamentary constituency that they’re in, and where they are and what kind of project they were. And that is really, really popular today, that’s one of our biggest things, because there’s a lot of demonstrations about it, it’s a big issue of the day. And so I would guess that 90% of people looking at it are just people who want to find out what the real raw data is.

And that’s the great thing about the internet, it gives you access to the raw, real information. And I think that’s what people really crave. They want the interpretation and the analysis from people, but they also want the veracity of seeing the real thing, without having it aggregated or put together. They just want to see the raw data.

JS: So you publish all of the original numbers that you get from the government?

SR: Well exactly. The only time– with the Data Blog, I try to make it as newsy as possible. So it’s often hooked around news stories of the day. Partly because it helps the traffic, and you’re kind of hooking on to existing requirements.

Obviously we do– it’s just a really eclectic mix of data. And I can show you the screen, for a sec.

JS: All right. Let’s see something.

SR: Okay, so this is the data blog today. So obviously we’ve got Afghanistan at the top. Afghanistan is often at the top at the moment. This is a full list of everybody who’s died, every British casualty who’s died and been wounded over time. So you’ve got this data here. We use, I tend to use a lot of third party services. This is a company called Timetric, who are very good at visualizing time series data. It takes about five minutes to create that, and you can roll over and get more information.

JS: So is that a free service?

SR: Yeah, absolutely free, you just sign up, and you share it. It works a bit like Many Eyes, you know the IBM service.

JS: Yeah.

SR: We’ll embed these Google docs. We use Google docs, Google spreadsheets to share all our information because it’s very for people to download it. So say you want to download this data. You click on the link, and it will take you through in a second to, there you go, it’s the full Google spreadsheet. And you’ve got everything on here. You’ve got, these are monthly totals, which you can’t get anywhere else, because nobody else does that information.

JS: What do you mean nobody else does it?

SR: Well nobody else bothers to put it together month by month. You can get totals by year from, iCasualties I think do it, but we’ve just collected some month by month, because often we’ve had to draw graphics where it’s month by month. It’s the kind of thing, actually it’s quite interesting to be able to see which month was the worst for casualties.

We’ve got lists of names, which obviously are in a few places. We collect Afghanistan wounded statistics which are terribly confused in the UK, because what they do is they try and make them as complicated as possible. So, the most serious ones, NOTICAS is where your next of kin is notified. That’s a serious event, but also you’ve got all those people evacuated. So anyway, this kind of data. We also keep amputation data, which is a new set that the government refused to release until recently, and a Guardian reporter was instrumental in getting this data released. So we kind thought, maybe we should make this available for people.

So you get all this data, and then what you can do, if you click on “File” there, you can download it as Excel, XML, CSV, or whatever format you want. So that’s why we use Google speadsheets. It’s the kind of thing that’s a very, very easily accessible format for people.

So really what we do is we try and encourage a community, a community to grow up around data and information. So every post has got a talk facility on it.

Anyway, going through it. So this is today’s Data Blog, where you’ve got Afghanistan, Academy schools in the UK. The schools are run by the state, pretty much.

JS: So just to clarify this for the American audience, what’s an Academy school?

SR: Ok, well basically in the UK most schools are state schools, that most children go to. State schools are, we all pay for them, they’re paid for out of our taxes. And they’re run at a local level, which obviously has it’s advantages because it means that you are, kind of, working to an area. What the new government’s proposing to do is allow any school that wants to to become an Academy. And what an Academy is is a school that can run its own finances, and own affairs.

And what we’ve got is we’ve got the data, the government’s published the data — as a PDF of course because governments always publish everything as a PDF, in this country anyway — and what they give you, which we’ve scraped here, is a list of every school in the UK which has expressed an interest. So you’ve got the local authority here, the name of the school, type of school, the address, and the post code. Which is great, because that’s good data, and because it’s on a PDF we can get that into a spreadsheet quite easily.

JS: So did you have to type in all of those things from a PDF, or cut and paste them?

SR: Good god no. No, no, we have, luckily we’ve got a really good editorial support team here, who are, thanks to the Data Blog, are becoming very experienced at getting data off of PDFs. Because every government department would much rather publish something as a PDF, so they can act as if they’re publishing the data but really it’s not open.

JS: So that’s interesting, because in the UK and the US there’s this big government publicity about, you know, we’re publishing all this data.

SR: Absolutely.

JS: But you’re saying that actually–

SR: It’s not 100 percent yet. So, I’ll show you in a second that what they tend to do is just publish– most government departments still want to publish stuff as PDFs. They can’t quite get out of that thing. Or want to say, why would somebody want a spreadsheet? They don’t really get it. A lot of people don’t get it.

And, we wanted the spreadsheet so you can do stuff like this, which is, this is a map of schools interested in becoming Academies by area. And so because we have that raw data in spreadsheet form we can work out how many in the area. You can see suddenly that this part of England, Kent, has 99 schools, which is the biggest in the country. And only one area, which is Barking, up here, in London, which is, sorry, is down here in London, but anyway that has no schools applying at all.

And the government also always said that at the beginning that it would mainly be schools which weren’t “outstanding” would apply. But actually if you look at the figures, which again, we can do, the majority of them are outstanding schools. So they’re already schools which are good, which are applying to become academies. Which kind of isn’t the point. But that kind of analysis, that’s data journalism in a sense. It’s using the numbers to get a story, and to tell a story.

JS: And how long did that story take you to put together? To get the numbers, and do the graphics, and…?

SR: Well, I was helped a bit, because I got, I’ve had one of my helpers who works in editorial support to get the data onto a spreadsheet. And in terms of creating the graphic we have a fantastic tool here, which is set up by one of our technical development team who are over there, and what it does, is it allows you to paste a load of data, geographic data, into this box, and you tell it what kind, is it parliamentary constituency, or local authority, or educational authority, or whatever, however the different regional differentiations we have in the UK, and it will draw a map for you. So this map here was drawn by computer, basically, and then one of the graphics guys help sort out the labels and finesse it and make it look beautiful. But it saves you the hard work of coloring up all those things. So actually that took me maybe a couple of hours. In total.

JS: How about getting the data, how long did that take?

SR: Oh well luckily that data– you know the government makes the data available. But like I say, as a PDF file. So this is the government site, and that’s the list there, and you open it, it opens as a PDF. Because we’ll link to that.

But luckily the guys in the ESD [editorial services department] are very adept now, because of the Data Blog, at getting data into spreadsheets. So, you know they can do that in 20 minutes.

JS: So how many people are working on data overall, then?

SR: Well, in terms of– it’s my full time job to do it. I’m lucky in that I’ve got an awful lot of people around here who have got an interest who I can kind of go and nudge, and ask. It’s a very informal basis, and we’re looking to formalize that, at the moment. We’re working on a whole data strategy, and where it goes. So we’re hoping to kind of make all of these arrangements a bit more formal. But at the moment I have to fit into what other people are doing. But yeah, we’ve got a good team now that can help, and that’s really a unique thing.

So I was going through the Data Blog for you. So this is a typical, a weird day, so schools, and then we’ve got another schools thing because it’s a big schools day today. This is school building projects scrapped by constituency, full list. Now, this is another where the government didn’t make the data easily available. The department for education published a list of all the school projects that were going to be stopped when the government cut the funding, some of which is going towards creating Academy schools, which is why this is a bit of an issue in the country at the moment. And we want to know by constituency how it was working. So which MPs were having the most school projects cut, in their constituency. And we couldn’t get that list out of the department of education, but one MP had lodged it with the House of Commons library. So we managed to get it from the House of Commons library. But it didn’t come in a good form, it came in a PDF again, so again we had to get someone from tech to sort it out for us.

But the great thing is that we can do something like this, which is a map of projects stopped by constituency, by MP. And most of the projects we’ve stopped were in Labour seats. As you know Labour are not in power at the moment. So we can do some of this sort of analysis which is great. So there were 418 projects stopped in Labour constituent seats, and 268 stopped in conservative seats. So basically 40% of Labour MPs had a project stopped, at least one project stopped in their seat, compared to only 27% of Conservatives, and 24% of the Dems who are in power at the moment.

JS: So would it be accurate to say the data drove this story, or showed this story, or…?

SR: Data showed this story, which is great, but the one thing, the caveat — of course, the raw numbers are never 100% — the caveat was there were more projects going on in Labour areas because Labour government, previous government which is Labour set up the projects, and they gave more projects to Labour areas. So you can read it either way.

JS: And you said this in the story?

SR: We said this in the story. Absolutely. We always try and make the caveats available for people. So that’s a big story today, because of there are demonstrations about it in London. You’ve come to us on a very education-centered day today.

But there’s other stuff on the blog too. This is a very British thing. We did this because we thought it would be an interesting project to do. I had somebody in for a week and they didn’t have much to do so I got them to make a list of every Doctor Who villain ever.

JS: This was an intern project?

SR: This was an intern project. We kinda thought, yeah, we’ll get a bit of traffic. And we’ve never had so much involvement in a single piece ever. It’s had 500 retweets, and when you think most pieces will get 30 or 40, it’s kind of interesting. The traffic has been through the roof. And the great thing is, so we created–

JS: Ooh, what’s this? This is good.

SR: It’s quite an easy– we use ManyEyes quite a lot, which is very very quick to create lovely little graphics. And this is every single Doctor Who villain since the start of the program, and how many times they appear. So you see the Daleks lead the way in Doctor Who.

JS: Yeah, absolutely.

SR: Followed by the Cybermen, and the Masters in there a lot. And there are lots of other little things. But we started off with about 106 villains in total, and now we’re up to– we put it out there and we said to people, we know this isn’t going to be the complete list, can you help us? And now we’ve got 212. So my weekend has basically been– I’ll show you the data sheet, it’s amazing. You can see the comments are incredible. You see these kinds of things, “so what about the Sea Devils? The Zygons?” and so on.

And I’ll show you the data set, because it’s quite interesting. So this is the data set. Again Google docs. And you can see over here on the right hand side, this is how many people looking at it at any one time. So at that moment there are 11 people looking on. There could be 40 or 50 people looking at any one moment. And they’re looking and they’re helping us make corrections.

JS: So, wait– this data set is editable?

SR: No, we haven’t made it editable, because we’ve had a bad experience people coming to editable ones and mucking around, you know, putting swear words on stuff.

JS: So how do they help you?

SR: Well they’ll put stuff in the comments field and I’ll go in and put it on the spreadsheet. Because I want a sheet that people can still download. So now we’ve got, we’re now up to 203. We’ve doubled the amount of villains thanks to our readers. It’s Doctor Who. And it just shows we’re an eclectic– we’re a broad church on the Data Blog. Everything can be data. And that’s data. We’ve got number of appearances per villain, and it’s a program that people really care about. And it’s about as British as it’s possible to get. But then we also have other stuff too– and there we go, crashed again.

JS: Well let me just ask you a few questions, and take this opportunity to ask you some broader questions. Because we can do this all day. And I have. I’ve spent hours on your data blog because I’m a data geek. But let’s sort of bring it to some general questions here.

SR: Okay. Go for it.

JS: So first of all, I notice you have the Data Blog, you also have the world data index.

SR: Yes. Now the idea of that was that, obviously lots of governments around the world have started to open up their data. And around the time that the British government was– a lot of developers here were involved in that project — we started to think, what can we do around this that would help people, because suddenly we’ve got lots of sites out there that are offering open government data. And we thought, what if we could just gather them all together into one place. So you’ve got a single search engine. And that’s how we set up the world data search. Sorry to point you at the screen again.

JS: No that’s fine, that’s fine.

SR: Basically, so what we did, we started off with just Australia, New Zealand, UK and America. And basically what this site does, is it searches all of these open government data sites. Now we’ve got Australia, Toronto in Canada, New Zealand, the UK, London, California, San Francisco, and data.gov.

So say you search for “crime,” say you’re interested in crime. There you go. So you come back here, you see you’ve got results here from the UK, London, you’ve got results from data.gov in America, San Francisco, New Zealand and Australia. Say you’re interested in just seeing– you live in San Francisco and you’re only interested in San Francisco results. You’ve three results. And there you go, you click on that.

And you’re still within the Guardian site because what we’re asking people to do is help us rank the data, and submit visualizations and applications. So we want people to tell us what they’ve done with the data.

But anyway if you go and click on that, and you click on “download,” and it will start downloading the data for you. Or, what it will do is take you to the terms and conditions. We don’t bypass any T&Cs. The T&C’s come alongside. But you click on that, you agree to that, and then you get the data. So we really try and make it easy for people. There you go. And this is the crime incidence data. Very variable. This is great because it’s KML files, so if you wanted to visualize that you get really great information. It’s all sorts of stuff. Sometimes it’s CSVs.

JS: What’s a KML file?

SR: So, Google Earth.

JS: Okay.

SR: Sorry. So, it’s mapping, a mapping file straight away.

SR: Okay, so one of the things we ask people to do is to submit visualizations and applications they’ve produced. So for instance, London has some very very good open data. If you haven’t looked around the Data Store, it’s really worth going to. And one of these things they do is they provide a live feed of all the London traffic cameras. You can watch them live. And this is a lovely thing, because what somebody’s done is they’ve written an iPad application. So you can watch live TFL, Transport for London, traffic cameras on your iPad.

And you see that data set has been rated. A couple of people have gone in there and rated it. You’ve got a download button, the download is XML. So we try and help people around this data. And this is growing now. Every time somebody launches an open government data site we’re gonna put it on here, and we’re working on a few more at the moment. So we want it to be the place that people go to. Every time you Google “world government data” it pops up at the top, which is what you want. You want people who are just trying to compare different countries and don’t know where to start, to help them find a way through this maze of information that’s out there.

JS: So do you intend to do this for every country in the world?

SR: Every country in the world that launches an open government data site, we’ll whack it on here. And we’re working– at the moment there are about 20 decent open government data sites around the world. We’re picking those up. We’ve got on here now, how many have we got? One, two, three, four, five, six, seven, eight. We’ll have 20 on in the next couple of weeks. We’re really working through them at the moment.

And what this does is, it scrapes them. So basically, we don’t– for us it’s easy to manage because we don’t have to update these data sets all the time. The computer does that for us. But basically, what we do provide people with is context and background information, because you’re part of the data site there.

JS: So let me make sure I have this clear. So you’re not sucking down the actual data, you’re sucking down the list and descriptions of the data sets available?

SR: Absolutely. So we’re providing people, because basically we want it to be as updated as possible. We don’t– if we just uploaded onto our site, that would kind of be pointless, and it would mean it would be out of date. This way, if something pops up on data.gov and stays there, we’ll get it quick on here. We’ll help people find it. Helping people find the data, that’s our mission here. It’s not just generating traffic, it’s to help people find the information, because we want people to come to us when they’re looking for data.

JS: So, okay. You’ve talked about, it sounds like, two different projects. The Data Blog. where you collect and clean up and present data that you–

SR: That we find interesting. We’re selective.

JS: In the process of the Guardian’s newsgathering.

SR: Yeah, and just things that are interesting anyway. So the Doctor Who post that we were just looking at is just interesting to do. It’s not anything we’re going to do a story about. And often they’ll be things that are in the news, say that day, and I’ll think “oh that’s a good thing to put on the Data Blog.” So it could be crime figures, or it could be– and sometimes, the side effect of that is a great side effect because you end up with a piece in the paper, or a piece on the web site. But often it might be the Data Blog is the only place to get that information.

JS: And you index world government data sites.

SR: Yeah, absolutely.

JS: Does the Guardian do anything else with data?

SR: Yeah, well what we do is, we’re doing a lot of Guardian research with data. So what we want to do is give people a kind of way into that. So for instance, we do do a lot of data-based projects. So for instance we’re doing an executive pay survey of all the biggest companies, how much they pay their bosses and their chief executives. That has always been a thing the paper’s always done for stories. And now what we’ll do is we’ll make that stuff available– that data available for people. So instead of just raw data journalism, it’s quite old data journalism. We’ve been doing it for ten years. But we used to just call it a survey. Now it’s data journalism, because it’s getting stories out of numbers. So we’ll work with that, and we’ll publish that information for people to see. And there are a couple of big projects coming up this week, which I really can’t tell you about, but next week it will be obvious what they are.

JS: Probably by the time this goes up we’ll be able to link to them.

[Simon was referring to the Guardian's data journalism work on the leaked Afghanistan war logs, described in a thorough post on the Data Blog.]

SR: Yeah, I’ll mail you about them. But we’ve got now an area of expertise. So increasingly what I’m finding is that I’m getting people coming to me within The Guardian, saying, so we’ve got this spreadsheet, well how can I do this? So for instance that Academies thing we were just looking at, we were really keen to find out which areas were the most, where the most schools were, for the paper. The correspondent wanted to know that. So actually, because we’ve got this area of expertise now in managing data, we’re becoming kind of a go-to place within The Guardian, for journalists who are just writing stories where they need to know something, or they need to find some information out, which is an interesting side effect. Because it used to be that journalists were kind of scared of numbers, and scared of data. I really think that was the case. And now, increasingly, they’re trying to embrace that, and starting to realize you can get stories out of it.

JS: Well that’s really interesting. Let’s talk for a minute about how this applies to other newsrooms, because it’s– as you say, journalists have been traditionally scared of data.

SR: Yeah, absolutely. You could say they prided themselves, in this country anyway, they prided themselves on lack of mathematical ability. I would say.

JS: Which seems unfortunate in this era.

SR: Yeah, absolutely. Yeah, yeah, absolutely.

JS: But especially a lot of our readers are from smaller newsrooms, and so what kind of technical capability do you need to start tracking data, and publishing data sets?

SR: I think it’s really minimal. I mean, the thing is that actually, what we’re doing is really working with a basic, most of the time just basic spreadsheet packages. Excel or whatever you’ve got. Excel is easy to use, but it could be any package really. And we’re using Google spreadsheets, which again is widely available for people to do information. We’re using visualization tools which are again, ManyEyes or Timetric which are widely available and easy to use. I think what we’re doing is just bringing it together.

I think traditionally that journalists wouldn’t regard data journalism as journalism. It was research. Or, you know, how is publishing data– is that journalism? But I think now, what is happening is that actually, what used to happen is that we were the kind of gatekeepers to this information. We would keep it to ourselves. So we didn’t want our rivals to get ahold of it, and give them stories. We’d be giving stories away. And we wouldn’t believe that people out there in the world would have any contribution to make towards that. Now, that’s all changed now. I think now we’ve realized that actually, we’re not always the experts. Be it Doctor Who or Academy schools, there’s somebody out there who knows a lot more than you do, and can thus contribute. So you can get stories back from them, in a way. So we’re receiving the information much more.

JS: So you publish the data, and then other people build stories out of it, is that what you’re saying?

SR: Other people will let us know– well, we publish say, well that’s an interesting story, or this is a good visualization. We’ve published data for other people to visualize. We thought, that’s quite an interesting thing to mash it up with, we should do that ourselves. So there’s that thing, and there’s also the fact that if you put the information out there, you always get a return. You get people coming back.

So for instance the Academies thing today that we were talking about. We’ve had people come back saying, well I live in Derbyshire and I know that those schools are in quite wealthy areas. So we start to think, well is there a trend towards schools in wealthy areas going to this, and schools in poorer areas not going to this.

So it gives you extra stories or extra angles on stories you wouldn’t think of. And I think that’s part of it. And I think partly there’s just the realization that just publishing data in itself, because it’s interesting, is a journalistic enterprise. Because I think you have to apply journalistic treatment to that data. You have to choose the data in a selective, editorial fashion. And I think you have to process it in a way that makes it easy for people to use, and useful to people.

JS: So last question here, which is of course going to be on many editors’ and publishers’ minds.

SR: Sure.

JS: Let’s talk about traffic and money. How does this contribute to the business of The Guardian?

SR: Okay, it’s a new– it’s an experiment for us, but traffic-wise it’s been pretty healthy. We’ve had– during the election we were getting a million page impressions in a month. Which is not bad. On the Data Blog. Now, as a whole, out of the 36 million that The Guardian gets, it doesn’t seem like a lot. But actually, in the firmament of Guardian web sites that’s not bad. That’s kind of upper tier. And this is only after being around for a year.

So in terms of what it gives us, it gives the same as producing anything that produces traffic gives us. It’s good for the brand, and it’s good for The Guardian site. In the long run, I think that there is probably canny money to be made out of there, for organizations that can manage and interpret data. I don’t know exactly how, but I think we’d have to be pretty dumb if we don’t come up with something. I’d be very surprised. It’s an area where there’s such a lot of potential. There are people who don’t really know how to manage data and don’t really know how to organize data that– for us to get involved in that area. I really think that.

But also I think that just journalistically, it’s as important to do this as it is to write a piece about a fashion week or anything else we might employ a journalist to do. And in a way it’s more important, because if The Guardian is about open information, which– since the beginning of The Guardian we’ve campaigned for freedom of information and access to information, and this is the ultimate expression of that.

And we, on the site, we use the phrase “facts are sacred.” And this comes from the famous C. P. Scott who said that “comment is free,” which as you know is the name of our comment site, but “facts are sacred” was the second part of the saying. And I kinda think that is– you can see it on the comment site, there you go. “Comment is free, but facts are sacred.” And that’s what The Guardian’s about. I really think that, you know, this says a lot about the web. Interestingly, I think that’s how the web is changing, in the sense that a few years ago it was just about comment. People wanted to say what they thought. Now I think it’s, increasingly, people want to find out what the facts are.

JS: All right, well, thank you very much for a thorough introduction to The Guardian’s data work.

SR: Thanks a lot.

June 01 2010

07:50
Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl