Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

December 07 2010

08:47

One ambassador’s embarrassment is a tragedy, 15,000 civilian deaths is a statistic

Few things illustrate the challenges facing journalism in the age of ‘Big Data’ better than Cable Gate – and specifically, how you engage people with stories that involve large sets of data.

The Cable Gate leaks have been of a different order to the Afghanistan and Iraq war logs. Not in number (there were 90,000 documents in the Afghanistan war logs and over 390,000 in the Iraq logs; the Cable Gate documents number around 250,000) – but in subject matter.

Why is it that the 15,000 extra civilian deaths estimated to have been revealed by the Iraq war logs did not move the US authorities to shut down Wikileaks’ hosting and PayPal accounts? Why did it not dominate the news agenda in quite the same way?

Tragedy or statistic?

Generally misattributed to Stalin, the quote “The death of one man is a tragedy, the death of millions is a statistic” illustrates the problem particularly well: when you move beyond scales we can deal with on a human level, you struggle to engage people in the issue you are covering.

Research suggests this is a problem that not only affects journalism, but justice as well. In October Ben Goldacre wrote about a study that suggested “People who harm larger numbers of people get significantly lower punitive damages than people who harm a smaller number. Courts punish people less harshly when they harm more people.”

“Out of a maximum sentence of 10 years, people who read the three-victim story recommended an average prison term one year longer than the 30-victim readers. Another study, in which a food processing company knowingly poisoned customers to avoid bankruptcy, gave similar results.”

Salience

This is where journalists play a particularly important role. Kevin Marsh, writing about Wikileaks on Sunday, argues that

“Whistleblowing that lacks salience does nothing to serve the pubic interest – if we mean capturing the public’s attention to nurture its discourse in a way that has the potential to change something material. “

He is right. But Charlie Beckett, in the comments to that post, points out that Wikileaks is not operating in isolation:

“Wikileaks is now part of a networked journalism where they are in effect, a kind of news-wire for traditional newsrooms like the New York Times, Guardian and El Pais. I think that delivers a high degree of what you call salience.”

This is because last year Wikileaks realised that they would have much more impact working in partnership with news organisations than releasing leaked documents to the world en masse. It was a massive move for Wikileaks, because it meant re-assessing a core principle of openness to all, and taking on a more editorial role. But it was an intelligent move – and undoubtedly effective. The Guardian, Der Spiegel, New York Times and now El Pais and Le Monde have all added salience to the leaks. But could they have done more?

Visualisation through personalisation and humanisation

In my series of posts on data journalism I identified visualisation as one of four interrelated stages in its production. I think that this concept needs to be broadened to include visualisation through case studies: or humanisation, to put it more succinctly.

There are dangers here, of course. Firstly, that humanising a story makes it appear to be an exception (one person’s tragedy) rather than the rule (thousands suffering) – or simply emotive rather than also informative; and secondly, that your selection of case studies does not reflect the more complex reality.

Ben Goldacre – again – explores this issue particularly well:

“Avastin extends survival from 19.9 months to 21.3 months, which is about 6 weeks. Some people might benefit more, some less. For some, Avastin might even shorten their life, and they would have been better off without it (and without its additional side effects, on top of their other chemotherapy). But overall, on average, when added to all the other treatments, Avastin extends survival from 19.9 months to 21.3 months.

“The Daily Mail, the ExpressSky News, the Press Association and the Guardian all described these figures, and then illustrated their stories about Avastin with an anecdote: the case of Barbara Moss. She was diagnosed with bowel cancer in 2006, had all the normal treatment, but also paid out of her own pocket to have Avastin on top of that. She is alive today, four years later.

“Barbara Moss is very lucky indeed, but her anecdote is in no sense whatsoever representative of what happens when you take Avastin, nor is it informative. She is useful journalistically, in the sense that people help to tell stories, but her anecdotal experience is actively misleading, because it doesn’t tell the story of what happens to people on Avastin: instead, it tells a completely different story, and arguably a more memorable one – now embedded in the minds of millions of people – that Roche’s £21,000 product Avastin makes you survive for half a decade.”

Broadcast journalism – with its regulatory requirement for impartiality, often interpreted in practical terms as ‘balance’ – is particularly vulnerable to this. Here’s one example of how the homeopathy debate is given over to one person’s experience for the sake of balance:

Journalism on an industrial scale

The Wikileaks stories are journalism on an industrial scale. The closest equivalent I can think of was the MPs’ expenses story which dominated the news agenda for 6 weeks. Cable Gate is already on Day 9 and the wealth of stories has even justified a live blog.

With this scale comes a further problem: cynicism and passivity; Cable Gate fatigue. In this context online journalism has a unique role to play which was barely possible previously: empowerment.

3 years ago I wrote about 5 Ws and a H that should come after every news story. The ‘How’ and ‘Why’ of that are possibilities that many news organisations have still barely explored. ‘Why should I care?’ is about a further dimension of visualisation: personalisation – relating information directly to me. The Guardian moves closer to this with its searchable database, but I wonder at what point processing power, tools, and user data will allow us to do this sort of thing more effectively.

‘How can I make a difference?’ is about pointing users to tools – or creating them ourselves – where they can move the story on by communicating with others, campaigning, voting, and so on. This is a role many journalists may be uncomfortable with because it raises advocacy issues, but then choosing to report on these stories, and how to report them, raises the same issues; linking to a range of online tools need not be any different. These are issues we should be exploring, ethically.

All the above in one sentence

Somehow I’ve ended up writing over a thousand words on this issue, so it’s worth summing it all up in a sentence.

Industrial scale journalism using ‘big data’ in a networked age raises new problems and new opportunities: we need to humanise and personalise big datasets in a way that does not detract from the complexity or scale of the issues being addressed; and we need to think about what happens after someone reads a story online and whether online publishers have a role in that.

August 05 2010

17:00

How The Guardian is pioneering data journalism with free tools

The Guardian takes data journalism seriously. They obtain, format, and publish journalistically interesting data sets on their Data Blog, they track transparency initiatives in their searchable index of world government data, and they do original research on data they’ve obtained, such as their amazing in-depth analysis of 90,000 leaked Afghanistan war documents. And they do most of this with simple, free tools.

Data Blog editor Simon Rogers gave me an action-packed interview in The Guardian’s London newsroom, starting with story walkthroughs and ending with a philosophical discussion about the changing role of data in journalism. It’s a must-watch if you’re wondering what the digitization of the world’s facts means for a newsroom. Here’s my take on the highlights; a full transcript is below.

The technology involved is surprisingly simple, and mostly free. The Guardian uses public, read-only Google Spreadsheets to share the data they’ve collected, which require no special tools for viewing and can be downloaded in just about any desired format. Visualizations are mostly via Many Eyes and Timetric, both free.

Data Blog posts are often related to or supporting of news stories, but not always. Rogers sees the publishing of interesting data as a journalistic act that stands alone, and is clear on where the newsroom adds value:

I think you have to apply journalistic treatment to data. You have to choose the data in a selective, editorial fashion. And I think you have to process it in a way that makes it easy for people to use, and useful to people.

The Guardian curates far more data than it creates. Some data sets are generated in-house, such as its yearly executive pay surveys, but more often the data already exists in some form, such as a PDF on a government web site. The Guardian finds such documents, scrapes the data into spreadsheets, cleans it, and adds context in a Data Blog post. But they also maintain an index of world government data which scrapes open government web sites to produce a searchable index of available data sets.

“Helping people find the data, that’s our mission here,” says Rogers. “We want people to come to us when they’re looking for data.”

In alignment with their open strategy, The Guardian encourages re-use and mashups of their data. Readers can submit apps and visualizations that they’ve created, but data has proven to be just as popular with non-developers — regular folks who want the raw information.

Sometimes readers provide additional data or important feedback, typically through the comments on each post. Rogers gives the example of a reader who wrote in to say that the Academy schools listed in his area in a Guardian data set were in wealthy neighborhoods, raising the journalistically interesting question of whether wealthier schools were more likely to take advantage of this charter school-like program. Expanding on this idea, Rogers says,

What used to happen is that we were the kind of gatekeepers to this information. We would keep it to ourselves. So we didn’t want our rivals to get ahold of it, and give them stories. We’d be giving stories away. And we wouldn’t believe that people out there in the world would have any contribution to make towards that.

Now, that’s all changed now. I think now we’ve realized that actually, we’re not always the experts. Be it Doctor Who or Academy schools, there’s somebody out there who knows a lot more than you do, and can thus contribute.

So you can get stories back from them, in a way…If you put the information out there, you always get a return. You get people coming back.

Perhaps surprisingly, data also gets pretty good traffic, with the Data Blog logging a million hits a month during the recent election coverage. “In the firmament of Guardian web sites that’s not bad. That’s kind of upper tier,” says Rogers. “And this is only after being around for a year.” (The even younger Texas Tribune also finds its data pages popular, accounting for a third of total page views.)

Rogers and I also discussed the process of getting useful data out of inept or uncooperative governments, the changing role of data specialists in the newsroom, and how the Guardian tapped its readers to produce the definitive database of Doctor Who villains. Here’s the transcript, lightly edited.

JS: All right. So. I’m here with Simons Rogers in the Guardian newsroom in London, and you’re the editor of the Data Blog.

SR: That’s right, and I’m also a news editor so I work across the organization on data journalism, essentially.

JS: So, first of all, can you tell us what the Data Blog is?

SR: Ok, well basically it came about because, as I said I was a news editor working a lot with graphics, and we realized we were just collecting enormous amounts of data. And we though, well wouldn’t our readers be interested in seeing that? And when the Guardian Open Platform launched, it seemed a good time to think about opening up– we were opening up the Guardian to technical development, so it seemed a good time to open up our data collections as well.

And also it’s the fact that increasingly we’ve found people are after raw information. If you looked– and there’s lots of raw information online, but if you start searching for that information you just get bewildering amounts of replies back. If if you’re looking for, say, carbon emissions, you get millions of entries back. So how do you know what the right set of data is? Whereas we’ve already done that set of work for our readers, because we’ve had to find that data, and we’ve had to choose it, and make an editorial selection about it, I suppose. So we thought we were able to cut out the middle man for people.

But also we kind of thought when we launched it, actually, what we’d be doing is creating data for developers. There seemed to be a lot of developers out there at that point who were interested in raw information, and they would be the people who would use the data blog, and the open platform would get a lot more traffic.

And what actually happened, what’s been interesting about it, is that– what’s actually happened is that it’s been real people who have been using the Data Blog, as much as developers. Probably more so than developers.

JS: What do you mean “real people”?

SR: Real people, I suppose what I mean is that, somebody who’s just interested in finding out what a number is. So for instance, here at the moment we’ve got a big story about a government scheme for building schools, which has just been cut by the new government. It was set up by the old government, who invested millions of pounds into building new school buildings. And so, we’ve got the full list of all the schools, but the parliamentary constituency that they’re in, and where they are and what kind of project they were. And that is really, really popular today, that’s one of our biggest things, because there’s a lot of demonstrations about it, it’s a big issue of the day. And so I would guess that 90% of people looking at it are just people who want to find out what the real raw data is.

And that’s the great thing about the internet, it gives you access to the raw, real information. And I think that’s what people really crave. They want the interpretation and the analysis from people, but they also want the veracity of seeing the real thing, without having it aggregated or put together. They just want to see the raw data.

JS: So you publish all of the original numbers that you get from the government?

SR: Well exactly. The only time– with the Data Blog, I try to make it as newsy as possible. So it’s often hooked around news stories of the day. Partly because it helps the traffic, and you’re kind of hooking on to existing requirements.

Obviously we do– it’s just a really eclectic mix of data. And I can show you the screen, for a sec.

JS: All right. Let’s see something.

SR: Okay, so this is the data blog today. So obviously we’ve got Afghanistan at the top. Afghanistan is often at the top at the moment. This is a full list of everybody who’s died, every British casualty who’s died and been wounded over time. So you’ve got this data here. We use, I tend to use a lot of third party services. This is a company called Timetric, who are very good at visualizing time series data. It takes about five minutes to create that, and you can roll over and get more information.

JS: So is that a free service?

SR: Yeah, absolutely free, you just sign up, and you share it. It works a bit like Many Eyes, you know the IBM service.

JS: Yeah.

SR: We’ll embed these Google docs. We use Google docs, Google spreadsheets to share all our information because it’s very for people to download it. So say you want to download this data. You click on the link, and it will take you through in a second to, there you go, it’s the full Google spreadsheet. And you’ve got everything on here. You’ve got, these are monthly totals, which you can’t get anywhere else, because nobody else does that information.

JS: What do you mean nobody else does it?

SR: Well nobody else bothers to put it together month by month. You can get totals by year from, iCasualties I think do it, but we’ve just collected some month by month, because often we’ve had to draw graphics where it’s month by month. It’s the kind of thing, actually it’s quite interesting to be able to see which month was the worst for casualties.

We’ve got lists of names, which obviously are in a few places. We collect Afghanistan wounded statistics which are terribly confused in the UK, because what they do is they try and make them as complicated as possible. So, the most serious ones, NOTICAS is where your next of kin is notified. That’s a serious event, but also you’ve got all those people evacuated. So anyway, this kind of data. We also keep amputation data, which is a new set that the government refused to release until recently, and a Guardian reporter was instrumental in getting this data released. So we kind thought, maybe we should make this available for people.

So you get all this data, and then what you can do, if you click on “File” there, you can download it as Excel, XML, CSV, or whatever format you want. So that’s why we use Google speadsheets. It’s the kind of thing that’s a very, very easily accessible format for people.

So really what we do is we try and encourage a community, a community to grow up around data and information. So every post has got a talk facility on it.

Anyway, going through it. So this is today’s Data Blog, where you’ve got Afghanistan, Academy schools in the UK. The schools are run by the state, pretty much.

JS: So just to clarify this for the American audience, what’s an Academy school?

SR: Ok, well basically in the UK most schools are state schools, that most children go to. State schools are, we all pay for them, they’re paid for out of our taxes. And they’re run at a local level, which obviously has it’s advantages because it means that you are, kind of, working to an area. What the new government’s proposing to do is allow any school that wants to to become an Academy. And what an Academy is is a school that can run its own finances, and own affairs.

And what we’ve got is we’ve got the data, the government’s published the data — as a PDF of course because governments always publish everything as a PDF, in this country anyway — and what they give you, which we’ve scraped here, is a list of every school in the UK which has expressed an interest. So you’ve got the local authority here, the name of the school, type of school, the address, and the post code. Which is great, because that’s good data, and because it’s on a PDF we can get that into a spreadsheet quite easily.

JS: So did you have to type in all of those things from a PDF, or cut and paste them?

SR: Good god no. No, no, we have, luckily we’ve got a really good editorial support team here, who are, thanks to the Data Blog, are becoming very experienced at getting data off of PDFs. Because every government department would much rather publish something as a PDF, so they can act as if they’re publishing the data but really it’s not open.

JS: So that’s interesting, because in the UK and the US there’s this big government publicity about, you know, we’re publishing all this data.

SR: Absolutely.

JS: But you’re saying that actually–

SR: It’s not 100 percent yet. So, I’ll show you in a second that what they tend to do is just publish– most government departments still want to publish stuff as PDFs. They can’t quite get out of that thing. Or want to say, why would somebody want a spreadsheet? They don’t really get it. A lot of people don’t get it.

And, we wanted the spreadsheet so you can do stuff like this, which is, this is a map of schools interested in becoming Academies by area. And so because we have that raw data in spreadsheet form we can work out how many in the area. You can see suddenly that this part of England, Kent, has 99 schools, which is the biggest in the country. And only one area, which is Barking, up here, in London, which is, sorry, is down here in London, but anyway that has no schools applying at all.

And the government also always said that at the beginning that it would mainly be schools which weren’t “outstanding” would apply. But actually if you look at the figures, which again, we can do, the majority of them are outstanding schools. So they’re already schools which are good, which are applying to become academies. Which kind of isn’t the point. But that kind of analysis, that’s data journalism in a sense. It’s using the numbers to get a story, and to tell a story.

JS: And how long did that story take you to put together? To get the numbers, and do the graphics, and…?

SR: Well, I was helped a bit, because I got, I’ve had one of my helpers who works in editorial support to get the data onto a spreadsheet. And in terms of creating the graphic we have a fantastic tool here, which is set up by one of our technical development team who are over there, and what it does, is it allows you to paste a load of data, geographic data, into this box, and you tell it what kind, is it parliamentary constituency, or local authority, or educational authority, or whatever, however the different regional differentiations we have in the UK, and it will draw a map for you. So this map here was drawn by computer, basically, and then one of the graphics guys help sort out the labels and finesse it and make it look beautiful. But it saves you the hard work of coloring up all those things. So actually that took me maybe a couple of hours. In total.

JS: How about getting the data, how long did that take?

SR: Oh well luckily that data– you know the government makes the data available. But like I say, as a PDF file. So this is the government site, and that’s the list there, and you open it, it opens as a PDF. Because we’ll link to that.

But luckily the guys in the ESD [editorial services department] are very adept now, because of the Data Blog, at getting data into spreadsheets. So, you know they can do that in 20 minutes.

JS: So how many people are working on data overall, then?

SR: Well, in terms of– it’s my full time job to do it. I’m lucky in that I’ve got an awful lot of people around here who have got an interest who I can kind of go and nudge, and ask. It’s a very informal basis, and we’re looking to formalize that, at the moment. We’re working on a whole data strategy, and where it goes. So we’re hoping to kind of make all of these arrangements a bit more formal. But at the moment I have to fit into what other people are doing. But yeah, we’ve got a good team now that can help, and that’s really a unique thing.

So I was going through the Data Blog for you. So this is a typical, a weird day, so schools, and then we’ve got another schools thing because it’s a big schools day today. This is school building projects scrapped by constituency, full list. Now, this is another where the government didn’t make the data easily available. The department for education published a list of all the school projects that were going to be stopped when the government cut the funding, some of which is going towards creating Academy schools, which is why this is a bit of an issue in the country at the moment. And we want to know by constituency how it was working. So which MPs were having the most school projects cut, in their constituency. And we couldn’t get that list out of the department of education, but one MP had lodged it with the House of Commons library. So we managed to get it from the House of Commons library. But it didn’t come in a good form, it came in a PDF again, so again we had to get someone from tech to sort it out for us.

But the great thing is that we can do something like this, which is a map of projects stopped by constituency, by MP. And most of the projects we’ve stopped were in Labour seats. As you know Labour are not in power at the moment. So we can do some of this sort of analysis which is great. So there were 418 projects stopped in Labour constituent seats, and 268 stopped in conservative seats. So basically 40% of Labour MPs had a project stopped, at least one project stopped in their seat, compared to only 27% of Conservatives, and 24% of the Dems who are in power at the moment.

JS: So would it be accurate to say the data drove this story, or showed this story, or…?

SR: Data showed this story, which is great, but the one thing, the caveat — of course, the raw numbers are never 100% — the caveat was there were more projects going on in Labour areas because Labour government, previous government which is Labour set up the projects, and they gave more projects to Labour areas. So you can read it either way.

JS: And you said this in the story?

SR: We said this in the story. Absolutely. We always try and make the caveats available for people. So that’s a big story today, because of there are demonstrations about it in London. You’ve come to us on a very education-centered day today.

But there’s other stuff on the blog too. This is a very British thing. We did this because we thought it would be an interesting project to do. I had somebody in for a week and they didn’t have much to do so I got them to make a list of every Doctor Who villain ever.

JS: This was an intern project?

SR: This was an intern project. We kinda thought, yeah, we’ll get a bit of traffic. And we’ve never had so much involvement in a single piece ever. It’s had 500 retweets, and when you think most pieces will get 30 or 40, it’s kind of interesting. The traffic has been through the roof. And the great thing is, so we created–

JS: Ooh, what’s this? This is good.

SR: It’s quite an easy– we use ManyEyes quite a lot, which is very very quick to create lovely little graphics. And this is every single Doctor Who villain since the start of the program, and how many times they appear. So you see the Daleks lead the way in Doctor Who.

JS: Yeah, absolutely.

SR: Followed by the Cybermen, and the Masters in there a lot. And there are lots of other little things. But we started off with about 106 villains in total, and now we’re up to– we put it out there and we said to people, we know this isn’t going to be the complete list, can you help us? And now we’ve got 212. So my weekend has basically been– I’ll show you the data sheet, it’s amazing. You can see the comments are incredible. You see these kinds of things, “so what about the Sea Devils? The Zygons?” and so on.

And I’ll show you the data set, because it’s quite interesting. So this is the data set. Again Google docs. And you can see over here on the right hand side, this is how many people looking at it at any one time. So at that moment there are 11 people looking on. There could be 40 or 50 people looking at any one moment. And they’re looking and they’re helping us make corrections.

JS: So, wait– this data set is editable?

SR: No, we haven’t made it editable, because we’ve had a bad experience people coming to editable ones and mucking around, you know, putting swear words on stuff.

JS: So how do they help you?

SR: Well they’ll put stuff in the comments field and I’ll go in and put it on the spreadsheet. Because I want a sheet that people can still download. So now we’ve got, we’re now up to 203. We’ve doubled the amount of villains thanks to our readers. It’s Doctor Who. And it just shows we’re an eclectic– we’re a broad church on the Data Blog. Everything can be data. And that’s data. We’ve got number of appearances per villain, and it’s a program that people really care about. And it’s about as British as it’s possible to get. But then we also have other stuff too– and there we go, crashed again.

JS: Well let me just ask you a few questions, and take this opportunity to ask you some broader questions. Because we can do this all day. And I have. I’ve spent hours on your data blog because I’m a data geek. But let’s sort of bring it to some general questions here.

SR: Okay. Go for it.

JS: So first of all, I notice you have the Data Blog, you also have the world data index.

SR: Yes. Now the idea of that was that, obviously lots of governments around the world have started to open up their data. And around the time that the British government was– a lot of developers here were involved in that project — we started to think, what can we do around this that would help people, because suddenly we’ve got lots of sites out there that are offering open government data. And we thought, what if we could just gather them all together into one place. So you’ve got a single search engine. And that’s how we set up the world data search. Sorry to point you at the screen again.

JS: No that’s fine, that’s fine.

SR: Basically, so what we did, we started off with just Australia, New Zealand, UK and America. And basically what this site does, is it searches all of these open government data sites. Now we’ve got Australia, Toronto in Canada, New Zealand, the UK, London, California, San Francisco, and data.gov.

So say you search for “crime,” say you’re interested in crime. There you go. So you come back here, you see you’ve got results here from the UK, London, you’ve got results from data.gov in America, San Francisco, New Zealand and Australia. Say you’re interested in just seeing– you live in San Francisco and you’re only interested in San Francisco results. You’ve three results. And there you go, you click on that.

And you’re still within the Guardian site because what we’re asking people to do is help us rank the data, and submit visualizations and applications. So we want people to tell us what they’ve done with the data.

But anyway if you go and click on that, and you click on “download,” and it will start downloading the data for you. Or, what it will do is take you to the terms and conditions. We don’t bypass any T&Cs. The T&C’s come alongside. But you click on that, you agree to that, and then you get the data. So we really try and make it easy for people. There you go. And this is the crime incidence data. Very variable. This is great because it’s KML files, so if you wanted to visualize that you get really great information. It’s all sorts of stuff. Sometimes it’s CSVs.

JS: What’s a KML file?

SR: So, Google Earth.

JS: Okay.

SR: Sorry. So, it’s mapping, a mapping file straight away.

SR: Okay, so one of the things we ask people to do is to submit visualizations and applications they’ve produced. So for instance, London has some very very good open data. If you haven’t looked around the Data Store, it’s really worth going to. And one of these things they do is they provide a live feed of all the London traffic cameras. You can watch them live. And this is a lovely thing, because what somebody’s done is they’ve written an iPad application. So you can watch live TFL, Transport for London, traffic cameras on your iPad.

And you see that data set has been rated. A couple of people have gone in there and rated it. You’ve got a download button, the download is XML. So we try and help people around this data. And this is growing now. Every time somebody launches an open government data site we’re gonna put it on here, and we’re working on a few more at the moment. So we want it to be the place that people go to. Every time you Google “world government data” it pops up at the top, which is what you want. You want people who are just trying to compare different countries and don’t know where to start, to help them find a way through this maze of information that’s out there.

JS: So do you intend to do this for every country in the world?

SR: Every country in the world that launches an open government data site, we’ll whack it on here. And we’re working– at the moment there are about 20 decent open government data sites around the world. We’re picking those up. We’ve got on here now, how many have we got? One, two, three, four, five, six, seven, eight. We’ll have 20 on in the next couple of weeks. We’re really working through them at the moment.

And what this does is, it scrapes them. So basically, we don’t– for us it’s easy to manage because we don’t have to update these data sets all the time. The computer does that for us. But basically, what we do provide people with is context and background information, because you’re part of the data site there.

JS: So let me make sure I have this clear. So you’re not sucking down the actual data, you’re sucking down the list and descriptions of the data sets available?

SR: Absolutely. So we’re providing people, because basically we want it to be as updated as possible. We don’t– if we just uploaded onto our site, that would kind of be pointless, and it would mean it would be out of date. This way, if something pops up on data.gov and stays there, we’ll get it quick on here. We’ll help people find it. Helping people find the data, that’s our mission here. It’s not just generating traffic, it’s to help people find the information, because we want people to come to us when they’re looking for data.

JS: So, okay. You’ve talked about, it sounds like, two different projects. The Data Blog. where you collect and clean up and present data that you–

SR: That we find interesting. We’re selective.

JS: In the process of the Guardian’s newsgathering.

SR: Yeah, and just things that are interesting anyway. So the Doctor Who post that we were just looking at is just interesting to do. It’s not anything we’re going to do a story about. And often they’ll be things that are in the news, say that day, and I’ll think “oh that’s a good thing to put on the Data Blog.” So it could be crime figures, or it could be– and sometimes, the side effect of that is a great side effect because you end up with a piece in the paper, or a piece on the web site. But often it might be the Data Blog is the only place to get that information.

JS: And you index world government data sites.

SR: Yeah, absolutely.

JS: Does the Guardian do anything else with data?

SR: Yeah, well what we do is, we’re doing a lot of Guardian research with data. So what we want to do is give people a kind of way into that. So for instance, we do do a lot of data-based projects. So for instance we’re doing an executive pay survey of all the biggest companies, how much they pay their bosses and their chief executives. That has always been a thing the paper’s always done for stories. And now what we’ll do is we’ll make that stuff available– that data available for people. So instead of just raw data journalism, it’s quite old data journalism. We’ve been doing it for ten years. But we used to just call it a survey. Now it’s data journalism, because it’s getting stories out of numbers. So we’ll work with that, and we’ll publish that information for people to see. And there are a couple of big projects coming up this week, which I really can’t tell you about, but next week it will be obvious what they are.

JS: Probably by the time this goes up we’ll be able to link to them.

[Simon was referring to the Guardian's data journalism work on the leaked Afghanistan war logs, described in a thorough post on the Data Blog.]

SR: Yeah, I’ll mail you about them. But we’ve got now an area of expertise. So increasingly what I’m finding is that I’m getting people coming to me within The Guardian, saying, so we’ve got this spreadsheet, well how can I do this? So for instance that Academies thing we were just looking at, we were really keen to find out which areas were the most, where the most schools were, for the paper. The correspondent wanted to know that. So actually, because we’ve got this area of expertise now in managing data, we’re becoming kind of a go-to place within The Guardian, for journalists who are just writing stories where they need to know something, or they need to find some information out, which is an interesting side effect. Because it used to be that journalists were kind of scared of numbers, and scared of data. I really think that was the case. And now, increasingly, they’re trying to embrace that, and starting to realize you can get stories out of it.

JS: Well that’s really interesting. Let’s talk for a minute about how this applies to other newsrooms, because it’s– as you say, journalists have been traditionally scared of data.

SR: Yeah, absolutely. You could say they prided themselves, in this country anyway, they prided themselves on lack of mathematical ability. I would say.

JS: Which seems unfortunate in this era.

SR: Yeah, absolutely. Yeah, yeah, absolutely.

JS: But especially a lot of our readers are from smaller newsrooms, and so what kind of technical capability do you need to start tracking data, and publishing data sets?

SR: I think it’s really minimal. I mean, the thing is that actually, what we’re doing is really working with a basic, most of the time just basic spreadsheet packages. Excel or whatever you’ve got. Excel is easy to use, but it could be any package really. And we’re using Google spreadsheets, which again is widely available for people to do information. We’re using visualization tools which are again, ManyEyes or Timetric which are widely available and easy to use. I think what we’re doing is just bringing it together.

I think traditionally that journalists wouldn’t regard data journalism as journalism. It was research. Or, you know, how is publishing data– is that journalism? But I think now, what is happening is that actually, what used to happen is that we were the kind of gatekeepers to this information. We would keep it to ourselves. So we didn’t want our rivals to get ahold of it, and give them stories. We’d be giving stories away. And we wouldn’t believe that people out there in the world would have any contribution to make towards that. Now, that’s all changed now. I think now we’ve realized that actually, we’re not always the experts. Be it Doctor Who or Academy schools, there’s somebody out there who knows a lot more than you do, and can thus contribute. So you can get stories back from them, in a way. So we’re receiving the information much more.

JS: So you publish the data, and then other people build stories out of it, is that what you’re saying?

SR: Other people will let us know– well, we publish say, well that’s an interesting story, or this is a good visualization. We’ve published data for other people to visualize. We thought, that’s quite an interesting thing to mash it up with, we should do that ourselves. So there’s that thing, and there’s also the fact that if you put the information out there, you always get a return. You get people coming back.

So for instance the Academies thing today that we were talking about. We’ve had people come back saying, well I live in Derbyshire and I know that those schools are in quite wealthy areas. So we start to think, well is there a trend towards schools in wealthy areas going to this, and schools in poorer areas not going to this.

So it gives you extra stories or extra angles on stories you wouldn’t think of. And I think that’s part of it. And I think partly there’s just the realization that just publishing data in itself, because it’s interesting, is a journalistic enterprise. Because I think you have to apply journalistic treatment to that data. You have to choose the data in a selective, editorial fashion. And I think you have to process it in a way that makes it easy for people to use, and useful to people.

JS: So last question here, which is of course going to be on many editors’ and publishers’ minds.

SR: Sure.

JS: Let’s talk about traffic and money. How does this contribute to the business of The Guardian?

SR: Okay, it’s a new– it’s an experiment for us, but traffic-wise it’s been pretty healthy. We’ve had– during the election we were getting a million page impressions in a month. Which is not bad. On the Data Blog. Now, as a whole, out of the 36 million that The Guardian gets, it doesn’t seem like a lot. But actually, in the firmament of Guardian web sites that’s not bad. That’s kind of upper tier. And this is only after being around for a year.

So in terms of what it gives us, it gives the same as producing anything that produces traffic gives us. It’s good for the brand, and it’s good for The Guardian site. In the long run, I think that there is probably canny money to be made out of there, for organizations that can manage and interpret data. I don’t know exactly how, but I think we’d have to be pretty dumb if we don’t come up with something. I’d be very surprised. It’s an area where there’s such a lot of potential. There are people who don’t really know how to manage data and don’t really know how to organize data that– for us to get involved in that area. I really think that.

But also I think that just journalistically, it’s as important to do this as it is to write a piece about a fashion week or anything else we might employ a journalist to do. And in a way it’s more important, because if The Guardian is about open information, which– since the beginning of The Guardian we’ve campaigned for freedom of information and access to information, and this is the ultimate expression of that.

And we, on the site, we use the phrase “facts are sacred.” And this comes from the famous C. P. Scott who said that “comment is free,” which as you know is the name of our comment site, but “facts are sacred” was the second part of the saying. And I kinda think that is– you can see it on the comment site, there you go. “Comment is free, but facts are sacred.” And that’s what The Guardian’s about. I really think that, you know, this says a lot about the web. Interestingly, I think that’s how the web is changing, in the sense that a few years ago it was just about comment. People wanted to say what they thought. Now I think it’s, increasingly, people want to find out what the facts are.

JS: All right, well, thank you very much for a thorough introduction to The Guardian’s data work.

SR: Thanks a lot.

Data Blog posts are often related to or supporting of news stories, but not always. Rogers sees the publishing of interesting data as a journalistic act that stands alone, and is clear on where the newsroom adds value:

I think you have to apply journalistic treatment to data. You have to choose the data in a selective, editorial fashion. And I think you have to process it in a way that makes it easy for people to use, and useful to people.

The Guardian curates far more data than it creates. Some data sets are generated in-house, such as the Guardian’s yearly executive pay surveys, but more often the data already exists in some form, such as a PDF on a government web site. The Guardian finds such documents, scrapes the data into spreadsheets, cleans it, and adds context in a Data Blog post. But they also maintain an index of world government data which scrapes open government web sites to produce a searchable index of available data sets.

“Helping people find the data, that’s our mission here,” says Rogers. “We want people to come to us when they’re looking for data.”

In alignment with their open strategy, The Guardian encourages re-use and mashups of their data. Readers can submit apps and visualizations that they’ve created, but data has proven to be just as popular with non-developers — regular folks who want the raw information.

Sometimes readers provide additional data or important feedback, typically through the comments on each post. Rogers gives the example of a reader who wrote in to say that the Academy schools listed in his area in a Guardian data set were in wealthy neighborhoods, raising the journalistically interesting question of whether wealthier schools were more likely to take advantage of this charter school-like program. Expanding on this idea, Rogers says,

What used to happen is that we were the kind of gatekeepers to this information. We would keep it to ourselves. So we didn’t want our rivals to get ahold of it, and give them stories. We’d be giving stories away. And we wouldn’t believe that people out there in the world would have any contribution to make towards that.

Now, that’s all changed now. I think now we’ve realized that actually, we’re not always the experts. Be it Doctor Who or Academy schools, there’s somebody out there who knows a lot more than you do, and can thus contribute.

So you can get stories back from them, in a way. … If you put the information out there, you always get a return. You get people coming back.

Perhaps surprisingly, data also gets pretty good traffic, with the Data Blog logging a million hits a month during the recent election coverage. “In the firmament of Guardian web sites that’s not bad. That’s kind of upper tier,” says Rogers. “And this is only after being around for a year.” (The even younger Texas Tribune also finds its data pages popular, accounting for a third of total page views.)

Rogers and I also discussed the process of getting useful data out of inept or uncooperative governments, the changing role of data specialists in the newsroom, and how the Guardian tapped its readers to produce the definitive database of Doctor Who villains. Here’s the transcript, lightly edited.

JS: All right. So. I’m here with Simons Rogers in the Guardian newsroom in London, and you’re the editor of the Data Blog.

SR: That’s right, and I’m also a news editor so I work across the organization on data journalism, essentially.

JS: So, first of all, can you tell us what the Data Blog is?

SR: Ok, well basically it came about because, as I said I was a news editor working a lot with graphics, and we realized we were just collecting enormous amounts of data. And we though, well wouldn’t our readers be interested in seeing that? And when the Guardian Open Platform launched, it seemed a good time to think about opening up– we were opening up the Guardian to technical development, so it seemed a good time to open up our data collections as well.

And also it’s the fact that increasingly we’ve found people are after raw information. If you looked– and there’s lots of raw information online, but if you start searching for that information you just get bewildering amounts of replies back. If if you’re looking for, say, carbon emissions, you get millions of entries back. So how do you know what the right set of data is? Whereas we’ve already done that set of work for our readers, because we’ve had to find that data, and we’ve had to choose it, and make an editorial selection about it, I suppose. So we thought we were able to cut out the middle man for people.

But also we kind of thought when we launched it, actually, what we’d be doing is creating data for developers. There seemed to be a lot of developers out there at that point who were interested in raw information, and they would be the people who would use the data blog, and the open platform would get a lot more traffic.

And what actually happened, what’s been interesting about it, is that– what’s actually happened is that it’s been real people who have been using the Data Blog, as much as developers. Probably more so than developers.

JS: What do you mean “real people”?

SR: Real people, I suppose what I mean is that, somebody who’s just interested in finding out what a number is. So for instance, here at the moment we’ve got a big story about a government scheme for building schools, which has just been cut by the new government. It was set up by the old government, who invested millions of pounds into building new school buildings. And so, we’ve got the full list of all the schools, but the parliamentary constituency that they’re in, and where they are and what kind of project they were. And that is really, really popular today, that’s one of our biggest things, because there’s a lot of demonstrations about it, it’s a big issue of the day. And so I would guess that 90% of people looking at it are just people who want to find out what the real raw data is.

And that’s the great thing about the internet, it gives you access to the raw, real information. And I think that’s what people really crave. They want the interpretation and the analysis from people, but they also want the veracity of seeing the real thing, without having it aggregated or put together. They just want to see the raw data.

JS: So you publish all of the original numbers that you get from the government?

SR: Well exactly. The only time– with the Data Blog, I try to make it as newsy as possible. So it’s often hooked around news stories of the day. Partly because it helps the traffic, and you’re kind of hooking on to existing requirements.

Obviously we do– it’s just a really eclectic mix of data. And I can show you the screen, for a sec.

JS: All right. Let’s see something.

SR: Okay, so this is the data blog today. So obviously we’ve got Afghanistan at the top. Afghanistan is often at the top at the moment. This is a full list of everybody who’s died, every British casualty who’s died and been wounded over time. So you’ve got this data here. We use, I tend to use a lot of third party services. This is a company called Timetric, who are very good at visualizing time series data. It takes about five minutes to create that, and you can roll over and get more information.

JS: So is that a free service?

SR: Yeah, absolutely free, you just sign up, and you share it. It works a bit like Many Eyes, you know the IBM service.

JS: Yeah.

SR: We’ll embed these Google docs. We use Google docs, Google spreadsheets to share all our information because it’s very for people to download it. So say you want to download this data. You click on the link, and it will take you through in a second to, there you go, it’s the full Google spreadsheet. And you’ve got everything on here. You’ve got, these are monthly totals, which you can’t get anywhere else, because nobody else does that information.

JS: What do you mean nobody else does it?

SR: Well nobody else bothers to put it together month by month. You can get totals by year from, iCasualties I think do it, but we’ve just collected some month by month, because often we’ve had to draw graphics where it’s month by month. It’s the kind of thing, actually it’s quite interesting to be able to see which month was the worst for casualties.

We’ve got lists of names, which obviously are in a few places. We collect Afghanistan wounded statistics which are terribly confused in the UK, because what they do is they try and make them as complicated as possible. So, the most serious ones, NOTICAS is where your next of kin is notified. That’s a serious event, but also you’ve got all those people evacuated. So anyway, this kind of data. We also keep amputation data, which is a new set that the government refused to release until recently, and a Guardian reporter was instrumental in getting this data released. So we kind thought, maybe we should make this available for people.

So you get all this data, and then what you can do, if you click on “File” there, you can download it as Excel, XML, CSV, or whatever format you want. So that’s why we use Google speadsheets. It’s the kind of thing that’s a very, very easily accessible format for people.

So really what we do is we try and encourage a community, a community to grow up around data and information. So every post has got a talk facility on it.

Anyway, going through it. So this is today’s Data Blog, where you’ve got Afghanistan, Academy schools in the UK. The schools are run by the state, pretty much.

JS: So just to clarify this for the American audience, what’s an Academy school?

SR: Ok, well basically in the UK most schools are state schools, that most children go to. State schools are, we all pay for them, they’re paid for out of our taxes. And they’re run at a local level, which obviously has it’s advantages because it means that you are, kind of, working to an area. What the new government’s proposing to do is allow any school that wants to to become an Academy. And what an Academy is is a school that can run its own finances, and own affairs.

And what we’ve got is we’ve got the data, the government’s published the data — as a PDF of course because governments always publish everything as a PDF, in this country anyway — and what they give you, which we’ve scraped here, is a list of every school in the UK which has expressed an interest. So you’ve got the local authority here, the name of the school, type of school, the address, and the post code. Which is great, because that’s good data, and because it’s on a PDF we can get that into a spreadsheet quite easily.

JS: So did you have to type in all of those things from a PDF, or cut and paste them?

SR: Good god no. No, no, we have, luckily we’ve got a really good editorial support team here, who are, thanks to the Data Blog, are becoming very experienced at getting data off of PDFs. Because every government department would much rather publish something as a PDF, so they can act as if they’re publishing the data but really it’s not open.

JS: So that’s interesting, because in the UK and the US there’s this big government publicity about, you know, we’re publishing all this data.

SR: Absolutely.

JS: But you’re saying that actually–

SR: It’s not 100 percent yet. So, I’ll show you in a second that what they tend to do is just publish– most government departments still want to publish stuff as PDFs. They can’t quite get out of that thing. Or want to say, why would somebody want a spreadsheet? They don’t really get it. A lot of people don’t get it.

And, we wanted the spreadsheet so you can do stuff like this, which is, this is a map of schools interested in becoming Academies by area. And so because we have that raw data in spreadsheet form we can work out how many in the area. You can see suddenly that this part of England, Kent, has 99 schools, which is the biggest in the country. And only one area, which is Barking, up here, in London, which is, sorry, is down here in London, but anyway that has no schools applying at all.

And the government also always said that at the beginning that it would mainly be schools which weren’t “outstanding” would apply. But actually if you look at the figures, which again, we can do, the majority of them are outstanding schools. So they’re already schools which are good, which are applying to become academies. Which kind of isn’t the point. But that kind of analysis, that’s data journalism in a sense. It’s using the numbers to get a story, and to tell a story.

JS: And how long did that story take you to put together? To get the numbers, and do the graphics, and…?

SR: Well, I was helped a bit, because I got, I’ve had one of my helpers who works in editorial support to get the data onto a spreadsheet. And in terms of creating the graphic we have a fantastic tool here, which is set up by one of our technical development team who are over there, and what it does, is it allows you to paste a load of data, geographic data, into this box, and you tell it what kind, is it parliamentary constituency, or local authority, or educational authority, or whatever, however the different regional differentiations we have in the UK, and it will draw a map for you. So this map here was drawn by computer, basically, and then one of the graphics guys help sort out the labels and finesse it and make it look beautiful. But it saves you the hard work of coloring up all those things. So actually that took me maybe a couple of hours. In total.

JS: How about getting the data, how long did that take?

SR: Oh well luckily that data– you know the government makes the data available. But like I say, as a PDF file. So this is the government site, and that’s the list there, and you open it, it opens as a PDF. Because we’ll link to that.

But luckily the guys in the ESD [editorial services department] are very adept now, because of the Data Blog, at getting data into spreadsheets. So, you know they can do that in 20 minutes.

JS: So how many people are working on data overall, then?

SR: Well, in terms of– it’s my full time job to do it. I’m lucky in that I’ve got an awful lot of people around here who have got an interest who I can kind of go and nudge, and ask. It’s a very informal basis, and we’re looking to formalize that, at the moment. We’re working on a whole data strategy, and where it goes. So we’re hoping to kind of make all of these arrangements a bit more formal. But at the moment I have to fit into what other people are doing. But yeah, we’ve got a good team now that can help, and that’s really a unique thing.

So I was going through the Data Blog for you. So this is a typical, a weird day, so schools, and then we’ve got another schools thing because it’s a big schools day today. This is school building projects scrapped by constituency, full list. Now, this is another where the government didn’t make the data easily available. The department for education published a list of all the school projects that were going to be stopped when the government cut the funding, some of which is going towards creating Academy schools, which is why this is a bit of an issue in the country at the moment. And we want to know by constituency how it was working. So which MPs were having the most school projects cut, in their constituency. And we couldn’t get that list out of the department of education, but one MP had lodged it with the House of Commons library. So we managed to get it from the House of Commons library. But it didn’t come in a good form, it came in a PDF again, so again we had to get someone from tech to sort it out for us.

But the great thing is that we can do something like this, which is a map of projects stopped by constituency, by MP. And most of the projects we’ve stopped were in Labour seats. As you know Labour are not in power at the moment. So we can do some of this sort of analysis which is great. So there were 418 projects stopped in Labour constituent seats, and 268 stopped in conservative seats. So basically 40% of Labour MPs had a project stopped, at least one project stopped in their seat, compared to only 27% of Conservatives, and 24% of the Dems who are in power at the moment.

JS: So would it be accurate to say the data drove this story, or showed this story, or…?

SR: Data showed this story, which is great, but the one thing, the caveat — of course, the raw numbers are never 100% — the caveat was there were more projects going on in Labour areas because Labour government, previous government which is Labour set up the projects, and they gave more projects to Labour areas. So you can read it either way.

JS: And you said this in the story?

SR: We said this in the story. Absolutely. We always try and make the caveats available for people. So that’s a big story today, because of there are demonstrations about it in London. You’ve come to us on a very education-centered day today.

But there’s other stuff on the blog too. This is a very British thing. We did this because we thought it would be an interesting project to do. I had somebody in for a week and they didn’t have much to do so I got them to make a list of every Doctor Who villain ever.

JS: This was an intern project?

SR: This was an intern project. We kinda thought, yeah, we’ll get a bit of traffic. And we’ve never had so much involvement in a single piece ever. It’s had 500 retweets, and when you think most pieces will get 30 or 40, it’s kind of interesting. The traffic has been through the roof. And the great thing is, so we created–

JS: Ooh, what’s this? This is good.

SR: It’s quite an easy– we use ManyEyes quite a lot, which is very very quick to create lovely little graphics. And this is every single Doctor Who villain since the start of the program, and how many times they appear. So you see the Daleks lead the way in Doctor Who.

JS: Yeah, absolutely.

SR: Followed by the Cybermen, and the Masters in there a lot. And there are lots of other little things. But we started off with about 106 villains in total, and now we’re up to– we put it out there and we said to people, we know this isn’t going to be the complete list, can you help us? And now we’ve got 212. So my weekend has basically been– I’ll show you the data sheet, it’s amazing. You can see the comments are incredible. You see these kinds of things, “so what about the Sea Devils? The Zygons?” and so on.

And I’ll show you the data set, because it’s quite interesting. So this is the data set. Again Google docs. And you can see over here on the right hand side, this is how many people looking at it at any one time. So at that moment there are 11 people looking on. There could be 40 or 50 people looking at any one moment. And they’re looking and they’re helping us make corrections.

JS: So, wait– this data set is editable?

SR: No, we haven’t made it editable, because we’ve had a bad experience people coming to editable ones and mucking around, you know, putting swear words on stuff.

JS: So how do they help you?

SR: Well they’ll put stuff in the comments field and I’ll go in and put it on the spreadsheet. Because I want a sheet that people can still download. So now we’ve got, we’re now up to 203. We’ve doubled the amount of villains thanks to our readers. It’s Doctor Who. And it just shows we’re an eclectic– we’re a broad church on the Data Blog. Everything can be data. And that’s data. We’ve got number of appearances per villain, and it’s a program that people really care about. And it’s about as British as it’s possible to get. But then we also have other stuff too– and there we go, crashed again.

JS: Well let me just ask you a few questions, and take this opportunity to ask you some broader questions. Because we can do this all day. And I have. I’ve spent hours on your data blog because I’m a data geek. But let’s sort of bring it to some general questions here.

SR: Okay. Go for it.

JS: So first of all, I notice you have the Data Blog, you also have the world data index.

SR: Yes. Now the idea of that was that, obviously lots of governments around the world have started to open up their data. And around the time that the British government was– a lot of developers here were involved in that project — we started to think, what can we do around this that would help people, because suddenly we’ve got lots of sites out there that are offering open government data. And we thought, what if we could just gather them all together into one place. So you’ve got a single search engine. And that’s how we set up the world data search. Sorry to point you at the screen again.

JS: No that’s fine, that’s fine.

SR: Basically, so what we did, we started off with just Australia, New Zealand, UK and America. And basically what this site does, is it searches all of these open government data sites. Now we’ve got Australia, Toronto in Canada, New Zealand, the UK, London, California, San Francisco, and data.gov.

So say you search for “crime,” say you’re interested in crime. There you go. So you come back here, you see you’ve got results here from the UK, London, you’ve got results from data.gov in America, San Francisco, New Zealand and Australia. Say you’re interested in just seeing– you live in San Francisco and you’re only interested in San Francisco results. You’ve three results. And there you go, you click on that.

And you’re still within the Guardian site because what we’re asking people to do is help us rank the data, and submit visualizations and applications. So we want people to tell us what they’ve done with the data.

But anyway if you go and click on that, and you click on “download,” and it will start downloading the data for you. Or, what it will do is take you to the terms and conditions. We don’t bypass any T&Cs. The T&C’s come alongside. But you click on that, you agree to that, and then you get the data. So we really try and make it easy for people. There you go. And this is the crime incidence data. Very variable. This is great because it’s KML files, so if you wanted to visualize that you get really great information. It’s all sorts of stuff. Sometimes it’s CSVs.

JS: What’s a KML file?

SR: So, Google Earth.

JS: Okay.

SR: Sorry. So, it’s mapping, a mapping file straight away.

SR: Okay, so one of the things we ask people to do is to submit visualizations and applications they’ve produced. So for instance, London has some very very good open data. If you haven’t looked around the Data Store, it’s really worth going to. And one of these things they do is they provide a live feed of all the London traffic cameras. You can watch them live. And this is a lovely thing, because what somebody’s done is they’ve written an iPad application. So you can watch live TFL, Transport for London, traffic cameras on your iPad.

And you see that data set has been rated. A couple of people have gone in there and rated it. You’ve got a download button, the download is XML. So we try and help people around this data. And this is growing now. Every time somebody launches an open government data site we’re gonna put it on here, and we’re working on a few more at the moment. So we want it to be the place that people go to. Every time you Google “world government data” it pops up at the top, which is what you want. You want people who are just trying to compare different countries and don’t know where to start, to help them find a way through this maze of information that’s out there.

JS: So do you intend to do this for every country in the world?

SR: Every country in the world that launches an open government data site, we’ll whack it on here. And we’re working– at the moment there are about 20 decent open government data sites around the world. We’re picking those up. We’ve got on here now, how many have we got? One, two, three, four, five, six, seven, eight. We’ll have 20 on in the next couple of weeks. We’re really working through them at the moment.

And what this does is, it scrapes them. So basically, we don’t– for us it’s easy to manage because we don’t have to update these data sets all the time. The computer does that for us. But basically, what we do provide people with is context and background information, because you’re part of the data site there.

JS: So let me make sure I have this clear. So you’re not sucking down the actual data, you’re sucking down the list and descriptions of the data sets available?

SR: Absolutely. So we’re providing people, because basically we want it to be as updated as possible. We don’t– if we just uploaded onto our site, that would kind of be pointless, and it would mean it would be out of date. This way, if something pops up on data.gov and stays there, we’ll get it quick on here. We’ll help people find it. Helping people find the data, that’s our mission here. It’s not just generating traffic, it’s to help people find the information, because we want people to come to us when they’re looking for data.

JS: So, okay. You’ve talked about, it sounds like, two different projects. The Data Blog. where you collect and clean up and present data that you–

SR: That we find interesting. We’re selective.

JS: In the process of the Guardian’s newsgathering.

SR: Yeah, and just things that are interesting anyway. So the Doctor Who post that we were just looking at is just interesting to do. It’s not anything we’re going to do a story about. And often they’ll be things that are in the news, say that day, and I’ll think “oh that’s a good thing to put on the Data Blog.” So it could be crime figures, or it could be– and sometimes, the side effect of that is a great side effect because you end up with a piece in the paper, or a piece on the web site. But often it might be the Data Blog is the only place to get that information.

JS: And you index world government data sites.

SR: Yeah, absolutely.

JS: Does the Guardian do anything else with data?

SR: Yeah, well what we do is, we’re doing a lot of Guardian research with data. So what we want to do is give people a kind of way into that. So for instance, we do do a lot of data-based projects. So for instance we’re doing an executive pay survey of all the biggest companies, how much they pay their bosses and their chief executives. That has always been a thing the paper’s always done for stories. And now what we’ll do is we’ll make that stuff available– that data available for people. So instead of just raw data journalism, it’s quite old data journalism. We’ve been doing it for ten years. But we used to just call it a survey. Now it’s data journalism, because it’s getting stories out of numbers. So we’ll work with that, and we’ll publish that information for people to see. And there are a couple of big projects coming up this week, which I really can’t tell you about, but next week it will be obvious what they are.

JS: Probably by the time this goes up we’ll be able to link to them.

[Simon was referring to the Guardian's data journalism work on the leaked Afghanistan war logs, described in a thorough post on the Data Blog.]

SR: Yeah, I’ll mail you about them. But we’ve got now an area of expertise. So increasingly what I’m finding is that I’m getting people coming to me within The Guardian, saying, so we’ve got this spreadsheet, well how can I do this? So for instance that Academies thing we were just looking at, we were really keen to find out which areas were the most, where the most schools were, for the paper. The correspondent wanted to know that. So actually, because we’ve got this area of expertise now in managing data, we’re becoming kind of a go-to place within The Guardian, for journalists who are just writing stories where they need to know something, or they need to find some information out, which is an interesting side effect. Because it used to be that journalists were kind of scared of numbers, and scared of data. I really think that was the case. And now, increasingly, they’re trying to embrace that, and starting to realize you can get stories out of it.

JS: Well that’s really interesting. Let’s talk for a minute about how this applies to other newsrooms, because it’s– as you say, journalists have been traditionally scared of data.

SR: Yeah, absolutely. You could say they prided themselves, in this country anyway, they prided themselves on lack of mathematical ability. I would say.

JS: Which seems unfortunate in this era.

SR: Yeah, absolutely. Yeah, yeah, absolutely.

JS: But especially a lot of our readers are from smaller newsrooms, and so what kind of technical capability do you need to start tracking data, and publishing data sets?

SR: I think it’s really minimal. I mean, the thing is that actually, what we’re doing is really working with a basic, most of the time just basic spreadsheet packages. Excel or whatever you’ve got. Excel is easy to use, but it could be any package really. And we’re using Google spreadsheets, which again is widely available for people to do information. We’re using visualization tools which are again, ManyEyes or Timetric which are widely available and easy to use. I think what we’re doing is just bringing it together.

I think traditionally that journalists wouldn’t regard data journalism as journalism. It was research. Or, you know, how is publishing data– is that journalism? But I think now, what is happening is that actually, what used to happen is that we were the kind of gatekeepers to this information. We would keep it to ourselves. So we didn’t want our rivals to get ahold of it, and give them stories. We’d be giving stories away. And we wouldn’t believe that people out there in the world would have any contribution to make towards that. Now, that’s all changed now. I think now we’ve realized that actually, we’re not always the experts. Be it Doctor Who or Academy schools, there’s somebody out there who knows a lot more than you do, and can thus contribute. So you can get stories back from them, in a way. So we’re receiving the information much more.

JS: So you publish the data, and then other people build stories out of it, is that what you’re saying?

SR: Other people will let us know– well, we publish say, well that’s an interesting story, or this is a good visualization. We’ve published data for other people to visualize. We thought, that’s quite an interesting thing to mash it up with, we should do that ourselves. So there’s that thing, and there’s also the fact that if you put the information out there, you always get a return. You get people coming back.

So for instance the Academies thing today that we were talking about. We’ve had people come back saying, well I live in Derbyshire and I know that those schools are in quite wealthy areas. So we start to think, well is there a trend towards schools in wealthy areas going to this, and schools in poorer areas not going to this.

So it gives you extra stories or extra angles on stories you wouldn’t think of. And I think that’s part of it. And I think partly there’s just the realization that just publishing data in itself, because it’s interesting, is a journalistic enterprise. Because I think you have to apply journalistic treatment to that data. You have to choose the data in a selective, editorial fashion. And I think you have to process it in a way that makes it easy for people to use, and useful to people.

JS: So last question here, which is of course going to be on many editors’ and publishers’ minds.

SR: Sure.

JS: Let’s talk about traffic and money. How does this contribute to the business of The Guardian?

SR: Okay, it’s a new– it’s an experiment for us, but traffic-wise it’s been pretty healthy. We’ve had– during the election we were getting a million page impressions in a month. Which is not bad. On the Data Blog. Now, as a whole, out of the 36 million that The Guardian gets, it doesn’t seem like a lot. But actually, in the firmament of Guardian web sites that’s not bad. That’s kind of upper tier. And this is only after being around for a year.

So in terms of what it gives us, it gives the same as producing anything that produces traffic gives us. It’s good for the brand, and it’s good for The Guardian site. In the long run, I think that there is probably canny money to be made out of there, for organizations that can manage and interpret data. I don’t know exactly how, but I think we’d have to be pretty dumb if we don’t come up with something. I’d be very surprised. It’s an area where there’s such a lot of potential. There are people who don’t really know how to manage data and don’t really know how to organize data that– for us to get involved in that area. I really think that.

But also I think that just journalistically, it’s as important to do this as it is to write a piece about a fashion week or anything else we might employ a journalist to do. And in a way it’s more important, because if The Guardian is about open information, which– since the beginning of The Guardian we’ve campaigned for freedom of information and access to information, and this is the ultimate expression of that.

And we, on the site, we use the phrase “facts are sacred.” And this comes from the famous C. P. Scott who said that “comment is free,” which as you know is the name of our comment site, but “facts are sacred” was the second part of the saying. And I kinda think that is– you can see it on the comment site, there you go. “Comment is free, but facts are sacred.” And that’s what The Guardian’s about. I really think that, you know, this says a lot about the web. Interestingly, I think that’s how the web is changing, in the sense that a few years ago it was just about comment. People wanted to say what they thought. Now I think it’s, increasingly, people want to find out what the facts are.

JS: All right, well, thank you very much for a thorough introduction to The Guardian’s data work.

SR: Thanks a lot.

July 30 2010

14:15

This Week in Review: WikiLeaks’ new journalism order, a paywall’s purpose, and a future for Flipboard

[Every Friday, Mark Coddington sums up the week’s top stories about the future of news and the debates that grew up around them. —Josh]

WikiLeaks, data journalism and radical transparency: I’ll be covering two weeks in this review because of the Lab’s time off last week, but there really was only one story this week: WikiLeaks’ release of The War Logs, a set of 90,000 documents on the war in Afghanistan. There are about 32 angles to this story and I’ll try to hit most of them, but if you’re pressed for time, the essential reads on the situation are Steve Myers, C.W. Anderson, Clint Hendler, and Janine Wedel and Linda Keenan.

WikiLeaks released the documents on its site on Sunday, cooperating with three news organizations — The New York Times, The Guardian, and Der Spiegel — to allow them to produce special reports on the documents as they were released. The Nation’s Greg Mitchell ably rounded up commentary on the documents’ political implications (one tidbit from the documents for newsies: evidence of the U.S. military paying Afghan journalists to write favorable stories), as the White House slammed the leaks and the Times for running them, and the Times defended its decision in the press and to its readers.

The comparison that immediately came to many people’s minds was the publication of the Pentagon Papers on the Vietnam War in 1971, and two Washington Post articles examined the connection. (The Wall Street Journal took a look at both casesFirst Amendment angles, too.) But several people, most notably ProPublica’s Richard Tofel and Slate’s Fred Kaplan, quickly countered that the War Logs don’t come close to the Pentagon Papers’ historical impact. They led a collective yawn that emerged from numerous political observers after the documents’ publication, with ho-hums coming from Foreign Policy, Mother Jones, the Washington Post, and even the op-ed page of the Times itself. Slate media critic Jack Shafer suggested ways WikiLeaks could have planned its leak better to avoid such ennui.

But plenty of other folks found a lot that was interesting about the entire situation. (That, of course, is why I’m writing about it.) The Columbia Journalism Review’s Joel Meares argued that the military pundits dismissing the War Logs as old news are forgetting that this information is still putting an often-forgotten war back squarely in the public’s consciousness. But the most fascinating angle of this story to many of us future-of-news nerds was that this leak represents the entry of an entirely new kind of editorial process into mainstream news. That’s what The Atlantic’s Alexis Madrigal sensed early on, and several others sussed out as the week moved along. The Times’ David Carr called WikiLeaks’ quasi-publisher role both a new kind of hybrid journalism and an affirmation of the need for traditional reporting to provide context. Poynter’s Steve Myers made some astute observations about this new kind of journalism, including the rise of the source advocate and WikiLeaks’ trading information for credibility. NYU j-prof Jay Rosen noted that WikiLeaks is the first “stateless news organization,” able to shed light on the secrets of the powerful because of freedom provided not by law, but by the web.

Both John McQuaid and Slate’s Anne Applebaum emphasized the need for data to be, as McQuaid put it, “marshaled in service to a story, an argument,” with McQuaid citing that as reason for excitement about journalism and Applebaum calling it a case for traditional reporting. Here at the Lab, CUNY j-prof C.W. Anderson put a lot this discussion into perspective with two perceptive posts on WikiLeaks as the coming-out party for data journalism. He described its value well: “In these recent stories, its not the presence of something new, but the ability to tease a pattern out of a lot of little things we already know that’s the big deal.”

As for WikiLeaks itself, the Columbia Journalism Review’s Clint Hendler provided a fascinating account of how its scoop ended up in three of the world’s major newspapers, including differences in WikiLeaks’ and the papers’ characterization of WikiLeaks’ involvement, which might help explain its public post-publication falling-out with the Times. The Times profiled WikiLeaks and its enigmatic founder, Julian Assange, and several others trained their criticism on WikiLeaks itself — specifically, on the group’s insistence on radical transparency from others but extreme secrecy from itself. The Washington Post’s Howard Kurtz said WikiLeaks is “a global power unto itself,” not subject to any checks and balances, and former military reporter Jamie McIntyre called WikiLeaks “anti-privacy terrorists.”

Several others were skeptical of Assange’s motives and secrecy, and Slate’s Farhad Manjoo wondered how we could square public trust with such a commitment to anonymity. In a smart Huffington Post analysis of that issue, Janine Wedel and Linda Keenan presented this new type of news organization as a natural consequence of the new cultural architecture (the “adhocracy,” as they call it) of the web: “These technologies lend themselves to new forms of power and influence that are neither bureaucratic nor centralized in traditional ways, nor are they generally responsive to traditional means of accountability.”

Keeping readers out with a paywall: The Times and Sunday Times of London put up their online paywall earlier this month, the first of Rupert Murdoch’s newspapers to set off on his paid-content mission (though some other properties, like The Wall Street Journal, have long charged for online access). Last week, we got some preliminary figures indicating how life behind the wall is going so far: Former Times media reporter Dan Sabbagh said that 150,000 of the Times’ online readers (12 percent of its pre-wall visitors) had registered for free trials during the paywall’s first two weeks, with 15,000 signing on as paying subscribers and 12,500 subscribing to the iPad app. PaidContent also noted that the Times’ overall web traffic is down about 67 percent, adding that the Times will probably tout these types of numbers as a success.

The Guardian did its own math and found that the Times’ online readership is actually down about 90 percent — exactly in line with what the paper’s leaders and industry analysts were expecting. Everyone noted that this is exactly what Murdoch and the Times wanted out of their paywall — to cut down on drive-by readers and wring more revenue out of the core of loyal ones. GigaOM’s Mathew Ingram explained that rationale well, then ripped it apart, calling it “fundamentally a resignation from the open web” because it keeps readers from sharing (or marketing) it with others. SEOmoz’s Tom Critchlow looked at the Times’ paywall interface and gave it a tepid review.

Meanwhile, another British newspaper that charges for online access, the Financial Times, is boasting strong growth in online revenue. The FT’s CEO, John Ridding, credited the paper’s metered paid-content system and offered a moral argument for paid access online, drawing on Time founder Henry Luce’s idea that an exclusively advertising-reliant model weakens the bond between a publication and its readers.

Flipboard and the future of mobile media: In just four months, we’ve already seen many attention-grabbing iPad apps, but few have gotten techies’ hearts racing quite like Flipboard, which was launched last week amid an ocean of hype. As Mashable explained, Flipboard combines social media and news sources of the user’s choosing to create what’s essentially a socially edited magazine for the iPad. The app got rave reviews from tech titans like Robert Scoble and ReadWriteWeb, which helped build up enough demand that it spent most of its first few post-release days crashed from being over capacity.

Jen McFadden marveled at Flipboard’s potential for mobile advertising, given its ability to merge the rich advertising experience of the iPad with the targeted advertising possibilities through social media, though Martin Belam wondered whether the app might end up being “yet another layer of disintermediation that took away some of my abilities to understand how and when my content was being used, or to monetise my work.” Tech pioneer Dave Winer saw Flipboard as one half of a brilliant innovation for mobile media and challenged Flipboard to encourage developers to create the other half.

At the tech blog Gizmodo, Joel Johnson broke in to ask a pertinent question: Is Flipboard legal? The app scrapes content directly from other sites, rather than through RSS, like the Pulse Reader. Flipboard’s defense is that it only offers previews (if you want to read the whole thing, you have to click on “Read on Web”), but Johnson delved into some of the less black-and-white scenarios and legal issues, too. (Flipboard, for example, takes full images, and though it is free for now, its executives plan to sell their own ads around the content under revenue-sharing agreements.) Stowe Boyd took those questions a step further and looked at possible challenges down the road from social media providers like Facebook.

A new perspective on content farms: Few people had heard of the term “content farms” about a year ago, but by now there are few issues that get blood boiling in future-of-journalism circles quite like that one. PBS MediaShift’s eight-part series on content farms, published starting last week, is an ideal resource to catch you up on what those companies are, why people are so worked up about them, and what they might mean for journalism. (MediaShift defines “content farm” as a company that produces online content on a massive scale; I, like Jay Rosen, would define it more narrowly, based on algorithm- and revenue-driven editing.)

The series includes an overview of some of the major players on the online content scene, pictures of what writing for and training at a content farm is like, and two posts on the world of large-scale hyperlocal news. It also features an interesting defense of content farms by Dorian Benkoil, who argues that large-scale online content creators are merely disrupting an inefficient, expensive industry (traditional media) that was ripe for a kick in the pants.

Demand Media’s Jeremy Reed responded to the series with a note to the company’s writers that “You are not a nameless, faceless, soul-less group of people on a ‘farm.’ We are not a robotic organization that’s only concerned about numbers and data. We are a media company. We work together to tell stories,” and Yahoo Media’s Jimmy Pitaro defended the algorithm-as-editor model in an interview with Forbes. Outspoken content-farm critic Jason Fry softened his views, too, urging news organizations to learn from their algorithm-driven approach and let their audiences play a greater role in determining their coverage.

Reading roundup: A few developments and ideas to take a look at before the weekend:

— We’ve written about the FTC’s upcoming report on journalism and public policy earlier this summer, and Google added its own comments to the public record last week, urging the FTC to move away from “protectionist barriers.” Google-watcher Jeff Jarvis gave the statement a hearty amen, and The Boston Globe’s Jeff Jacoby chimed in against a government subsidy for journalism.

— Former equity analyst Henry Blodget celebrated The Business Insider’s third birthday with a very pessimistic forecast of The New York Times’ future, and, by extension, the traditional media’s as well. Meanwhile, Judy Sims targeted a failure to focus on ROI as a cause of newspapers’ demise.

— The Columbia Journalism Review devoted a feature to the rise of private news, in which news organizations are devoted to a niche topic for an intentionally limited audience.

— Finally, a post to either get you thinking or, judging from the comments, foaming at the mouth: Penn professor Eric Clemons argues on TechCrunch that advertising cannot be our savior online: “Online advertising cannot deliver all that is asked of it.  It is going to be smaller, not larger, than it is today.  It cannot support all the applications and all the content we want on the internet. And don’t worry. There are other things that can be done that will work well.”

July 29 2010

14:00

WikiLeaks and a failure of transparency

In all the kerfuffle this week around WikiLeaks and its disclosure of 91,000+ documents in its Afghan War Diary, it seems to me that a fundamental irony has been overlooked: A nonprofit journalism organization dedicated to imposing transparency on reluctant governments seems to think the rules don’t apply at home.

Go to the WikiLeaks “about” page, and you can see what I mean. There’s lots of rah-rah about rooting out corruption freedom of the press and why the site is “so important.” But there’s not a peep about organizational governance, where their money comes from or where it goes.

In some cases, such opacity is by mistake. But in WikiLeaks’ case, it is by design. Just two weeks before Afghan War Diary was released, Wired published an enterprising story on WikiLeaks’ finances. The reporter, Kim Zetter, tracked down a vice president of the Berlin-based Wau Holland Foundation, which apparently handles most contributions to WikiLeaks’ contributions. The story provided some idea as to the scale of the WikiLeaks budget — the group needs about $200,000 a year for basic operations — but the vice president offered only a promise of more disclosure next month. And from WikiLeaks founder Julian Assange? No comment.

I understand the need to protect whistleblowers and other sources. But when it comes to the group’s finances, can’t they cut out all the James Bond stuff? I don’t need names and addresses of donors, but can’t we have a little more transparency and accountability?

This isn’t just a matter of idle curiosity. Love or hate WikiLeaks, the organization is doing more than its share to transform journalism. And it is doing so in dramatic fashion by fully unharnessing the power and creativity of the nonprofit model. As Ruth McCambridge noted in the Nonprofit Quarterly earlier this week, WikiLeaks “may be the soul of nonprofithood.”

If that’s the case, then the stakes involved in WikiLeaks’ own willingness to operate with transparency are quite high.

Perhaps the most-repeated criticism of the nonprofit model in journalism is that an organization that relies in whole or in part on philanthropy will become beholden to its funders and will compromise its journalistic principles in order to ensure continued funding.

That’s simply not the case — not any more than the newsroom of a for-profit newspaper would have a self-imposed ban on negative stories about car dealers, department stores, and other (remaining) major advertisers.

But the secrecy invites speculation. A July 3 post at Cryptome.org from a “WikiLeaks insider” alleges that the organization had become overly dependent on “keep alive donations” from left wing politicians in Iceland. It warns ominously: “Sooner or later it will be payback time. And payback will be in the form of political bias in WIKILEAKS output.”

WikiLeaks does its part to fuel the speculation and undercut its credibility as well. In the Q&A on its “about” page, WikiLeaks raises this question: “Is WikiLeaks a CIA front?” I’ll save you a click back and tell you that the answer is no. But do we really need this kind of drama from an organization that presents itself as an honest broker of information? Of course not. It only serves to undercut WikiLeaks’ credibility.

If WikiLeaks really wants to promote transparency, it should start with its own operations.

July 28 2010

06:46

A War Logs interactive – with a crowdsourcing bonus

Owni war logs interface

French data journalism outfit Owni have put together an impressive app (also in English) that attempts to put a user-friendly interface on the intimidating volume of War Logs documents.

The app allows you to filter the information by country and category, and also allows you to choose whether to limit results to incidents involving the deaths of wounding of civilians, allies or enemies.

Clicking on an individual incident bring up the raw text but also a mapping of the location and the details split into a more easy-to-read table.

War Logs results detail

But key to the whole project is the ability to comment on documents, making this genuinely interactive. Once commented, you can choose to receive updates on “this investigation”

This could be fleshed out more, however (UPDATE: it’s early days – see below). “So that we can investigate a war that does not tell its name” is about as much explanation as we get – indeed, Afghanistan is not mentioned on the site at all (which presents SEO problems). In this sense the project suffers from a data-centric perspective which overlooks that not everyone has the same love of data for data’s sake.

A second weakness is an assumption that users are familiar with the story. While the project is linked with Slate.fr and Monde Diplomatique there are no links to any specifically related journalism on those sites, leaving the data without any particular context. Users visiting the site as a result of social media sharing (which is built into the site) might therefore not know what they’re dealing with.

Technically, however, this is an excellent solution to the scale problem that War Logs presents. It just needs an editorial solution to support it.

UPDATE: Nicolas Kayser-Bril, the man behind the project (disclosure: a former OJB contributor) explains the background:

“We contacted several outlets on Monday to coproduce the app. (we’re still in talks with several others in Italy, Belgium, Germany). What we offered them was an all-inclusive solution that gives them visibility and image gains and a way for them to engage with their audience.

“You’re right to say that the app lacks an editorial perspective as such. We’re implementing a feature called ‘contextualization’ that will offer users links to backgrounder stories published on partner websites according to several criteria (year, civil/military report, region, nationality of the engaged forces).

“Moreover, we’ve crowdsourced a huge work that considerably expanded the glossary published by Wikileaks and the Guardian. We launched a call for help on Monday morning. In 36 hours, we had 30% more entries related to unexplained abbreviations or details about equipment, as well as a French translation. Something we want to provide is a way for everyone with a low level of English to decipher the documents.”

July 27 2010

16:30

When do 92,000 documents trump an off-the-record dinner? A few more thoughts about Wikileaks

Sometimes you can spend an entire morning racing the clock to put together the perfect blog post, and once you’re done, find a quote or two that would have let you sum up the entire thing in a lot less time. Such is the case with this great exchange between veteran reporter Tom Ricks (now blogging at Foreign Policy magazine) and David Corn at Mother Jones. Ricks pretty much trashed the “War Logs“/Wikileaks story that has been the buzz of the journalism world for the past few days, and dropped this gem:

A huge leak of U.S. reports and this is all they get? I know of more stuff leaked at one good dinner on background.

David Corn responded with a thoughtful post that is worth reading in full. The essence of it, however, is this:

These documents — snapshots from a far-away war — show the ground truth of Afghanistan. This is not what Americans receive from US officials. And with much establishment media unable (or unwilling) to apply resources to comprehensive coverage of the war, the public doesn’t see many snapshots like these. Any information that illuminates the realities of Afghanistan is valuable.

This captures the essence of the question I was trying to get at in the fifth point of yesterday’s post (“journalism in the era of big data”). I noted the similarities between “War Logs” and last week’s big bombshell, “Top Secret America.” The essence of the similarity, I said, was that they were based on reams of data, which, in sum, might not tell us anything shockingly new but that brought home, in Ryan Sholin’s excellent phrase, “the weight of failure.” And this gets me excited because I think it represents something new in journalism, or something old-enough-to-new: a focus on the aggregation of a million “on the ground reports” that might sometimes get us closer to the truth than three well placed sources over a nice off-the-record dinner. And I’m fascinated by this because this is the way that I, as a qualitative social scientist, have always seen as a particularly valid way to learn about the world.

Ricks’ quote, on the other hand, captures a certain strain of more traditional thinking: the point of journalism is to learn something shockingly new, hopefully from those elites in a position to really know what’s going on. Your job, as a journalist, is to get close enough to those elites so that they’ll tell you what’s really going on (a “nice” dinner, now, not just any old dinner!), and your skill as a journalist lies in your ability to hone your bullshit detector so that you can separate the self-serving goals of your sources from “the truth.” Occasionally, those elites will drop a big stack of documents on your desk, but that’s a rare occurrence.

I want to be clear: I don’t think one “new” type of journalism is going to displace the traditional way. Obviously, both journalistic forms will work together in tandem; indeed, it seems like most of what The New York Times did with “War Logs” was to run the data dump by its network of more elite sources for verification and context. But we are looking at something different here, and I think the Ricks-Corn exchange captures an important tension at the heart of this transition.

To conclude, two more reading links for you. In the first, “A Speculative Post on the Idea of Algorithmic Authority,” Clay Shirky wrote late last year that the authority system he sees emerging in a Google-dominated world values crap as much as it does quality.

Algorithmic authority is the decision to regard as authoritative an unmanaged process of extracting value from diverse, untrustworthy sources, without any human standing beside the result saying “Trust this because you trust me.”

This notion gets at the fact that a lot of the documents contained in the “War Logs” trove might have been biased, or partial, or flat-out wrong. But it doesn’t matter, Shirky might argue, in the same way that it might in the world that Ricks describes — a world where, in Shirky’s terms, an elite source is “standing beside the result saying ‘Trust this because you trust me.’”

The second link is a little more obscure. In her book How We Became Posthuman, N. Katherine Hayles argues that one of the major consequences of digitization is that we, as an informational culture, no longer focus as much on the distinction between presence and absence (“being there,” or not “being there”) as we do on the difference between pattern and randomness. In other words, “finding something new” (being there, being at dinner, getting the source to say something we didn’t know before) may not always be as important as finding the pattern in what is there already.

This is a deep point, and I can’t go into it much more in this post. But I’m thinking a lot about it these days as I ponder new forms of online journalism, and I’ll probably write about it more in the months and years ahead.

July 26 2010

17:00

Data, diffusion, impact: Five big questions the Wikileaks story raises about the future of journalism

Whenever big news breaks that’s both (a) exciting and (b) relevant to the stuff I research, I put myself through a little mental exercise. I pretend I have an army of invisible Ph.D. students at my beck-and-call and ask them to research the three most important “future of news” items that I think emerge out of the breaking news. That way, I figure out for myself what’s really important amidst all the chaos.

The Wikileaks-Afghanistan story is big. It’s big for the country, it’s big for NATO soldiers and Afghan civilians, and (probably least importantly) it’s big for journalism. And a ton of really smart commentary has been written about it already. So all I want to do here is chime in on what I’d be focusing on if I wanted to understand the Wikileaks story in a way that will still be relevant one year, five years, even twenty years from now. I want to briefly mention three quick assignments I’d give my hypothetical Ph.D. students, and two assignments I’d keep for myself.

Watch the news diffuse: The release of the Wikileaks stories yesterday was a classic case study of the new ecosystem of news diffusion. More complex than the usual stereotype of “journalists report, bloggers opine,” in the case the Wikileaks story we got to see a far more nuanced (and, I would say, far more real) series of news decisions unfold: from new fact-gatherers, to news organizations in a different position in the informational chain, all the way to the Twittersphere in which conversation about the story was occurring in real-time, back to the bloggers, the opinion makers, the partisans, the politicians, and the hacks. This is how news works in 2010; let’s try to map it.

What’s the frame?: This one’s simple, but interesting because of that simplicity. With the simultaneous release of the same news story by three different media organization, all in different countries (The New York Times, The Guardian, and Der Spiegel), all coming out the the same set of 92,000 documents, we’ve got almost a lab-quality case study here of how different national news organizations talk about the news differently. Why did The Guardian headline civilian casualties while the Times chose to talk about the U.S. relationship with Pakistan? And what do these differences in framing say about how the rest of the world sees the U.S. military adventure in Afghanistan?

What’s the impact?: Will the “War Logs” release have the same impact that the Pentagon Papers did, either in the short of long term? And why will the stories have the impact they do? Like Jay Rosen, I’m sadly skeptical that this huge story will change the course of the war in the way the Ellsberg leaks did. And like Rosen, I think a lot of the reasons lie beyond journalism — they lie in the nature of politics and the way society and the political elite process huge challenges to our assumed, stable world views.

I might make one addition to Jay’s list about the impact of this story though — one that has to do with the speed of the news cycle. Like I noted already, there’s nothing more exciting than watching these sorts of stories unfold in real time. But I wonder if the “meme-like” nature of their distribution — and the fact that there will always be another meme, another bombshell — blunts there impact. You don’t have to be Nicholas Carr to get the feeling that we’re living in a short-attention span, media-saturated society; I wonder what it would take for a story like the “War Logs” bombshell to stick around in the public mind long enough for it to mean something.

So those are stories I’d give my grad students. Here are the topics I’d be keeping for myself:

Why Wikileaks?: I talked about this a bit over in my column today at NPR, so I’ll just summarize my main points from there. Looking rationally at the architecture of the news ecosystem, it doesn’t make a lot of sense that Wikileaks would have been tapped to serve as the intermediary for this story. After all, they just turned around and fed it to three big, traditional, national newspapers. There is, of course, Wikileaks’ technical expertise; what Josh Young called their “focus lower in the journalism stack…on the logistics of anonymity.” But I think there’s more to it than that. I think to understand “why Wikileaks,” you have to think in terms of organizational culture as well as network architecture and technical skills. In short, I think Wikileaks has an organizational affinity with folks who are most likely to be on the leaking end of the news in today’s increasingly wired societies. To understand the world of Wikileaks, and what it means for journalism, you have to understand the world of geeks, of hackers, and of techno-dissidents. Understanding reporting and reporters isn’t enough.

Journalism in the era of big data: Finally, it’s here where I’d start to draw the links between the “War Logs,” the Washington Post “Top Secret America” series, and even the New York Times front page story on the increasing conservatism of the Roberts Supreme Court. What do they all have in common? Databases, big data, an attempt to get at “the whole picture” — and maybe even a slight sense of letdown. The Washington Post story took years to write and came with a giant database. The Afghanistan story was based on 92,000 documents, many of which might have been largely inaccurate. And the Roberts story unapologetically quoted “an analysis of four sets of political science data.”

We’re seeing here the full-throated emergence of what a lot of smart people have been talking about for years now: data-driven journalism, but data in the service of somehow getting to the “big picture” about what’s really going on in the world. And this attempt to get at the big picture carries with it the risk of a slight letdown, not because of journalism, but because of us. As Ryan Sholin noted on Twitter, “Much like the massive WaPo story on secrecy, I don’t see much new [in the Wikileaks story], other than the sheer weight of failure.”

Part of what we’ve been trained, as a society, to expect out of the Big Deal Journalistic Story is something “new,” something we didn’t know before. Nixon was a crook! Osama Bin Laden was found by the CIA and then allowed to escape! But in these recent stories, its not the presence of something new, but the ability to tease a pattern out of a lot of little things we already know that’s the big deal. It’s not the newsness of failure; as Sholin might put it, it’s the weight of failure. It remains to be seen how this new focus on “the pattern” will change our political culture, our news culture, and the expectations we have of journalism. And it will be interesting to see what the focus on data leaves out. This week, however, big-data journalism proved its mettle.

10:58

White House seeks to advise reporters over Wikileaks Afghanistan release

Last night Wikileaks, the Guardian, the New York Times and Der Spiegel simultaneously published more than 90,000 classified military documents relating to the war in Afghanistan. Read our report on the publication at this link.

The New York Times has published a statement sent to reporters by the White House entitled “Thoughts on Wikileaks”. The statement advises journalists of some things to bare in mind when reporting on the leak, and offers help “to put these documents in context”.

4) As you report on this issue, it’s worth noting that wikileaks is not an objective news outlet but rather an organization that opposes US policy in Afghanistan.

The email quotes from the Guardian’s report, looking to stress the unreliability of the Wikileaks and the information they have released.

From the Guardian:

But for all their eye-popping details, the intelligence files, which are mostly collated by junior officers relying on informants and Afghan officials, fail to provide a convincing smoking gun for ISI complicity. Most of the reports are vague, filled with incongruent detail, or crudely fabricated.

(…)

If anything, the jumble of allegations highlights the perils of collecting accurate intelligence in a complex arena where all sides have an interest in distorting the truth.

The Times has explained its reasons for publishing the classified files in “a note to readers” entitled “Piecing together the reports and deciding what to publish“.

Full story at this link… (see entry at 6:46pm)Similar Posts:



Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl