Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

July 01 2013

14:57

Monday Q&A: Denise Malan on the new data-driven collaboration between INN and IRE

Every news organization wishes it could have more reporters with data skills on staff. But not every news organization can afford to make data a priority — and even those that do can sometimes find the right candidates hard to find.

A new collaboration between two journalism nonprofits — the Investigative News Network and Investigative Reporters and Editors — aims to address this allocation issue. Denise Malan, formerly a investigative and data reporter at the Corpus Christi Caller-Times, will fill the new role of INN director of data services, offering “dedicated data-analysis services to INN’s membership of more than 80 nonprofit investigative news organizations,” many of them three- or four-person teams that can’t find room or funding for a dedicated data reporter.

It’s a development that could both strengthen the investigative work being done by these institutions and skill building around data analysis in journalism. Malan has experience in training journalists in skills of procuring, cleaning, and analyzing data, and she has high hopes for the kinds of stories and networked reporting that will be produced by this collaboration. We talked about IRE’s underutilized data library, potentially disruptive Supreme Court decisions around freedom of information, the unfortunate end for wildlife wandering onto airplane runways, and what it means to translate numbers into stories.

O’Donovan: How does someone end up majoring in physics and journalism?
Malan: My freshman year they started a program to do a bachelor of arts in physics. Physics Lite. And you could pair that with business or journalism or English — something that was really your major focus of study, but the B.A. in physics would give you a good science background. So you take physics, you take calculus, you take statistics, and that really gives you the good critical thinking and data background to pair with something else — in my case, journalism.
O’Donovan: I guess it’s kind of easy to see how that led into what you’re doing now. But did you always see them going hand in hand? Or is that something that came later?
Malan: In college, I thought I was going to be a science writer. That was the main reason I paired those. When I got into news and started going down the path of data journalism, I was very glad to have that background, for sure. But I started getting more into the data journalism world when the Caller-Times in Corpus Christi sent me to the IRE bootcamp, where it’s a weeklong, intensive week where you concentrate on learning Excel and Access and the different pitfalls you can face in data — some basic cleaning skils. That’s really what got me started in the data journalism realm. And then the newspaper continued to send me to training — to the CAR conferences every year and local community college classes to beef up my skills.
O’Donovan: So, how long were you at the Caller-Times?
Malan: I was there seven years. I started as a reporter in June 2006, and then moved up into editing in May of 2010.
O’Donovan: And in the time that you were there as their data person, what are some stories that you were particularly proud of, or made you feel like this was a a burgeoning field?
Malan: We focused on intensely local projects at the Caller-Times. One of the ones that I was really proud of I worked on with our city hall reporter Jessica Savage. She found out that the city streets are a huge issue in Corpus Christi. If you’ve ever driven here, you know they are just horrible — a disaster. And the city is trying to find a billion dollars to fix them.

So our city hall reporter found out that the city keeps a database of scores called the Pavement Condition Index. Basically, it’s the condition of your street. So we got that database and we merged it with a file of streets and color-coded it so people could fully see what the condition of their street was, and we put it a database for people to find their exact block. This was something the city did not want to give us at first, because if people know the condition of their street scores, they’re going to demand that we do something about it. We’re like, “Yeah, that’s kind of the idea.” But that database became the basis for an entire special section on our streets. We used it to find people on streets who scored a 0, and talked about how it effects their life — how often they have to repair their cars, how often they walk through giant puddles.

And then we paired it with a breakout box of every city council member and their score. We did a map online, which, for over a year, actually, has been a big hit while the city is discussing how they’re going to find this money. People have been using it as a basis for the debate that they’re having, which, to me, is really kind of how we make a difference. Using this data that the city had, bringing it to light, making it accessible, I think, has really just changed the debate here for people. So that’s one thing I’m really proud of — that we can give people information to make informed decisions.

O’Donovan: Part of your new position is going to be facilitating and assisting other journalists in starting to understand how to do this kind of work. How do you tell reporters that this isn’t scary — that it’s something they can do or they can learn? How do you begin that conversation?
Malan: [At the Caller-Times] we adopted the philosophy that data journalism isn’t just something that one nerdy person in the office does, but something that everyone in the newsroom should have in their toolbox. It really enhances eery beat at the newspaper.

I would do training sessions occasionally on Excel, Google Fusion Tables, Caspio to show everyone in the newsroom, “Here’s what’s possible.” Some people really pick up on it and take it and run with it. Some people are not as math oriented and are not going to be able to take it and run with it themselves, but at least they know those tools are available and what it’s possible to do with them.

So some of the reporters would be just aware of how we could analyze data and they would keep their eyes open for databases on their beats, and other reporters would run with it. That philosophy is very important in any newsroom today. A lot of what I’m going to be doing with IRE and INN is working with the INN members in helping them to gather the data and analyze it and inform their local reporting. So a lot of the same roles, but in a broader context.

O’Donovan: So a lot of it is understanding that everyone is going to come at it with a different skill level.
Malan: Yes, absolutely. All our members have different levels of skills. Some of our members have very highly skilled data teams, like ProPublica, Center for Public Integrity — they’re really at the forefront of data journalism. Other members are maybe one- or two-person newsrooms that may not have the training and don’t have any reporters with those skills. So the skill sets are all over the board. But it will be my job to help, especially smaller newsrooms, plug into those resources — especially the resources at IRE — the best they can, with the data library there and the training available there. We help them bring up their own skills and enhance their own reporting.
O’Donovan: When a reporter comes to you and says, “I just found this dataset or I just got access to it” — how do you dive into that information when it comes to looking for stories? How do you take all of that and start to look for what could turn into something interesting?
Malan: A lot of it depends on the data set. Just approach every set of data as a source that you’re interviewing. What is available there? What is maybe missing from the data is something you want to think about too? And you definitely want to narrow it down: A lot of data sets are huge, especially these federal data sets that might have records containing, I don’t know, 120 fields, but maybe you’re only interested in three of them. So you want to get to know the data set, and what is interesting in it, and you want to really narrow your focus.

One collaboration that INN did was using data gathered by NASA for the FAA, and it was essentially near misses — incidents at airports like hitting deer on the runway, and all these little things that can happen but aren’t necessarily reported. They all get compiled in this database, and pilots write these narratives about it, so that field is very interesting to them. There were four or five INN members who collaborated on that, and they all came away with different stories because they all found something else that was interesting for them locally.

O’Donovan: This position you’ll hold is about bringing the work of INN and IRE together. What’s that going to look like? We talk all the time about how journalism is moving in a more networked direction — where do you see this fitting into that?
Malan: IRE and INN have always had a very close relationship, and I think that this position just kind of formalizes that. I will be helping INN members plug into the resources of IRE, especially the data library, I’ll be working closely with Liz Lucas, the database director at IRE, and I’m actually going to be living near IRE so I can work more closely with them. Some of that data there is very underutilized and it’s really interesting and maybe hasn’t been used in any projects, especially on a national level.

So we can take that data and I can kind of help analyze it, help slice it for the various regions we might be looking at, and help the INN members use that data for their stories. I’ll basically be acting as almost a translator to get this data from the IRE and help the INN members use it.

Going the other way, with INN members, they might come up with some project idea where data isn’t available from the database library, or it might be something where we have to gather data from every state individually, so we might compile that and whatever we end up with will be sent back to the IRE library and made available to other IRE members. So it’s a two-way relationship.

O’Donovan: So in terms of managing this collaboration, what are the challenges? Are you think of building an interface for sharing data or documents?
Malan: We’re going to be setting up a kind of committee of data people with INN to have probably monthly calls and just discuss ideas, what they’re working on, brainstorming, possible ideas. I want it to be a very organic, ground-up process — I don’t want it to be dictating what the projects should be. I want the members to come up with their own ideas. So we’ll be brainstorming and coming up with things, and we’ll be managing the group through Basecamp and communicating that way. A lot of the other members are already on Basecamp and communicate that way through INN.

We’ll be communicating through this committee and coming up with ideas and I’l be working with other members to, to reach out to them. If we come up with an idea that deals with health care, for example, I might reach out to some of the members that are especially focused on health care and try to bring in other members on it.

O’Donovan: Do you foresee collaborations between members, like shared reporting and that kind of thing?
Malan: Yeah, depending on the project. Some of it might be shared reporting; some of it might be someone does a main interview. If we’re doing a crime story dealing with the FBI’s Uniformed Crime Report, maybe we just have one reporter from every property, we nominate one person to do the interview with the FBI that everyone can use in their own story, which they localize with their own data. So, yeah, depending on the project, we’ll have to kind of see how the reporting would shake out.
O’Donovan: Do you have any specific goals or types of stories you want to tell, or even just specific data sets you’re eager to get a look at?
Malan: I think there are several interesting sets in the IRE data library that we might go after at first. There’s really interesting health sets, for example, from the FDA — one of them is a database of adverse affects from drugs, complaints that people make that drugs have had adverse effects. So yeah, some of those can be right off the bat, ready to go and parse and analyze.

Some other data sets we might be looking at will be a little harder to get, will take some FOIs and some time to get. There are several major areas that our members focus on and that we’ll be looking at projects for. Environment, for example — fracking is a large issue, and how environment effects public health. Health care, especially with the Affordable Care Act coming into effect next year is going to be a large one. Politics, government, how money effects influences politicians is a huge area as we come up on the 2016 elections and the 2014 midterms. And education is another issue with achievement gaps, graduation rates, charter schools — those are all large issues that our members follow. Finding those commonalties and dealing with data sets, digging into that is going to be my first priority.

O’Donovan: The health question is interesting. Knight announced its next round of News Challenge grants is going to be all around health.
Malan: I’m excited about that. We have several members that are really specifically focused on healt,h so I feel like we might be able to get something good with that.
O’Donovan: Health care stuff or more public health stuff?
Malan: It’s a mix, but a lot of stuff is geared toward the Affordable Care Act now.
O’Donovan: Gathering these data sets must often involve a lot of coordination across states and jurisdictions.
Malan: Yeah, absolutely. One thing I am a little nervous about is the Supreme Court’s recent ruling in the Virginia case where they can now require you to live in a state to put in an FOI. That might complicate things a little bit. I know there are several groups working on lists of people who will put an FOI in for you in various states. But that can kind of just slow down the process and put a little kink in and add to the timeline. I’m concerned of course that now they know it’s been ruled constitutional that every state might make that the law. It could be a huge thing. A management nightmare.
O’Donovan: What kind of advice do you normally give to reporters who are struggling to get information that they know they should be allowed to have?
Malan: That’s something we encountered a lot here, especially getting data in the proper format, too. Laws on that can vary from state to state. A lot of governments will give you paper or PDF format, instead of the Excel or text file that you asked for. It’s always a struggle.

The advice is to know the law as best you can, know what exceptions are allowed under your state law, be able to quote — you don’t have to have the law memorized, but be able to quote specific sections that you know are on your side. Be prepared with your requests, and be prepared to fight for it. And in a lot of cases, it is a fight.

O’Donovan: That’s an interesting intersection of technical and legal skill. That’s a lot of education dollars right there.
Malan: Yeah, no kidding.
O’Donovan: When you do things like attend the NICAR conference and assess the scene more broadly, where do you see the most urgent gaps in the data journalism field? Is it that we need more data analysts? More computer scientists? More reporters with the fluency in communicating with government? More legal aid? If you could allocate more resources, where would you put them right now?
Malan: There’s always going to be a need for more very highly skilled data journalists who can gather these national sets, analyze them, clean them, get them into a digestible format, visualize them online, and inform readers. I would like to see more general beat reporters interested in data and at least getting skills in Excel and even Access — because the beat reporters are the ones on the ground, using their sources, finding these data sets or not finding them if they’re not aware of what data is. I would really like this to be a bigger push to at least educate most general beat reporters to a certain level.
O’Donovan: Where do you see the data journalism movement headed over the next couple years? What would your next big hope for the field be?
Malan: Well, of course I hope for it to go kind of mainstream, and that all reporters will have some sort of data skills. It’s of course harder with fewer and fewer resources, and reporters are learning how to tweet and Instagram, and there are demands on their time that have never been there.

But I would hope it would become just an normal part of journalism, that there would be no more “data journalism” — that it just becomes part of what we do, because it’s invaluable to reporting and to really helping ferret out the truth and to give context to stories.

September 20 2010

14:00

L.A. Times’ controversial teacher database attracted traffic and got funding from a nontraditional source

Not so long ago, a hefty investigative series from the Los Angeles Times might have lived its life in print, starting on a Monday and culminating with abig package in the Sunday paper. But the web creates the potential for long-from and in-depth work to not just live on online, but but do so in a more useful way than a print-only story could. That’s certainly the case for the Times’ “Grading the Teachers,” a series based on the “value-added” performance of individual teachers and schools. On the Times’ site, users can review the value-added scores of 6,000 3rd- through 5th-grade teachers — by name — in the Los Angeles Unified School District as well as individual schools. The decision to run names of individual teachers and their performance was controversial.

The Times calculated the value-added scores from the 2002-2003 school year through 2008-2009 using standardized test data provided by the school district. The paper hired a researcher from RAND Corp. to run the analysis, though RAND was not involved. From there, in-house data expert and long-time reporter Doug Smith figured out how to present the information in a way that was usable for reporters and understandable to readers.

As might be expected, the interactive database has been a big traffic draw. Smith said that since the database went live, more than 150,000 unique visitors have checked it out. Some 50,000 went right away and now the Times is seeing about 4,000 users per day. And those users are engaged. So far the project has generated about 1.4 million page views — which means a typical user is clicking on more than 9 pages. That’s sticky content: Parents want to compare their child’s teacher to the others in that grade, their school against the neighbor’s. (I checked out my elementary school alma mater, which boasts a score of, well, average.)

To try to be fair to teachers, the Times gave their subjects a chance to review the data on their page and respond before publication. But that’s not easy when you’re dealing with thousands of subjects, in a school district where email addresses aren’t standardized. An early story in the series directed interested teachers to a web page where they were asked to prove their identity with a birth date and a district email address to get their data early. About 2,000 teachers did before the data went public. Another 300 submitted responses or comments on their pages.

“We moderate comments,” Smith said. “We didn’t have any problems. Most of them were immediately posteable. The level of discourse remained pretty high.”

All in all, it’s one of those great journalism moments at the intersection of important news and reader interest. But that doesn’t make it profitable. Even with the impressive pageviews, the story was costly from the start and required serious resource investment on the part of the Times.

To help cushion the blow, the newspaper accepted a grant from the Hechinger Report, the education nonprofit news organization based at Columbia’s Teachers College. [Disclosure: Lab director Joshua Benton sits on Hechinger's advisory board.] But aside from doing its own independent reporting, Hechinger also works with established news organizations to produce education stories for their own outlets. In the case of the Times, it was a $15,000 grant to help get the difficult data analysis work done.

I spoke with Richard Lee Colvin, editor of the Hechinger Report, about his decision to make the grant. Before Hechinger, Colvin covered education at the Times for seven years, and he was interested in helping the newspaper work with a professional statistician to score the 6,000 teachers using the “value-added” metric that was the basis for the series.

“[The L.A. Times] understood that was not something they had the capacity to do internally,” Colvin said. “They had already had conversations with this researcher, but they needed financial support to finish the project.” (Colvin wanted to be clear that he was not involved in the decision to run individual names of teachers on the Times’ site, just in analzying the testing data.) In exchange for the grant, the L.A. Times allowed Hechinger to use some of its content and gave them access to the data analysis, which Colvin says could have future uses.

At The Hechinger Report, Colvin is experimenting with how it can best carry out their mission of supporting in-depth education coverage — producing content for the Hechinger website, placing its articles with partner news organizations, or direct subsidies as in the L.A. Times series. They’re currently sponsoring a portion of the salary of a blogger at the nonprofit MinnPost whose beat includes education. “We’re very flexible in the ways we’re working with different organizations,” Colvin said. But, to clarify, he said, “we’re not a grant-making organization.”

As for the L.A. Times’ database, will the Times continue to update it every year? Smith says the district has not yet handed over the 2009-10 school year data, which isn’t a good sign for the Times. The district is battling with the union over whether to use value-added measurements in teacher evaluations, which could make it more difficult for the paper to get its hands on the data. “If we get it, we’ll release it,” Smith said.

September 06 2010

20:35

Charities data opened up – journalists: say thanks.

Having made significant inroads in opening up council and local election data, Chris Taggart has now opened up charities data from the less-than-open Charity Commission website. The result: a new website – Open Charities.

The man deserves a round of applause. Charity data is enormously important in all sorts of ways – and is likely to become more so as the government leans on the third sector to take on a bigger role in providing public services. Making it easier to join the dots between charitable organisations, the private and public sector, contracts and individuals – which is what Open Charities does – will help journalists and bloggers enormously.

A blog post by Chris explains the site and its background in more depth. In it he explains that:

“For now, it’s just a the simplest of things, a web application with a unique URL for every charity based on its charity number, and with the basic information for each charity available as data (XML, JSON and RDF). It’s also searchable, and sortable by most recent income and spending, and for linked data people there are dereferenceable Resource URIs.

“The entire database is available to download and reuse (under an open, share-alike attribution licence). It’s a compressed CSV file, weighing in at just under 20MB for the compressed version, and should probably only attempted by those familiar with manipulating large datasets (don’t try opening it up in your spreadsheet, for example). I’m also in the process of importing it into Google Fusion Tables (it’s still churning away in the background) and will post a link when it’s done.”

Chris promises to add more features “if there’s any interest”.

Well, go on…

August 16 2010

14:30

The Guardian launches governmental pledge-tracking tool

Since it came to office nearly 100 days ago, Britain’s coalition government — a team-up between Conservatives and Liberal Democrats that had the potential to be awkward and ineffective, but has instead (if The Economist’s current cover story is to be believed) emerged as “a radical force” on the world stage — has made 435 pledges, big and small, to its constituents.

In the past, those pledges might have gone the way of so many campaign promises: broken. But no matter — because also largely forgotten.

The Guardian, though, in keeping with its status as a data journalism pioneer, has released a tool that tries to solve the former problem by way of the latter. Its pledge-tracker, a sortable database of the coalition’s various promises, monitors the myriad pledges made according to their individual status of fulfillment: “In Progress,” “In Trouble,” “Kept,” “Not Kept,” etc. The pledges tracked are sortable by topic (civil liberties, education, transport, security, etc.) as well as by the party that initially proposed them. They’re also sortable — intriguingly, from a future-of-context perspective — according to “difficulty level,” with pledges categorized as “difficult,” “straightforward,” or “vague.”

Status is the key metric, though, and assessments of completion are marked visually as well as in text. The “In Progress” note shows up in green, for example; the “Not Kept” shows up in red. Political accountability, meet traffic-light universality.

The tool “needs to be slightly playful,” notes Simon Jeffery, The Guardian’s story producer, who oversaw the tool’s design and implementation. “You need to let the person sitting at the computer actually explore it and look at what they’re interested in — because there are over 400 things in there.”

The idea was inspired, Jeffery wrote in a blog post explaining the tool, by PolitiFact’s Obameter, which uses a similar framework for keeping the American president accountable for individual promises made. Jeffery came up with the idea of a British-government version after May’s general election, which not only gave the U.S.’s election a run for its money in terms of political drama, but also occasioned several interactive projects from the paper’s editorial staff. They wanted to keep that multimedia trajectory going. And when the cobbled-together new government’s manifesto for action — a list of promises agreed to and offered by the coalition — was released as a single document, the journalists had, essentially, an instant data set.

“And the idea just came from there,” Jeffery told me. “It seemed almost like a purpose-made opportunity.”

Jeffery began collecting the data for the pledge-tracker at the beginning of June, cutting and pasting from the joint manifesto’s PDF documents. Yes, manually. (“That was…not much fun.”) In a tool like this — which, like PolitiFact’s work, merges subjective and objective approaches to accountability — context is crucial. Which is why the pledge-tracking tool includes with each pledge a “Context” section: “some room to explain what this all means,” Jeffery says. That allows for a bit of gray (or, since we’re talking about The Guardian, grey) to seep, productively, into the normally black-and-white constraints that define so much data journalism. One health care-related pledge, for example — “a 24/7 urgent care service with a single number for every kind of care” — offers this helpful context: “The Department of Health draft structural reform plan says preparations began in July 2010 and a new 111 number for 24/7 care will be operational in April 2012.” It also offers, for more background, a link to the reform plan.

To aggregate that contextual information, Jeffery consulted with colleagues who, by virtue of daily reporting, are experts on immigration, the economy, and the other topics covered by the manifesto’s pledges. “So I was able to work with them and just say, ‘Do you know about this?’ ‘Do you know about that?’ and follow things up.”

The tool isn’t perfect, Jeffery notes; it’s intended to be “an ongoing thing.” The idea is to provide accountability that is, in particular, dynamic: a mechanism that allows journalists and everyone else to “go back to it on a weekly or fortnightly basis and look at what has been done — and what hasn’t been done.” Metrics may change, he says, as the political situation does. In October, for example, the coalition government will conclude an external spending review that will help crystallize its upcoming budget, and thus political, priorities — a perfect occasion for tracker-based follow-up stories. But the goal for the moment is to gather feedback and work out bugs, “rather than having a perfectly finished product,” Jeffery says. “So it’s a living thing.”

August 03 2010

12:57

5 tips on data journalism projects from ProPublica

A few months ago I heard ProPublica’s Olga Pierce and Jeff Larson speak at the Digital Editors Network Data Meet, giving their advice on data journalism projects. I thought I might publish notes of five tips they had here for the record:

1. Three-quarters of the top 10 stories on the site were news apps

Online applications prove very popular with users – but they are more often a landing page for further exploration via stories.

2. When you publish your story, ask for data

Publication is not the end of the process. If you invite users to submit their own information, it can lead to follow-ups and useful contacts.

3. Have both quantitative and qualitative fields in your forms

In other words, ask for basic details such as location, age, etc. but also ask for ‘their story’ if they have one.

4. Aim for a maximum of 12 questions

That seems to be the limit that people will realistically respond to. Use radio buttons and dropdown menus to make it easier for people to complete. At the end, ask whether it is okay for the organisation to contact them to ensure you’re meeting data protection regulations.

5. Share data left over from your investigation

Just because you didn’t use it doesn’t mean someone else can’t find something interesting in it.

July 22 2010

07:00

The New Online Journalists #6: Conrad Quilty-Harper

As part of an ongoing series on recent graduates who have gone into online journalism, The Telegraph’s new Data Mapping Reporter Conrad Quilty-Harper talks about what got him the job, what it involves, and what skills he feels online journalists need today.

I got my job thanks to Twitter. Chris Brauer, head of online journalism at City University, was impressed by my tweets and my experience, and referred me to the Telegraph when they said they were looking for people to help build the UK Political database.

I spent six weeks working on the database, at first manually creating candidate entries, and later mocking up design elements and cleaning the data using Freebase Gridworks, Excel and Dabble DB. At the time the Telegraph was advertising for a “data juggler” role, and I interviewed for the job and was offered it.

My job involves three elements:

  • Working with reporters to add visualisations to stories based on numbers,
  • Covering the “open data” beat as a reporter, and
  • Creating original stories with visualisations based on data from FOI and other sources.

For my job I need to know how to select and scrape good data, clean it, pick out the stories and visualise it. (P.S. you may have noticed that I’m a “data is singular” kinda guy).

The “data” niche is greatly exciting to me. Feeding into this is the #opendata movement, the new Government’s plan to release more data and the understanding that data driven journalism as practised in the United States has to come here. There’s clearly a hunger for more data driven stories, a point well illustrated by a recent letter to the FT.

The mindset you need to have as an online journalist today is to become familiar with and proficient at using tools that make you better at your job. You have to be an early adopter. Get on the latest online service, get the latest gadget and get it before your colleagues and competitors. Find the value in those tools, integrate it into your work and go and find another tool.

When I blogged for Engadget our team had built an automated picture watermarker for liveblogging. I played with it for a few hours and made a new script that downloaded the pictures from a card, applied the watermark, uploaded the pictures and ejected the SD card. Engadget continues to try out new tools that enable them to do their job faster and better. There are endless innovations being churned out every day from the world of technology. Make time to play with them and make them work for you.

If you know of anyone else who should be featured in this series, let me know in the comments.

June 28 2010

10:12

Video: BBC at the 2012 Olympics: visualisations, maps and augmented reality

With 2 years to go to the 2012 Olympics, the BBC are already starting to plan their online coverage of the event. With a large, creative team at hand who have experimented with maps, visualisations and interactive content in the past, the pressure is on them to keep the standards high.

At the recent News:Rewired event, OJB caught up with Olympics Reporter Ollie Williams, himself a visualisation guru, to find out exactly what they were planning for 2012.

June 04 2010

11:32

Get used to reading this…

“We have a team of developers going through the data now – and we’ll let you know here what we learn as and when we learn it.”

If you had any doubt over the concept of ‘programmer as journalist’, that quote above from The Guardian’s liveblog of the opening of the COINS database gives you a preview of things to come. While you’re at it, you might as well add in ‘statistician as journalist‘ and ‘information designer as journalist‘ – or look at my post from 2008 on New Journalists for New Information Flows. Are we there yet?

09:25

Coins Expenditure Database Published by Government – Open Data

(Cross-posted from the Wardman Wire.)

This looks like an excellent start. The Coalition Government has just published the COINS database, which is the detailed database of Government spending:

The release of COINS data is just the first step in the Government’s commitment to data transparency on Government spending.

You can get the database from the data.gov website here. There are explanations to help you get to grips with it here.

Tim Almond notes (via chat) that it is a 68mb zipped file which extracts to 4GB, i.e., huge. It will require significant database tools to get to grips with this, but I’m predicting that easier ways of querying may be created by someone in 48 hours. Here is the full statement:

COINS: publishing data from the database

The release of COINS data is just the first step in the Government’s commitment to data transparency on Government spending.

The data is available from Data.gov.uk (opens in new window) but the following guidance explains more about the release.

What is COINS?

COINS – the Combined On-line Information System – is used by the Treasury to collect financial data from across the public sector to support fiscal management, the production of Parliamentary Supply Estimates and public expenditure statistics, the preparation of Whole of Government Accounts (WGA) and to meet data requirements of the Office for National Statistics (ONS).

Up to nine years of data can be actively maintained – five historic (or outturn) years, the current year and up to three future (or plan) years depending on the timing of the latest spending review. COINS is a consolidation system rather than an accounts application, and so it does not hold details of individual financial transactions by departments.

Why are you doing this?

The coalition agreement made clear that this Government believes in removing the cloak of secrecy from government and throwing open the doors of public bodies, enabling the public to hold politicians and public bodies to account. Nowhere is this truer than in being transparent about the way in which the Government spends your money. The release of COINS data is just the first step in the Government’s commitment to data transparency on Government spending.

As the Prime Minister has made clear, by November, all new items of central government spending over £25,000 will be published online and by January of next year, all new items of local government spending over £500 will be published on a council-by-council basis.

Who might find the data useful?

COINS contains millions of rows of data; as a consequence the files are large and the data held within the files complex. Using these download files will require some degree of technical competence and expertise in handling and manipulating large volumes of data. It is likely that these data will be most easily used by organisations that have the relevant expertise, rather than individuals. Having access to this data, institutions and experts will be able to process it and present it in a way that is more accessible to the general public. In addition, subsets of data from the COINS database will also be made available in more accessible formats by August 2010.

Downloading the data

The COINS data are provided in two files for each financial year; the ‘fact table’ (fact table extract 200x xx.txt) and the ‘adjustment table’ (adjustment table extract 200x xx .txt). The contents of these two files are explained in ‘What is COINS data?’.

The ‘fact tables’ are approximately 70MB. With a fast broadband link of 8mbps, it will take approximately 10 minutes to download this file. The ‘adjustment tables’ are approximately 40MB, and this will take approximately 5 minutes to download. Both these files have been compressed using ZIP archival. When unzipped the file sizes will decompress and expand significantly to sizes of around 5GB and 0.5GB respectively.

The data are provided in a txt file format. The structure of the data is similar to a csv (comma separated variable) file with a string of characters being formed to represent each row, with each field separated by an ‘@’. We estimate that there are around 3.5 million rows in the ‘fact tables’, and around 500,000 rows in the ‘adjustment tables’. While the contents of the latter can be loaded into Excel 2007, the former is too large for the Excel software. In order to read the data, they will need to be uploaded into appropriate database software.

We need some crowd-sourced hackery here to prevent everyone reinventing the wheel. If you want to be in on the start of this conversation, try chatting to Tim on twitter.

The next step will need to be a regular and reliable process to allow meaningful continuing analysis.

April 20 2010

08:56

Telegraph launches powerful election database

The Telegraph have finally launched – in beta – the election database I’ve been waiting for since the expenses scandal broke. And it’s rather lovely.

Starting with the obvious part (skip to the next section for the really interesting bit): the database allows you to search by postcode, candidate or constituency, or to navigate by zooming, moving and clicking on a political map of the UK.

Searches take you to a page on an individual candidate or a constituency. For the former you get a biography, details on their profession and education (for instance, private or state, oxbridge, redbrick or neither), as well as email, website and Twitter page. Not only is there a link to their place in the Telegraph’s ‘Expenses Files’ – but also a link to their allowances page on Parliament.uk.

Constituency pages feature a raft of stats, the names of candidates (not many at the moment), and the swing needed to change control.

At the moment both have ‘Related stories’ but these are only related in the loosest sense for the moment. And there is a link to the election map and swingometer that The Telegraph built previously.

Advanced search

All of which is nice but not earth-shattering. Where the database really comes into its own is with the Advanced Search feature.

This is so powerful that the main issue may turn out to be usability. I’m not sure myself of everything it can do at the moment but apart from the fundamentals of actually finding a candidate, this allows you to filter all the candidates in the database based on everything from what type of education they had, to their age, gender, profession, county and role (i.e. contesting, defending, standing for the first time or again). The Swingometer filter also appears to let you filter based on who wins as a result of predicted swings (not just Lab-Con but Con-Lib and Lab-Lib)

The site is still rough around the edges – it appears that the Shadow Secretary of State for Justice Dominic Grieve went to “Lyc‚àö¬©e Fran‚àö√üais Charles de Gaulle” and ’Oxbridge University’, while the link to his website is missing a ‘http://’ and so doesn’t work.

Data geeks will be disappointed that the data doesn’t appear to be mashable, and there obviously isn’t an API. The Telegraph’s Marcus Warren tells me that they are looking at mashups for after the election, but for the moment are focusing on researching candidates.

That seems a sensible move. The MPs’ expenses scandal may turn out not just to be the biggest story of the last decade, but the foundation of a political database to rival any other news organisation. The Telegraph have a real strength here and it’s good to see them building on it.

March 31 2010

13:53

Sourcemap Makes Data Visualizations Transparent

Yesterday colleagues of mine at MIT were brainstorming plenaries for an upcoming media conference. Data visualization came up, but each of us grumbled. "Overdone," one of us said, to nodding heads. We'd done a session on that at every one of our conferences and forums, as had others at theirs. Data visualization had become tragically hip, as if we were in charge of a music festival and one of us had just proffered Coldplay.

But as we teased out our reservations, we realized that it wasn't visualization that we had an issue with; yes, we agreed, it's an overdone topic, but it's still incredibly useful. Rather, data was the problem. Despite the great leaps made in representing information, we were disappointed in the relatively teeny steps taken to explain how that information is collected, organized, and verified. (In the meeting, I half-jokingly suggested adding a credit card company executive to a plenary, given their companies' dependence on accurate data. It wasn't dismissed out of hand.)

The key issue, it turns out, is transparency throughout the entire data-collecting process (something we wouldn't expect to get from a credit card exec). Matt Hockenberry and Leo Bonanni here at MIT's Center for Future Civic Media try to address this issue with their project Sourcemap.

While pitched as a way to create and visualize "open supply chains," Sourcemap's real virtue is that the data itself is fully sourced. Like the links at the bottom of a Wikipedia article and the accompanying edit history, you know exactly who added the data and where that data came from. You can take that data and make counter-visualizations if you feel the data isn't correctly represented. Sourcemap's very structure acknowledges that visualization is an editorial process and gives others a chance to work with the original data. For example, here's an example of a Sourcemap for an Ikea bed:

In another example, the Washington Post yesterday published a piece about food fraud -- food whose contents or origins are misrepresented. You could use Sourcemap to out companies that lie about their food products. But using the same data, a food producer could use Sourcemap to show how consumer prices are lowered by using certain substitute ingredients and not others. The same goes for visualizations of campaign contributions, federal spending on hospitals, rural broadband penetration, your mayor's ability to get potholes filled, etc. The key, though, isn't necessarily good visualization but good data.

I don't want to understate the genius of good visualizations. But as they say: garbage in, garbage out. Without well collected, well organized, transparent data, you never know if you're looking at a mountain of trash.

March 19 2010

09:25

Interview: Nicolas Kayser-Bril, head of datajournalism at Owni.fr

Past OJB contributor Nicolas Kayser-Bril is now in charge of datajournalism at Owni.fr, a recently launched news site that defines itself as an “open think-tank”.

“Acting as curators, selecting and presenting content taken deep in the immense and self-expanding vaults of the internet,” explains Nicolas, “the Owni team links to the best and does the rest.”

I asked Nicolas 2 simple questions on his work at Owni. Here are his responses:

What are you trying to do?

What we do is datajournalism. We want to use the whole power of online and computer technologies to bring journalism to a new height, to a whole new playing field. The definition remains vague because so little has been made until now, but we don’t want to limit ourselves to slideshows, online TV or even database journalism.

Take the video game industry, for instance. In the late 1970’s, a personal computer could be used to play Pong clones or text-based games. Since then, a number of genres have flourished, taking action games to 3D, building an ever-more intelligent AI for strategy games, etc. In the age of the social web, games were quick to use Facebook and even Twitter.

Take the news industry. In the late 1970’s, you could read news articles on your terminal. In the early 2010’s you can, well… read articles online! How innovative is that? (I’m not overlooking the innovations you’ll be quick to think of, but the fact remains that most online news content are articles.)

We want to enhance information with the power of computers and the web. Through software, databases, visualizations, social apps, games, whatever, we want to experiment with news in ways traditional and online media haven’t done yet.

What have you achieved?

We started to get serious about this in February, when I joined the mother company (22mars) full-time. In just a month, we have completed 2 projects

The first one, dubbed Photoshop Busters (see it here), gives users digital forensics tools to assess the authenticity of an image. It was made as a widget for one of our partners, LesInrocks.com.

More importantly, we made a Facebook app, Where do I vote? There, users can find their polling station and their friends’ for the upcoming regional election in France.

It might sound underwhelming, but it required finding and locating the addresses of more than 35,000 polling stations.

On top of convincing a reluctant administration to hand over their files, we set up a large crowdsourcing effort to convert the documents from badly scanned PDFs to computer-readable data. More than 7,000 addresses have been treated that way.

Dozens of other ideas are in the works. Within Owni.fr, we want to keep the ratio of developers/non-developers to 1, so as to be able to go from idea to product very quickly. I code most of my ideas myself, relying on the team for help, ideas and design.

In the coming months, we’ll expand our datajournalism activities to include another designer, a journalist and a statistician. Expect more cool stuff from Owni.fr.

January 05 2010

18:00

California Watch: The latest entrant in the dot-org journalism boom

“Ten years ago,” says Mark Katches, editorial director of California Watch, “there were 85 reporters covering the California state house; today there are fewer than 25.”

Katches sees California Watch, which officially launched yesterday after a soft launch period and months of preparation, as stepping into a “big void in doing investigative work in California.” Katches has assembled the largest investigative team in the state: seven reporters, two multimedia producers, and two editors.

The site is focused on investigative watchdog journalism. It won’t cover the ins and outs of the California legislature or other governmental minutiae, aiming instead to “expose injustice, waste, mismanagement, wrongdoing, questionable practices and corruption, so that those responsible can be held to account and the public is armed with the information it needs to debate solutions and spark change.” Besides political topics, the site will cover higher education, health and welfare, and criminal justice.

Assembling the team

Based in Berkeley, California Watch has a four-person team in Sacramento, and hopes to open a Los Angeles office as well. 

The team’s credentials are impressive. Katches is a California native who lived in the state most of his life; he directed investigative teams at The Orange County Register and for the past two years at the Milwaukee Journal Sentinel. The team’s director is Louis Freedberg, a longtime reporter on California affairs for the San Francisco Chronicle and other state and national publications. Senior editor Robert Salladay is a veteran of the L. A. Times; senior reporter Lance Williams has 32 years of California coverage experience and was one of the two reporters at the Chronicle who uncovered the Barry Bonds-BALCO steroid doping scandal.  Web entrepreneur Susan Mernit, a veteran of AOL, Netscape and Yahoo, supplies web strategy. Multimedia guru Mark Luckie (of 10,000 Words fame) is producing content. And longtime Philadelphia Inquirer journalist Robert Rosenthal, director of CIR, and others on the CIR staff supply development and administrative support.

I asked Katches whether California Watch is doling out the kind of salaries reported to be going to the top talent at recent nonprofit startup Texas Tribune ($315,000 to CEO Evan Smith, $90,000 to top reporter Brian Thevenot). “Not even close,” he said. Top California Watch executives are paid closer to what Texas Tribune reporters get, but Katches says the pay scales are competitive and appropriate for the levels of talent and scope of management involved.

The model

The site aims for up to a dozen updates every weekday, including daily blog entries by most staffers. A rotation of four top stories are featured front and center, followed by the “WatchBlog” and an inside-the-newsroom feature. Like The Texas Tribune, the site offers an extensive data center, currently featuring information about stimulus-funding distribution, campaign finance, educational costs, and wildfires. It’s not as extensive or interactive as the Texas Trib databases and document collection, but the intent is to build up its contents over time.

California Watch is a project of the Center for Investigative Reporting, the oldest nonprofit investigative news organization in the country (founded 1977), and joins a growing list of state and regional nonprofits that have in common a serious journalistic mission but take a variety of approaches to funding, coverage and distribution. The highest profile, best-funded members of that list now include The Texas Tribune, MinnPost, the St. Louis Beacon, Voice of San Diego, and (at a national level) ProPublica. “The dot-org boom” is really one of the top journalism stories of 2009, Katches says.

CIR garnered about $3.5 million in funding to start California Watch (roughly the same amount as The Texas Tribune), enough for more than two years of operations at its $1.5 million annual budget. Major funding came from the John S. and James L. Knight Foundation [also a supporter of this site —Ed.], the William and Flora Hewlett Foundation, and the James Irvine Foundation.

Going forward, California Watch plans to develop a business model that includes continued philanthropic support, along with revenue from sponsorship, individual memberships, advertising, and licensing. The site is offering its content to the state’s newspapers and other media on a fee basis. One of its first stories during the development period was carried by 25 of the state’s papers, all on the front page. (This fee-based model differs from The Texas Tribune, which is offering its content free to Texas media outlets for now; Texas Tribune also covers day-to-day politics in addition to doing investigative journalism.) California Watch partners with KQED in San Francisco for radio and TV distribution; with the Associated Press for distribution through its Exchange marketplace; and with New America Media for distribution of translated versions to ethnic media.

December 16 2009

17:10

KNC 2010: NewsGraf wants to slap a search box on journalists’ brains

[EDITOR'S NOTE: The Knight News Challenge closed submissions for the 2010 awards last night at midnight, which means that another batch of great ideas, interesting concepts, and harebrained schemes gave their chance to convince the Knight Foundation they deserve funding. (Trust us — great, interesting, and harebrained are all well represented at this stage each year.) We've been picking through the applications available for public inspection the past few weeks, and over the next few days Mac is going to highlight some of the ideas that struck us as worthy of a closer look — starting today with NewsGraf, below.

But we also want your help. Do you know of a really interesting News Challenge application? Did you submit one yourself? Let us know about it. Either leave a comment on this post or email Mac Slocum. In either case, keep your remarks brief — 200 words or less. We'll run some of the ones you think are noteworthy in a post later this week. —Josh]

The most eye-catching thing about the NewsGraf’s proposal is its price tag; $950,000 over two years. That stands out in a sea of $50,000 and $100,000 requests.

But if you spend a little time digging into the intricacies of NewsGraf, that big price becomes downright reasonable. Cheap even. That’s because with NewsGraf, Mike Aldax and John Marshall want to digitally duplicate the knowledge, connections and synapses of a veteran journalist. That kind of audacity doesn’t come cheap.

Technologically speaking, NewsGraf ventures into the murky world of semantic tagging and social graphs. Unless you’ve got a computer science degree, it’s hard to get a handle on exactly what NewsGraf is. It’s a database, it’s a search engine, but it’s also a connectivity machine.

It’s easier to compare NewsGraf to a person — think of it as a veteran reporter. Someone who carries around a vast collection of interviews, research, and general knowledge gleaned from years working a beat. All this info is tucked neatly into her memory, and she taps this personal database whenever she’s assembling a story. It searches for red flags, patterns, and relationships. It’s an editorial sixth sense.

But there’s a big problem with this brain-based model: It disappears when the brain — and its associated owner — get laid off. With news organizations already running smaller and faster, how can they possibly overcome this growing knowledge gap?

Enter NewsGraf. The project is still on the drawing board, but the idea is to capture all that connective information in a format that’s accessible to anyone with a web browser. A visitor can enter the name of a local newsmaker and see the threads that bind that person to others in the community. It’s like Facebook, as designed by a beat reporter.

Data will come from government databases, local newspapers, blogs, and other sources. After running a query, a user can click through to the originating stories for deeper information. NewsGraf is merely the conduit here; Marshall said they want to send users to the information, not keep them locked within NewsGraf’s walls. As the application puts it:

As newspapers find it increasingly difficult to send reporters to monitor local politics and public discourse, communities will need alternative mechanisms to ensure transparency and good government. Local journalists and citizens will be able to draw upon NewsGraf’s data as a starting point for further investigation, uncovering important relationships that may be influencing decisions being made in their community.

The team behind the idea combines journalism (Aldax covers city hall for The San Francisco Examiner) and tech (Marshall is a software developer and a former VP at AOL). NewsGraf will focus on San Francisco and the Bay Area if it wins a News Challenge grant. But if funding doesn’t come through, Aldax hopes someone else runs with the idea. “We just want to see this happen,” he said.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl