Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

July 01 2013

14:57

Monday Q&A: Denise Malan on the new data-driven collaboration between INN and IRE

Every news organization wishes it could have more reporters with data skills on staff. But not every news organization can afford to make data a priority — and even those that do can sometimes find the right candidates hard to find.

A new collaboration between two journalism nonprofits — the Investigative News Network and Investigative Reporters and Editors — aims to address this allocation issue. Denise Malan, formerly a investigative and data reporter at the Corpus Christi Caller-Times, will fill the new role of INN director of data services, offering “dedicated data-analysis services to INN’s membership of more than 80 nonprofit investigative news organizations,” many of them three- or four-person teams that can’t find room or funding for a dedicated data reporter.

It’s a development that could both strengthen the investigative work being done by these institutions and skill building around data analysis in journalism. Malan has experience in training journalists in skills of procuring, cleaning, and analyzing data, and she has high hopes for the kinds of stories and networked reporting that will be produced by this collaboration. We talked about IRE’s underutilized data library, potentially disruptive Supreme Court decisions around freedom of information, the unfortunate end for wildlife wandering onto airplane runways, and what it means to translate numbers into stories.

O’Donovan: How does someone end up majoring in physics and journalism?
Malan: My freshman year they started a program to do a bachelor of arts in physics. Physics Lite. And you could pair that with business or journalism or English — something that was really your major focus of study, but the B.A. in physics would give you a good science background. So you take physics, you take calculus, you take statistics, and that really gives you the good critical thinking and data background to pair with something else — in my case, journalism.
O’Donovan: I guess it’s kind of easy to see how that led into what you’re doing now. But did you always see them going hand in hand? Or is that something that came later?
Malan: In college, I thought I was going to be a science writer. That was the main reason I paired those. When I got into news and started going down the path of data journalism, I was very glad to have that background, for sure. But I started getting more into the data journalism world when the Caller-Times in Corpus Christi sent me to the IRE bootcamp, where it’s a weeklong, intensive week where you concentrate on learning Excel and Access and the different pitfalls you can face in data — some basic cleaning skils. That’s really what got me started in the data journalism realm. And then the newspaper continued to send me to training — to the CAR conferences every year and local community college classes to beef up my skills.
O’Donovan: So, how long were you at the Caller-Times?
Malan: I was there seven years. I started as a reporter in June 2006, and then moved up into editing in May of 2010.
O’Donovan: And in the time that you were there as their data person, what are some stories that you were particularly proud of, or made you feel like this was a a burgeoning field?
Malan: We focused on intensely local projects at the Caller-Times. One of the ones that I was really proud of I worked on with our city hall reporter Jessica Savage. She found out that the city streets are a huge issue in Corpus Christi. If you’ve ever driven here, you know they are just horrible — a disaster. And the city is trying to find a billion dollars to fix them.

So our city hall reporter found out that the city keeps a database of scores called the Pavement Condition Index. Basically, it’s the condition of your street. So we got that database and we merged it with a file of streets and color-coded it so people could fully see what the condition of their street was, and we put it a database for people to find their exact block. This was something the city did not want to give us at first, because if people know the condition of their street scores, they’re going to demand that we do something about it. We’re like, “Yeah, that’s kind of the idea.” But that database became the basis for an entire special section on our streets. We used it to find people on streets who scored a 0, and talked about how it effects their life — how often they have to repair their cars, how often they walk through giant puddles.

And then we paired it with a breakout box of every city council member and their score. We did a map online, which, for over a year, actually, has been a big hit while the city is discussing how they’re going to find this money. People have been using it as a basis for the debate that they’re having, which, to me, is really kind of how we make a difference. Using this data that the city had, bringing it to light, making it accessible, I think, has really just changed the debate here for people. So that’s one thing I’m really proud of — that we can give people information to make informed decisions.

O’Donovan: Part of your new position is going to be facilitating and assisting other journalists in starting to understand how to do this kind of work. How do you tell reporters that this isn’t scary — that it’s something they can do or they can learn? How do you begin that conversation?
Malan: [At the Caller-Times] we adopted the philosophy that data journalism isn’t just something that one nerdy person in the office does, but something that everyone in the newsroom should have in their toolbox. It really enhances eery beat at the newspaper.

I would do training sessions occasionally on Excel, Google Fusion Tables, Caspio to show everyone in the newsroom, “Here’s what’s possible.” Some people really pick up on it and take it and run with it. Some people are not as math oriented and are not going to be able to take it and run with it themselves, but at least they know those tools are available and what it’s possible to do with them.

So some of the reporters would be just aware of how we could analyze data and they would keep their eyes open for databases on their beats, and other reporters would run with it. That philosophy is very important in any newsroom today. A lot of what I’m going to be doing with IRE and INN is working with the INN members in helping them to gather the data and analyze it and inform their local reporting. So a lot of the same roles, but in a broader context.

O’Donovan: So a lot of it is understanding that everyone is going to come at it with a different skill level.
Malan: Yes, absolutely. All our members have different levels of skills. Some of our members have very highly skilled data teams, like ProPublica, Center for Public Integrity — they’re really at the forefront of data journalism. Other members are maybe one- or two-person newsrooms that may not have the training and don’t have any reporters with those skills. So the skill sets are all over the board. But it will be my job to help, especially smaller newsrooms, plug into those resources — especially the resources at IRE — the best they can, with the data library there and the training available there. We help them bring up their own skills and enhance their own reporting.
O’Donovan: When a reporter comes to you and says, “I just found this dataset or I just got access to it” — how do you dive into that information when it comes to looking for stories? How do you take all of that and start to look for what could turn into something interesting?
Malan: A lot of it depends on the data set. Just approach every set of data as a source that you’re interviewing. What is available there? What is maybe missing from the data is something you want to think about too? And you definitely want to narrow it down: A lot of data sets are huge, especially these federal data sets that might have records containing, I don’t know, 120 fields, but maybe you’re only interested in three of them. So you want to get to know the data set, and what is interesting in it, and you want to really narrow your focus.

One collaboration that INN did was using data gathered by NASA for the FAA, and it was essentially near misses — incidents at airports like hitting deer on the runway, and all these little things that can happen but aren’t necessarily reported. They all get compiled in this database, and pilots write these narratives about it, so that field is very interesting to them. There were four or five INN members who collaborated on that, and they all came away with different stories because they all found something else that was interesting for them locally.

O’Donovan: This position you’ll hold is about bringing the work of INN and IRE together. What’s that going to look like? We talk all the time about how journalism is moving in a more networked direction — where do you see this fitting into that?
Malan: IRE and INN have always had a very close relationship, and I think that this position just kind of formalizes that. I will be helping INN members plug into the resources of IRE, especially the data library, I’ll be working closely with Liz Lucas, the database director at IRE, and I’m actually going to be living near IRE so I can work more closely with them. Some of that data there is very underutilized and it’s really interesting and maybe hasn’t been used in any projects, especially on a national level.

So we can take that data and I can kind of help analyze it, help slice it for the various regions we might be looking at, and help the INN members use that data for their stories. I’ll basically be acting as almost a translator to get this data from the IRE and help the INN members use it.

Going the other way, with INN members, they might come up with some project idea where data isn’t available from the database library, or it might be something where we have to gather data from every state individually, so we might compile that and whatever we end up with will be sent back to the IRE library and made available to other IRE members. So it’s a two-way relationship.

O’Donovan: So in terms of managing this collaboration, what are the challenges? Are you think of building an interface for sharing data or documents?
Malan: We’re going to be setting up a kind of committee of data people with INN to have probably monthly calls and just discuss ideas, what they’re working on, brainstorming, possible ideas. I want it to be a very organic, ground-up process — I don’t want it to be dictating what the projects should be. I want the members to come up with their own ideas. So we’ll be brainstorming and coming up with things, and we’ll be managing the group through Basecamp and communicating that way. A lot of the other members are already on Basecamp and communicate that way through INN.

We’ll be communicating through this committee and coming up with ideas and I’l be working with other members to, to reach out to them. If we come up with an idea that deals with health care, for example, I might reach out to some of the members that are especially focused on health care and try to bring in other members on it.

O’Donovan: Do you foresee collaborations between members, like shared reporting and that kind of thing?
Malan: Yeah, depending on the project. Some of it might be shared reporting; some of it might be someone does a main interview. If we’re doing a crime story dealing with the FBI’s Uniformed Crime Report, maybe we just have one reporter from every property, we nominate one person to do the interview with the FBI that everyone can use in their own story, which they localize with their own data. So, yeah, depending on the project, we’ll have to kind of see how the reporting would shake out.
O’Donovan: Do you have any specific goals or types of stories you want to tell, or even just specific data sets you’re eager to get a look at?
Malan: I think there are several interesting sets in the IRE data library that we might go after at first. There’s really interesting health sets, for example, from the FDA — one of them is a database of adverse affects from drugs, complaints that people make that drugs have had adverse effects. So yeah, some of those can be right off the bat, ready to go and parse and analyze.

Some other data sets we might be looking at will be a little harder to get, will take some FOIs and some time to get. There are several major areas that our members focus on and that we’ll be looking at projects for. Environment, for example — fracking is a large issue, and how environment effects public health. Health care, especially with the Affordable Care Act coming into effect next year is going to be a large one. Politics, government, how money effects influences politicians is a huge area as we come up on the 2016 elections and the 2014 midterms. And education is another issue with achievement gaps, graduation rates, charter schools — those are all large issues that our members follow. Finding those commonalties and dealing with data sets, digging into that is going to be my first priority.

O’Donovan: The health question is interesting. Knight announced its next round of News Challenge grants is going to be all around health.
Malan: I’m excited about that. We have several members that are really specifically focused on healt,h so I feel like we might be able to get something good with that.
O’Donovan: Health care stuff or more public health stuff?
Malan: It’s a mix, but a lot of stuff is geared toward the Affordable Care Act now.
O’Donovan: Gathering these data sets must often involve a lot of coordination across states and jurisdictions.
Malan: Yeah, absolutely. One thing I am a little nervous about is the Supreme Court’s recent ruling in the Virginia case where they can now require you to live in a state to put in an FOI. That might complicate things a little bit. I know there are several groups working on lists of people who will put an FOI in for you in various states. But that can kind of just slow down the process and put a little kink in and add to the timeline. I’m concerned of course that now they know it’s been ruled constitutional that every state might make that the law. It could be a huge thing. A management nightmare.
O’Donovan: What kind of advice do you normally give to reporters who are struggling to get information that they know they should be allowed to have?
Malan: That’s something we encountered a lot here, especially getting data in the proper format, too. Laws on that can vary from state to state. A lot of governments will give you paper or PDF format, instead of the Excel or text file that you asked for. It’s always a struggle.

The advice is to know the law as best you can, know what exceptions are allowed under your state law, be able to quote — you don’t have to have the law memorized, but be able to quote specific sections that you know are on your side. Be prepared with your requests, and be prepared to fight for it. And in a lot of cases, it is a fight.

O’Donovan: That’s an interesting intersection of technical and legal skill. That’s a lot of education dollars right there.
Malan: Yeah, no kidding.
O’Donovan: When you do things like attend the NICAR conference and assess the scene more broadly, where do you see the most urgent gaps in the data journalism field? Is it that we need more data analysts? More computer scientists? More reporters with the fluency in communicating with government? More legal aid? If you could allocate more resources, where would you put them right now?
Malan: There’s always going to be a need for more very highly skilled data journalists who can gather these national sets, analyze them, clean them, get them into a digestible format, visualize them online, and inform readers. I would like to see more general beat reporters interested in data and at least getting skills in Excel and even Access — because the beat reporters are the ones on the ground, using their sources, finding these data sets or not finding them if they’re not aware of what data is. I would really like this to be a bigger push to at least educate most general beat reporters to a certain level.
O’Donovan: Where do you see the data journalism movement headed over the next couple years? What would your next big hope for the field be?
Malan: Well, of course I hope for it to go kind of mainstream, and that all reporters will have some sort of data skills. It’s of course harder with fewer and fewer resources, and reporters are learning how to tweet and Instagram, and there are demands on their time that have never been there.

But I would hope it would become just an normal part of journalism, that there would be no more “data journalism” — that it just becomes part of what we do, because it’s invaluable to reporting and to really helping ferret out the truth and to give context to stories.

February 11 2012

15:49

The best tools for converting printouts to spreadsheets

Spreadsheets aren't very useful when they're locked inside PDFs. But after some rigorous testing over at the Review Lab, we've got a few suggestions that can help you wrangle your data and free you up for more reporting. Read More »

January 05 2012

15:20

Feed Your PANDA With New APIs and Excel Import

PANDA_reasonably_small.jpg

Last time I wrote it was to solicit ideas for PANDA's API. We've since implemented those ideas, and we've just released our third alpha, which includes a complete writable API, demo scripts showing how to import from three different data sources, and the ability to import data from Excel spreadsheets.

The PANDA project aims to make basic data analysis quick and easy for news organizations, and make data sharing simple.

Try Alpha 3 now.

Hello, Write API

Our new write API is designed to be as simple and consistent as possible. We've gone to great lengths to illustrate how it works in our new API documentation. We've also provided three example scripts showing how to populate PANDA with data from:

Using these scripts as a starting point, any programmer with a little bit of Python knowledge should be able to easily import data from an SQL database, local file or any other arcane data source they can conjure up in the newsroom.

Excel support

Also included in this release is support for importing from Excel .xls and .xlsx files. It's difficult to promise that this support will work for every Excel file anyone can find to throw at it, but we've had good results with files produced from both Windows and Mac versions of Excel, as well as from OpenOffice on Mac and Linux.

Our Alpha 4 release will be coming at the end of January, followed quickly by Beta 1 around the time of NICAR. To see what we have planned, check out our Release schedule.

December 08 2011

09:22

4 ways to publish your data online

I’ve written a post on the Help Me Investigate blog on a number of different ways to publish data online, from converting Excel spreadsheets into HTML tables, to using Google Docs, or using data-sharing platforms like BuzzData. You may find it useful.

January 28 2011

16:44

Ruby screen scraping tutorials

Mark Chapman has been busy translating our Python web scraping tutorials into Ruby.

They now cover three tutorials on how to write basic screen scrapers, plus extra ones on using .ASPX pages, Excel files and CSV files.

We’ve also installed some extra Ruby modules – spreadsheet and FastCSV – to make them possible.

These Ruby scraping tutorials are made using ScraperWiki, so you can of course do them from your browser without installing anything.

Thanks Mark!


December 01 2010

07:31

Data journalism training – some reflections

OpenHeatMap - Percentage increase in fraud crimes in London since 2006_7

I recently spent 2 days teaching the basics of data journalism to trainee journalists on a broadsheet newspaper. It’s a pretty intensive course that follows a path I’ve explored here previously – from finding data and interrogating it to visualizing it and mashing – and I wanted to record the results.

My approach was both practical and conceptual. Conceptually, the trainees need to be able to understand and communicate with people from other disciplines, such as designers putting together an infographic, or programmers, statisticians and researchers.

They need to know what semantic data is, what APIs are, the difference between a database and open data, and what is possible with all of the above.

They need to know what design techniques make a visualisation clear, and the statistical quirks that need to be considered – or looked for.

But they also need to be able to do it.

The importance of editorial drive

The first thing I ask them to do (after a broad introduction) is come up with a journalistic hypothesis they want to test (a process taken from Mark E Hunter’s excellent ebook Story Based Inquiry). My experience is that you learn more about data journalism by tackling a specific problem or question – not just the trainees but, in trying to tackle other people’s problems, me as well.

So one trainee wants to look at the differences between supporters of David and Ed Miliband in that week’s Labour leadership contest. Another wants to look at authorization of armed operations by a police force (the result of an FOI request following up on the Raoul Moat story). A third wants to look at whether ethnic minorities are being laid off more quickly, while others investigate identity fraud, ASBOs and suicides.

Taking those as a starting point, then, I introduce them to some basic computer assisted reporting skills and sources of data. They quickly assemble some relevant datasets – and the context they need to make sense of them.

For the first time I have to use Open Office’s spreadsheet software, which turns out to be not too bad. The data pilot tool is a worthy free alternative to Excel’s pivot tables, allowing journalists to quickly aggregate & interrogate a large dataset.

Formulae like concatenate and ISNA turn out to be particularly useful in cleaning up data or making it compatible with similar datasets.

The ‘Text to columns’ function comes in handy in breaking up full names into title, forename and surname (or addresses into constituent parts), while find and replace helped in removing redundant information.

It’s not long before the journalists raise statistical issues – which is reassuring. The trainee looking into ethnic minority unemployment, for example, finds some large increases – but the numbers in those ethnicities are so small as to undermine the significance.

Scraping the surface of statistics

Still, I put them through an afternoon of statistical training. Notably, not one of them has studied a maths or science-related degree. History, English and Law dominate – and their educational history is pretty uniform. At a time when newsrooms need diversity to adapt to change, this is a little worrying.

But they can tell a mean from a mode, and deal well with percentages, which means we can move on quickly to standard deviations, distribution, statistical significance and regression analysis.

Even so, I feel like we’ve barely scraped the surface – and that there should be ways to make this more relevant in actively finding stories. (Indeed, a fortnight later I come across a great example of using Benford’s law to highlight problems with police reporting of drug-related murder)

One thing I do is ask one trainee to toss a coin 30 times and the others to place bets on the largest number of heads to fall in a row. Most plump for around 4 – but the longest run is 8 heads in a row.

The point I’m making is regarding small sample sizes and clusters. (With eerie coincidence, one of them has a map of Bridgend on her screen, which made the news after a cluster of suicides).

That’s about as engaging as this section got – so if you’ve any ideas for bringing statistical subjects to life and making them relevant to journalists, particularly as a practical tool for spotting stories, I’m all ears.

Visualisation – bringing data to life, quickly

Day 2 is rather more satisfying, as – after an overview of various chart types and their strengths and limitations – the trainees turn their hands to visualization tools – Many Eyes, Wordle, Tableau Public, Open Heat Map, and Mapalist.

Suddenly the data from the previous day comes to life. Fraud crime in London boroughs is shown on a handy heat map. A pie chart, and then bar chart, shows the breakdown of Labour leadership voters; and line graphs bring out new possible leads in suicide data (female suicide rates barely change in 5 years, while male rates fluctuate more).

It turns out that Mapalist – normally used for plotting points on Google Maps from a Google spreadsheet – now also does heat maps based on the density of occurrences. ManyEyes has also added mapping visualizations to its toolkit.

Looking through my Delicious bookmarks I rediscover a postcodes API with a hackable URL to generate CSV or XML files with the lat/long, ward and other data from any postcode (also useful on this front is Matthew Somerville’s project MaPit).

Still a print culture

Notably, the trainees bring up the dominance of print culture. “I can see how this works well online,” says one, “but our newsroom will want to see a print story.”

One of the effects of convergence on news production is that a tool traditionally left to designers after the journalist has finished their role in the production line is now used by the journalist as part of their newsgathering role – visualizing data to see the story within it, and possibly publishing that online to involve users in that process too.

A print news story – in this instance – may result from the visualization process, rather than the other way around.

More broadly, it’s another symptom of how news production is moving from a linear process involving division of labour to a flatter, more overlapping organization of processes and roles – which involves people outside of the organization as well as those within.

Mashups

The final session covers mashups. This is an opportunity to explore the broader possibilities of the technology, how APIs and semantic data fit in, and some basic tools and tutorials.

Clearly, a well-produced mashup requires more than half a day and a broader skillset than exists in journalists alone. But by using tools like Mapalist the trainees have actually already created a mashup. Again, like visualization, there is a sliding scale between quick and rough approaches to find stories and communicate them – and larger efforts that require a bigger investment of time and skill.

As the trainees are already engrossed in their own projects, I don’t distract them too much from that course.

You can see what some of the trainees produced at the links below:

Matt Holehouse:

Many Eyes _ Rate of deaths in industrial accidents in the EU (per 100k)

Rate of deaths in industrial accidents in the EU (per 100k)

Raf Sanchez:

Rosie Ensor

  • Places with the highest rates for ASBOs

Sarah Rainey

July 22 2010

07:00

The New Online Journalists #6: Conrad Quilty-Harper

As part of an ongoing series on recent graduates who have gone into online journalism, The Telegraph’s new Data Mapping Reporter Conrad Quilty-Harper talks about what got him the job, what it involves, and what skills he feels online journalists need today.

I got my job thanks to Twitter. Chris Brauer, head of online journalism at City University, was impressed by my tweets and my experience, and referred me to the Telegraph when they said they were looking for people to help build the UK Political database.

I spent six weeks working on the database, at first manually creating candidate entries, and later mocking up design elements and cleaning the data using Freebase Gridworks, Excel and Dabble DB. At the time the Telegraph was advertising for a “data juggler” role, and I interviewed for the job and was offered it.

My job involves three elements:

  • Working with reporters to add visualisations to stories based on numbers,
  • Covering the “open data” beat as a reporter, and
  • Creating original stories with visualisations based on data from FOI and other sources.

For my job I need to know how to select and scrape good data, clean it, pick out the stories and visualise it. (P.S. you may have noticed that I’m a “data is singular” kinda guy).

The “data” niche is greatly exciting to me. Feeding into this is the #opendata movement, the new Government’s plan to release more data and the understanding that data driven journalism as practised in the United States has to come here. There’s clearly a hunger for more data driven stories, a point well illustrated by a recent letter to the FT.

The mindset you need to have as an online journalist today is to become familiar with and proficient at using tools that make you better at your job. You have to be an early adopter. Get on the latest online service, get the latest gadget and get it before your colleagues and competitors. Find the value in those tools, integrate it into your work and go and find another tool.

When I blogged for Engadget our team had built an automated picture watermarker for liveblogging. I played with it for a few hours and made a new script that downloaded the pictures from a card, applied the watermark, uploaded the pictures and ejected the SD card. Engadget continues to try out new tools that enable them to do their job faster and better. There are endless innovations being churned out every day from the world of technology. Make time to play with them and make them work for you.

If you know of anyone else who should be featured in this series, let me know in the comments.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl