Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 26 2011

17:32

PANDA Aims to Make Data Analysis Easier for Journalists (And We'll Be at ONA!)

What's got rows and columns and sucks at data? Excel. Though to be fair, we misuse it. Excel was built for spreadsheets, but it's become most folks' go-to kit for poking at data. It's installed on your computer. It opens CSV files. It's what you know.

Of course, databases are great at data, but they're hard. Microsoft Access is limiting, and real databases like MySQL and PostgreSQL aren't the easiest things for a non-hacker to get up and running, let alone query. Learning a little SQL will make you a better reporter, but digging through many datasets from different sources can take more than "a little SQL."

Plus, it doesn't matter if you're an Excel maniac or a database jockey -- either way, the data is just sitting on your PC, invisible to your peers. Hidden data is sad data.

We live in a data-soaked sci-fi future. It's awesome. And in this future, every journalist must be a data journalist. But to get there, we need a better kit.

PANDA will help

We're trying do two things with PANDA: make basic data analysis quick and easy for news organizations, and make data sharing simple. I'll explain by example:

Let's say you've got an Excel spreadsheet of city employees with columns for first name, last name, department, position and salary. You visit your PANDA, upload the spreadsheet, give it a name, and tell PANDA it's a list of people. Once the data's in, you'll be able to search and sort and filter -- whatever you need.

Each news organization will have their own PANDA, so your data stays private while you work. And every time you add a new spreadsheet, you'll be building your newsroom's data library. So next time one of your peers is scrubbing a name, they'll be able to simultaneously search this and all the other lists of names your newsroom has collected.

That's just the baseline. We've got many more ideas, but we'd like to discuss them with you! So...

Hello, ONA!

The PANDA Gang is going to be at ONA 2011 in Boston in a few weeks, and we need to hear from you! We'll be roaming the halls, camping in the lobbies and crawling the bars -- furiously taking notes about your newsroom data needs.

Find us!

We'll be in the red PANDA T-shirts.

17:32

PANDA Aims to Make Data Analysis Easier for Journalists

Excel sucks. Though to be fair, we misuse it. Excel was built for spreadsheets, but it's become most folks' go-to kit for poking at data. It's installed on your computer. It opens CSV files. It's what you know.

Of course, databases are great at data, but they're hard. Microsoft Access is limiting, and real databases like MySQL and PostgreSQL aren't the easiest things for a non-hacker to get up and running, let alone query. Learning a little SQL will make you a better reporter, but digging through many datasets from different sources can take more than "a little SQL."

Plus, it doesn't matter if you're an Excel maniac or a database jockey -- either way, the data is just sitting on your PC, invisible to your peers. Hidden data is sad data.

We live in a data-soaked future. It's awesome. And in this future, every journalist must be a data journalist. But to get there, we need a better kit.

PANDA will help

With PANDA (which is a recursive acronym for "A News Data Application"), we're trying do two things: make basic data analysis quick and easy for news organizations, and make data sharing simple. I'll explain by example:

Let's say you've got an Excel spreadsheet of city employees with columns for first name, last name, department, position and salary. You visit your PANDA, upload the spreadsheet, give it a name, and tell PANDA it's a list of people. Once the data's in, you'll be able to search and sort and filter -- whatever you need.

Each news organization will have their own PANDA, so your data stays private while you work. And every time you add a new spreadsheet, you'll be building your newsroom's data library. So next time one of your peers is scrubbing a name, they'll be able to simultaneously search this and all the other lists of names your newsroom has collected.

That's just the baseline. We've got many more ideas, but we'd like to discuss them with you! So...

PANDA from Knight Foundation on Vimeo.

Hello, ONA!

The PANDA Gang is going to be at "ONA 2011 in Boston":http://ona11.journalists.org/ in a few weeks, and we need to hear from you! We'll be roaming the halls, camping in the lobbies and crawling the bars -- furiously taking notes about your newsroom data needs.

Find us!

We'll be in the red PANDA T-shirts.

December 06 2010

02:35

UN Global Pulse Camp 1.0


(Photo credit: Christopher Fabian of UNICEF & Global Pulse)

Just got back from the UN "Pulse Camp 1.0".

Global Pulse is a new and quite ambitious UN initiative "to improve evidence-based decision-making and close the information gap between the onset of a global crisis and the availability of actionable information to protect the vulnerable" (Full overview at http://www.unglobalpulse.org/about).

read more

November 16 2010

15:27

Extractiv: crawl webpages and make semantic connections

Extractiv screenshot

Here’s another data analysis tool which is worth keeping an eye on. Extractiv “lets you transform unstructured web content into highly-structured semantic data.” Eyes glazing over? Okay, over to ReadWriteWeb:

“To test Extractive, I gave the company a collection of more than 500 web domains for the top geolocation blogs online and asked its technology to sort for all appearances of the word “ESRI.” (The name of the leading vendor in the geolocation market.)

“The resulting output included structured cells describing some person, place or thing, some type of relationship it had with the word ESRI and the URL where the words appeared together. It was thus sortable and ready for my analysis.

“The task was partially completed before being rate limited due to my submitting so many links from the same domain. More than 125,000 pages were analyzed, 762 documents were found that included my keyword ESRI and about 400 relations were discovered (including duplicates). What kinds of patterns of relations will I discover by sorting all this data in a spreadsheet or otherwise? I can’t wait to find out.”

What that means in even plainer language is that Extractiv will crawl thousands of webpages to identify relationships and attributes for a particular subject.

This has obvious applications for investigative journalists: give the software a name (of a person or company, for example) and a set of base domains (such as news websites, specialist publications and blogs, industry sites, etc.) and set it going. At the end you’ll have a broad picture of what other organisations and people have been connected with that person or company.

It won’t answer your questions, but it will suggest some avenues of enquiry, and potential sources of information. And all within an hour.

Time and cost

ReadWriteWeb reports that the process above took around an hour “and would have cost me less than $1, after a $99 monthly subscription fee. The next level of subscription would have been performed faster and with more simultaneous processes running at a base rate of $250 per month.”

As they say, the tool represents “commodity level, DIY analysis of bulk data produced by user generated or other content, sortable for pattern detection and soon, Extractiv says, sentiment analysis.”

Which is nice.

September 17 2010

17:00

Yeah, but what does it mean for journalism? A visual rhetoric guide

It’s become something of a Twitter joke. A new gadget appears, or a dramatic development takes place on the world stage, and the cry goes up: But what does it mean for journalism? I’m guilty of it myself. And a lot of the time, it’s a meaningful question to ask; we are in the future-of-journalism business, after all. What would we spend our day doing if not inquiring about what it — all of it– means for journalism.

That said, I wanted to try a little experiment. And so using Wordle, some time-delimited Google searches, and quick-and-dirty cutting and pasting, I decided to take a look at how the conversation about “what it means for journalism” might have changed, or not changed, since 2008.

The results are below. But first, a little bit about what I did. I plugged a few searches into Google, namely “what” AND “future of journalism.” I time-delimited the search, looking only for results from 2008, then only from 2009, then only 2010. I scraped the text from all my results, and dropped them into OpenOffice. I then deleted all mentions of “journalism,” “media,” and “news,” figuring they’d be the most common and least interesting answers, and wanting to weigh the words without them included in Wordle. And here’s what I got.

2008 [full-size version here]: Words that jump out: “public,” “interest,” “material,” “interactivity,” “information.” The combination of “public” and “interest” are the most interesting to me here. It was an election, after all, perhaps there was a bit more discussion of that amorphous body we call “the public,” and how it relates to changes in journalism. There’s a little about journalists, though not as much as we’ll see in 2009.

2009 [full size]: “Public” has disappeared, as has “information.” It’s been replaced by “people,” “journalist,” “online,” “world,” “web,” “paper,” and “think.” There’s some question about medium at play here; this was the year of “what comes after newspapers die,” after all. I have to admit I was a little surprised there weren’t more words having to do with “morbidity” here, stuff like “death,” “dying,” “disappearing, or “crisis.” But I think the focus on “journalist” here reflects the industry crisis in its own way — as in, what about all those people losing their jobs?

2010 [full size]: Now here’s the “what does it mean for journalism” conversation I remember — iPad and WikiLeaks. Will either of them save journalism? We’ll see what the rest of the year brings, but for now, it looks to me like a fairly abstract conversation about journalism and the public has been replaced by a debate over particular types of mediums (paper and web), which has itself been supplanted by a focus on particular organizations and devices.

Now, all of this is incredibly crude measurement, and there’s a ton wrong with it. (Let’s just say my methodology wouldn’t pass peer review.) Time-limited Google searching is imperfect, and of course I’ve totally left out stuff like Twitter and Facebook. But I think there’s a germ of potential here for mapping particular forms of dialog around particular key phrases. I’d love to work with any data-happy, data-mining Twitter scholars or smart Google engineers to pursue this line of work further. Drop me a line if you’re interested.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl