Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 03 2013

22:05

Intercontinental collaboration: How 86 journalists in 46 countries can work on a single investigation

piggy-bank-offshore-banking-beach

On Thursday morning, the International Consortium of Investigative Journalists will begin releasing detailed reports on the workings of offshore tax havens. A little over a year ago, 260 gigabytes of data were leaked to ICIJ executive dIrector Gerard Ryle; they contained information about the finances of individuals in over 170 countries.

Ryle was a media executive in Australia at the time he received the data, says deputy director Marina Walker Guevara. “He came with the story under his arm.” Walker Guevara says the ICIJ was surprised Ryle wanted a job in their small office in Washington, but soon realized that it was only through their international scope and experience with cross border reporting that the Offshore Project could be executed. The result is a major international collaboration that has to be one of the largest in journalism history.

“It was a huge step. As reporters and journalists, the first thing you think is not ‘Let me see how I can share this with the world.’ You think: ‘How can I scoop everyone else?’ The thinking here was different.” Walker Guevara says the ICIJ seriously considered keeping the team to a core five or six members, but ultimately decided to go with the “most risky” approach when they realized the enormous scope of the project: Journalists from around the world were given lists of names to identify and, if they found interesting connections, were given access to Interdata, the secure, searchable, online database built by the ICIJ.

Just as the rise of information technology has allowed new competition for the attention of audiences, it’s also enabled traditional news organizations to partner in what can sometimes seem like dizzyingly complex relationships. The ICIJ says this is the largest collaborative journalism project they have ever organized, with the most comparable involving a team of 25 cross border journalists.

In the end, the Offshore Project brings together 86 journalists from 46 countries into an ongoing reporting collaboration. German and Canadian news outlets (Süddeutsche Zeitung, Norddeutscher Rundfunk, and the CBC) will be among the first to report their findings this week, with The Washington Post beginning their report on April 7, just in time for Tax Day. Reporters from more than 30 other publications also contributed, including Le Monde, the BBC and The Guardian. (The ICIJ actually published some preliminary findings in conjunction with the U.K. publications as a teaser back in November.)

“The natural step wasn’t to sit in Washington and try to figure out who is this person and why this matters in Azerbaijan or Romania,” Walker Guevara said, “but to go to our members there — or a good reporter if we didn’t have a member — give them the names, invite them into the project, see if the name mattered, and involve them in the process.”

Defining names that matter was a learning experience for the leaders of the Offshore Project. Writes Duncan Campbell, an ICIJ founder and current data journalism manager:

ICIJ’s fundamental lesson from the Offshore Project data has been patience and perseverance. Many members started by feeding in lists of names of politicians, tycoons, suspected or convicted fraudsters and the like, hoping that bank accounts and scam plots would just pop out. It was a frustrating road to follow. The data was not like that.

The data was, in fact, very messy and unstructured. Between a bevy of spreadsheets, emails, PDFs without OCR, and pictures of passports, the ICIJ still hasn’t finished mining all the data from the raw files. Campbell details the complicated process of cleaning the data and sorting it into a searchable database. Using NUIX software licenses granted to the ICIJ for free, it took a British programmer two weeks to build a secure database that would allow all of the far-flung journalists not only to safely search and download the documents, but also to communicate with one another through an online forum.

“Once we went to these places and gathered these reporters, we needed to give them the tools to function as a team,” Walker Guevara said.

Even so, some were so overwhelmed by the amount of information available, and so unaccustomed to hunting for stories in a database, that the ICIJ ultimately hired a research manager to do searches for reporters and send them the documents via email. “We do have places like Pakistan where the reporters didn’t have much Internet access, so it was a hassle for him,” says Walker Guevara, adding that there were also security concerns. “We asked him to take precautions and all that, and he was nervous, so I understand.”

They also had to explain to each of the reporting teams that they weren’t simply on the lookout for politicians hiding money and people who had broken the law. “First, you try the name of your president. Then, your biggest politician, former presidents — everybody has to go through that,” Walker Guevara says. While a few headline names did eventually appear — Imelda Marcos, Robert Mugabe — she says some of the most surprising stories came from observing broader trends.

“Alongside many usual suspects, there were hundreds of thousands of regular people — doctors and dentists from the U.S.,” she says, “It made us understand a system that is a lot more used than what you think. It’s not just people breaking the law or politicians hiding money, but a lot of people who may feel insecure in their own countries. Or hiding money from their spouses. We’re actually writing some stories about divorce.”

In the 2 million records they accessed, ICIJ reporters began to get an understanding of the methods account holders use to avoid association with these accounts. Many use “nominee directors,” a process which Campbell says is similar to registering a car in the name of a stranger. But in their post about the Offshore Project, the ICIJ team acknowledges that, to a great extent, most of the money being channeled through offshore accounts and shell companies is actually not being used for illegal transactions. Defenders of the offshore banks say they “allow companies and individuals to diversify their investments, forge commercial alliances across national borders, and do business in entrepreneur-friendly zones that eschew the heavy rules and red tape of the onshore world.”

Walker Guevara says that, while that can be true, the “parallel set of rules” that governs the offshore world so disproportionately favor the elite, wealthy few as to be unethical. “Regulations, bureaucracy, and red tape are bothersome,” she says, “but that’s how democracy works.”

Perhaps the most interesting question surrounding the Offshore Project, however, is how do you get traditional shoe-leather journalists up to speed on an international story that involves intensive data crunching. Walker Guevara says it’s all about recognizing when the numbers cease to be interesting on their own and putting them in global context. Ultimately, while it’s rewarding to be able to trace dozens of shell companies to a man accused of stealing $5 billion from a Russian bank, someone has to be able to connect the dots.

“This is not a data story. It was based on a huge amount of data, but once you have the name and you look at your documents, you can’t just sit there and write a story,” says Walker Guevara. “That’s why we needed reporters on the ground. We needed people checking courthouse records. We needed people going and talking to experts in the field.”

All of the stories that result from the Offshore Project — some of which could take up to a year to be published — will live on a central project page at ICIJ.org. The team is also considering creating a web app that will allow users to explore some (though probably not all) of the data. In terms of the unique tools they built, Walker Guevara says most are easily replicable by anyone using NUIX or dtSearch software, but they won’t be open sourced. Other lessons from the project, like the inherent vulnerability of PGP encryption and “other complex cryptographic systems popular with computer hackers,” will endure.

“I think one of the most fascinating things about the project was that you couldn’t isolate yourself. It was a big temptation — the data was very addictive,” Walker Guevara says. “But the story worked because there was a whole other level of traditional reporting that was going and checking public records, going and seeing — going places.”

Photo by Aaron Shumaker used under a Creative Commons license.

September 04 2012

18:47

Top-2 plug-ins for scraping data right in your browser

Scraping information from the Web can be a complicated affair -- but that's not always the case. If you've got a simple data extraction job to do for a story, check out these two inexpensive browser extensions that can help you get the job done. Read More »

August 07 2012

14:32

Archiving tweets isn’t as easy as it seems

Between deleted tweets and posts that disappear into the timeline void, it's a pain to keep track of and find information more than a few days old on Twitter. So how hard could it really be to build a capable Twitter archiver for reporters? Read More »

June 15 2011

14:43

Tanzania Media Copes with Wild Success of Feedback via SMS

For the largest civil society media platform in Tanzania, back talk is good. 



In fact, talking back is the objective of a new service at Femina HIP called Speak Up! The service aims to increase access of marginalized youth and rural communities and promote a participatory, user-driven media scene in Tanzania.



SpeakUp.png

Femina HIP is the largest civil society media platform in the country, outside of commercial mainstream media. Products include print magazines, television shows, a radio program, and an interactive website. Fema magazine, for example, has a print run of over 170,000 copies and is distributed to every rural region in the country.



Over the last few years, Femina HIP has encouraged its audience to connect and comment by sending letters, email, and SMS messages -- and comment people did. Dr. Minou Fuglesang, executive director of Femina HIP, said the platform was nearly drowning in messages.

It became clear to the team that SMS needed to be handled more systematically. Speak Up! is a service that offers a more automated, organized way to receive and respond to incoming SMS messages. With the Speak Up! service, the message flow is more systematic and organized. Femina HIP is better equipped to respond to comments and queries. A more automated system also helps Femina HIP embrace the young community -- one that feels a growing need to organize and participate, Fuglesang said.



How It Works

Femina HIP uses an application built by Starfish Mobile, a wireless application service provider. All SMS messages are sent to the same shortcode (15665) and the Starfish application sorts messages according to key word. (Senders have to begin the message with the key word of the product they wish to address, be it Fema magazine or the Ruka Juu na Fema TV Talk Show.) 



Femina HIP staff members access the application from a web-based dashboard, where they can view all incoming messages across products. Virtually all messages received are in Swahili. "It is very rare to get a message in English, let alone other languages," said Diana Nyakyi of Femina HIP. "Though if we do receive something in English, it is considered just as much as any other SMS in Swahili in terms of feedback value."



The Speak Up! service works in collaboration with local mobile providers, because the shortcode is "bound" to the providers, Nyakyi said. "However, we are keen on having a more engaging and beneficial relationship with them [the mobile operators] as partners, and some have shown interest."



Two-Way Communications

Femina HIP wants to talk back to its audience, too.

When an individual sends an SMS to the Femina HIP shortcode, he or she receives an automatic confirmation. Senders' phone numbers are automatically entered in a database, which allows Femina HIP staff to further respond to individuals. Often, this is to simply say thank you for the message. But staff can also access and respond to urgent or serious messages, including questions on issues of health, sex, suicide, or requests for advice. Currently, Femina HIP has a list of about 30,000 active mobile numbers.



chezasalama sms.jpg

The Speak Up! database can also be sorted by categories such as key word, time submitted (date, week, month), or by phone network. Statistics are available, including which phone numbers have had the most interactions with the system, and whether the interactions were via SMS vote or SMS comment. The ability to sort allows the staff to group SMS messages around content themes and inform people about relevant, upcoming programs. 



Speak Up! wants the audience to become agenda setters, and claims to achieve "a more inclusive public debate and a more investigative reporting that mirrors everyday life in Tanzania." 


Challenges and Lessons

Femina HIP and the Speak Up! service have faced a learning curve. For example, it's been challenging to help the audience understand how to send an SMS to an automated service. "It's not as easy as it sounds because people have to understand how to use the shortcode and our key words," Fuglesang said. 



If someone misses a space or spells the key word incorrectly, for example, the SMS is marked "invalid" and ends up in the trash box. 



Similarly, if people send a message that's over the 160-character limit of a text message, the second half of it is also marked invalid. Currently, Starfish Mobile does not support these so-called concatenated SMS messages. "This is causing a problem, even though we ask our listeners to send us short messages," Fuglesang said. "People write long messages." 



For example, Speak Up! had 900 responses to a recent question, but nearly 500 ended up in the trash bin because of error or length. While the messages can be retrieved, and the team is trying to do just that, "it does pose a bit of a headache," Fuglesang said.



Another issue may be cost. While there is a cost to send a text message, sending an SMS to a shortcode actually carries a slightly higher cost, Fuglesang said. "We are trying to monitor this to see if it affects the flow."



February 07 2011

15:10

Google query language question

Good afternoon.

I'm trying to analyse some data in a Google spreadsheet useing the query language, but I've hit a brickwall and hope someone can help.

Here's the data: https://spreadsheets.google.com/ccc?key=0ApL1zT2P00q5dE1tQVhLcHdiZ1hQRXBvc0tFdFV1Zmc&hl=en_GB&authkey=CISto5cD

I've created a query that allows me to create a table listing the number of crimes in a local neighbourhood in ascending order, thus

https://spreadsheets.google.com/tq?tqx=out:html&tq=select C, D order by D&key=tMmAXKpwbgXPEposKEtUufg&hl=en_GB#gid=1

All I want to do is reorder the table so that the highest are listed first.

Can anyone help?

December 01 2010

15:30

Keeping track of political candidates online: Web archiver Perpetually follows the digital campaign trail

There is one huge, almost infinitely wide memory gap in our culture that can be summed up with this question: Where does the Internet go when it dies? Not the whole Internet, but the individual websites and pages that every day are modified and deleted, discarded and cached. Who can a journalist turn to when needing to look up the older version of a website, a retired blog, or a deleted Facebook post?

It turns out, not many people. The hole that Nexis plugs for academic papers and the newspapers of the world has few equivalents online. The once excellent Wayback Machine: Internet Archive — an attempt at a complete, Library of Congress-worthy web archive — is now fairly useless in today’s social-media driven web world, storing a slipshod record of photos, multimedia, and basically anything that’s not Web 1.0, and on top of that, taking up to a year for updates to appear in its index after its spider has crawled a site.

This election season, as candidates propped up their digital campaign booths online with Twitter feeds and new, snazzy websites, Darrell Silver, founder of the Perpetually Public Data Project, realized this was actually kind of terrifying. For all the thousands of reporters following candidates’ buses and rallies, there was no mechanism to follow the campaign trail online. Anything pledged on a candidate’s website could be wiped out with the click of a mouse — and without so much as a peep.

To fill this collective memory hole, for the 2010 midterms, Perpetually archived the websites — and the Facebook, MySpace, and Twitter accounts — of every politician it could find: all the major candidates for all 435 House and 37 Senate races. And it archived every change at every second of every minute: Flash, blog posts, photos, whatever — with the exception of YouTube videos, which have a copyright conflict, and he decided to discard.

The result has been a great experiment that’s made at least one news splash and brought the technology onto the Huffington Post. After the election, Silver added every congressperson — newly elected or not — and every governor to Perpetually’s database. Imagine the difference at some point in the future: Anyone will to be able to zoom into any point in the past, load up a politician’s website, and see how things stood on any given day in any given year. And then, a few clicks more, to be able to scroll through the politician’s history and to create a larger story about the politician over a wide, career-length timespan. “We’re trying to be the undebatable reference point for the source material and the proof of what happened when,” Silver said.

The site, thus far, has given that goal its best shot, although it has got a way to go. In terms of the breadth of his archive and the depth of its storage, Silver’s peerless. ProPublica’s Versionista is perhaps his closest competitor, but for now it doesn’t track candidate sites, only the Whitehouse.gov site. Moreover, the Versionista platform shows only specific html-coded changes, so it monitors mostly text and lacks a screenshot archive, a complete record of images, and interactive elements. Silver had many of the same critiques — lack of interactive elements, a generally superficial archiving — for the Wayback Machine (not to mention its dinosaur lag-time in updating its archive).

If his database is the gold standard for Internet archiving, on Perpetually’s front-end — the site visitors use to navigate the database — the story was less nice. In the rush to get things up, a shaky vision for the project created the odd mess of creaky widgets, bridge-to-nowhere links, and brilliant data-archiving that were the site for the few weeks it was live.

The site as it existed is a good case study in how a great concept with poor execution can crash and burn — and then potentially redeem itself. In Silver’s defense, he had little time to get things together. Perpetually began archiving candidates’ sites in June — not knowing exactly what he would do with the data — and managed with only a team of five to have a website up for the general public by early October.

But it was painful to use. You could see that some idea, some vision, was at work, but it was hard to see how whoever was behind the thing actually thought they could pull it off. Links broke, videos gave errors, and community was non-existent. The annotations page — an absurd Tumblr-style page with no entries limit — with a larger user base would have sent an average laptop crashing to its knees, and text-diff mode gave an html page read-out, a fairly frightening chunk of words and symbols specializing in alienation and confusion.

The good news, though, is that as far as Perpetually’s future is concerned, its history doesn’t matter: Perpetually has gone into hibernation for a complete overhaul and redesign. “One of the things I learned is that there’s a huge amount of interest of tracking politicians who are nationally or locally interesting,” said Silver. “But you have to provide a lot better and more immediate goals and feedback.”

Silver’s looked at the Guardian’s expense-scandal tracker for ideas on how to use better crowdsourcing mechanisms, like promoting what’s interesting and highlighting top users. And he likes Versionista’s feed-subscription service that gives users instant notification of changes made by a specific candidate. Silver — who is far more of a tech geek than politico — just did not understand a political junkie’s motivations, but he’s clearly getting there, and it is likely that his redesign will showcase a savvy pastiche of social media tools he culls from around the Internet.

If these changes make the site user-friendly, journalists should rejoice. As it stands, the tools available to journalists to retrieve information about a candidate’s online campaign trail are unreliable and incomplete, jeopardizing online accountability. We’ve already seen how easily that can happen. Perpetually provides a common resource to circumvent this problem. “That ability to see, to go beyond the Wikipedia summary is vital to…the history to what this person is saying,” Silver put it.

Non-journalists — whoever these people might be — have reason to celebrate, too. It’s easy to imagine a day when early website incarnations have Americana value, like The Museum of Moving Image has archived online has rediscovered in presidential TV ads. The White House itself seems to be getting in on the idea. It’s created “frozen in time” portraits of previous administrationswebsites, anointing them with the exclusive “.gov” extension along with the program.

These are big ideas — an institutional memory hole, the making of a blog into classic memorabilia — and the opportunity is there for Silver to make them a reality. But before any of that happens, he still has to get the details right. He says has set forth three things he believes his audience wants and that a remade Perpetually must do for them:

“People want to know about significant changes and want to research the candidates they don’t know about. [They] want to be kept up to date and want a way to do that really easily. The third thing they want is to participate. They all want to improve the election process and want to discuss and do it in an efficient way.”

News organizations, take note: Leading up to 2012, Perpetually’s a site to watch.

October 16 2010

13:39

ScraperWiki: Hacks and Hackers day, Manchester.

If you’re not familiar with scraperwiki it’s ”all the tools you need for Screen Scraping, Data Mining & visualisation”.

These guys are working really hard at convincing Journos that data is their friend by staging a steady stream of events bringing together journos and programmers together to see what happens.

So I landed at NWVM’s offices to what seems like a mountain of laptops, fried food, coke and biscuits to be one of the judges of their latest hacks and hackers day in Manchester (#hhhmcr). I was expecting some interesting stuff. I wasn’t dissapointed.

The winners

We had to pick three prizes from the six of so projects started that day and here’s what we (Tom Dobson, Julian Tait and me)  ended up with.

The three winners, in reverse order:

Quarternote: A website that would ‘scrape’ myspace for band information. The idea was that you could put a location and style of music in to the system and it would compile a line-up of bands.

A great idea (although more hacker than hack) and if I was a dragon I would consider investing. These guys also won the Scraperwiki ‘cup’ award for actually being brave enough to have a go at scraping data from Myspace. Apparently myspace content has less structure than custard! The collective gasps from the geeks in the room when they said that was what they wanted to do underlined that.

Second was Preston’s summer of spend.  Local councils are supposed to make details of any invoice over 500 pounds available, and many have. But many don’t make the data very useable.  Preston City council is no exception. PDF’s!

With a little help from Scraperwiki the data was scraped, tidied and put in a spreadsheet and then organised. It through up some fun stuff – 1000 pounds to The Bikini Beach Band! And some really interesting areas for exploration – like a single payment of over 80,000 to one person (why?) – and I’m sure we’ll see more from this as the data gets a good running through.  A really good example of how a journo and a hacker can work together.

The winner was one of number of projects that took the tweets from the GMP 24hr tweet experiment; what one group titled ‘Genetically modified police’ tweeting :). Enrico Zini and Yuwei Lin built a searchable GMP24 tweet database (and a great write up of the process) of the tweets which allowed searching by location, keyword, all kinds of things. It was a great use of the data and the working prototype was impressive given the time they had.

Credit should go to Michael Brunton-Spall of the Guardian into a useable dataset which saved a lot of work for those groups using the tweets as the raw data for their projects.

Other projects included mapping deprivation in manchester and a legal website that if it comes off will really be one to watch. All brilliant stuff.

Hacks and hackers we need you

Give the increasing amount of raw data that organisations are pumping out journalists will find themselves vital in making sure that they stay accountable. But I said in an earlier post that good journalists don’t need to know how to do everything, they just need to know who to ask.

The day proved to me and, I think to lots of people there,  that asking a hacker to help sort data out is really worth it.

I’m sure there will be more blogs etc about the day appearing over the next few days.

Thanks to everyone concerned for asking me along.

September 08 2010

04:28

Any open source database tools like Google Fusion Tables out there?

A previous question pointed me to Google Fusion Tables, and I've been hooked. I'd like to give my audience a small fraction of that power, or at least give it to our newsroom, but I've been unable to find any open source work being done on similar projects (super friendly UI wrapped around a database).

Our site is processing a lot of datasets, usually in Excel format, and I'm wondering if anyone has found some FOSS that could help us do a better job presenting it and opening it up to others.

We're doing some Jquery formatting, but anything deeper than that would be much appreciated.

August 05 2010

22:58

How do you keep track of FOI requests, and what do you do with the responses after?

Do you have a organization-wide tracking system for FOI requests? A personal one? Do you just mark your calendar, "Bug state department for them darn files"?

And what about the results of your FOI requests? I'm always surprised how few news orgs put the actual documents online (documentcloud.org not withstanding) and so wownder: Do you have a organization-wide or personal filing system? Does it go into your news org library? Is this just lost to an internal memory hole?

July 23 2010

21:24

Automatically geotagging longitude and latitude on addresses in Google Doc spreadsheet?

I have a spreadsheet of addresses and some other information I'd like to throw onto a Google map using this tool, but it requires the data have a longitude and latitude. Anyone know an easy way to do this in Google docs, since I've got some 4,000 rows?

Here's a sample row:

XAVIER'S MARKET, INC | 290 N FRONT ST | NEW BEDFORD | MA | $129,692.44

June 17 2010

09:28

TechSoup Webinar: Harness the Power of Your Data with CRM

Techsoup Talks LogoWe all know that it’s important to collect information about our donors, vendors, volunteers, partners, and members. But we don’t always have a good system for capturing this information. In the perfect world our data would be stored all in the same place, in the same way, and be easily accessible by key staff and board members. It would also show relationships and connections.

read more

May 29 2010

19:08

Spreadsheet or database of of state-by-state, county-by-county government agencies?

I'm looking for a downloadable/scrape-able database or spreadsheet document that has a list of state and local government offices, alongside (fingers crossed) address and contact info.

Some Googling didn't turn it up, but thought someone on here might have some pointers.

04:58

Ideas or Examples of NoSQL for News?

Normally, I like my data structured, and that tends to mean some flavor of SQL. But I've been seeing glimpses of NoSQL popping up lately and I'm wondering if anyone here can share a few good uses of NoSQL in the newsroom. A few I've seen:

What else is out there? And what code can you share?

May 23 2010

18:48

Storing video in mysql?

What's the best way to store videos for an app? Database or filesystem? How?

May 04 2010

19:36

What are your caching (and cache-busting) tips and tricks?

Many of us work on database-intensive web apps and rely on caching for site performance. So what are the tips, tricks and tweaks you've developed to keep your apps humming along while avoiding database meltdowns?

From the big things (got a great strategy for caching server config?) down to little plugin approaches (like the example I'll put below).

April 10 2010

12:13

TechSoup Webinar: Finding the Perfect Donor Database in an Imperfect World

There are nearly two hundred donor databases on the market. Each has its own strengths and weaknesses, fans and foes. The challenge is to find a system with strengths that meet your needs, weaknesses that won’t get in your way, at a price you can afford.

This workshop will cover the basic concepts you will need to make a decision.

Topics will include:

  • What to expect from a fundraising database.
  • When to consider a change.
  • How to make the decision. 
  • Why not build your own database?

This webinar is appropriate for Executive Directors, fundraisers, donor database managers, and anyone else who needs to help their organization choose a new fundraising database.

read more

March 19 2010

20:12

Resurrecting Unstructured Data to Help Small Newspapers

Unstructured data is typically said to account for up to 80 percent of information stored on business computer systems. While this is a widely accepted notion, I'm inclined to agree with Seth Grimes that this 80 percent rule is inflated, depending on the type of business. Still, If we could structure even a fraction of that data, it would create significant value for small newspapers.

The type of data that has my attention is free-form text. Small newspapers in particular have computers full of text files containing information about their communities. Often, these files lie dormant, left on the hard drive of a dusty computer somewhere in the back of the newsroom, inaccessible to the public. Compounding this problem is the fact that newspapers realize no additional value from content they paid journalists to produce. The information is gathered, and then much of it sits somewhere, unused and untouched. Only parts of it end up being published.

To further understand the potential of resurrecting unstructured data, one must realize the workflow of traditional small newspapers.

Newspaper Workflow

It surprised me several years ago when I learned learn that most community newspapers utilize a very low-tech workflow when managing their data. A typical newspaper might organize their content in hierarchical folders as shown in the example below. Files are grouped by month, then named with the day of publication:

newsroom.png

The workflow is simple, effective and has served its purpose for many years. Once a file's publication date has passed, it is ignored forever. At best, a selection of these files are copy and pasted into a content management system for publication online. But this process seldom happens until after the newspaper's print edition has been completed. At this point the newspaper has little incentive to process these files further, as attention must now be focused on the next day's edition.

This reality helps illustrate the potential for the CMS Upload Utility, my Knight News Challenge project. It's an inexpensive way to move text files into a web-accessible database. Once inside a database, possibilities abound for how value can be created from this data. In my next post, I'll share several sample use cases to help explain how the application works.

For now, though, think about all of that unstructured data, and how we can make better use of it.

Reblog this post [with Zemanta]

March 13 2010

02:47

Digg Says Yes To NoSQL Cassandra DB, Bye To MySQL

donadony writes "After twitter, now it's Digg who's decided to replace MySQL and most of their infrastructure components and move away from LAMP to another architecture called NoSQL that is based in Cassandra, an open source project that develops a highly scalable second-generation distributed database. Cassandra was open sourced by Facebook in 2008 and is licensed under the Apache License. The reason for this move, as explained by Digg, is the increasing difficulty of building a high-performance, write-intensive application on a data set that is growing quickly, with no end in sight. This growth has forced them into horizontal and vertical partitioning strategies that have eliminated most of the value of a relational database, while still incurring all the overhead."

Read more of this story at Slashdot.

Tags: database

December 17 2009

16:30

KNC 2010: FollowIndy tries to marry aggregation and geography

[EDITOR'S NOTE: We're highlighting a few of the entries in this year's Knight News Challenge, which just closed Tuesday night. Did you know of an entry worth looking at? Email Mac or leave a brief comment on this post. —Josh]

Former Indianapolis Star software developer Chris Vannoy brings something unusual to his News Challenge application: a fully functional site already built on nights and weekends.

FollowIndy is a hyperlocal aggregator, tapping into the vast web of information published through Twitter, Flickr, news sites, and blogs. Its value is in the limits of its geography: The site only targets news and information relevant to Indianapolis. “Unlike a lot of aggregators that sort of cast a wide net, the idea is to get a very small net that’s aiming for a specific area,” Vannoy said. “It’s about getting a full picture of what’s going on in Indianapolis and then providing some context around what people are talking about.”

Once the sources are pulled into FollowIndy, content is automatically tagged and aggregated, which makes it possible to aggregate all material related to arson, apartment complexes, or Peyton Manning.

Aggregating both professional and personal feeds means Vannoy has data to track how stories are pushed by each — if mainstream media is pushing a story that’s then being picked up by personal users, or vice versa. That’s similar to the Media Cloud project of our friends down the street here at Harvard. For instance, here is a visualization of mentions of the word “flu” in the sources FollowIndy tracks. Notice how mentions spike after The Indianapolis Star mentions is around 24 seconds in:

So if FollowIndy is already up and running, what does Vannoy need $100,000 of Knight money for? Vannoy wants to expand the network of sources it tracks (it doesn’t yet include local blogs, for instance), and it needs a variety of infrastructure improvements, such as better autotagging of content. But he believes FollowIndy is a model that can be duplicated in other markets, as he writes in his application:

By focusing on a single geographic area, you can go deep: pulling in blogs…alternative news weeklies, business journals, television stations and Twitter to try to grab every last speck of news. It’s that volume of data that suddenly makes things interesting and makes some things possible. Small-scale trends by geography, what a geographic area is linking to and the like become not just possible, but relatively easy.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl