Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 09 2012

12:19

Two reasons why every journalist should know about scraping (cross-posted)

This was originally published on Journalism.co.uk – cross-posted here for convenience.

Journalists rely on two sources of competitive advantage: being able to work faster than others, and being able to get more information than others. For both of these reasons, I  love scraping: it is both a great time-saver, and a great source of stories no one else has.

Scraping is, simply, getting a computer to capture information from online sources. They might be a collection of webpages, or even just one. They might be spreadsheets or documents which would otherwise take hours to sift through. In some cases, it might even be information on your own newspaper website (I know of at least one journalist who has resorted to this as the quickest way of getting information that the newspaper has compiled).

In May, for example, I scraped over 6,000 nomination stories from the official Olympic torch relay website. It allowed me to quickly find both local feelgood stories and rather less positive national angles. Continuing to scrape also led me to a number of stories which were being hidden, while having the dataset to hand meant I could instantly pull together the picture of a single day on which one unsuccessful nominee would have run, and I could test the promises made by organisers.

ProPublica scraped payments to doctors by pharma companies; the Ottawa Citizen ran stories based on its scrape of health inspection reports. In Tampa Bay they run an automatically updated page on mugshots. And it’s not just about the stories: last month local reporter David Elks was using Google spreadsheets to compile a table from a Word document of turbine applications for a story which, he says, “helped save the journalist probably four or five hours of manual cutting and pasting.”

The problem is that most people imagine that you need to learn a programming language to start scraping - but that’s not true. It can help - especially if the problem is complicated. But for simple scrapers, something as easy as Google Docs will work just fine.

I tried an experiment with this recently at the News:Rewired conference. With just 20 minutes to introduce a room full of journalists to the complexities of scraping, and get them producing instant results, I used some simple Google Docs functions. Incredibly, it worked: by the end The Independent’s Jack Riley was already scraping headlines (the same process is outlined in the sample chapter from Scraping for Journalists).

And Google Docs isn’t the only tool. Outwit Hub is a must-have Firefox plugin which can scrape through thousands of pages of tables, and even Google Refine can grab webpages too. Database scraping tool Needlebase was recently bought by Google, too, while Datatracker is set to launch in an attempt to grab its former users. Here are some more.

What’s great about these simple techniques, however, is that they can also introduce you to concepts which come into play with faster and more powerfulscraping tools like Scraperwiki. Once you’ve become comfortable with Google spreadsheet functions (if you’ve ever used =SUM in a spreadsheet, you’ve used a function) then you can start to understand how functions work in a programming language like Python. Once you’ve identified the structure of some data on a page so that Outwit Hub could scrape it, you can start to understand how to do the same in Scraperwiki. Once you’ve adapted someone else’s Google Docs spreadsheet formula, then you can adapt someone else’s scraper.

I’m saying all this because I wrote a book about it. But, honestly, I wrote a book about this so that I could say it: if you’ve ever struggled with scraping or programming, and given up on it because you didn’t get results quickly enough, try again. Scraping is faster than FOI, can provide more detailed and structured results than a PR request – and allows you to grab data that organisations would rather you didn’t have. If information is a journalist’s lifeblood, then scraping is becoming an increasingly key tool to get the answers that a journalist needs, not just the story that someone else wants to tell.

12:19

Two reasons why every journalist should know about scraping (cross-posted)

This was originally published on Journalism.co.uk – cross-posted here for convenience.

Journalists rely on two sources of competitive advantage: being able to work faster than others, and being able to get more information than others. For both of these reasons, I  love scraping: it is both a great time-saver, and a great source of stories no one else has.

Scraping is, simply, getting a computer to capture information from online sources. They might be a collection of webpages, or even just one. They might be spreadsheets or documents which would otherwise take hours to sift through. In some cases, it might even be information on your own newspaper website (I know of at least one journalist who has resorted to this as the quickest way of getting information that the newspaper has compiled).

In May, for example, I scraped over 6,000 nomination stories from the official Olympic torch relay website. It allowed me to quickly find both local feelgood stories and rather less positive national angles. Continuing to scrape also led me to a number of stories which were being hidden, while having the dataset to hand meant I could instantly pull together the picture of a single day on which one unsuccessful nominee would have run, and I could test the promises made by organisers.

ProPublica scraped payments to doctors by pharma companies; the Ottawa Citizen ran stories based on its scrape of health inspection reports. In Tampa Bay they run an automatically updated page on mugshots. And it’s not just about the stories: last month local reporter David Elks was using Google spreadsheets to compile a table from a Word document of turbine applications for a story which, he says, “helped save the journalist probably four or five hours of manual cutting and pasting.”

The problem is that most people imagine that you need to learn a programming language to start scraping - but that’s not true. It can help - especially if the problem is complicated. But for simple scrapers, something as easy as Google Docs will work just fine.

I tried an experiment with this recently at the News:Rewired conference. With just 20 minutes to introduce a room full of journalists to the complexities of scraping, and get them producing instant results, I used some simple Google Docs functions. Incredibly, it worked: by the end The Independent’s Jack Riley was already scraping headlines (the same process is outlined in the sample chapter from Scraping for Journalists).

And Google Docs isn’t the only tool. Outwit Hub is a must-have Firefox plugin which can scrape through thousands of pages of tables, and even Google Refine can grab webpages too. Database scraping tool Needlebase was recently bought by Google, too, while Datatracker is set to launch in an attempt to grab its former users. Here are some more.

What’s great about these simple techniques, however, is that they can also introduce you to concepts which come into play with faster and more powerfulscraping tools like Scraperwiki. Once you’ve become comfortable with Google spreadsheet functions (if you’ve ever used =SUM in a spreadsheet, you’ve used a function) then you can start to understand how functions work in a programming language like Python. Once you’ve identified the structure of some data on a page so that Outwit Hub could scrape it, you can start to understand how to do the same in Scraperwiki. Once you’ve adapted someone else’s Google Docs spreadsheet formula, then you can adapt someone else’s scraper.

I’m saying all this because I wrote a book about it. But, honestly, I wrote a book about this so that I could say it: if you’ve ever struggled with scraping or programming, and given up on it because you didn’t get results quickly enough, try again. Scraping is faster than FOI, can provide more detailed and structured results than a PR request – and allows you to grab data that organisations would rather you didn’t have. If information is a journalist’s lifeblood, then scraping is becoming an increasingly key tool to get the answers that a journalist needs, not just the story that someone else wants to tell.

May 27 2011

16:27

#Newsrw: ‘It’s about creating a party and making it rock’

Connecting with the readership and being the centre of things is key to social media strategy, the news:rewired audience heard.

“You have to acknowledge that audience is far larger than editorial team, and they’ll outperform you,” said Jack Riley, head of digital audience and content development at the Independent.

Riley presented some impressive stats – a combined total of 100,000 Likes across all the Independent’s Facebook pages, but said recognising audience needs was far more important than sheer numbers.

@_JackRiley speaking at #newsrw by JosephStash

This was something that Mark Johnson of the Economist agreed with, pointing to the organisation’s total of 1.2 Twitter followers across various accounts but saying that a more important metric is to look at the number of people who come to the main site from social web sources.

A popular feature called Ask The Economist deals with specialist topics for readers to take part in live question and answer sessions with experts.

“The long and short of it is to work out what’s special about your brand and rather than change that, work out how it can work in a social environment,” he said.

Riley looked at the importance of the social graph, and harnessing the engagement that publications can get from existing communities like Facebook and Twitter.

He mentioned Trove, a project by the Washington Post that aims to create a personalised news feed based on readers social activities along with content from the newspaper.

Further on in the discussion were some interesting ideas by Suw Charman-Anderson and Stefan Stern.

Charman-Anderson mentioned the 1/9/90 rule – for every one user that heavily engages with content there are nine who moderately engage and 90 who simply view content and don’t engage at all.

Stern also said that it was “a big ask for journalists to open up,” and that traditionally there are many journalists and columnists who don’t want to engage with the public at all.

As Mark Jones of Reuters said: “It’s not about being the centre of attention any more, it’s about creating a party and making it rock,” – publications need to enter into open collaboration with other organisations (like Reuters have with Tweetminster) as well as making themselves an essential part of the social web.

11:53

LIVE: Session 2B – Social media strategy

We have Matthew Caines and Ben Whitelaw from Wannabe Hacks liveblogging for us at news:rewired all day. You can follow session 2B ‘Social media strategy’, below.

Session 2B features: With: Jack Riley, head of digital audience and content development, the Independent; Stefan Stern, director of strategy, Edelman; Mark Jones, global communities editor, Reuters News; Mark Johnson, community editor, the Economist. Moderated by Suw Charman-Anderson, social technologist.

January 12 2011

20:23

The Independent’s Facebook revolution

Like Robert Fisk

The Independent newspaper has introduced a fascinating new feature on the site that allows users to follow articles by individual writers and news about specific football teams via Facebook.

It’s one of those ideas so simple you wonder why no one else appears to have done it before: instead of just ‘liking’ individual articles, or having to trudge off to Facebook to see if there’s a relevant page you can become a fan of, the Indie have applied the technology behind the ‘Like’ button to make the process of following specific news feeds more intuitive.

To that end, you can pick your favourite football team from this page or click on the ‘Like’ button at the head of any commentator’s homepage. The Independent’s Jack Riley says that the feature will be rolled out to columnists next, followed by public figures, places, political parties, and countries.

The move is likely to pour extra fuel on the overblown ‘RSS is dying‘ discussion that has been taking place recently. The Guardian’s hugely impressive hackable RSS feeds (with full content) are somewhat put in the shade by this move – but then the Guardian have generated enormous goodwill in the development community for that, and continue to innovate. Both strategies have benefits.

At the moment the Independent’s new Facebook feature is plugged at the end of each article by the relevant commentator or about a particular club. It’s not the best place to put given how many people read articles through to the end, nor the best designed to catch the eye, and it will be interesting to see whether the placement and design changes as the feature is rolled out.

It will also be interesting to see how quickly other news organisations copy the innovation.

More coverage at Read Write Web and Future of Media.

August 17 2010

13:54

#followjourn: @_JackRiley – Jack Riley/digital media editor

#followjourn: Jack Riley

Who? Jack Riley, digital media editor at the Independent.

Where? Jack has a blog on the Independent website, older posts from which are archived at this link. You can also find him on Journalisted and LinkedIn.

Contact? @_JackRiley

Just as we like to supply you with fresh and innovative tips every day, we’re recommending journalists to follow online too. They might be from any sector of the industry: please send suggestions (you can nominate yourself) to laura at journalism.co.uk; or to @journalismnews.Similar Posts:



May 25 2010

15:43

Independent integrates article comments with Twitter and Facebook

The Independent has installed a new commenting system on its website in the shape of Disqus – the same as we use on Journalism.co.uk no less.

The system allows users to login to leave a comment using a Disqus profile, but also, and more importantly, with their Twitter username and password, Facebook login or OpenId identification.

With the Twitter and Facebook logins there’s also the option to share your article comment via these sites.

Jack Riley, digital media editor at the Independent, explains in a blog post that the new system has been trialled on the site’s sport section for the past week and has improved the level of “constructive debate”.

We’re encouraging people to use credentials linked to their personal profiles not just because openness and accountability are great, fundamental things which underpin good journalism as well as good commenting (and why should the two be different?), but also because by introducing accountability into the equation, we’re hoping the tone and standard of the comments will go up (…) It’s about first of all letting people authenticate their commenting using systems with which they’re already familiar (in Facebook’s case, that’s 400 million people worldwide and counting), and secondly, it’s about restoring your trust in our comments section, so that some of the really great submissions we get on there rise to the top, the bad sink to the bottom, and the ugly – the spam and abuse that are an inevitable adjunct of any commenting system – don’t appear at all.

Similar Posts:



Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl