Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 09 2012

12:19

Two reasons why every journalist should know about scraping (cross-posted)

This was originally published on Journalism.co.uk – cross-posted here for convenience.

Journalists rely on two sources of competitive advantage: being able to work faster than others, and being able to get more information than others. For both of these reasons, I  love scraping: it is both a great time-saver, and a great source of stories no one else has.

Scraping is, simply, getting a computer to capture information from online sources. They might be a collection of webpages, or even just one. They might be spreadsheets or documents which would otherwise take hours to sift through. In some cases, it might even be information on your own newspaper website (I know of at least one journalist who has resorted to this as the quickest way of getting information that the newspaper has compiled).

In May, for example, I scraped over 6,000 nomination stories from the official Olympic torch relay website. It allowed me to quickly find both local feelgood stories and rather less positive national angles. Continuing to scrape also led me to a number of stories which were being hidden, while having the dataset to hand meant I could instantly pull together the picture of a single day on which one unsuccessful nominee would have run, and I could test the promises made by organisers.

ProPublica scraped payments to doctors by pharma companies; the Ottawa Citizen ran stories based on its scrape of health inspection reports. In Tampa Bay they run an automatically updated page on mugshots. And it’s not just about the stories: last month local reporter David Elks was using Google spreadsheets to compile a table from a Word document of turbine applications for a story which, he says, “helped save the journalist probably four or five hours of manual cutting and pasting.”

The problem is that most people imagine that you need to learn a programming language to start scraping - but that’s not true. It can help - especially if the problem is complicated. But for simple scrapers, something as easy as Google Docs will work just fine.

I tried an experiment with this recently at the News:Rewired conference. With just 20 minutes to introduce a room full of journalists to the complexities of scraping, and get them producing instant results, I used some simple Google Docs functions. Incredibly, it worked: by the end The Independent’s Jack Riley was already scraping headlines (the same process is outlined in the sample chapter from Scraping for Journalists).

And Google Docs isn’t the only tool. Outwit Hub is a must-have Firefox plugin which can scrape through thousands of pages of tables, and even Google Refine can grab webpages too. Database scraping tool Needlebase was recently bought by Google, too, while Datatracker is set to launch in an attempt to grab its former users. Here are some more.

What’s great about these simple techniques, however, is that they can also introduce you to concepts which come into play with faster and more powerfulscraping tools like Scraperwiki. Once you’ve become comfortable with Google spreadsheet functions (if you’ve ever used =SUM in a spreadsheet, you’ve used a function) then you can start to understand how functions work in a programming language like Python. Once you’ve identified the structure of some data on a page so that Outwit Hub could scrape it, you can start to understand how to do the same in Scraperwiki. Once you’ve adapted someone else’s Google Docs spreadsheet formula, then you can adapt someone else’s scraper.

I’m saying all this because I wrote a book about it. But, honestly, I wrote a book about this so that I could say it: if you’ve ever struggled with scraping or programming, and given up on it because you didn’t get results quickly enough, try again. Scraping is faster than FOI, can provide more detailed and structured results than a PR request – and allows you to grab data that organisations would rather you didn’t have. If information is a journalist’s lifeblood, then scraping is becoming an increasingly key tool to get the answers that a journalist needs, not just the story that someone else wants to tell.

12:19

Two reasons why every journalist should know about scraping (cross-posted)

This was originally published on Journalism.co.uk – cross-posted here for convenience.

Journalists rely on two sources of competitive advantage: being able to work faster than others, and being able to get more information than others. For both of these reasons, I  love scraping: it is both a great time-saver, and a great source of stories no one else has.

Scraping is, simply, getting a computer to capture information from online sources. They might be a collection of webpages, or even just one. They might be spreadsheets or documents which would otherwise take hours to sift through. In some cases, it might even be information on your own newspaper website (I know of at least one journalist who has resorted to this as the quickest way of getting information that the newspaper has compiled).

In May, for example, I scraped over 6,000 nomination stories from the official Olympic torch relay website. It allowed me to quickly find both local feelgood stories and rather less positive national angles. Continuing to scrape also led me to a number of stories which were being hidden, while having the dataset to hand meant I could instantly pull together the picture of a single day on which one unsuccessful nominee would have run, and I could test the promises made by organisers.

ProPublica scraped payments to doctors by pharma companies; the Ottawa Citizen ran stories based on its scrape of health inspection reports. In Tampa Bay they run an automatically updated page on mugshots. And it’s not just about the stories: last month local reporter David Elks was using Google spreadsheets to compile a table from a Word document of turbine applications for a story which, he says, “helped save the journalist probably four or five hours of manual cutting and pasting.”

The problem is that most people imagine that you need to learn a programming language to start scraping - but that’s not true. It can help - especially if the problem is complicated. But for simple scrapers, something as easy as Google Docs will work just fine.

I tried an experiment with this recently at the News:Rewired conference. With just 20 minutes to introduce a room full of journalists to the complexities of scraping, and get them producing instant results, I used some simple Google Docs functions. Incredibly, it worked: by the end The Independent’s Jack Riley was already scraping headlines (the same process is outlined in the sample chapter from Scraping for Journalists).

And Google Docs isn’t the only tool. Outwit Hub is a must-have Firefox plugin which can scrape through thousands of pages of tables, and even Google Refine can grab webpages too. Database scraping tool Needlebase was recently bought by Google, too, while Datatracker is set to launch in an attempt to grab its former users. Here are some more.

What’s great about these simple techniques, however, is that they can also introduce you to concepts which come into play with faster and more powerfulscraping tools like Scraperwiki. Once you’ve become comfortable with Google spreadsheet functions (if you’ve ever used =SUM in a spreadsheet, you’ve used a function) then you can start to understand how functions work in a programming language like Python. Once you’ve identified the structure of some data on a page so that Outwit Hub could scrape it, you can start to understand how to do the same in Scraperwiki. Once you’ve adapted someone else’s Google Docs spreadsheet formula, then you can adapt someone else’s scraper.

I’m saying all this because I wrote a book about it. But, honestly, I wrote a book about this so that I could say it: if you’ve ever struggled with scraping or programming, and given up on it because you didn’t get results quickly enough, try again. Scraping is faster than FOI, can provide more detailed and structured results than a PR request – and allows you to grab data that organisations would rather you didn’t have. If information is a journalist’s lifeblood, then scraping is becoming an increasingly key tool to get the answers that a journalist needs, not just the story that someone else wants to tell.

October 01 2010

12:32

September 24 2010

11:30

September 17 2010

14:17
11:27

Financial protection for NCTJ courses

Rachel McAthy at journalism.co.uk chips in to the recent NCTJ debate asking NCTJ accreditation: essential or an outdated demand? She reports on the recent meeting of the NCTJ’s cross-media accreditation board where the answer is an emphatic, if predictable, yes.

Most interesting for me though was a quote from the report of the meeting by Professor Richard Tait, director of the Centre of Journalism Studies at Cardiff University:

While the NCTJ is quite right to insist on sufficient resources and expertise so that skills are properly taught and honed, education is a competitive market, and NCTJ courses are expensive to run. In the likely cuts ahead, it is vital for accredited courses to retain their funding so that they are not forced to charge students exorbitant fees; otherwise, diversity will be further compromised.

On the face of it a reasonable demand. But one that in turn demands a lot more clarification.  Who should be offering that financial security?  The universities, the industry or the NCTJ who take a fee.

Some more NCTJ bursaries perhaps….

September 10 2010

13:35

#jpod: The week’s biggest news stories from Journalism.co.uk, 10 September 2010

September 03 2010

14:48

August 27 2010

11:00

#jpod: The week’s biggest news stories from Journalism.co.uk, 27 August 2010

August 13 2010

13:32

Podcast: The week’s biggest media stories on Journalism.co.uk

August 06 2010

13:50

Podcast: The week’s biggest media stories on Journalism.co.uk

July 22 2010

19:08

Some other online innovators for some other list

Journalism.co.uk have a list of this year’s “leading innovators in journalism and media”. I have some additions. You may too.

Nick Booth

I brought Nick in to work with me on Help Me Investigate, a project for which he doesn’t get nearly enough credit. It’s his understanding of and connections with local communities that lie behind most of the successful investigations on the site. In addition, Nick helped spread the idea of the social media surgery, where social media savvy citizens help others find their online voice. The idea has spread as far as Australia and Africa.

Matt Buck and Alex Hughes

Matt and Alex have been busily reinventing news cartoons for a digital age with a number of projects, including Drawnalism (event drawing), animated illustrations, and socially networked characters such as Tobias Grubbe.

Pete Cashmore

Mashable.

Tony Hirst

Tony has been blogging about mashups for longer than most at OUseful.info, providing essential help for journalists getting to grips with Yahoo! Pipes, Google spreadsheets, scraping, and – this week – Google App Inventor.

Adrian Holovaty and Simon Willison

I’m unfairly bunching these two together because they were responsible – with others – for the Django web framework, which has been the basis for some very important data journalism projects including The Guardian’s experiment in crowdsourcing analysis of MPs’ redacted expenses, and Holovaty’s Everyblock.

Philip John

Behind the Lichfield Blog but equally importantly, Journal Local, the platform for hyperlocal publishers which comes with a raft of useful plugins pre-installed, and he runs the West Midlands Future of News Group.

Christian Payne

Documentally has been innovating and experimenting with mobile journalism for years in the UK, with a relaxed-but-excitable on-screen/on-audio presence that suits the medium perfectly. And he really, really knows his kit.

Meg Pickard

Meg is an anthropologist by training, a perfect background for community management, especially when combined with blogging experience that pre-dates most of the UK. The practices she has established on the community management front at The Guardian’s online operations are an exemplar for any news organisation – and she takes lovely photos too.

Chris Taggart

Chris has been working so hard on open data in 2010 I expect steam to pour from the soles of his shoes every time I see him. His ambition to free up local government data is laudable and, until recently, unfashionable. And he deserves all the support and recognition he gets.

Rick Waghorn

One of the first regional newspaper reporters to take the payoff and try to go it alone online – first with his Norwich City website, then the MyFootballWriter network, and more recently with the Addiply self-serve ad platform. Rick is still adapting and innovating in 2010 with some promising plans in the pipeline.

I freely admit that these are based on my personal perspective and knowledge. And yes, lists are pointless, and linkbait.

June 11 2010

07:30

#VOJ10: Follow the Value of Journalism conference

The BBC College of Journalism and media thinktank Polis are hosting a one-day conference today to discuss the value of networked journalism, free newspapers, political and government reporting and ‘grassroots’ journalism. Keynote speakers include Channel 4 News presenter Jon Snow and BBC Global News director Peter Horrocks, interviewed by Journalism.co.uk at this link.

Journalism.co.uk is hosting the session on ‘grassroots journalism’ and we will be discussing what new ‘hyperlocal’ start-ups are up to, how sustainable these ventures and opportunities this trend could in turn create for ‘big’ media groups in the local space. In keeping with the title of the conference we’re hoping to move the discussion away from what is hyperlocal or definitions of ‘citizen journalism’ and talk about the value of ‘grassroots journalism’ to the public and the media in the UK as a whole.

For our updates you can follow @journalism_live on Twitter – there’s also a hashtag of #VOJ10 and tweets from the conference in the liveblog below:

Similar Posts:



June 10 2010

11:48

#newsrw: 10 tickets left – get yours before the price goes up

There are just 10 tickets left for news:rewired – the nouveau niche, Journalism.co.uk’s one-day event on 25 June for journalists working within a specialist beat or patch.

If you want one now – here’s the link to book: http://www.journalism.co.uk/195/

The price is currently discounted at £80 (+VAT), but will return to the full price of £100 (+VAT) tomorrow, Friday 11 June.

If you need more convincing, full details of the day are at this link. In summary we’ve got speakers from MSN UK, the Financial Times, Reed Business Information and the BBC discussing paid content, mobile, social media, data journalism and much, much more.

If you’re not able to attend you’ll be able to follow proceedings on @newsrewired and http://www.newsrewired.com.

Similar Posts:



April 20 2010

08:52

January 01 2010

10:58

December 22 2009

08:45
Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl