- monthly subscription or
- one time payment
- cancelable any time
"Tell the chef, the beer is on me."
This was originally published on Journalism.co.uk – cross-posted here for convenience.
Journalists rely on two sources of competitive advantage: being able to work faster than others, and being able to get more information than others. For both of these reasons, I love scraping: it is both a great time-saver, and a great source of stories no one else has.
Scraping is, simply, getting a computer to capture information from online sources. They might be a collection of webpages, or even just one. They might be spreadsheets or documents which would otherwise take hours to sift through. In some cases, it might even be information on your own newspaper website (I know of at least one journalist who has resorted to this as the quickest way of getting information that the newspaper has compiled).
In May, for example, I scraped over 6,000 nomination stories from the official Olympic torch relay website. It allowed me to quickly find both local feelgood stories and rather less positive national angles. Continuing to scrape also led me to a number of stories which were being hidden, while having the dataset to hand meant I could instantly pull together the picture of a single day on which one unsuccessful nominee would have run, and I could test the promises made by organisers.
ProPublica scraped payments to doctors by pharma companies; the Ottawa Citizen ran stories based on its scrape of health inspection reports. In Tampa Bay they run an automatically updated page on mugshots. And it’s not just about the stories: last month local reporter David Elks was using Google spreadsheets to compile a table from a Word document of turbine applications for a story which, he says, “helped save the journalist probably four or five hours of manual cutting and pasting.”
The problem is that most people imagine that you need to learn a programming language to start scraping - but that’s not true. It can help - especially if the problem is complicated. But for simple scrapers, something as easy as Google Docs will work just fine.
I tried an experiment with this recently at the News:Rewired conference. With just 20 minutes to introduce a room full of journalists to the complexities of scraping, and get them producing instant results, I used some simple Google Docs functions. Incredibly, it worked: by the end The Independent’s Jack Riley was already scraping headlines (the same process is outlined in the sample chapter from Scraping for Journalists).
And Google Docs isn’t the only tool. Outwit Hub is a must-have Firefox plugin which can scrape through thousands of pages of tables, and even Google Refine can grab webpages too. Database scraping tool Needlebase was recently bought by Google, too, while Datatracker is set to launch in an attempt to grab its former users. Here are some more.
What’s great about these simple techniques, however, is that they can also introduce you to concepts which come into play with faster and more powerfulscraping tools like Scraperwiki. Once you’ve become comfortable with Google spreadsheet functions (if you’ve ever used =SUM in a spreadsheet, you’ve used a function) then you can start to understand how functions work in a programming language like Python. Once you’ve identified the structure of some data on a page so that Outwit Hub could scrape it, you can start to understand how to do the same in Scraperwiki. Once you’ve adapted someone else’s Google Docs spreadsheet formula, then you can adapt someone else’s scraper.
I’m saying all this because I wrote a book about it. But, honestly, I wrote a book about this so that I could say it: if you’ve ever struggled with scraping or programming, and given up on it because you didn’t get results quickly enough, try again. Scraping is faster than FOI, can provide more detailed and structured results than a PR request – and allows you to grab data that organisations would rather you didn’t have. If information is a journalist’s lifeblood, then scraping is becoming an increasingly key tool to get the answers that a journalist needs, not just the story that someone else wants to tell.
This was originally published on Journalism.co.uk – cross-posted here for convenience.
Journalists rely on two sources of competitive advantage: being able to work faster than others, and being able to get more information than others. For both of these reasons, I love scraping: it is both a great time-saver, and a great source of stories no one else has.
Scraping is, simply, getting a computer to capture information from online sources. They might be a collection of webpages, or even just one. They might be spreadsheets or documents which would otherwise take hours to sift through. In some cases, it might even be information on your own newspaper website (I know of at least one journalist who has resorted to this as the quickest way of getting information that the newspaper has compiled).
In May, for example, I scraped over 6,000 nomination stories from the official Olympic torch relay website. It allowed me to quickly find both local feelgood stories and rather less positive national angles. Continuing to scrape also led me to a number of stories which were being hidden, while having the dataset to hand meant I could instantly pull together the picture of a single day on which one unsuccessful nominee would have run, and I could test the promises made by organisers.
ProPublica scraped payments to doctors by pharma companies; the Ottawa Citizen ran stories based on its scrape of health inspection reports. In Tampa Bay they run an automatically updated page on mugshots. And it’s not just about the stories: last month local reporter David Elks was using Google spreadsheets to compile a table from a Word document of turbine applications for a story which, he says, “helped save the journalist probably four or five hours of manual cutting and pasting.”
The problem is that most people imagine that you need to learn a programming language to start scraping - but that’s not true. It can help - especially if the problem is complicated. But for simple scrapers, something as easy as Google Docs will work just fine.
I tried an experiment with this recently at the News:Rewired conference. With just 20 minutes to introduce a room full of journalists to the complexities of scraping, and get them producing instant results, I used some simple Google Docs functions. Incredibly, it worked: by the end The Independent’s Jack Riley was already scraping headlines (the same process is outlined in the sample chapter from Scraping for Journalists).
And Google Docs isn’t the only tool. Outwit Hub is a must-have Firefox plugin which can scrape through thousands of pages of tables, and even Google Refine can grab webpages too. Database scraping tool Needlebase was recently bought by Google, too, while Datatracker is set to launch in an attempt to grab its former users. Here are some more.
What’s great about these simple techniques, however, is that they can also introduce you to concepts which come into play with faster and more powerfulscraping tools like Scraperwiki. Once you’ve become comfortable with Google spreadsheet functions (if you’ve ever used =SUM in a spreadsheet, you’ve used a function) then you can start to understand how functions work in a programming language like Python. Once you’ve identified the structure of some data on a page so that Outwit Hub could scrape it, you can start to understand how to do the same in Scraperwiki. Once you’ve adapted someone else’s Google Docs spreadsheet formula, then you can adapt someone else’s scraper.
I’m saying all this because I wrote a book about it. But, honestly, I wrote a book about this so that I could say it: if you’ve ever struggled with scraping or programming, and given up on it because you didn’t get results quickly enough, try again. Scraping is faster than FOI, can provide more detailed and structured results than a PR request – and allows you to grab data that organisations would rather you didn’t have. If information is a journalist’s lifeblood, then scraping is becoming an increasingly key tool to get the answers that a journalist needs, not just the story that someone else wants to tell.
Listen below for this week’s news round-up from Journalism.co.uk editor Laura Oliver and sign up to our iTunes podcast feed for future audio.
Similar Posts:
Listen below for this week’s news round-up from Journalism.co.uk editor Laura Oliver and sign up to our iTunes podcast feed for future audio.
Similar Posts:
Listen below for this week’s news round-up from Journalism.co.uk reporter Rachel McAthy and sign up to our iTunes podcast feed for future audio.
Similar Posts:
Rachel McAthy at journalism.co.uk chips in to the recent NCTJ debate asking NCTJ accreditation: essential or an outdated demand? She reports on the recent meeting of the NCTJ’s cross-media accreditation board where the answer is an emphatic, if predictable, yes.
Most interesting for me though was a quote from the report of the meeting by Professor Richard Tait, director of the Centre of Journalism Studies at Cardiff University:
While the NCTJ is quite right to insist on sufficient resources and expertise so that skills are properly taught and honed, education is a competitive market, and NCTJ courses are expensive to run. In the likely cuts ahead, it is vital for accredited courses to retain their funding so that they are not forced to charge students exorbitant fees; otherwise, diversity will be further compromised.
On the face of it a reasonable demand. But one that in turn demands a lot more clarification. Who should be offering that financial security? The universities, the industry or the NCTJ who take a fee.
Some more NCTJ bursaries perhaps….
Listen below for this week’s news round-up from Journalism.co.uk reporter Rachel McAthy and sign up to our iTunes podcast feed for future audio.
For more information on Journalism.co.uk’s PressQuest service mentioned in the podcast click here.Similar Posts:
Listen below for this week’s news round-up from Journalism.co.uk sub-editor Joel Gunter and sign up to our iTunes podcast feed for future audio.
Similar Posts:
Listen below for this week’s news round-up from Journalism.co.uk editor Laura Oliver and sign up to our iTunes podcast feed for future audio.
Similar Posts:
Listen below for this week’s news roundup from Journalism.co.uk reporter Rachel McAthy and sign up to our iTunes podcast feed for future audio.
Similar Posts:
Listen below for this week’s news roundup from Journalism.co.uk reporter Rachel McAthy and sign up to our iTunes podcast feed for future audio.
Similar Posts:
The BBC College of Journalism and media thinktank Polis are hosting a one-day conference today to discuss the value of networked journalism, free newspapers, political and government reporting and ‘grassroots’ journalism. Keynote speakers include Channel 4 News presenter Jon Snow and BBC Global News director Peter Horrocks, interviewed by Journalism.co.uk at this link.
Journalism.co.uk is hosting the session on ‘grassroots journalism’ and we will be discussing what new ‘hyperlocal’ start-ups are up to, how sustainable these ventures and opportunities this trend could in turn create for ‘big’ media groups in the local space. In keeping with the title of the conference we’re hoping to move the discussion away from what is hyperlocal or definitions of ‘citizen journalism’ and talk about the value of ‘grassroots journalism’ to the public and the media in the UK as a whole.
For our updates you can follow @journalism_live on Twitter – there’s also a hashtag of #VOJ10 and tweets from the conference in the liveblog below:
Similar Posts:
There are just 10 tickets left for news:rewired – the nouveau niche, Journalism.co.uk’s one-day event on 25 June for journalists working within a specialist beat or patch.
If you want one now – here’s the link to book: http://www.journalism.co.uk/195/
The price is currently discounted at £80 (+VAT), but will return to the full price of £100 (+VAT) tomorrow, Friday 11 June.
If you need more convincing, full details of the day are at this link. In summary we’ve got speakers from MSN UK, the Financial Times, Reed Business Information and the BBC discussing paid content, mobile, social media, data journalism and much, much more.
If you’re not able to attend you’ll be able to follow proceedings on @newsrewired and http://www.newsrewired.com.
Similar Posts:
Journalism.co.uk consulting editor Colin Meek (@colinmeek) found himself stranded recently in Oslo, Norway but was rescued thanks to some nifty footwork by Kristine Lowe and an online project from Norwegian news site VG.no entitled Hitchhikers Central.
Colin was in Oslo to give, among other things, an evening presentation to the Norwegian Online News Association (NONA). Colin, when he’s not advising on Journalism.co.uk’s editorial board, is an investigative journalist and trainer in advanced online research skills (his next one-day, open course is in London Tuesday 15 June 2010). Here are some of the tips he shared with our Norwegian colleagues:
Similar Posts:
"Tell the chef, the beer is on me."
"Basically the price of a night on the town!"
"I'd love to help kickstart continued development! And 0 EUR/month really does make fiscal sense too... maybe I'll even get a shirt?" (there will be limited edition shirts for two and other goodies for each supporter as soon as we sold the 200)