Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 29 2012

08:38

How to teach a journalist programming

Cross-posted from Data Driven Journalism.

Earlier this year I set out to tackle a problem that was bothering me: journalists who had started to learn programming were giving up.

They were hitting a wall. In trying to learn the more advanced programming techniques – particularly those involved in scraping – they seemed to fall into one of two camps:

  • People who learned programming, but were taking far too long to apply it, and so losing momentum – the generalists
  • People who learned how to write one scraper, but could not extend it to others, and so becoming frustrated – the specialists

In setting out to figure out what was going wrong, I set myself a task which I have found helpful in taking a fresh perspective on an issue: I started writing a book chapter.

The nice thing about writing books is that they force you to put together a coherent and complete narrative about an entire process. You identify gaps that you weren’t otherwise aware of, and you have to put yourself in the place of someone with no knowledge at all. You take nothing for granted.

So my starting point was this: what is a good way to learn how to write scrapers?

That’s a different question to ‘How do I write a scraper?’ and also to ‘How do I learn programming?’ And that’s important. Because most of the resources available fell into one of those two camps.

The people trying to learn programming were hitting a common problem in learning: lack of feedback. They might be able to change a variable in Ruby, but how would that help in journalism? It was like learning the structure of the entire French language just so they could go to the corner shop and ask for a loaf of bread.

The people learning how to write one scraper were hitting another common problem: learning how to do one task well, rather than the underlying principles. This was like someone learning how to ask for a loaf of bread in French, but not being able to extend that knowledge into asking for directions home.

I tackled both by beginning the chapter with probably the simplest scraper you can write: a spreadsheet formula in Google Docs. This provided the instant feedback that the generalists lacked, but the formula was also used to introduce some key concepts in programming: functions, strings, indexes, and parameters. These would provide key principles that the specialists lacked, and which future chapters could build on.

Learning differently

I also looked at how journalists tried to learn programming, and how programmers developed, and realised something else: journalists and programmers learned differently.

I’m generalising wildly, of course, but journalists – particularly student journalists – often try to learn programming from books. That may sound like common sense, but it’s not in an art or a science – and programming is both.

Programmers – if I’m to generalise wildly again – typically combine books (which they don’t read cover to cover) with documentation, adapting other code, trial and error, and each other. When they teach journalists, they often don’t realise that journalists don’t always share that culture.

And journalists – coming traditionally from a background in the humanities – are used to learning from books: static knowledge. Teaching programming to journalists then, I realised, would also mean teaching how programmers learn.

So my chapter introducing that first scraper introduced some other key concepts as well. It would direct readers to the documentation on the function being used, and invite them to engage in some trial and error to work out a solution to a problem. As more scraper tutorials were added, they introduced more key concepts in programming – importantly, without having to learn an entirely new language, and with documentation and trial and error running throughout, along with the principle of adapting other code.

I tested the approach at the News:Rewired conference. Can you teach scraping in 20 minutes? At a basic level, yes: it seemed you could.

Agile publishing

After 20,000 words I realised that my book chapter was turning into a book. Meanwhile, a colleague had told me about Leanpub: a website that allowed people to publish books as they were being written, with readers able to download new updates as they came.

The platform suited the book perfectly: it meant I could stagger the publication of the book, Codecademy-style, with readers trying at least one scraper per week, but also having the time to experiment with trial and error before the next chapter was published. It meant that I could respond to feedback on the earlier chapters and adapt the rest of the book before it was published (in one case a Brazilian reader pointed out after the first chapter was published that the Portuguese-language Google Docs uses semi colons instead of commas). If examples used in the book changed then I could replace them. And it meant that if new tools or techniques emerged, I could incorporate them.

It is a programming-style approach to publishing – trial and error – which very much suits the spirit of the book. It’s extra work, but it makes for a much better writing experience. I hope the readers think so too.

Scraping for Journalists is available at Leanpub.com/ScrapingForJournalists

08:38

How to teach a journalist programming

Cross-posted from Data Driven Journalism.

Earlier this year I set out to tackle a problem that was bothering me: journalists who had started to learn programming were giving up.

They were hitting a wall. In trying to learn the more advanced programming techniques – particularly those involved in scraping – they seemed to fall into one of two camps:

  • People who learned programming, but were taking far too long to apply it, and so losing momentum – the generalists
  • People who learned how to write one scraper, but could not extend it to others, and so becoming frustrated – the specialists

In setting out to figure out what was going wrong, I set myself a task which I have found helpful in taking a fresh perspective on an issue: I started writing a book chapter.

The nice thing about writing books is that they force you to put together a coherent and complete narrative about an entire process. You identify gaps that you weren’t otherwise aware of, and you have to put yourself in the place of someone with no knowledge at all. You take nothing for granted.

So my starting point was this: what is a good way to learn how to write scrapers?

That’s a different question to ‘How do I write a scraper?’ and also to ‘How do I learn programming?’ And that’s important. Because most of the resources available fell into one of those two camps.

The people trying to learn programming were hitting a common problem in learning: lack of feedback. They might be able to change a variable in Ruby, but how would that help in journalism? It was like learning the structure of the entire French language just so they could go to the corner shop and ask for a loaf of bread.

The people learning how to write one scraper were hitting another common problem: learning how to do one task well, rather than the underlying principles. This was like someone learning how to ask for a loaf of bread in French, but not being able to extend that knowledge into asking for directions home.

I tackled both by beginning the chapter with probably the simplest scraper you can write: a spreadsheet formula in Google Docs. This provided the instant feedback that the generalists lacked, but the formula was also used to introduce some key concepts in programming: functions, strings, indexes, and parameters. These would provide key principles that the specialists lacked, and which future chapters could build on.

Learning differently

I also looked at how journalists tried to learn programming, and how programmers developed, and realised something else: journalists and programmers learned differently.

I’m generalising wildly, of course, but journalists – particularly student journalists – often try to learn programming from books. That may sound like common sense, but it’s not in an art or a science – and programming is both.

Programmers – if I’m to generalise wildly again – typically combine books (which they don’t read cover to cover) with documentation, adapting other code, trial and error, and each other. When they teach journalists, they often don’t realise that journalists don’t always share that culture.

And journalists – coming traditionally from a background in the humanities – are used to learning from books: static knowledge. Teaching programming to journalists then, I realised, would also mean teaching how programmers learn.

So my chapter introducing that first scraper introduced some other key concepts as well. It would direct readers to the documentation on the function being used, and invite them to engage in some trial and error to work out a solution to a problem. As more scraper tutorials were added, they introduced more key concepts in programming – importantly, without having to learn an entirely new language, and with documentation and trial and error running throughout, along with the principle of adapting other code.

I tested the approach at the News:Rewired conference. Can you teach scraping in 20 minutes? At a basic level, yes: it seemed you could.

Agile publishing

After 20,000 words I realised that my book chapter was turning into a book. Meanwhile, a colleague had told me about Leanpub: a website that allowed people to publish books as they were being written, with readers able to download new updates as they came.

The platform suited the book perfectly: it meant I could stagger the publication of the book, Codecademy-style, with readers trying at least one scraper per week, but also having the time to experiment with trial and error before the next chapter was published. It meant that I could respond to feedback on the earlier chapters and adapt the rest of the book before it was published (in one case a Brazilian reader pointed out after the first chapter was published that the Portuguese-language Google Docs uses semi colons instead of commas). If examples used in the book changed then I could replace them. And it meant that if new tools or techniques emerged, I could incorporate them.

It is a programming-style approach to publishing – trial and error – which very much suits the spirit of the book. It’s extra work, but it makes for a much better writing experience. I hope the readers think so too.

Scraping for Journalists is available at Leanpub.com/ScrapingForJournalists

August 09 2012

12:19

Two reasons why every journalist should know about scraping (cross-posted)

This was originally published on Journalism.co.uk – cross-posted here for convenience.

Journalists rely on two sources of competitive advantage: being able to work faster than others, and being able to get more information than others. For both of these reasons, I  love scraping: it is both a great time-saver, and a great source of stories no one else has.

Scraping is, simply, getting a computer to capture information from online sources. They might be a collection of webpages, or even just one. They might be spreadsheets or documents which would otherwise take hours to sift through. In some cases, it might even be information on your own newspaper website (I know of at least one journalist who has resorted to this as the quickest way of getting information that the newspaper has compiled).

In May, for example, I scraped over 6,000 nomination stories from the official Olympic torch relay website. It allowed me to quickly find both local feelgood stories and rather less positive national angles. Continuing to scrape also led me to a number of stories which were being hidden, while having the dataset to hand meant I could instantly pull together the picture of a single day on which one unsuccessful nominee would have run, and I could test the promises made by organisers.

ProPublica scraped payments to doctors by pharma companies; the Ottawa Citizen ran stories based on its scrape of health inspection reports. In Tampa Bay they run an automatically updated page on mugshots. And it’s not just about the stories: last month local reporter David Elks was using Google spreadsheets to compile a table from a Word document of turbine applications for a story which, he says, “helped save the journalist probably four or five hours of manual cutting and pasting.”

The problem is that most people imagine that you need to learn a programming language to start scraping - but that’s not true. It can help - especially if the problem is complicated. But for simple scrapers, something as easy as Google Docs will work just fine.

I tried an experiment with this recently at the News:Rewired conference. With just 20 minutes to introduce a room full of journalists to the complexities of scraping, and get them producing instant results, I used some simple Google Docs functions. Incredibly, it worked: by the end The Independent’s Jack Riley was already scraping headlines (the same process is outlined in the sample chapter from Scraping for Journalists).

And Google Docs isn’t the only tool. Outwit Hub is a must-have Firefox plugin which can scrape through thousands of pages of tables, and even Google Refine can grab webpages too. Database scraping tool Needlebase was recently bought by Google, too, while Datatracker is set to launch in an attempt to grab its former users. Here are some more.

What’s great about these simple techniques, however, is that they can also introduce you to concepts which come into play with faster and more powerfulscraping tools like Scraperwiki. Once you’ve become comfortable with Google spreadsheet functions (if you’ve ever used =SUM in a spreadsheet, you’ve used a function) then you can start to understand how functions work in a programming language like Python. Once you’ve identified the structure of some data on a page so that Outwit Hub could scrape it, you can start to understand how to do the same in Scraperwiki. Once you’ve adapted someone else’s Google Docs spreadsheet formula, then you can adapt someone else’s scraper.

I’m saying all this because I wrote a book about it. But, honestly, I wrote a book about this so that I could say it: if you’ve ever struggled with scraping or programming, and given up on it because you didn’t get results quickly enough, try again. Scraping is faster than FOI, can provide more detailed and structured results than a PR request – and allows you to grab data that organisations would rather you didn’t have. If information is a journalist’s lifeblood, then scraping is becoming an increasingly key tool to get the answers that a journalist needs, not just the story that someone else wants to tell.

12:19

Two reasons why every journalist should know about scraping (cross-posted)

This was originally published on Journalism.co.uk – cross-posted here for convenience.

Journalists rely on two sources of competitive advantage: being able to work faster than others, and being able to get more information than others. For both of these reasons, I  love scraping: it is both a great time-saver, and a great source of stories no one else has.

Scraping is, simply, getting a computer to capture information from online sources. They might be a collection of webpages, or even just one. They might be spreadsheets or documents which would otherwise take hours to sift through. In some cases, it might even be information on your own newspaper website (I know of at least one journalist who has resorted to this as the quickest way of getting information that the newspaper has compiled).

In May, for example, I scraped over 6,000 nomination stories from the official Olympic torch relay website. It allowed me to quickly find both local feelgood stories and rather less positive national angles. Continuing to scrape also led me to a number of stories which were being hidden, while having the dataset to hand meant I could instantly pull together the picture of a single day on which one unsuccessful nominee would have run, and I could test the promises made by organisers.

ProPublica scraped payments to doctors by pharma companies; the Ottawa Citizen ran stories based on its scrape of health inspection reports. In Tampa Bay they run an automatically updated page on mugshots. And it’s not just about the stories: last month local reporter David Elks was using Google spreadsheets to compile a table from a Word document of turbine applications for a story which, he says, “helped save the journalist probably four or five hours of manual cutting and pasting.”

The problem is that most people imagine that you need to learn a programming language to start scraping - but that’s not true. It can help - especially if the problem is complicated. But for simple scrapers, something as easy as Google Docs will work just fine.

I tried an experiment with this recently at the News:Rewired conference. With just 20 minutes to introduce a room full of journalists to the complexities of scraping, and get them producing instant results, I used some simple Google Docs functions. Incredibly, it worked: by the end The Independent’s Jack Riley was already scraping headlines (the same process is outlined in the sample chapter from Scraping for Journalists).

And Google Docs isn’t the only tool. Outwit Hub is a must-have Firefox plugin which can scrape through thousands of pages of tables, and even Google Refine can grab webpages too. Database scraping tool Needlebase was recently bought by Google, too, while Datatracker is set to launch in an attempt to grab its former users. Here are some more.

What’s great about these simple techniques, however, is that they can also introduce you to concepts which come into play with faster and more powerfulscraping tools like Scraperwiki. Once you’ve become comfortable with Google spreadsheet functions (if you’ve ever used =SUM in a spreadsheet, you’ve used a function) then you can start to understand how functions work in a programming language like Python. Once you’ve identified the structure of some data on a page so that Outwit Hub could scrape it, you can start to understand how to do the same in Scraperwiki. Once you’ve adapted someone else’s Google Docs spreadsheet formula, then you can adapt someone else’s scraper.

I’m saying all this because I wrote a book about it. But, honestly, I wrote a book about this so that I could say it: if you’ve ever struggled with scraping or programming, and given up on it because you didn’t get results quickly enough, try again. Scraping is faster than FOI, can provide more detailed and structured results than a PR request – and allows you to grab data that organisations would rather you didn’t have. If information is a journalist’s lifeblood, then scraping is becoming an increasingly key tool to get the answers that a journalist needs, not just the story that someone else wants to tell.

April 20 2012

12:19
12:19
Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl