Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 09 2012

12:19

Two reasons why every journalist should know about scraping (cross-posted)

This was originally published on Journalism.co.uk – cross-posted here for convenience.

Journalists rely on two sources of competitive advantage: being able to work faster than others, and being able to get more information than others. For both of these reasons, I  love scraping: it is both a great time-saver, and a great source of stories no one else has.

Scraping is, simply, getting a computer to capture information from online sources. They might be a collection of webpages, or even just one. They might be spreadsheets or documents which would otherwise take hours to sift through. In some cases, it might even be information on your own newspaper website (I know of at least one journalist who has resorted to this as the quickest way of getting information that the newspaper has compiled).

In May, for example, I scraped over 6,000 nomination stories from the official Olympic torch relay website. It allowed me to quickly find both local feelgood stories and rather less positive national angles. Continuing to scrape also led me to a number of stories which were being hidden, while having the dataset to hand meant I could instantly pull together the picture of a single day on which one unsuccessful nominee would have run, and I could test the promises made by organisers.

ProPublica scraped payments to doctors by pharma companies; the Ottawa Citizen ran stories based on its scrape of health inspection reports. In Tampa Bay they run an automatically updated page on mugshots. And it’s not just about the stories: last month local reporter David Elks was using Google spreadsheets to compile a table from a Word document of turbine applications for a story which, he says, “helped save the journalist probably four or five hours of manual cutting and pasting.”

The problem is that most people imagine that you need to learn a programming language to start scraping - but that’s not true. It can help - especially if the problem is complicated. But for simple scrapers, something as easy as Google Docs will work just fine.

I tried an experiment with this recently at the News:Rewired conference. With just 20 minutes to introduce a room full of journalists to the complexities of scraping, and get them producing instant results, I used some simple Google Docs functions. Incredibly, it worked: by the end The Independent’s Jack Riley was already scraping headlines (the same process is outlined in the sample chapter from Scraping for Journalists).

And Google Docs isn’t the only tool. Outwit Hub is a must-have Firefox plugin which can scrape through thousands of pages of tables, and even Google Refine can grab webpages too. Database scraping tool Needlebase was recently bought by Google, too, while Datatracker is set to launch in an attempt to grab its former users. Here are some more.

What’s great about these simple techniques, however, is that they can also introduce you to concepts which come into play with faster and more powerfulscraping tools like Scraperwiki. Once you’ve become comfortable with Google spreadsheet functions (if you’ve ever used =SUM in a spreadsheet, you’ve used a function) then you can start to understand how functions work in a programming language like Python. Once you’ve identified the structure of some data on a page so that Outwit Hub could scrape it, you can start to understand how to do the same in Scraperwiki. Once you’ve adapted someone else’s Google Docs spreadsheet formula, then you can adapt someone else’s scraper.

I’m saying all this because I wrote a book about it. But, honestly, I wrote a book about this so that I could say it: if you’ve ever struggled with scraping or programming, and given up on it because you didn’t get results quickly enough, try again. Scraping is faster than FOI, can provide more detailed and structured results than a PR request – and allows you to grab data that organisations would rather you didn’t have. If information is a journalist’s lifeblood, then scraping is becoming an increasingly key tool to get the answers that a journalist needs, not just the story that someone else wants to tell.

12:19

Two reasons why every journalist should know about scraping (cross-posted)

This was originally published on Journalism.co.uk – cross-posted here for convenience.

Journalists rely on two sources of competitive advantage: being able to work faster than others, and being able to get more information than others. For both of these reasons, I  love scraping: it is both a great time-saver, and a great source of stories no one else has.

Scraping is, simply, getting a computer to capture information from online sources. They might be a collection of webpages, or even just one. They might be spreadsheets or documents which would otherwise take hours to sift through. In some cases, it might even be information on your own newspaper website (I know of at least one journalist who has resorted to this as the quickest way of getting information that the newspaper has compiled).

In May, for example, I scraped over 6,000 nomination stories from the official Olympic torch relay website. It allowed me to quickly find both local feelgood stories and rather less positive national angles. Continuing to scrape also led me to a number of stories which were being hidden, while having the dataset to hand meant I could instantly pull together the picture of a single day on which one unsuccessful nominee would have run, and I could test the promises made by organisers.

ProPublica scraped payments to doctors by pharma companies; the Ottawa Citizen ran stories based on its scrape of health inspection reports. In Tampa Bay they run an automatically updated page on mugshots. And it’s not just about the stories: last month local reporter David Elks was using Google spreadsheets to compile a table from a Word document of turbine applications for a story which, he says, “helped save the journalist probably four or five hours of manual cutting and pasting.”

The problem is that most people imagine that you need to learn a programming language to start scraping - but that’s not true. It can help - especially if the problem is complicated. But for simple scrapers, something as easy as Google Docs will work just fine.

I tried an experiment with this recently at the News:Rewired conference. With just 20 minutes to introduce a room full of journalists to the complexities of scraping, and get them producing instant results, I used some simple Google Docs functions. Incredibly, it worked: by the end The Independent’s Jack Riley was already scraping headlines (the same process is outlined in the sample chapter from Scraping for Journalists).

And Google Docs isn’t the only tool. Outwit Hub is a must-have Firefox plugin which can scrape through thousands of pages of tables, and even Google Refine can grab webpages too. Database scraping tool Needlebase was recently bought by Google, too, while Datatracker is set to launch in an attempt to grab its former users. Here are some more.

What’s great about these simple techniques, however, is that they can also introduce you to concepts which come into play with faster and more powerfulscraping tools like Scraperwiki. Once you’ve become comfortable with Google spreadsheet functions (if you’ve ever used =SUM in a spreadsheet, you’ve used a function) then you can start to understand how functions work in a programming language like Python. Once you’ve identified the structure of some data on a page so that Outwit Hub could scrape it, you can start to understand how to do the same in Scraperwiki. Once you’ve adapted someone else’s Google Docs spreadsheet formula, then you can adapt someone else’s scraper.

I’m saying all this because I wrote a book about it. But, honestly, I wrote a book about this so that I could say it: if you’ve ever struggled with scraping or programming, and given up on it because you didn’t get results quickly enough, try again. Scraping is faster than FOI, can provide more detailed and structured results than a PR request – and allows you to grab data that organisations would rather you didn’t have. If information is a journalist’s lifeblood, then scraping is becoming an increasingly key tool to get the answers that a journalist needs, not just the story that someone else wants to tell.

Sponsored post
feedback2020-admin
20:51

June 29 2010

08:00

Video: Guardian’s Beat Blogger for Cardiff: breaking the boundaries between blogger and journalist

It’s an modern day battle: journalist versus blogger. Often operating in the same field, but with very different aims and objectives, some traditional reporters are wary of this new breed of content creator. However, a new Beat-Blogger role, created by The Guardian, has brought the 2 fields closer together.

Having a local blogger based in several cities around the UK, The Guardian has given itself direct contact with the community, something a national paper would often overlook.

Hannah Waldram is the beat-blogger in Cardiff. At News:Rewired she told OJB more about how the new project is going, and how it has been accepted in the city.

June 28 2010

14:14

Video: Vikki Chowney & Tony Curzon-Price on creating a buzz: how to get your content noticed

With so much news content available online and a host of ways to promote and share that material it’s often hard for journalists and bloggers to know how to make their content stand out. There are a host of companies offering a quick fix to this problem with promises of Facebook friends and sky-high traffic stats. However, some of the most successful blogs go for a niche audience who care about the subject matter, and spread the word organically.

OJB grabbed a few minutes at News:Rewired with Vikki Chowney (Reputation Online), and Tony Curzon-Price (openDemocracy) to find out how they make an impact online

10:12

Video: BBC at the 2012 Olympics: visualisations, maps and augmented reality

With 2 years to go to the 2012 Olympics, the BBC are already starting to plan their online coverage of the event. With a large, creative team at hand who have experimented with maps, visualisations and interactive content in the past, the pressure is on them to keep the standards high.

At the recent News:Rewired event, OJB caught up with Olympics Reporter Ollie Williams, himself a visualisation guru, to find out exactly what they were planning for 2012.

08:24

Video interview: The Times: safeguarding journalism?

Currently running as a registration service, The Times plan to launch their paid-for site in the next few weeks. So far they are reluctant to release initial registration figures and the demographic audience they are attracting. OJB caught up with Assistant Editor and Head of Online Tom Whitwell to find out more:

June 18 2010

09:54

#newsrw: Countdown to Journalism.co.uk’s news:rewired event – are you coming?

It’s just one week till news:rewired – the nouveau niche at Microsoft UK in London, our one-day event for journalists and communications professionals with a specialist subject or beat.

**Last chance to buy** We have a couple of tickets still available, so click through fast, to be in with a chance.

So who’s coming along on Friday 25 June? You can see a full list of delegates here; and a full list of speakers at this link. We’ve also created this Wordle showing the various organisations at which our delegates work (click through image to see larger version):

As reported on our news:rewired site, UBM is the best represented B2B publisher, with 10 delegates, followed by Reed Business Information in second, with five delegates and three speakers. Follow this link for further breakdown.

Similar Posts:



June 10 2010

11:48

#newsrw: 10 tickets left – get yours before the price goes up

There are just 10 tickets left for news:rewired – the nouveau niche, Journalism.co.uk’s one-day event on 25 June for journalists working within a specialist beat or patch.

If you want one now – here’s the link to book: http://www.journalism.co.uk/195/

The price is currently discounted at £80 (+VAT), but will return to the full price of £100 (+VAT) tomorrow, Friday 11 June.

If you need more convincing, full details of the day are at this link. In summary we’ve got speakers from MSN UK, the Financial Times, Reed Business Information and the BBC discussing paid content, mobile, social media, data journalism and much, much more.

If you’re not able to attend you’ll be able to follow proceedings on @newsrewired and http://www.newsrewired.com.

Similar Posts:



January 14 2010

20:51

news:rewired Hyperlocal and community

I’ve spent the day at the very excellent news:rewired conference organised by the good folks at journalism.co.uk. Lot’s of interesting people and discussions. But I found one thing very frustrating. (actually I found it infuriating and apparently went a shade of purple not often seen)

It seems that some of the breakout sessions descended in to ‘arguments’ generated around an issue which can be best summed up as the “but they are not journalists” argument. The afternoon session on hyperlocal I sat in on certainly fell victem.

We had the whole gamut of arguments including a number of the old favourites, my personal fave was “someone holding a camera is not a photographer”. Erm…yes they are but…I found it frustrating because I thought we had moved on from this. By the time we got to the ‘close the BBC and local newspapers will thrive’ stage  I lost my patience and   my contribution reflects that.  But I realise that was naive and a little unfair.

Given the painful restructuring in the industry at the moment it’s perfectly understandable that people will be looking at where the pinch is. Adam Timworth made a good point to me that in terms of the stages of loss at least they had moved on to anger from denial. But I realised that it’s not really fair of me to dismiss that out of hand. I should have sat on my hands.

What did become clear to me is a growing divergence in the way hyperlocal and community are being defined and applied. Let me expand.

For me hyperlocal is now best defined by outfits like the Lichfield blog, represented at the session by Philip John. It’s content built on social capital. People are involved because it means something to them other than just a job or brand. Money is second to social status or altruistic motivation.

In contrast we could say that (in the context of the future of journalism) community is a strategy employed by media organisations and the journalists within them to engage with audience. Money is a defining commodity here in terms of starting it and sustaining it. Whether it’s to use that community to newsgather/crowdsource or to bolster the brand.

Both have economies of scale.

A hyperlocal site can only be so big. It will eventually get to a point where it demands more time and resources than volunteers can sustain. The economics of altruism only stretch so far. They can be be satisfied with ‘big enough’ or look at alternatives. Communities can, perversely, be too big to manage for large organisations, they cost too much for little return. In the context of profit and investment the economics don’t work

Both are different.

This inherent difference of motivation and a definition of the economic (investment and return) is becoming increasingly clear (and more so in the debate today) and in that a truth is evident. Hyperlocal websites are not a solution for media organisations who are struggling. You can not fill the gap that hyperlocal sites are starting to fill. A good community strategy may work but your core motivations make it different.

But just as hyperlocal is not the solution it’s also not the cause of the problems.

The truth is that the shift is creating a lot of friction (it’s perhaps bad taste to refer to shifting tectonic plates) and I think thats what created a lot of the ‘grief’ in the sessions.

There was a lot of criticism of hyperlocal as undermining/stealing/destroying journalism; you know the arguments. Likewise the crowd sourcing session seemed to descend in to sa similar semantic debate. As Adam reports:

There’s an undercurrent of hostility to the very idea of calling these contributors to crowd-sourced journalism “journalists” in any way – and that it’s under-mining credibility. In answer, people are suggestion that people can become journalists for single events – one time they happen to be at the right place at the right time.

But growing difference between parish pump websites and the local media, between community and audience, suggests that even discussing hyperlocal and community together is, perhaps, a mistake at a journalism conference.

The motivations, models and practice, it seems from the tone of the debate, are just too different.

Enhanced by Zemanta

January 13 2010

17:54

#newsrw: How to put news:rewired on your own blog

Over on Journalism.co.uk we’ve explained the various ways you can follow our news:rewired event tomorrow, but we thought we’d share the code for embedding the CoverItLive liveblog which will pick up #newsrw Twitter conversation and commentary:

<iframe src=”http://www.coveritlive.com/index2.php/option=com_altcaster/task=viewaltcast/altcast_code=36632b9923/height=550/width=470″ scrolling=”no” height=”550px” width=”470px” frameBorder=”0″ allowTransparency=”true” ><a href=”http://www.coveritlive.com/mobile.php?option=com_mobile&task=viewaltcast&altcast_code=36632b9923″ >news:rewired</a></iframe>

It will be set live at 10.15am tomorrow morning; to participate, you can register with a CoverItLive log in or with your Twitter account. To follow it on our site, follow this link: http://www.newsrewired.com/?p=912

Meanwhile follow the ‘buzz’ here. After the event, we hope to share video content courtesy of the BBC College of Journalism. We expect plenty of blog coverage too: from our own team of City students, as well as other well-known media bloggers.

Throughout the day, you will be able to pick up AudioBoos tagged #newsrw – these can be embedded using the code supplied on audioboo.fm.

More information at this link and at http://newsrewired.com.

Similar Posts:



December 21 2009

15:23

#newsrw: Who’s attending our digital journalism event?

If you’ve been under a rock for the last month and haven’t heard us mention our digital journalism event news:rewired next month, then here’s an idea of who’s going, courtesy of Wordle:

news:rewired is a practical, one-day event at City University London with the aim of giving working journalists relevant and immediately useful advice on multimedia, social media and online business models.

There’s still tickets left, but they’re going fast. If you want to book before the VAT hike, tickets are £80 +VAT and available at this link.

Similar Posts:



13:36

#newsrw: ‘The fact that multimedia is visual is a huge benefit to radio,’ says @newsleader

In the latest Q&A on our news:rewired site, media consultant and top radio tipster Justin Kings describes how radio can make the most of multimedia tools.

The fact that much of multimedia is visual is a huge benefit to radio. We’ve seen how images, videos and so on can help enhance the radio experience. In terms of social media, I would suggest radio is itself a social media. By that I mean it shares similar characteristics, it is personal, it’s interactive and reactive. So, again, tools like Twitter can complement content on the radio.

news:rewired » Q&A: Justin Kings, media consultant, Newsleader.

Similar Posts:



09:36

FT.com: Social media editors and community managers – a new two-way dialogue

Sky News’ ‘Twitter correspondent’, aka Ruth Barnett, is among those cited in a Financial Times article looking at the emerging ’social media editor’ and ‘community manager’ roles at media organisations.

Ms Barnett sends pictures and eyewitness reports back to her colleagues, aware that it is often tricky to verify their authenticity.

“It’s a new role, a very diverse one and still evolving,” she says. “I’m very careful about what I say.”

Full story at this link…

Ruth Barnett will be talking at Journalism.co.uk’s news:rewired, Thursday 14 January 2010 (supported by the BBC College of Journalism and the Press Association; sponsored by AudioBoo). Tickets available at this link…

Similar Posts:



December 18 2009

11:53

‘A non-profit is a business as well,’ says mySociety’s senior developer

Francis Irving, senior developer at mySociety – an organisation that runs some of the biggest democracy projects in the UK – has shared some of his thoughts about online transparency and citizen collaboration in a Q&A for Journalism.co.uk’s news:rewired site.

What advice would he give to people going down the non-profit publishing route, we asked. Irving answers:

A non-profit is a business as well – it still has to make a surplus, it is just that that surplus is used to do more of the charitable work, rather than as personal profit.

I would advise people to go one of two ways – either have some good ideas for business models from the start (take a look at Patient Opinion for an example) or work out how to run it entirely on philanthropic donations and volunteer work.

It’s going to be as hard to start a sustainably funded non-profit as it is to start a successful for-profit business.

Francis Irving will be talking at Journalism.co.uk’s digital journalism event news:rewired, 14 January 2010.

Tickets still available at this link…

Similar Posts:



December 17 2009

13:43

Journalism.co.uk signs up Press Association as event partner

Press Association logoThe Press Association has signed up as a media partner for Journalism.co.uk’s digital journalism event news:rewired.

The Press Association joins the BBC’s College of Journalism and sponsor Audioboo as partners for the event on 14 January 2010 at City University London.

To meet a growing demand for digital and multimedia content from its clients, the agency launched its video news wire in April. In keeping with our news:rewired session on working in partnerships, the Press Association is also planning a public service reporting pilot in collaboration with local media groups.

You can follow the agency on Twitter on @pressassoc and find out all about news:rewired at this link.

Similar Posts:



December 09 2009

11:55

#newsrw: Win a Flip Ultra HD camcorder

To spread the word even further about our forthcoming digital journalism event news:rewired on 14 January 2010, we are enlisting the help of the Twitter army and offering you the chance to win a brand new Flip Ultra HD pocket camcorder, just in time for Christmas!

The entry requirement is simple, all you have to do is follow @newsrewired and tweet or re-tweet the following:

Come to #newsrw digital #journalism event 14:1:10. Follow @newsrewired & RT for chance to #win FlipHD http://is.gd/58NY8

The competition will close on Friday 18 December 2009 at 13:00 GMT and the winner will be selected at random and announced shortly after.

Full details of news:rewired are on www.newsrewired.com, but a quick summary: this is a one-day event for journalists looking to up their digital game and for trainers and new recruits hoping to stay one step ahead of the industry.

We’ll be offering practical sessions on videojournalism, using social media and data, and working in partnerships – all from the perspective of a journalist or publisher in the field.

We’ll also be discussing where the potential for making money to support digital and new forms of journalism is.

Tickets are £80 + VAT and can be purchased here. Contact us on laura [at] journalism.co.uk for more details.

Similar Posts:



December 07 2009

16:47

Future of regional news: an ongoing discussion

Last week’s regional journalism panel at City University – in which I took part -  brought out some telling detail: just how many students would be prepared to work for online start-ups (18 out of 70) and the high proportion of income that comes from regional newspaper advertising (73 per cent of the Northern Echo’s revenue comes from advertising, six per cent of that from online). With new local projects arriving on the news scene each day, there are plenty more events at which to discuss and examine the future of regional news:

  • Tonight (Monday 7 December) is probably a bit short notice for the UK Future of News group’s inaugural meeting (Waterloo, 7pm) but keep track of the next date at this link. The group is for anyone interested in the future of journalism: “What it isn’t, is an arena to repeatedly lament the death of print, or the end of quality journalism, or to go around saying ‘paywalls must be the answer, journalists have got to eat,’”says its founder Adam Westbrook.”What it is, is a place where people can think positively, about tangible new ideas to determine the future of journalism. I hope someone will pitch a few ideas which we can all thrash out and stew over.”
  • There’s a good line-up at the AOP microlocal conference on Wednesday 9 December and with Birmingham City University’s Paul Bradshaw, Guardian Local’s Sarah Hartley and Trinity Mirror multimedia head David Higgerson involved there’s likely to be a bit Twittering on the day: follow #aopforum.  Other speakers include Roger Green, managing director of digital media, Newsquest; Lori Cunningham, digital strategy director, Johnston Press; and James Thornett, executive product manager, BBC Local & Location Services. We’re told some tickets are still available.
  • Journalism.co.uk’s own news:rewired event on 14 January 2010, where independent regional sites will meet traditional brands pursuing new partnerships and community sourcing projects. We’ll be covering social media, data-crunching, citizen collaboration and entrepreneurship, with some of the UK’s leading regional and national online journalists.

Similar Posts:



December 01 2009

09:00

#Tip of the day from Journalism.co.uk – follow #newsrw for multimedia and community tips

Want the latest multimedia and community journalism tips? Follow the tag #newsrw on Twitter for Journalism.co.uk's news:rewired updates. Tipster: Judith Townend. To submit a tip to Journalism.co.uk, use this link - we will pay a fiver for the best ones published.


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
(PRO)
No Soup for you

Don't be the product, buy the product!

close
YES, I want to SOUP ●UP for ...