Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

May 22 2013

17:54

Who’s reusing the news?

Derek Willis, interactive news developer for The New York Times, wrote a blog post about a different way to use analytics. Willis says he’s interested in tracking and mapping who is citing and quoting the work of major news outlets (like The New York Times).

The idea behind linkypedia is that links on Wikipedia aren’t just references, they help describe how digital collections are used on the Web, and encourage the spread of knowledge: “if organizations can see how their web content is being used in Wikipedia, they will be encouraged and emboldened to do more.” When I first saw it, I immediately thought about how New York Times content was being cited on Wikipedia. Because it’s an open source project, I was able to find out, and it turned out (at least back then) that many Civil War-era stories that had been digitized were linked to from the site. I had no idea, and wondered how many of my colleagues knew. Then I wondered what else we didn’t know about how our content is being used outside the friendly confines of nytimes.com.

That’s the thread that leads from Linkypedia to TweetRewrite, my “analytics” hack that takes a nytimes.com URL and feeds tweets that aren’t simply automatic retweets; it tries to filter out posts that contain the exact headline of the story to find what people say about it. It’s a pretty simple Ruby app that uses Sinatra, the Twitter and Bitly gems and a library I wrote to pull details about a story from the Times Newswire API.

March 28 2013

12:52

August 20 2012

13:34

How Wikipedia Manages Sources for Breaking News

Almost a year ago, I was hired by Ushahidi to work as an ethnographic researcher on a project to understand how Wikipedians managed sources during breaking news events.

Ushahidi cares a great deal about this kind of work because of a new project called SwiftRiver that seeks to collect and enable the collaborative curation of streams of data from the real-time web about a particular issue or event. If another Haiti earthquake happened, for example, would there be a way for us to filter out the irrelevant, the misinformation, and build a stream of relevant, meaningful and accurate content about what was happening for those who needed it? And on Wikipedia's side, could the same tools be used to help editors curate a stream of relevant sources as a team rather than individuals?

pakistan.png

Ranking sources

When we first started thinking about the problem of filtering the web, we naturally thought of a ranking system that would rank sources according to their reliability or veracity. The algorithm would consider a variety of variables involved in determining accuracy, as well as whether sources have been chosen, voted up or down by users in the past, and eventually be able to suggest sources according to the subject at hand. My job would be to determine what those variables are -- i.e., what were editors looking at when deciding whether or not to use a source?

I started the research by talking to as many people as possible. Originally I was expecting that I would be able to conduct 10 to 20 interviews as the focus of the research, finding out how those editors went about managing sources individually and collaboratively. The initial interviews enabled me to hone my interview guide. One of my key informants urged me to ask questions about sources not cited as well as those cited, leading me to one of the key findings of the report (that the citation is often not the actual source of information and is often provided in order to appease editors who may complain about sources located outside the accepted Western media sphere). But I soon realized that the editors with whom I spoke came from such a wide variety of experience, work areas and subjects that I needed to restrict my focus to a particular article in order to get a comprehensive picture of how editors were working. I chose a 2011 Egyptian revolution article on Wikipedia because I wanted a globally relevant breaking news event that would have editors from different parts of the world working together on an issue with local expertise located in a language other than English.

Using Kathy Charmaz's grounded theory method, I chose to focus editing activity (in the form of talk pages, edits, statistics and interviews with editors) from January 25, 2011 when the article was first created (within hours of the first protests in Tahrir Square), to February 12 when Mubarak resigned and the article changed its name from "2011 Egyptian protests" to "2011 Egyptian revolution." After reviewing the big-picture analyses of the article using Wikipedia statistics of top editors, and locations of anonymous editors, etc., I started work with an initial coding of the actions taking place in the text, asking the question, "What is happening here?"

I then developed a more limited codebook using the most frequent/significant codes and proceeded to compare different events with the same code (looking up relevant edits of the article in order to get the full story), and to look for tacit assumptions that the actions left out. I did all of this coding in Evernote because it seemed the easiest (and cheapest) way of importing large amounts of textual and multimedia data from the web, but it wasn't ideal because talk pages, when imported, need to be re-formatted, and I ended up using a single column to code data in the first column since putting each conversation on the talk page in a cell would be too time-consuming.

evernote.png

I then moved to writing a series of thematic notes on what I was seeing, trying to understand, through writing, what the common actions might mean. I finally moved to the report writing, bringing together what I believed were the most salient themes into a description and analysis of what was happening according to the two key questions that the study was trying to ask: How do Wikipedia editors, working together, often geographically distributed and far from where an event is taking place, piece together what is happening on the ground and then present it in a reliable way? And how could this process be improved?

Key variables

Ethnography Matters has a great post by Tricia Wang that talks about how ethnographers contribute (often invisible) value to organizations by showing what shouldn't be built, rather than necessarily improving a product that already has a host of assumptions built into it.

And so it was with this research project that I realized early on that a ranking system conceptualized this way would be inappropriate -- for the single reason that along with characteristics for determining whether a source is accurate or not (such as whether the author has a history of presenting accurate news article), a number of important variables are independent of the source itself. On Wikipedia, these include variables such as the number of secondary sources in the article (Wikipedia policy calls for editors to use a majority of secondary sources), whether the article is based on a breaking news story (in which case the majority of sources might have to be primary, eyewitness sources), or whether the source is notable in the context of the article. (Misinformation can also be relevant if it is widely reported and significant to the course of events as Judith Miller's New York Times stories were for the Iraq War.)

nyt.png

This means that you could have an algorithm for determining how accurate the source has been in the past, but whether you make use of the source or not depends on factors relevant to the context of the article that have little to do with the reliability of the source itself.

Another key finding recommending against source ranking is that Wikipedia's authority originates from its requirement that each potentially disputed phrase is backed up by reliable sources that can be checked by readers, whereas source ranking necessarily requires that the calculation be invisible in order to prevent gaming. It is already a source of potential weakness that Wikipedia citations are not the original source of information (since editors often choose citations that will be deemed more acceptable to other editors) so further hiding how sources are chosen would disrupt this important value.

On the other hand, having editors provide a rationale behind the choice of particular sources, as well as showing the variety of sources rather than those chosen because of loading time constraints may be useful -- especially since these discussions do often take place on talk pages but are practically invisible because they are difficult to find.

Wikipedians' editorial methods

Analyzing the talk pages of the 2011 Egyptian revolution article case study enabled me to understand how Wikipedia editors set about the task of discovering, choosing, verifying, summarizing, adding information and editing the article. It became clear through the rather painstaking study of hundreds of talk pages that editors were:

  1. storing discovered articles either using their own editor domains by putting relevant articles into categories or by alerting other editors to breaking news on the talk page,
  2. choosing sources by finding at least two independent sources that corroborated what was being reported but then removing some of the citations as the page became too heavy to load,
  3. verifying sources by finding sources to corroborate what was being reported, by checking what the summarized sources contained, and/or by waiting to see whether other sources corroborated what was being reported,
  4. summarizing by taking screenshots of videos and inserting captions (for multimedia) or by choosing the most important events of each day for a growing timeline (for text),
  5. adding text to the article by choosing how to reflect the source within the article's categories and providing citation information, and
  6. editing disputing the way that editors reflected information from various sources and replacing primary sources with secondary sources over time.

It was important to discover the work process that editors were following because any tool that assisted with source management would have to accord as closely as possible with the way that editors like to do things on Wikipedia. Since the process is managed by volunteers and because volunteers decide which tools to use, this becomes really critical to the acceptance of new tools.

sources.png

Recommendations

After developing a typology of sources and isolating different types of Wikipedia source work, I made two sets of recommendations as follows:

  1. The first would be for designers to experiment with exposing variables that are important for determining the relevance and reliability of individual sources as well as the reliability of the article as a whole.
  2. The second would be to provide a trail of documentation by replicating the work process that editors follow (somewhat haphazardly at the moment) so that each source is provided with an independent space for exposition and verification, and so that editors can collect breaking news sources collectively.

variables.png

Regarding a ranking system for sources, I'd argue that a descriptive repository of major media sources from different countries would be incredibly beneficial, but that a system for determining which sources are ranked highest according to usage would yield really limited results. (We know, for example, that the BBC is the most used source on Wikipedia by a high margin, but that doesn't necessarily help editors in choosing a source for a breaking news story.) Exposing the variables used to determine relevancy (rather than adding them up in invisible amounts to come up with a magical number) and showing the progression of sources over time offers some opportunities for innovation. But this requires developers to think out of the box in terms of what sources (beyond static texts) look like, where such sources and expertise are located, and how trust is garnered in the age of Twitter. The full report provides details of the recommendations and the findings and will be available soon.

Just the beginning

This is my first comprehensive ethnographic project, and one of the things I've noticed when doing other design and research projects using different methodologies is that, although the process can seem painstaking and it can prove difficult to articulate the hundreds of small observations into findings that are actionable and meaningful to designers, getting close to the experience of editors is extremely valuable work that is rare in Wikipedia research. I realize now that in the past when I actually studied an article in detail, I knew very little about how Wikipedia works in practice. And this is only the beginning!

Heather Ford is a budding ethnographer who studies how online communities get together to learn, play and deliberate. She currently works for Ushahidi and is studying how online communities like Wikipedia work together to verify information collected from the web and how new technology might be designed to help them do this better. Heather recently graduated from the UC Berkeley iSchool where she studied the social life of information in schools, educational privacy and Africans on Wikipedia. She is a former Wikimedia Foundation Advisory Board member and the former Executive Director of iCommons - an international organization started by Creative Commons to connect the open education, access to knowledge, free software, open access publishing and free culture communities around the world. She was a co-founder of Creative Commons South Africa and of the South African nonprofit, The African Commons Project as well as a community-building initiative called the GeekRetreat - bringing together South Africa's top web entrepreneurs to talk about how to make the local Internet better. At night she dreams about writing books and finding time to draw.

This article also appeared at Ushahidi.com and Ethnography Matters. Get the full report at Scribd.com.

August 08 2012

20:08

'Wikipedia Redefined': Why not make it better, more functional?

Wikipedia Redefined :: Imagine you were granted the magic power to change any website in the whole world-wide web the way you like it, to make it better, more functional, more useful, better looking, more pleasing or disrupting to the eye. Which one would you pick?

Wikipedia "redefined", visit the suggestions here www.wikipediaredefined.com

HT: Jan Tißler via Google+

Tags: Wikipedia

January 20 2012

17:12

Poll: What Do You Think About the Anti-SOPA Protests?

Can online protests make a difference? In the past, they've had mixed success but with enough people pushing against the twin anti-piracy bills, SOPA and PIPA, the U.S. Congress was forced to pay heed. They have now put off bringing the bills to a vote, while contemplating rewrites and changes to the bills. Google alone collected more than 7 million signatures online for a petition against the bills. So what was your experience on Wednesday during the day of protest? Were you moved or unmoved? Did you take action or did life go on as normal? Share your experience in the comments below, and vote in our poll.


What do you think about the anti-SOPA protests?

For more on the protests, check out these recent stories on MediaShift:

> Mediatwits #34: SOPA Protests Make a Difference; Yang Out at Yahoo

> Your Guide to the Anti-SOPA Protests

This is a summary. Visit our site for the full post ».

16:00

This Week in Review: The SOPA standoff, and Apple takes on textbooks with ebooks

The web flexes its political muscle: After a couple of months of growing concern, the online backlash against the anti-piracy bills SOPA and PIPA reached a rather impressive peak this week. There’s a lot of moving parts to this, so I’ll break it down into three parts: the arguments for and against the bill, the status of the bill, and this week’s protests.

The bills’ opponents have covered a wide variety of arguments over the past few months, but there were still a few more new angles this week in the arguments against SOPA. NYU prof Clay Shirky put the bill in historical context in a 14-minute TED talk, and social-media researcher danah boyd parsed out both the competitive and cultural facets of piracy. At the Harvard Business Review, James Allworth and Maxwell Wessel framed the issue as a struggle between big content companies and smaller innovators. The New York Times asked six contributors for their ideas about viable SOPA alternatives in fighting piracy, and at Slate, Matthew Yglesias argued that piracy actually has some real benefits for society and the entertainment industry.

The most prominent SOPA supporter on the web this week was News Corp.’s Rupert Murdoch, who went on a Twitter rant against SOPA opponents and Google in particular, reportedly after seeing a Google TV presentation in which the company said it wouldn’t remove links in search to illegal movie streams. Both j-prof Jeff Jarvis and GigaOM’s Mathew Ingram responded that Murdoch doesn’t understand how the Internet works, with Jarvis arguing that Murdoch isn’t opposed so much to piracy as the entire architecture of the web. At the Guardian, however, Dan Gillmor disagreed with the idea that Murdoch doesn’t get the web, saying that he and other big-media execs know exactly the threat it represents to their longstanding control of media content.

Now for the status of the bill itself: Late last week, SOPA was temporarily weakened and delayed, as its sponsor, Lamar Smith, said he would remove domain-name blocking until the issue has been “studied,” and House Majority Leader Eric Cantor said he won’t bring the bill to the House floor until some real consensus about the bill can be found.

That consensus became a bit less likely this week, after the White House came out forcefully against SOPA and PIPA, calling for, as Techdirt described it, a “hard reset” on the bills. The real blow to the bills came after Wednesday’s protests, when dozens of members of Congress announced their opposition. The fight is far from over, though — as Mathew Ingram noted, PIPA still has plenty of steam, and the House Judiciary Committee will resume its work on SOPA next month.

But easily the biggest news surrounding SOPA and PIPA this week was the massive protests of it around the web. Hundreds of sites, including such heavyweights as Wikipedia, Reddit, Mozilla, BoingBoing, and WordPress, blacked out on Wednesday, and other sites such as Google and Wired joined with “censored” versions of their home pages. As I noted above, the protest was extremely successful politically, as some key members of Congress backed off their support of the bill, leading The New York Times to call it a “political coming of age” for the tech industry.

The most prominent of those protesting sites was Wikipedia, which redirected site users to an anti-SOPA action page on Wednesday. Wikipedia co-founder Jimmy Wales’ announcement of the protest was met with derision in some corners, with Twitter CEO Dick Costolo and PandoDaily’s Paul Carr chastising the global site for doing something so drastic in response to a single national issue. Wales defended the decision by saying that the law will affect web users around the world, and he also got support from writers like Mathew Ingram and the Atlantic’s Alexis Madrigal, who argued that Wikipedia and Google’s protests could help take the issue out of the tech community and into the mainstream.

The New York Times’ David Pogue was put off by some aspects of the SOPA outrage and argued that some of the bill’s opposition grew out of a philosophy that was little more than, “Don’t take my free stuff!” And ReadWriteWeb’s Joe Brockmeier was concerned about what happens after the protest is over, when Congress goes back to business as usual and the public remains largely in the dark about what they’re doing. “Even if SOPA goes down in flames, it’s not over. It’s never over,” he wrote.

Apple’s bid to reinvent the textbook: Apple announced yesterday its plans to add educational publishing to the many industries it’s radically disrupted, through its new iBooks and iBooks Author systems. Wired’s Tim Carmody, who’s been consistently producing the sharpest stuff on this subject, has the best summary of what Apple’s rolling out: A better iBooks platform, a program (iBooks Author) allowing authors to design their own iBooks, textbooks in the iBookstore, and a classroom management app called iTunes U.

After news of the announcement was broken earlier this week by Ars Technica, the Lab’s Joshua Benton explained some of the reasons the textbook industry is ripe for disruption and wondered about the new tool’s usability. (Afterward, he listed some of the change’s implications, including for the news industry.) Tim Carmody, meanwhile, gave some historical perspective on Steve Jobs’ approach to education reform.

As Carmody detailed after the announcement, education publishing is a big business for Apple to come crashing into. But The Atlantic’s Megan Garber explained that that isn’t exactly what Apple’s doing here; instead, it’s simply “identifying transformative currents and building the right tools to navigate them.” Still, Reuters’ Jack Shafer asserted that what’s bad for these companies is good for readers like him.

But while Apple talked about reinventing the textbook, several observers didn’t see revolutionary changes around the corner. ReadWriteWeb’s John Paul Titlow noted that Apple is teaming up with big publishers, not killing them, and Paul Carr of PandoDaily argued that iBook Author’s self-made ebooks won’t challenge the professionally produced and marketed ones. All Things Digital’s Peter Kafka did the math to show the publishers should still get plenty of the new revenue streams.

The news still brought plenty of concerns: At CNET, Lindsey Turrentine wondered how many schools will have the funds to afford the hardware for iBooks, and David Carnoy and Scott Stein questioned how open Apple’s new platforms would be. That theme was echoed elsewhere, especially by developer Dan Wineman, who found that through its user agreement, Apple will essentially assert rights to anything produced with its iBooks file format. That level of control gave some, like GigaOM’s Mathew Ingram, pause, but Paul Carr said we shouldn’t be surprised: This is what Apple does, he said, and we all buy its products anyway.

Making ‘truth vigilantes’ mainstream: The outrage late last week over New York Times public editor Arthur Brisbane’s column asking whether the paper’s reporters should challenge misleading claims by officials continued to yield thoughtful responses this week. After his column last week voicing his support for journalism’s “truth vigilantes,” j-prof Robert Niles created a site to honor them, pointing out instances in which reporters call out their sources for lying. Salon’s Gene Lyons, meanwhile, said that attitudes like Brisbane’s are a big part of what’s led to the erosion of trust in the Times and the mainstream press.

The two sharpest takes on the issue this week came from The Atlantic’s Conor Friedersdorf and from Columbia Ph.D. student Lucas Graves here at the Lab. Friedersdorf took on journalists’ argument that people should read the news section for unvarnished facts and the opinion section for analysis: That argument doesn’t work, he said, because readers don’t consume a publication as a bundle anymore.

Graves analyzed the issue in light of both the audience’s expectations for news and the growth of the fact-checking movement. He argued for fact-checking to be incorporated into journalists’ everyday work, rather than remaining a specialized form of journalism. Reuters’ Felix Salmon agreed, asserting that “the greatest triumph of the fact-checking movement will come when it puts itself out of work, because journalists are doing its job for it as a matter of course.” At the Lab, Craig Newmark of Craigslist also chimed in, prescribing more rigorous fact-checking efforts as a way for journalists to regain the public’s trust.

Reading roundup: Not a ton of other news developments per se this week, but plenty of good reads nonetheless. Here’s a sample:

— There was one major development on the ongoing News Corp. phone hacking case: The company settled 36 lawsuits by victims, admitting a cover-up of the hacking. Here’s the basic story from Reuters and more in-depth live coverage from the Guardian.

— Rolling Stone published a long, wide-ranging interview with WikiLeaks’ Julian Assange as he awaits his final extradition hearing. Reuters’ Jack Shafer also wrote a thoughtful piece on the long-term journalistic implications of WikiLeaks, focusing particularly on the continued importance of institutions.

— Two interesting pieces of journalism-related research: Slate’s Farhad Manjoo described a Facebook-based study that throws some cold water on the idea of the web as a haven for like-minded echo chambers, and the Lab’s Andrew Phelps wrote about a study that describes and categorizes the significant group people who stumble across news online.

— In a thorough feature, Nick Summers of Newsweek/The Daily Beast laid out the concerns over how big ESPN is getting, and whether that’s good for ESPN itself and sports media in general.

— Finally, for those thinking about how to develop the programmer-journalists of the future, j-prof Matt Waite has a set of thoughts on the topic that functions as a great jumping-off point for more ideas and discussion.

15:20

Mediatwits #34: SOPA Protests Make a Difference; Yang Out at Yahoo

danny telegram.jpg

Welcome to the 34th episode of "The Mediatwits," the weekly audio podcast from MediaShift. The co-hosts are MediaShift's Mark Glaser and Rafat Ali. This week the show is mainly focused on the huge day of protest online Wednesday against the Stop Online Piracy Act (SOPA) and Protect IP Act (PIPA) before the U.S. Congress. After Wikipedia, Reddit and other sites went black, and millions signed petitions and called lawmakers, at least 40 representatives and Senators said they wouldn't support the bills in their current form. It was a breathtaking display of online organization that got results.

Special guest Danny Sullivan of Search Engine Watch discussed the role that Google played in educating people and helping them take action. Plus, Sullivan created one of the more creative memes by sending a telegram to Sen. Dianne Feinstein (D-Calif.) because she didn't have an active Twitter or Facebook page. (Click the image above-left to see the telegram at full size.) In other news, Chief Yahoo and company co-founder Jerry Yang announced he was stepping down as Yahoo tries again to turn the tanker around. Special guest Eric Jackson, an activist investor in Yahoo, talks about the brightened prospects for the web giant now that Yang has departed.

Check it out!

mediatwits34.mp3

Subscribe to the podcast here

Subscribe to Mediatwits via iTunes

Follow @TheMediatwits on Twitter here

Intro and outro music by 3 Feet Up; mid-podcast music by Autumn Eyes via Mevio's Music Alley.

Here are some highlighted topics from the show:

danny_sullivan headshot.jpg

Intro

1:10: Rafat is going away to get married and to take a long honeymoon trip

3:00: There are more serious issues that should get this much attention

5:00: A clear explanation of the SOPA and PIPA bills before Congress

7:15: Rundown of topics on the podcast

Huge day of protesting SOPA online

8:00: Special guest Danny Sullivan

11:10: Sullivan: Big media companies should make content easier to find, buy

13:00: Should be an easier way to pull down infringing sites

15:10: Sullivan explains why he did the telegram for Sen. Feinstein

19:00: Obama comes out against the bills in their current form

Yang out at Yahoo

Eric Jackson head.jpg

20:20: Special guest Eric Jackson

22:40: Jackson: Investors have shied away from Yahoo stock

25:40: Jackson is heartened by new CEO Scott Thompson

28:00: Jackson: Shareholders could get a special dividend

More Reading

SOPA protest by the numbers: 162M pageviews, 7 million signatures at Ars Technica

Your Guide to the Anti-SOPA Protests at MediaShift

Put Down the Pitchforks on SOPA at NY Times

Where Do Your Members of Congress Stand on SOPA and PIPA? at ProPublica

Protect IP Act Senate whip count at OpenCongress

Senator Ron Wyden To The Internet: Thank You For Speaking Up... But We're Not Done Yet at TechDirt

With Twitter, Blackouts and Demonstrations, Web Flexes Its Muscle at NY Times

Google Blackens Its Logo To Protest SOPA/PIPA, While Bing & Yahoo Carry On As Usual at Search Engine Land

Protests lead to weakening support for Protect IP, SOPA at CNET

Jerry Yang's Departure Means Major Transformations for Yahoo! at Forbes.com

Yahoo's Yang is gone. That was the easy part at CNET

With Yahoo co-founder Jerry Yang departed from board, Yahoo seeks a new course at Mercury News

Weekly Poll

Don't forget to vote in our weekly poll, this time about the anti-SOPA protests:


What do you think about the anti-SOPA protests?

Mark Glaser is executive editor of MediaShift and Idea Lab. He also writes the bi-weekly OPA Intelligence Report email newsletter for the Online Publishers Association. He lives in San Francisco with his son Julian. You can follow him on Twitter @mediatwit. and Circle him on Google+

This is a summary. Visit our site for the full post ».

January 18 2012

23:10

Your Guide to the Anti-SOPA Protests

Today was an important day in the history of the Internet and activism. While the U.S. Congress expected to quickly pass two bills, the Stop Online Piracy Act (SOPA) and Protect IP Act (PIPA), mounting opposition online has led them to reconsider. That all came to a head today when various sites such as Wikipedia and Reddit decided to black out their content, and others such as Google put up anti-SOPA messages on their sites. The following is a Storify aggregation of all those efforts, including explainers, stories, tweets, parody videos and more.

[View the story "A Guide to the Anti-SOPA Protests" on Storify]

Mark Glaser is executive editor of MediaShift and Idea Lab. He also writes the bi-weekly OPA Intelligence Report email newsletter for the Online Publishers Association. He lives in San Francisco with his son Julian. You can follow him on Twitter @mediatwit. and Circle him on Google+

This is a summary. Visit our site for the full post ».

16:27

Three alternate ways to access Wikipedia

Wikipedia is among a number of sites blacking out today in protest of SOPA and PIPA, two bills before Congress that many in the tech community fear will infringe upon free expression and do serious harm to the Internet. You can read more about Wikipedia’s position and the bill here. And here...

06:54

SOPA Rep. Lamar Smith blasts Wikipedia blackout, says law to go forward in February

paidContent :: Rep. Lamar Smith (R-Tx), who is leading a push to pass a controversial anti-piracy bill, issued a statement today scolding Wikipedia over its plan to go dark with its English-language website for 24 hours in protest of the legislation. In a separate release, Smith said the House Judiciary Committee would go forward with a mark-up of the legislation in February.

Continue to read Jeff Roberts, paidcontent.org

Tags: SOPA Wikipedia

January 17 2012

20:16

Jan 18th, "blackout" day: Google will protest SOPA using popular home page

CNet :: Google, the Web's top search company and one of technology's most influential powers in Washington, will post a link on its home page tomorrow to notify users of Google's opposition to controversial antipiracy bills being debated in Congress. The company confirmed in a statement that it will join Wikipedia, Reddit, and other influential tech firms

Continue to read Declan McCullagh | Greg Sandov, news.cnet.com

06:57

SOPA: CEO Dick Costolo calls a blackout of Twitter 'foolish'

The original title of this article was "Twitter’s Dick Costolo calls Wikipedia’s SOPA blackout ‘foolish’" but Costolo later wrote to Wikipedia founder Jimmy Wales "that he was only referring to a potential blackout of Twitter when he made the ‘foolish’ statement."

The Next Web :: We’ve already reported on Wikipedia’s founder Jimmy Wales announcing that the site will be going dark, effectively closing up for business, for 24 hours on Wednesday in protest of SOPA (Stop Online Piracy Act).  Well, after being goaded about whether Twitter would be joining in protest with a blackout as well, Twitter CEO Dick Costolo responded by saying that the decision (changed original text, due to update:) would be foolish for Twitter.

Continue to read Matthew Panzarino, thenextweb.com

January 16 2012

20:12

January 18th "blackout day": Wikipedia to shut down on Wednesday to protest SOPA

The Next Web :: Today, founder of the non-profit behind information archive Wikipedia, Jimmy Wales, announced that the site will go dark for 24 hours on Wednesday, January 18th, in protest of the Stop Online Piracy Act (SOPA). 

Wikipedia-anti-sopa-jpg

Continue to read Drew Olanoff, thenextweb.com

Tags: SOPA Wikipedia
12:16

Sockpuppetry and Wikipedia – a PR transparency project

Wikipedia image by Octavio Rojas

Wikipedia image by Octavio Rojas

Last month you may have read the story of lobbyists editing Wikipedia entries to remove criticism of their clients and smear critics. The story was a follow-up to an undercover report by the Bureau of Investigative Journalism and The Independent on claims of political access by Bell Pottinger, written as a result of investigations by SEO expert Tim Ireland.

Ireland was particularly interested in reported boasts by executives that they could “manipulate Google results to ‘drown out’ negative coverage of human rights violations and child labour”. His subsequent digging resulted in the identification of a number of Wikipedia edits made by accounts that he was able to connect with Bell Pottinger, an investigation by Wikipedia itself, and the removal of edits made by suspect accounts (also discussed on Wikipedia itself here).

This month the story reverted to an old-fashioned he-said-she-said report on conflict between Wikipedia and the PR industry as Jimmy Wales spoke to Bell Pottinger employees and was criticised by co-founder Tim (Lord) Bell.

More insightfully, Bell’s lack of remorse has led Tim Ireland to launch a campaign to change the way the PR industry uses Wikipedia, by demonstrating directly to Lord Bell the dangers of trying to covertly shape public perception:

“Mr Bell needs to learn that the age of secret lobbying is over, and while it may be difficult to change the mind of someone as obstinate as he, I think we have a jolly good shot at changing the landscape that surrounds him in the attempt.

“I invite you to join an informal lobbying group with one simple demand; that PR companies/professionals declare any profile(s) they use to edit Wikipedia, name and link to them plainly in the ‘About Us’ section of their website, and link back to that same website from their Wikipedia profile(s).”

The lobbying group will be drawing attention to Bell Pottinger’s techniques by displacing some of the current top ten search results for ‘Tim Bell’ (“absurd puff pieces”) with “factually accurate and highly relevant material that Tim Bell would much rather faded into the distance” – specifically, the contents of an unauthorised biography of Bell, currently “largely invisible” to Google.

Ireland writes that:

“I am hoping that the prospect of dealing with an unknown number of anonymous account holders based in several different countries will help him to better appreciate his own position, if only to the extent of having him revise his policy on covert lobbying.”

…and from there to the rest of the PR industry.

It’s a fascinating campaign (Ireland’s been here before, using Google techniques to demonstrate factual inaccuracies to a Daily Mail journalist) and one that we should be watching closely. The PR industry is closely tied to the media industry, and sockpuppetry in all its forms is something journalists should do more than merely complain about.

It also highlights again how distribution has become a role of the journalist: if a particular piece of public interest reporting is largely invisible to Google, we should care about it.

January 14 2012

19:21

Wikipedia considering joining SOPA blackout protest

CNet :: As anger towards the proposed Stop Online Piracy Act grows, more and more people and organizations are joining the fight against the bipartisan Congressional legislation. Earlier this week, the news site Reddit announced it would shut down for 12 hours on January 18 in a bid to make its displeasure known about SOPA and its Senate counterpart, the Protect IP Act. And now, there are strong signs that Wikipedia may express its community's protest sentiment, although it's not yet known in what form.

Continue to read Daniel Terdiman, news.cnet.com

Tags: SOPA Wikipedia

January 04 2012

16:51

Daily Must Reads, Jan. 4, 2012

The best stories across the web on media and technology, curated by Nathan Gibbs


1. Yahoo announces PayPal president Scott Thompson as its new CEO (TechCrunch)

2. Wikipedia raises $20 million in its annual donation drive, from 1 million donors  (Wikimedia Foundation)

3. No warrant needed for GPS monitoring, judge rules (Wired)

4. Why Twitter's "verified account" failure matters (GigaOM)

5. It is now illegal to visit a foreign website in Belarus (The Next Web)

6. Slate partners with YouTube to bring its Explainer to video (Nieman Journalism Lab)



Subscribe to our daily Must Reads email newsletter and get the links in your in-box every weekday!



Subscribe to Daily Must Reads newsletter

This is a summary. Visit our site for the full post ».

16:51

Daily Must Reads, Jan. 4, 2011

The best stories across the web on media and technology, curated by Nathan Gibbs


1. Yahoo announces PayPal president Scott Thompson as its new CEO (TechCrunch)

2. Wikipedia raises $20 million in its annual donation drive, from 1 million donors  (Wikimedia Foundation)

3. No warrant needed for GPS monitoring, judge rules (Wired)

4. Why Twitter's "verified account" failure matters (GigaOM)

5. It is now illegal to visit a foreign website in Belarus (The Next Web)

6. Slate partners with YouTube to bring its Explainer to video (Nieman Journalism Lab)



Subscribe to our daily Must Reads email newsletter and get the links in your in-box every weekday!



Subscribe to Daily Must Reads newsletter

This is a summary. Visit our site for the full post ».

January 02 2012

17:26

Wikimedia Foundation raises $20m, serving more than 470 million per month

Wikimedia :: The Wikimedia Foundation’s annual fundraising campaign reached a successful conclusion on Sunday, January 1, having raised a record-breaking USD 20 million from more than one million donors in nearly every country in the world. It is the Wikimedia Foundation’s most successful campaign ever, continuing an unbroken streak in which donations have risen every year since the campaigns began in 2003. Wikimedia websites serve more than 470 million people every month. It is the only major website supported not by advertising, but by donations from readers.

Continue to read wikimediafoundation.org

Tags: Wikipedia

December 07 2011

14:50

How to scrape and parse Wikipedia

Today’s exercise is to create a list of the longest and deepest caves in the UK from Wikipedia. Wikipedia pages for geographical structures often contain Infoboxes (that panel on the right hand side of the page).

The first job was for me to design an Template:Infobox_ukcave which was fit for purpose. Why ukcave? Well, if you’ve got a spare hour you can check out the discussion considering its deletion between the immovable object (American cavers who believe cave locations are secret) and the immovable force (Wikipedian editors who believe that you can’t have two templates for the same thing, except when they are in different languages).

But let’s get on with some Wikipedia parsing. Here’s what doesn’t work:

import urllib
print urllib.urlopen("http://en.wikipedia.org/wiki/Aquamole_Pot").read()

because it returns a rather ugly error, which at the moment is: “Our servers are currently experiencing a technical problem.”

What they would much rather you do is go through the wikipedia api and get the raw source code in XML form without overloading their servers.

To get the text from a single page requires the following code:

import lxml.etree
import urllib

title = "Aquamole Pot"

params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"timestamp|user|comment|content" }
params["titles"] = "API|%s" % urllib.quote(title.encode("utf8"))
qs = "&".join("%s=%s" % (k, v)  for k, v in params.items())
url = "http://en.wikipedia.org/w/api.php?%s" % qs
tree = lxml.etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')

print "The Wikipedia text for", title, "is"
print revs[-1].text

Note how I am not using urllib.urlencode to convert params into a query string. This is because the standard function converts all the ‘|’ symbols into ‘%7C’, which the Wikipedia api site doesn’t accept.

The result is:

{{Infobox ukcave
| name = Aquamole Pot
| photo =
| caption =
| location = [[West Kingsdale]], [[North Yorkshire]], England
| depth_metres = 113
| length_metres = 142
| coordinates =
| discovery = 1974
| geology = [[Limestone]]
| bcra_grade = 4b
| gridref = SD 698 784
| location_area = United Kingdom Yorkshire Dales
| location_lat = 54.19082
| location_lon = -2.50149
| number of entrances = 1
| access = Free
| survey = [http://cavemaps.org/cavePages/West%20Kingsdale__Aquamole%20Pot.htm cavemaps.org]
}}
'''Aquamole Pot''' is a cave on [[West Kingsdale]], [[North Yorkshire]],
England wih which was first discovered from the
bottom by cave diving through 550 feet of
sump from [[Rowten Pot]] in 1974....

This looks pretty structured. All ready for parsing. I’ve written a nice complicated recursive template parser that I use in wikipedia_utils, which makes it easy to extract all the templates from the page in the following way:

import scraperwiki
wikipedia_utils = scraperwiki.swimport("wikipedia_utils")

title = "Aquamole Pot"

val = wikipedia_utils.GetWikipediaPage(title)
res = wikipedia_utils.ParseTemplates(val["text"])
print res               # prints everything we have found in the text
infobox_ukcave = dict(res["templates"]).get("Infobox ukcave")
print infobox_ukcave    # prints just the ukcave infobox

This now produces the following Python data structure that is almost ready to push into our database — after we have converted the length and depths from strings into numbers:

{0: 'Infobox ukcave', 'number of entrances': '1',
 'location_lon': '-2.50149',
 'name': 'Aquamole Pot', 'location_area': 'United Kingdom Yorkshire Dales',
 'geology': '[[Limestone]]', 'gridref': 'SD 698 784', 'photo': '',
 'coordinates': '', 'location_lat': '54.19082', 'access': 'Free',
 'caption': '', 'survey': '[http://cavemaps.org/cavePages/West%20Kingsdale__Aquamole%20Pot.htm cavemaps.org]',
 'location': '[[West Kingsdale]], [[North Yorkshire]], England',
 'depth_metres': '113', 'length_metres': '142', 'bcra_grade': '4b', 'discovery': '1974'}

Right. Now to deal with the other end of the problem. Where do we get the list of pages with the data?

Wikipedia is, unfortunately, radically categorized, so Aquamole_Pot is inside Category:Caves_of_North_Yorkshire, which is in turn inside Category:Caves_of_Yorkshire which is then inside
Category:Caves_of_England which is finally inside
Category:Caves_of_the_United_Kingdom.

So, in order to get all of the caves in the UK, I have to iterate through all the subcategories and all the pages in each category and save them to my database.

Luckily, this can be done with:

lcavepages = wikipedia_utils.GetWikipediaCategoryRecurse("Caves_of_the_United_Kingdom")
scraperwiki.sqlite.save(["title"], lcavepages, "cavepages")

All of this adds up to my current scraper wikipedia_longest_caves that extracts those infobox tables from caves in the UK and puts them into a form where I can sort them by length to create this table based on the query SELECT name, location_area, length_metres, depth_metres, link FROM caveinfo ORDER BY length_metres desc:

name location_area length_metres depth_metres Ease Gill Cave System United Kingdom Yorkshire Dales 66000.0 137.0 Dan-yr-Ogof Wales 15500.0 Gaping Gill United Kingdom Yorkshire Dales 11600.0 105.0 Swildon’s Hole Somerset 9144.0 167.0 Charterhouse Cave Somerset 4868.0 228.0

If I was being smart I could make the scraping adaptive, that is only updating the pages that have changed since the last scraped by using all the data returned by GetWikipediaCategoryRecurse(), but it’s small enough at the moment.

So, why not use DBpedia?

I know what you’re saying: Surely the whole of DBpedia does exactly this, with their parser?

And that’s fine if you don’t want your updates to come less than 6 months, which prevents you from getting any feedback when adding new caves into Wikipedia, like Aquamole_Pot.

And it’s also fine if you don’t want to be stuck with the naïve semantic web notion that the boundaries between entities is a simple, straightforward and general concept, rather than what it really is: probably the one deep and fundamental question within any specific domain of knowledge.

I mean, what is the definition of a singular cave, really? Is it one hole in the ground, or is it the vast network of passages which link up into one connected system? How good do those connections have to be? Are they defined hydrologically by dye tracing, or is a connection defined as the passage of one human body getting itself from one set of passages to the next? In the extreme cases this can be done by cave diving through an atrocious sump which no one else is ever going to do again, or by digging and blasting through a loose boulder choke that collapses in days after one nutcase has crawled through. There can be no tangible physical definition. So we invent the rules for the definition. And break them.

So while theoretically all the caves on Leck Fell and Easgill have been connected into the Three Counties System, we’re probably going to agree to continue to list them as separate historic caves, as well as some sort of combined listing. And that’s why you’ll get further treating knowledge domains as special cases.


October 06 2011

16:00

The Newsonomics of f8

Editor’s Note: Each week, Ken Doctor — author of Newsonomics and longtime watcher of the business side of digital news — writes about the economics of news for the Lab.

Is it declaration of war, or of peace, or is Mark Zuckerberg saying he just really Likes us all very, very much?

“No activity is too big or too small to share,” the 27-year-old proclaimed at the recent f8 announcement. “All your stories, all your life…. This is going to make it easy to share orders of magnitude more things than before.” (f8 sounds, oddly, like FATE, but I think my paranoia is kicking in.)

“Excuse me, have we met?” is one response.

Another response to Facebook’s Ticket, Timeline, and News Feed initiatives is to go dating. Some quite influential publishers are road-testing the new features, while others ponder a light commitment.

In 2011, U.S. dailies’ digital ad take will be about $3 billion and Facebook’s $2 billion.

They should be aware that Facebook is bent on world domination — having targeted businesses now run by Amazon, Apple, Google, LinkedIn, Wikipedia, Flipboard, Pulse, Pandora, Last.fm, and Flickr, as well as legacy news and information providers — in the latest move. (Forget debating Google’s “do no evil” mantra; Google’s sin may have been that it thought too small.) That’s audience, though not business, domination, as Facebook’s EMEA platform partnerships director, Christian Hernandez, told PaidContent. “[f8] is not a commercial decision.” Got it. And Google just wants to help us better organize our info.

Facebook’s f8 signals a next round of digital disruption. Remember Microsoft’s decade-old bid to become the hub of our entertainment lives, as evidenced by its futuristic Consumer Electronics Show displays? Facebook has taken that metaphor — and updated and socialized it.

This unabashed push to remake the digital world in its own image would seem like laughable megalomania coming from many other sources in the world. But it’s not megalomania if others act like you’re not crazy. In fact, our story takes strange turns as this megalomania, so far, seems quite magnanimous to publishers, as Facebook looks to some like the best available date, compared to the other ascendant audience resellers (Apple, Amazon, and Google).

As leading-edge publishers move away from destination-only strategies, they seek to colonize other habitable web environments; Facebook now looks like the friendliest clime, allowing publishers to keep all the revenue from ads they are selling within their Facebook apps. In addition, Facebook is providing aggregated data on user engagement — active users, likes, comments, post views, and post feedback.

Buy-in from such brands as the Washington Post, The Economist, the Wall Street Journal, The Guardian, and Yahoo helps to place Facebook’s push into the “normal” scale of corporate behavior.

Why are news players playing along? What do they think is in it for them?

Let’s look at the newsonomics of f8 and of the new social whirl.

“Rather than incorporate Facebook features into our site, we’ve looked at incorporating our content into Facebook.”

Let’s start with the stark, Willie Sutton reason: you work with Facebook because that’s where the audience is. In the U.S., Facebook claims more as much as seven hours of average monthly usage; globally, that number is four hours plus. It’s where would-be readers hang out.

Worldwide, it claims an audience of 800 million.

If Facebook is the hang-out mall, newspaper and magazine sites are grocery stores. People go there when they need something — to find out what’s new — and then leave. The comparative average monthly usage of news sites runs five to 20 minutes per month.

So exposure to audience is the no-brainer, here. The question is: to what end?

Step back from the flurry of news company announcements, or from the behind-the-scenes 2012 strategies-in-the-making, and publishers cite three top goals:

  • Lower-cost development of audience, especially audience that may become core customers.
  • Digital advertising revenue growth.
  • Establishing a robust, growing stream of digital reader revenue.

So how might f8 innovations help those?

Let’s start with brand awareness. It’s a digital din out there, a survival-of-the-feistiest time. Consumers will come to rely on a handful or two of news brands, goes the theory. So best to be high in their consciousness, and Facebook omnipresence in people’s lives offers that possibility.

Adam Freeman, executive director of Commercial for Guardian News and Media, explains Guardian’s digital-first strategy here this way:

Our digital audience has grown to a phenomenal 50m+, but, with the best will in the world, chances are we are never going to outpace and outstrip Facebook’s audience size. So we see an opportunity in that — rather than incorporate Facebook features into our site, we’ve looked at incorporating our content into Facebook. There is an untapped audience within Facebook who may not be regularly encountering Guardian and Observer content, and we think our app increases the the visibility of our content in that space.

Of course that brand consciousness needs to be acted on, which leads us to…

Lower-cost traffic acquisition. Online, publishers have invested in search engine optimization and search engine marketing. SEO makes them more findable in organic search; SEM pays for high-level brand placement. In addition, they’ve done deals with portals over the years; the current Yahoo deals of swapping news stories for links is a major one for many.

Against, though, Facebook is simply social media optimization (“The newsonomics of social media optimization”).

It’s another route to pouring newer customers into the top end of news publishers’ audience funnel, hoping a few tumble out the bottom as paying, regular readers. And any readers can be monetized with advertising.

SMO’s relative economics are better than SEO or SEM. Not only is SMO cheaper than SEM, some publishers say it “performs” better. That performance is best measured by conversions (registrations, more pages read, digital sub buying), while for others the jury is still out. And, at best, audience development multiplies off these new relationships.

“These new Facebook users aren’t necessarily finding the brand in traditional ways, nor do they necessarily hold longstanding brand affinity,” says Jed Williams, analyst at BIA/Kelsey.

Their social graphs, curators/editors, recommendations, etc. are doing the pointing for them. So they do arrive at the very top of the proverbial funnel. And, as they interact with the publisher, with them in turn comes their social network. Potentially, the exponential network effects take off, and new audience continues to breed even more new audience. Original audience targets emerge, and the funnel continually expands. At least in the best case scenario, it does.

Sale of paid products: If you are now selling digital subscriptions, you’re doubly interested in customer acquisition. Now publishers can discover the percentage of new audience they can convert to paying customers, though that’s not an easy proposition to figure out. That percentage will be tiny, but it may be meaningful.

Out of the chute, digital circulation efforts have focused strongly on longstanding customers. Publishers have wanted to keep their print customers paying. They want to reduce print churn by taking away customers’ ability to get the news they get in the paper for free online. They want to change the psychology of long-term readers, giving them a new understanding: You pay for news, in print or digitally.

Facebook looks like it may become a top media-selling marketplace, along with Amazon and Apple.

That’s round one, 2011-2012, of the digital circulation wars. Round two necessitates bringing in new customers, especially younger ones who don’t have print habits and may not have much news brand loyalty.

That’s a key place Facebook fits in. It’s a potential hothouse of new, younger customers.

“It isn’t obvious that we can be successful with premium content on social,” notes Alisa Bowen, general manager of WSJ Digital Network. The Journal, while not participating in the f8 launch, already has a significant trial in place. The same holds true of the spate of other recent WSJ innovations, like WSJ Live and its iPad apps. “WSJ Everywhere,” Bowen says, “tests what we’re doing for people who never come to the website.”

As publishers create more one-off tablet and smartphone products (“The newsonomics of Kindle Singles”), Facebook looks like it may become a top media-selling marketplace, along with Amazon and Apple.

Advertising revenue: Facebook is still so bent on building audience that it is providing publishers their best ad deals. Publishers can sell ads for display within their Facebook apps — and keep all the revenue. No revenue share, thank you. (At least for now.)

Data: “In addition to serving adverts from our own partners in the app, we have highly detailed but anonymized data from Facebook covering demographics and usage,” says Freeman. “We also have our own analytics embedded in the pages on the app, which will help us understand how our content is used and shared within the Facebook Open Graph.”

Learning about social curation. Social filtering will be a standard feature of all news (unless we opt out) by 2015. It’s not hard to see why. It’s old village world-of-mouth, jet-propelled by technology. How social curation will work is a huge question; how can it best co-exist with editorial curation, for instance? That kind of learning is one other benefit f8 partners tell me they hope to gain.

The Facebook dance is a cautious one. News publishers’ experiences with web wunderkinds have not, in general, been great ones. Witness the ongoing battles over revenue share percentages, customer relationships, and customer data access that have characterized the soap-opera-like Apple/publisher public spats. Amazon’s new Kindle tablet re-lights the question of publisher/Amazon rev share and data sharing.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl