Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

February 23 2011

15:56

Six questions about semantic data and news innovation

If there is no clear guidance on essential building blocks of the open Web, like rich semantic data, and every organization is left to draw their own conclusions, I ask myself, Is this the fertile ground where innovation takes root?

I've started interviewing our news partners in an effort to sketch the outlines of the first series of news-technology innovation challenges. There are a number of consistent themes emerging in these conversations: mobile & new devices, large datasets & presenting data in useful ways, HTML5 video & audio, and so on. The theme that I'm most curious about today is semantic data, and the question of what standards are going to lead to the most new, and interesting, innovations, e.g.: microformats, or microdata, or RDFa, or...

Specifically, I'm interested in asking:

  • What would it take to see a move toward one standard for semantic data on the open Web?
  • Would a broad adoption of one standard make new innovations more likely?
  • Is the choice of a standard simply an issue of matching project needs to the features provided by the standard, thus validating the necessary of several different options?

This curiosity stems from my excitement for the HTML5 specification and the aspirations it sets for the future of the open Web: bendable, programmable, and accessible (in other words, awesomesauce). It's also exciting because people are actually working with what is available from the HTML5 specification today -- it's not only possible, but practical. However, there is very little guidance provided (currently) on how to implement semantic data in the HTML5 universe.

From my (admittedly cursory) investigation, the situation exists because the HTML5 community hasn't agreed on the "one true way" to implement semantic data (perhaps that's not a realistic possibility). There are at least three competing semantic data standards that seem to frame the debate:

Last week I read on the Microformats blog that Facebook added hCalendar and hCard microformats to millions of events. This type of scenario is a good example of what I was referring to last week when I wrote about the decisions that news organizations are making today, and what impact those decisions could have on the future of the Web, e.g.:

  • On one side of the Internet seesaw (also known as a teeter-totter) are companies like Facebook, Twitter, Google that have the massive "weight" of large user communities and immense volumes of data;
  • On the other side are news organizations. News organizations that still have, I would argue, equivalent weight in terms of their reach, attention, and the trust they've earned over time.

So what happens if one side moves to the other? Or -- if that doesn't happen -- which side will be the first to convince a majority of developers to hop on their end and change the balance?

For example, the BBC has already made significant investments into RDF and actively advocate for other organizations to embrace Linked Data. Other news organizations are, no doubt, using Microdata and hoping to leverage Google's ability to turn that data into "rich snippets" that drive traffic. More still, and this is my point, are probably sitting on the fence waiting to see what happens.

So, as I rush to make my next connection on route to Raleigh, I'll finish off this post with these questions:

  • How can the Knight-Mozilla News Technology Partnership play a role here? (And should it?)
  • What are the opportunities to work with news organizations, and the broader news innovation community, to explore the far edges of possibility for a semantically-rich Web?
  • Would broad adoption of an open standard for semantic data by large news organizations create new opportunities for innovation that have not been explore thoroughly yet?
  • Would this be the type of challenge that would pique your curiosity, and -- possible -- entice you to get involved?

If you have thoughts on the matter, speak up or drop me a line. :)

November 18 2010

18:06

Google News Meta Tags Fail to Give Credit Where Credit Is Due

Far be it for me to question the brilliance of Google, but in the case of its new news meta tagging scheme, I'm struggling to work out why it is brilliant or how it will be successful.

First, we should applaud the sentiment. Most of us would agree that it is a Good Thing that we should be able to distinguish between syndicated and non-syndicated content, and that we should be able to link back to original sources. So it is important to recognize that both of these are -- in theory -- important steps forward both from the perspective of news and the public.

But there are a number of problems with the meta tag scheme that Google proposes.

Problems With Google's Approach

Meta tags are clunky and likely to be gamed. They are clunky because they cover the whole page, not just the article. As such, if the page contains more than one article or, more likely, contains lots of other content besides the article (e.g. links, promos, ads), the meta tag will not distinguish between them. More important is that meta tags are, traditionally, what many people have used to game the web. Put in lots of meta tags about your content, the theory goes, and you will get bumped up the search engine results. Rather than address this problem, the new Google system is likely to make it worse, since there will be assumed to be a material value to adding the "original source" meta tag.

Though there is a clear value in being able to identify sources, distinguishing between an "original source" as opposed to a source is fraught with complications. This is something that those of us working on hNews, a microformat for news, have found when talking with news organizations. For example, if a journalist attends a press conference then writes up that press conference, is that the original source? Or is it the press release from the conference with a transcript of what was said? Or is it the report written by another journalist in the room published the following day? Google appears to suggest they could all be "original sources"; if this extends too far then it is hard to see what use it is.

Even when there is an obvious original source, like a scientific paper, news organizations rarely link back to it (even though it's easy to use a hyperlink). The BBC -- which is generally more willing to source than most -- has historically tended to link to the front page of a scientific publication or website rather than to the scientific paper itself (something the Corporation has sought to address in its more recent editorial guidelines). It is not even clear, in the Google meta-tagging scheme, whether a scientific paper is an original source, or the news article based on it is an original source.

And what about original additions to existing news stories? As Tom Krazit wrote on CNET:

The notion of 'original source' doesn't take into account incremental advances in news reporting, such as when one publication advances a story originally broken by another publication with new important details. In other words, if one publication broke the news of Prince William's engagement while another (hypothetically) later revealed exactly how he proposed, who is the "original source" for stories related to "Prince William engagement," a hot search term on Google today?

Differences with hNews

Something else Google's scheme does not acknowledge is that there are already methodologies out there that do much of what it is proposing, and are in widespread use (ironic given Google's blog post title "Credit where credit is due"). For example, our News Challenge-funded project, hNews already addresses the question of syndicated/non-syndicated, and in a much simpler and more effective way. Google's meta tags do not clash with hNews (both conventions can be used together), but neither do they build on its elements or work in concert with them.

One of the key elements of hNews is "source-org" or the source organization from which the article came. Not only does this go part-way toward the "original source" second tag Google suggests, it also cleverly avoids the difficult question of how to credit a news article that may be based on wire copy but has been adapted since -- a frequent occurence in journalism. The Google syndication method does not capture this important difference. hNews is also already the standard used by the largest American syndicator of content, the Associated Press, and is also used by more than 500 professional U.S. news organizations.

It's also not clear if Google has thought about how this will fit into the workflow of journalists. Every journalist we spoke to when developing hNews said they did not want to have to do things that would add time and effort to what they already do to gather, write up, edit and publish a story. It was partly for this reason that hNews was made easy to integrate into publishing systems; it's also why hNews marks information up automatically.

Finally, the new Google tags only give certain aspects of credit. They give credit to the news agency and the original source but not to the author, or to when the piece was first published, or how it was changed and updated. As such, they are a poor cousin to methodologies like hNews and linked data/RDFa.

Ways to Improve

In theory Google's initiative could be, as this post started by saying, a good thing. But there are a number of things Google should do if it is serious about encouraging better sourcing and wants to create a system that works and is sustainable. It should:

  • Work out how to link its scheme to existing methodologies -- not just hNews but linked data and other meta tagging methods.
  • Start a dialogue with news organizations about sourcing information in a more consistent and helpful way.
  • Clarify what it means by original source and how it will deal with different types of sources.
  • Explain how it will prevent its meta-tagging system from being misused such that the term "original source" becomes useless.
  • Use its enormous power to encourage news organizations to include sources, authors, etc. by ranking properly marked-up news items over plain-text ones.

It is not clear whether the Google scheme -- as currently designed -- is more focused on helping Google with some of its own problems sorting news or with nurturing a broader ecology of good practice.

One cheer for intention, none yet for collaboration or execution.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl