Wednesday, April 29, 2009

RDF and Twitter: Compare and Contrast

As I wrote previously, RDF was developed with the idea that it would be the backbone of something called the "semantic web", which was supposed to be different from the world-wide web in that machines would be able to transmit and "understand" information from global network. In contrast, Twitter was developed with the idea that people would need to document their sad pathetic lives in 240 character chunks. On this date, however, the Twitterverse seems to be an intelligent global network that can transmit and understand almost anything, and the RDF-based semantic web seems to still be convinced of a need for agents to transmit dribbles of sad, pathetic knowledge in an endless stream of subject-object-predicate triples.

It's interesting to compare the core data models for RDF and for Twitter. In RDF, the fundamental particles are, as I've said, subject-object-predicate triples. To recast that last sentence into the RDF model, we would proceed as follows:
 Assertion:
subject: RDF
object: subject-object-predicate triples
predicate: has fundamental particles of type
That's probably too self-referential for most people to wrap their heads around, so instead I'll change the example:
 Assertion:
subject: The United States
object: Barack Obama
predicate: has a president named
I usually have trouble remembering which is the predicate and which is the object. If you think about it, however, you can express the same particle of knowledge in ways that swap the roles of predicate and object, or even subject and predicate. For example:
 Assertion:
subject: Barack Obama
object: President of the United States
predicate: has the office of
In your copious spare time, you can work out the other 4 permutations.

Now let's look at Twitter. The particle of information in Twitter, the tweet, seems also to be a triple:
 Tweet:
tweeter: gluejar
message: going to bed now!
time: Wed, 29 Apr 2009 06:58:01 +0000
The tweeter in turn has associated with it sets of followed users and followers as well as profile information. There's a lot to talk about here, and in a previous post I pointed out that Twitter message content is becoming richer and more linguistically complex. But the point I'd like to make for now is that twitter's point of view is that it doesn't care so much about what the message is saying as who is saying it and when it was said. The more we look at the RDF examples above, the more the subject-object-predicate representation of knowledge seems limiting. The assertion may be true or false depending on when it was said; assertions removed from the context of who is making the assertion are for the most part useless because machines have no way to know whether to trust the assertion.

Friend-of-the-blog Jeff Young asserts that the OpenURL data model can be thought of as answering 6 questions: Who, What, Where, When, Why and How. Whatever success Twitter has achieved can be thought of as an argument that the most important of these are the Who, What and When.

Sanity Alert! the following may be mind-blowing to certain susceptible individuals: the data model that Twitter REALLY uses to propagate tweets is RSS and Atom. These formats are decended from what was originally called "Meta Content Format" which became "RDF Site Summary" (Yes, the very same RDF!) which became "Really Simple Syndication" or maybe something else, I'm not sure for sure. Here's how Twitter REALLY feeds into the semantic web:
  tweet:
title: gluejar: going to bed now!
description: gluejar: going to bed now!
pubDate: Wed, 29 Apr 2009 06:58:01 +0000
guid: http://twitter.com/gluejar/statuses/1649740567
link: http://twitter.com/gluejar/statuses/1649740567

Exercise for the reader- how does this look in Atom?

Does anyone but me think that there's something weird going on here?

Friday, April 17, 2009

I have seen the Semantic Web and it tweets "Temba, his arms wide!".

We actually know quite a lot about how human languages develop, if you've not read "The Language Instinct" or something in that direction then you really need to make some time for it. One thing that is known is that if you want to develop a human language, the main thing you need is a bunch of children. If you put a bunch of adults together who don't speak a common language, they start communicating using pidgin, fragments of languages mixed together with very simple grammars: "Me Tarzan, you Jane" sorts of things that never develop into true languages with complexity and expressive power and the ability to have Shakespeares or Tolstoys. But if you add children to the mix, something completely different happens. Their brains seem to be wired to invent the complexity missing from the pidgins of the adults, and the result is a creole language with all the complexity and expressiveness of any other human language.

I've recently decided that I need to understand what's going on with Twitter. If you've not tried it, don't worry too much, because I've been there and I can assure you that it's every bit as dumb an idea as it sounds like, as dumb as putting together a bunch of adults who don't speak any languages in common and expect something useful to come out. But despite the dumbness of the twitter idea, it's really interesting what is happening there. Yes, there are lots of adults and corporate entities making "Me Tarzan you Jane" noises, but a lot of people manage to approach Twitter with the child-like approach that is resulting in more complexity and expressiveness.

Twitter asks for status messages of quite a short length, 140 characters. You would think that this is a severe limit on what you can do with it, but it turns out to be a great blessing, because it forces people to creatively seek ways to build linguistic complexity and expressive power into their tweets. The result is the emergence of a new human language. In addition to an entirely new set of vocabulary imported from texting: OMG, ROTFL and the like, there are three grammatical constructs that are widely used to build the expressiveness of the 240 character tweet.
1. The "@username" construct. Used to address and reference another user.
2. The "#topic" construct, or hashtag. Used to tie in your tweet with a wider conversation.
3. The embedded hyperlink. Used to point to something on the web.

All three of these pieces of grammatical machinery are worth further discussion (I'm not sure if I'll get a chance to write about them for a week or so due to vacation!!!!) but I need to tie all this into Star Trek and the Semantic Web so I'll focus on the hyperlink part for now.

In the episode "Darmok", the USS Enterprise-D is on a mission to attempt to establish communications between the Federation and the Tamarians after several previous attempts had failed. The difficulty was that the output of the Federation's universal translators was a stream of words that didn't make sense. A typical message was "Darmok and Jalad at Tanagra" for example. As the plot unfolds, Captain Picard and a Tamarian, Dathon, are beamed down to a planet and after a violent interlude, Picard realizes that the Tamarian language is composed entirely of references to episodes in the cultures's oral history. "Darmok and Jalad at Tanagra" for example, is meant to express "Let's cooperate to face a common enemy".

Believe it or not, there is an analog to the Tamarian language that is being promoted as being the next great internet revolution. Serious people are promoting the idea that this "Semantic Web" will represent the emergence of a new kind of intelligent data network of great power. The core of the Semantic Web is a data model called "RDF", which stands for Resource Description Framework. RDF is a beautiful thing. It's based on the idea that all knowledge can be represented as a bunch of data triples, each of which is an assertion: SUBJECT-OBJECT-PREDICATE. It also goes a step further, by saying that all subjects, all objects, and all predicates can be represented by URI's (Uniform Resource Identifiers), which are essentially hyperlinks, or references to other things.

The problem with RDF is the same problem that Picard's universal translator had with Tamarian. Both Tamarian and RDF can seem to be nothing but references.

So what does Darmok have to do with Twitter? Well, it was only through intense (and ultimately fatal) interaction and imitation between Picard and Dathon that the two were able to converge the reference to the concept. Children create Creole languages only by intense imitation and play in a group. And the Semantic Web will only happen in the presence of intense community messaging and back and forth networking provided in an environment like Twitter.

Temba, his arms wide!

Thursday, April 16, 2009

Optimism about OpenURL

4 weeks ago, there was a thread on the OpenURL listserv with the wonderful title "OpenURL listserv still not accepting my mail". I was mentioned by Herbert van de Sompel, so I thought I should reply. The problem was, I had been unsubscribed by virtue of having left my email address when I left OCLC. I figured, no problem, I should be able to get resubscribed. With a bit of help from Phil Norman, I eventually got resubscribed, only to find out that the OpenURL listserv was not accepting my mail either!

I've never had much urge to start blogging, but I've known for a while what the name of my blog would be!

The following is a horrible way to start a blog, but I fully intend for this to be a horrible blog. It will be incomprehensible, arcane, obscure, indirect and never poetic. Here's what I had to say that the OpenURL listserv won't publish:

It's a bit unkind to talk about dynamical OpenURL formats as "false fantasy". Having said that, I think most of the committee that worked on OpenURL standard would plead guilty to being optimistic about the future. The present situation, however, is that if you use a metadata format that a resolver is unfamiliar with, there are no resolvers, either in production or in the lab, that will understand enough about the ContextObject to do anything other than validate it.

If you're a glass-half-full guy like me, you'll say- "Wow, you mean an OpenURL link resolver can actually validate a metadata format that it's never seen before???" and you'll admire the practicality of the group that worked on the standard.

If you're a glass half-empty person, you'll say- "That's completely useless, the resolver has no hope of doing anything useful for a user unless somebody goes and does some work on the format" and you'll be muttering about the false fantasies and delusions of the group that worked on the standard.

As Herbert pointed out, the standard is written so that metadata formats that are not in the registry (and thus validated as being important in some way by real live human beings) must be either described by an xml schema or a matrix file. At the time we worked on the standard, the most that could be accomplished with this rule is that a resolver machine would be able to validate a context object. We thought that was a realistic and sensible goal. There was always the hope that semantic web technology would advance to the point that self-describing metadata formats would also be possible. And in fact, there have since been developed some very interesting annotation technologies that would make that possible- if you really needed to do it.

The bottom line for the question that began the discussion four weeks ago is that registering the metadata formats that are thought to be important is a Good Thing. Those formats will not be self-describing in common use (because they are successful without self-description!)

(Thanks to Paul Moss who alerted me to the discussion while my listserv subscription had lapsed since leaving OCLC, and to Phil Norman, who helped resubscribe me.)
Reblog this post [with Zemanta]

Sunday, April 5, 2009

Does Google *really* get an Orphan Monopoly?

After posting on the Bits blog that "monopoly" is too strong a word to describe the rights to orphan works that Google would acquire if the proposed Google Book Search Settlement agreement is approved, I started to worry that my interpretation of the settlement agreement was incorrect. The key question is this: Would the Book Rights Registry have the ability to authorize a Google competitor to copy and use "Orphan works"? On further study, and after incanting "I Am Not A Lawyer" ten times fast, I've come to the conclusion that definitely, maybe it can. The relevant section of the settlement agreement is 3.8 (a), also known in the commentary as the "Most Favored Nation" section.
Effect of Other Agreements. The Registry (and any substantially similar entity organized by Rightsholders that is using any data or resources that Google provides, or that is of the type that Google provides, to the Registry relating to this Settlement) will extend economic and other terms to Google that, when taken as a whole, do not disfavor or disadvantage Google as compared to any other substantially similar authorizations granted to third parties by the Registry (or any substantially similar entity organized by Rightsholders that is using any data or resources that Google provides, or that is of the type that Google provides, to the Registry relating to this Settlement) when such authorizations (i) are made within ten (10) years of the Effective Date and (ii) include rights granted from a significant portion of Rightsholders other than Registered Rightsholders. With respect to any such authorization, the Registry promptly will provide Google with notice that an authorization has been granted with sufficient detail of the terms to allow Google to obtain the benefits of such authorization pursuant to this Section 3.8(a) (Effect of Other Agreements).
That's a lot of clauses and legalese. Here are my notes. IANAL!
  1. The settlement agreement clearly anticipates that the Registry would enter into other agreements with regard to orphan works. The phrase "rights granted from a significant portion of Rightsholders other than Registered Rightsholders" can be translated into the vernacular as "rights to orphan works".
  2. "Most Favored Nation" applies only to orphan works, and expires after 10 years.
  3. "(and any substantially similar entity organized by Rightsholders that is using any data or resources that Google provides, or that is of the type that Google provides, to the Registry relating to this Settlement)" just says that the Registry can't use a puppet to evade the MFN.
  4. I gather that lawyers are not universally agreed that it would be legal for the Registry to release copyright infringement claims by "Rightsholders other than Registered Rightsholders". IANAL. Certainly the Registry could act on behalf of Registered Rightsholders. But supposing a law firm filed a class action suit to enjoin the Open Content Alliance from doing digitization of orphan works. How could the court block a settlement agreement of this new lawsuit after approving a settlement for Google?
Bottom line- I think I was right to say that "monopoly" is too strong a word. "murkopoly" maybe. Below the Bottom line- Mike Shatzkin's posts (with help from Michael Cairns) and the commentary on the Shatzkin Files are very much worth reading, if you, like me, are trying to understand the implications of the Google Book Search Settlement.