Thursday, May 28, 2009

Part 3: Reification Considered Harmful

About two years ago, we had some landscaping done on our yard. Part of the work was to replace the crumbling walkways. I suggested making some of the walkways curved, to create more esthetic shapes for the garden beds in front of our house. The landscape designer we were working with suggested that I reconsider, because it is well known in the landscape design world that people never, ever, follow a curved path. Studies have been made using hidden cameras showing that people always walk in straight paths, no matter what the landscape design tries to coax them into doing. The usual result of a curved pathway is the creation of a footworn path that makes the curved path straight. As you can see, we took our designer's advice.

This is the third part in a series of posts on reification. In Part 1, I tried to explain what reification is; in my second post I gave some examples of how to use reification using RDFa. In a philosophical interlude on truth on the internet, I made it pretty clear why I think it's really important to include and retain sourcing and provenance information whenever you try to collect information from the internet. In this part 3, I promised to discuss the pros and cons of reification. I lied. RDF Reification has been nothing but disastrous for the semantic web. The problem is that RDF tries to lead implementers along a strangely curved path if they want to do the "right" thing and keep track of sourcing and provenance of the knowledge loaded into a triple-store. I have a strong suspicion that no one, anywhere, ever in the history of RDF, has made significant use of the reification machinery. I have asked a fair number of semantic web implementers and none of them have ever used reification.

Semantic Web implementers certainly don't ignore the imperatives of sourcing and provenance, but what they do instead of using reification is to make the equivalent of straight worn dirt paths. Typically they won't use pure triple stores, instead treating triples as first class data objects that can be joined to separate tables with provenance information, or else they build knowledge models which make the provenance and source explicit, as do Google's models for reviews that they are supporting in RDFa.

Alternatively, Semantic Web implementers may choose to ignore the retention of provenance and sourcing and treat their RDF triple-store as a pristine, never-changing, collection of truth. For many applications, this works quite well. It rapidly becomes unworkable if it is required to merge many sources of information. RDF works great for the collection, transmission and processing of unchanging, unpolluted, uncontroversial knowledge; on this blog, I will from now on refer to this sort of information as UnKnowledge.

To my mind, there is a deeper problem with reification. and that relates to what an RDF triple really means. My view is that an RDF triple means absolutely nothing, and that it is only the action of asserting a triple that has meaning. The deep problem with reification is that it's hard to do, and thus nobody does it. It also forces implementers to think too much about semantics, and thinking too much about semantics is always a bad thing. Too often you end up dizzy like a dog chasing its tail.

The RDF working group has produced an entire document trying to clarify what the semantics of RDF are. Here is an example paragraph to study:
The semantic extension described here requires the reified triple that the reification describes - I(_:xxx) in the above example - to be a particular token or instance of a triple in a (real or notional) RDF document, rather than an 'abstract' triple considered as a grammatical form. There could be several such entities which have the same subject, predicate and object properties. Although a graph is defined as a set of triples, several such tokens with the same triple structure might occur in different documents. Thus, it would be meaningful to claim that the blank node in the second graph above does not refer to the triple in the first graph, but to some other triple with the same structure. This particular interpretation of reification was chosen on the basis of use cases where properties such as dates of composition or provenance information have been applied to the reified triple, which are meaningful only when thought of as referring to a particular instance or token of a triple.
I've read that sentence over and over again; I've finally concluded that it is an example of steganography. Here is how I have decoded it:
the semantic extension described HEre requires the reified tripLe that the reification describes - i(_:xxx) in the above example - to be a Particular token or InstAnce of a triple in a (real or notional) rdf docuMent, rAther than an 'abstract' triPle consideRed as a grammatIcal form. there could be Several such entities which have the same subject, predicate and Object properties. although a graph is defiNed as a set Of tRiples, several such tokens wIth the same triple structure might occur i different documents. thus, it would be meNAningful to Claim thAt the blank node in the second Graph abovE does not refer to the triPLE in the first grAph, but to Som other tERiplE with the Same struCtUrE. THIS Particular interpretatiOn Of Reification was choSen On the basis of Use cases where properties such as dates of composition or provenance information have been appLied to the reified triple, whicH are mEaningfuL only when thought of as referring to a Particular instance or token of a triple.
I'll try to suggest some ways that we might rescue RDF and the Semantic Web in a future post.

Wednesday, May 27, 2009

Twitterdata and How Chinese Could Be the Future of Tweeting

True confession time: I love Unicode. I think that Unicode was one of the most important achievements of 20th century civilization. I am so much of a Unicode wonk that one of my first thoughts when I heard about Twitter was "I wonder whether it's 140 characters or 140 bytes?" If I were a true Unicode geek rather than a Unicode wonk, it wouldn't have taken until today for me to do the test to see for sure. In case you're wondering, it really is 140 characters; if it were bytes, you'd only get to send 70 Chinese characters in a tweet. But there's a catch- The SMS network restricts SMS messages to 70 characters if any Unicode characters past the first 256 are used. So somehow international tweets are sent as two SMS messages if Chinese characters are used. There's also the catch that tweet recipients may not be equipped to handle the full suite of Unicode characters that you might want to send. I had to change a setting on Tweetdeck before I could see my Chinese tweet; I was unsuccessful in sending a legible Chinese SMS from Skype to my iPhone- not sure where that problem comes from!

I've been spurred into Unicode tweeting because of a recent proposal called Twitterdata. If you've been reading my blog for a while, which I know you haven't been, you'll know that I've been interested in the way that Twittering seems to be developing in ways that resemble the development of human languages. I'm certainly not the first one to notice that Twitter has many semantic-web-like features, and there has been discussion about ways to add semantics to the Twitter stream. The Twitterdata people have made a very interesting proposal: they suggest some very simple additions to tweet grammar that would make tweets more meaningful to machines. They suggest to use the "$" character to denote the name in a name-value pair of meaningfulness. I think this proposal is brilliant, but my thoughts on the matter are entirely irrelevant, because the twitterdata proposal has an approximately zero chance of being widely adopted. My prediction is that one year from now, there will have been more human-generated tweets in Klingon than in Twitterdataese. Here are the reasons I think that:
  1. Twitterdataese is ugly. Example: "@bdelacretaz: #wmodata $id DW1428 $temp 69F $wangle 232 $wspeed 4.0mph $rh 50% $dew 49F $press 1015.2mb http://bit.ly/lxvlh #twitterdata". I rest my case.
  2. Twitterdataese doesn't lend itself well to imitation. In a previous post, I discussed the importance of imitation on the establishement of languages. Without reading the twitterdata documentation, can you figure out what the "$" does in the tweet "@toddfast $likes movies $likes Twitter"? I don't think I would have been able to.
  3. Twitterdata doesn't relieve pain. When I started my first company, a more seasoned entrepreneur gave me some great advice. "People will spend a lot of money to relieve a toothache- but they're much more reluctant to spend money on toothache prevention. Make sure the product you're selling relieves someone's pain." Somehow I doubt that @toddfast would be suffering much if he just liked movies as opposed to $linking them.
So what is causing Twitterers pain? Or to cast the question in terms of language evolution, what are the competitive pressures on Tweet vocabulary and syntax? So far I have been able to discern two strong competitive pressures.
  1. findability- Twitterers want their tweets to be found and read. This pressure is addressed by hashtags.
  2. terseness- Twitterers want to say more in one tweet than permitted by 140 characters. This pressure has led to the proliferation of URL shorteners (another thing I would like to write about) and to innumerable ROTFL and LOL inventions.
Which brings me back to Unicode tweeting. Chinese characters are much terser in terms of character count than any nonideogrammatic language. So maybe, just maybe, we'll start seeing Chinese characters creep into our tweeting to help us say more with fewer characters- first the really easy characters like 中 for china or 山 for mountain or 水 for water. 好吗?

Tuesday, May 26, 2009

There is no truth on the internet

In his retirement, my father took up genealogy as a hobby, and after he died, his database of thousands of ancestors (most of them in northern Sweden) passed to me. If you're interested, you can browse through them on the hellman.net website. Having all this data up on the web has been rather entertaining. Every month or so, I get an e-mail from some sixth cousin or such who has discovered a common ancestor through a google search, and the resulting exchanges of data allow me to make occasional corrections and additions.

Since I've taken the database on, huge amounts of genealogic information has become available on the internet. When I first started finding this information, I made the mistake of trying to suck it into my database, since I had become more a less a professional data sucker and spewer in my work life. Once I had spent hour after hour pulling data in, I started to wonder what the point of it all was. Could I relly determine, and did I really care whether Erik Eriksson, born 1837 in Backfors, was really my fourth cousin thrice removed or not? What is the relationship between the data I sucked in and the truth about all the real people listed in the database? I quickly regretted my data gluttony.

Traditional genealogists focusing on Sweden use a variety of material as primary sources of information. Baptismal records typically give a childs name and birthdate along with the names of their parents; burial and marriage records similarly give names and dates. The genealogist's job is to connect names on different records to construct a family tree. But things are not always simple. Probably 20% of males in the Backfors region were named Erik, and since patronymics were used, 20% of those males were also named Eriksson, though the name might be abbreviated in the records as "Ersson". To judge whether a girl named Hanna listed on a birth record from 1877 which lists "Erik Eriksson" as the father is really the daughter of the Erik Eriksson born in 1837 in Backfors, the genealogist must consider all the information available together with conditional probabilities.

The internet genealogist (e.g., me) has a different task. Rather than looking at the birth records and assessing the likelihood of name coincidences, the internet genealogist looking at the same question searches the internet and finds that the web site "sikhallan.se" lists Hanna as Erik's daughter. The internet genealogist then makes a judgement about the reliability of the Sikhallan website. For example, how do we know that Sikhallan's source for Erik's birthdate isn't just the hellman.net website? If the two databases disagree, who should be believed? In my case, I just look at my father's meticulous notes about where his information comes from and if he noted some uncertainty, then I'm much more likely to believe the other sources available to me. Unless of course my data has come from one of my data sucking binges, in which case the source of my data has been lost and I can no longer judge its reliability.

In my last two posts on reification (Part 1, Part 2), I promised that I would have a third post evaluating whether the reification machinery in RDF was worth the trouble. This is not that third post, this is more of a philosophical interlude. You see, another way to look at genealogic information on the internet is to think of it as a web of RDF triples. For example, imagine if Sikhallan made its data available as a set of triples, e.g. (subject: Erik Eriksson; predicate: had daughter; object Hanna). Then we could load up all the triples into an RDF-enabled genealogy database, and all our problems would be solved, right? Well, yes, unless of course we wanted to retain all the supporting information behind the data, the data provenance, all the extra care in citation of source taken by my Father and and ignored by me in my data-sucking orgies. In reality, the triple itself is worthless, devoid of assessable truth. If the triple were associated with provenance information its truth would become assessable, and thus valuable. The mechanism that RDF provides for doing things like this is... reification.

Wikipedia is the most successful knowledge aggregation on the internet today and is also, not coincidentally, the best example of the value of comprehensive retention of provenance and attribution. Wikipedia keeps track of the data and author of every change in its database, and relentlessly purges anything which is not properly cited. Wikipedia is, in my opinion the best embodiment of my view that there is no truth on the internet- there are only reified assertions.

Wednesday, May 20, 2009

Reif#&cation Part 2: The Future of the RDF, RDFa, and the Semantic Web is Behind Us

In Reif#&cation Part 1, I introduced the concept of reification and its role in RDF and the Semantic Web. in Part 3, I'll discuss the pros and cons of reification. Today, I'll show some RDFa examples.

I've spent the last couple of days catching up on lots of things that have happened over the last few years while the semantic web part of my brain was on vacation. I was hoping to be able to give some examples of reification in RDFa using the vocabulary that Google announced it was supporting, but I'm not going to be able to do that, because the Google vocabulary is structured so that you can't do anything useful with reification. There are some useful lessons to draw from this little fact. First of all, you can usually avoid reification by designing your domain model to avoid it. You should probably avoid it too if you can. In the Google vocabulary, a Review is a first-class object with a reviewer property. The assertion that a product has rating 3 stars is not made directly by a reviewer, but indirectly by a review created by a reviewer.

Let's take a look at the html snippet presented by Google on their help page for RDFa (It's permissible to skip past the code if you like.):


<div xmlns:v="http://rdf.data-vocabulary.org/#"
typeof="v:Review">
<p><strong><span property="v:itemReviewed">
Blast 'Em Up</span>
Review</strong></p>
<p>by <span rel="v:reviewer">
<span typeof="v:Person">
<span property="v:name">Bob Smith</span>,
<span property="v:title">Senior
Editor</span> at ACME Reviews
</span>
</span></p>
<p><span property="v:description">This is a great
game. I enjoyed it from the opening battle to the final
showdown with the evil aliens.</span></p>

</div>

(Note that I've corrected a bunch of Google's sloppy mistakes here- the help page erroneously had "v:person", "v:itemreviewed" and "v:review" where "v:Person", "v:itemReviewed" and "v:Review" would be been correct according to their published documentation. I've also removed an affiliation assertion that is hard to fix for reasons that are not relevant to this discussion, and I've fixed the non-well-formedness of the Google example. )

The six RDF triples embedded here are:

subject: this block of html (call it "ThisReview")
predicate: is of type
object: google-blessed-type "Review"

subject: ThisReview
predicate: is reviewing the item
object: "Blast 'Em Up"

subject: ThisReview
predicate: has reviewer
object: a google-blessed-type "Person"

subject: a thing of google-blessed-type "Person"
(call it BobSmith)
predicate: is named
object: "Bob Smith"

subject: BobSmith
predicate: has title
object: "Senior Editor"

subject: ThisReview
predicate: gives description
object: "This is a great game. I enjoyed it from the
opening battle to the final showdown with the evil
aliens."

Notice that in Google's favored vocabulary, Person and Review are first-class objects and the item being reviewed is not (though they defined a class that might be appropriate). An alternate design would be to make the item a first class object and the review a predicate that could be applied to RDF statements. The seven triples for that would be

subject: a thing of google-blessed-type "Product"
(call it BlastEmUp)
predicate: is named
object: "Blast 'Em Up"

subject: BobSmith
predicate: is named
object: "Bob Smith"

subject: BobSmith
predicate: has title
object: "Senior Editor"

subject: an RDF statement (call it TheReview)
predicate: has creator
object: BobSmith

subject: TheReview
predicate: has subject
object: BlastEmUp

subject: TheReview
predicate: has predicate
object: gives description

subject: TheReview
predicate: has object
object: "This is a great game. I enjoyed it from the
opening battle to the final showdown with the evil
aliens."

To put those triples in the same HTML, I do this:


<div xmlns:v="http://rdf.data-vocabulary.org/#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
typeof="rdf:Statement"
rel="dc:creator"
href="#BobSmith">
<p><strong>
<span property="rdf:subject">
<span typeof="v:Product">
<span property="v:name">Blast 'Em Up</span>
</span>
</span> Review</strong></p>
<p>by <span typeof="v:Person" id="BobSmith">
<span property="v:name">Bob Smith</span>,
<span property="v:title">Senior Editor</span>
at ACME Reviews
</span></p>
<p><span property="rdf:predicate"
resource="v:description"/>
<span property="rdf:object">This is a great
game. I enjoyed it from the opening battle
to the final showdown with the evil
aliens.</span></p>
</div>

I've drawn one extra bit of vocabulary from the venerable "Dublin Core" vocabulary, "dc:creator", to do this.

Some observations:
  1. Reification requires a bit of gymnastics even for something simple; if I wanted to reify more than one triple, it would start to look really ugly.
  2. Through use of a thought-out knowledge model, I can avoid the need for reification.
  3. The Knowledge model has a huge impact on the way I embed the information.

This last point is worth thinking about further. It means for you and me to exchange knowledge using RDFa or RDF, we need to share more than a vocabulary, we need to share a knowledge model. It reminds me of another story I heard on NPR, about the Aymara people of the Andean highlands, whose language expresses the future as being behind them, whereas in English and other western languages the future is thought of as being in front of us. We can know the vocabulary for front and back in Aymarian, but because we don't share the same knowledge model, we wouldn't be able to successfully speak to an Aymarian about the past and the future.

Friday, May 15, 2009

Reif#&cation Part 1: RDF and the dry martini

A man walks into a bar. The bartender asks him what he wants. "Nothing," he says.
"So why did you come in here for nothing?" asks the bartender.
"Because nothing is better than a dry martini."

This joke is an example of reification. An abstract concept, "nothing", is linguistically twisted into a real object, resulting in a humorous absurdity. I first encountered the concept when, 10 years ago, I learned RDF, (resource description framework) the data model which was designed to be the fundamental underpinning of the semantic web. At that time, I was sure that "reification" was a completely made up word used as a jargon stolen from the knowledge representation community. It's only this week that I learned that in fact, "reification" is a "macaronic calque" translation of a completely made up German word used prominently by Karl Marx, "Verdinglichung". Somehow that doesn't make me feel much better about the word. If you learn nothing else from reading this, remember that you can use "reification" as a code word to gain admittance to any gathering of the Semantic Webnoscenti.

In RDF, reification is necessary so that stores of triples can avoid self-contradiction. Let me translate that into English. RDF is just a way to say things about other things so that machines can understand. The model is simple enough that machines can gather together large numbers of RDF statements, apply mathematical machinery to the lot of them, and then spit out new statements that make it seem as though the machines are reasoning. The problem is that machines are really stupid, so if you tell them that the sky is blue, and also that the sky is not blue, they can't resolve the contradiction and they start emitting greenhouse gases out the wazoo and millions of people in low-lying countries lose their homes to flooding. What you need to do instead is to "reify" the contradictory statements and tell the machine "Eric said the 'the sky is blue'" and "Bruce said 'the sky is not blue'". RDF, as a system, can't talk about the assertions that it contains without doing the extra step of reifying them.

So let's see how the RDF model accomplishes this (remember, RDF represents assertions as a set of (subject,predicate,object,) triples. We start with:
Subject: The sky
Predicate: is colored
Object: blue
And after reification, we have:
Subject: statement x
Predicate: has Subject
Object: The sky

Subject: statement x
Predicate: has Predicate
Object: is colored

Subject: statement x
Predicate: has Object
Object: blue

Subject: Eric
Predicate: said
Object: statement x
So now the statement about the color of the sky has become a real thing within the RDF model, and I can do all sorts of things with it, such as compare it to a dry martini. The downside is that this comes at the cost of turning one triple into 3 triples.

Reification has analogs in other disciplines. Software developers familiar with object-oriented programming may want to think of reification as making the assertion into a first-class object. Physicists and people who just want their minds blown may want to compare reification to "second quantization". At this point, I'll don my ex-physicist hat (even though I never wore a hat while doing physics!) and tell you that second quantization is the mathematical machinery of field theory that allows field theory to treat bundles of waves as if they were real particles that can be created and annihilated.

Whether you're doing linked open data or quantum field theory, it's a good idea to focus on things that behave as if they were real. Otherwise, no dry martinis for you!

This is the first part of three articles on reification. In Part 2, I'll show how reification is applied in a real example, using the newly trendy RDFa. In Part 3, I'll write about whether reification is a good idea.

Wednesday, May 13, 2009

My Pathetic Life (#mpl) and Collaborative Intelligence

I'm still pretty new to Twitter, but it feels pretty familiar to me. Part of the reason for that is that some of my Facebook friends have been parallel posting their status on Facebook and on Twitter. This mostly annoyed me, because Twitterers update their statuses more often than facebookers, and a lot of the Twitter vernacular is totally inexplicable when viewed on Facebook. On Twitter, by contrast, I find that I'm annoyed when people that I follow, but don't really know, mix details from their personal lives into their otherwise interesting Twitter streams. For example, I follow dchud because I know that he will throw out some very interesting ideas. (He was the very first person to follow me on Twitter; I was the very first person to comment on his blog way back when). But I'm not really interested in his reports on the Washington Capitols. Somehow I find that Facebook is a much better place to get to know details like that- if you friend me there, you'll find that I'm a rabid fan of the Philadelphia Phillies, and I won't mind it if you update me on the triumphs of your Columbus Blue Jackets. We both knew they would lose eventually.

The thing that intrigues me about Twitter is that it does so little so well that it's really a lot easier to fix the problems that it has. The past two days I've been writing about the challenges of propagating vocabulary (and grammar for that matter!) for use in the semantic web. Yesterday, Google demonstrated one way of propagating vocabulary- be big and powerful and just tell the world what vocabulary to use. Ian Davis called Google's approach to implementing RDFa "a damp squib" which is what Americans would less colorfully call a "dud" or a wet firecracker. He lamented that Google had chosen to use their own limited vocabulary rather than adopt vocabulary already in use. R.V. Guha, who I mentioned in yesterday's post, commented on Davis' blog that we shouldn't judge too soon what Google is doing. A lot of us are hoping that "igniter fuse" will turn out to be an apter pyrotechnical analogy.

The other strategy for vocabulary propgation is based on community-based collaboration. In my post yesterday, I complained that it was hard to find vocabulary that I might use to attach an ISBN to a resource. In contrast, Twitter, together with the accessories that are built around it, seems to enable rapid propagation of vocabulary and grammar. So back to my complaint about how Twitter streams seem to annoyingly mix tweets of varying interest. One way that people deal with this is to use multiple accounts to organize their tweets in different genres, the same way you might want to have a business email and a personal email. Perhaps a better, more flexible way to address this would be to adopt a special hashtag to signal that a tweet is not a product of one's brilliant intellect, but rather just a status message about "my personal life" (#mpl). That way, your mom can easily filter out your irrelevant work stuff and and your boss can filter out your irrelevant personal stuff. Anyway, if you think this a good idea, see if you can help propagate the use of #mpl (let's call the idea #mplIdea). If you don't think it's a good idea, or if there's some better way to fix this twitter deficiency, leave a comment. Let's see if we can demonstrate the power of the collaborative approach to vocabulary propagation.

The adoption of RDFa by Google and their centralized approach to vocabulary may be a turning point in the first stage of the semantic web- that of using the web to aggregate data. I think this approach is not going to take us very far. We need to start building the second stage of the semantic web. We should be thinking about collaborative intelligence rather than about accumulating distributed sets of data. I don't expect that machines will be able to come up with ideas like #mplIdea, but I do think it's reasonable for machines to be able to help us judge wether ideas like #mplIdea are inspired (and should be propagated) or whether they're just stupid.
Reblog this post [with Zemanta]

Tuesday, May 12, 2009

Google, RDFa, and Reusing Vocabularies

Yesterday, I wrote about one difficulty of having machines talk to other machines- propagation and re-use of vocabularies is not something that machines being used today know how to do on their own. I thought it would be instructive to work out a real example of how I might find and reuse vocabulary to express that a work has a certain ISBN (international standard book number). What I found (not to my great surprise) was that it wasn't that easy for me, a moderately intelligent human with some experience at RDF development, to find RDF terminology to use. I tried Knoodl, Google, and SchemaWeb to help me.

Before I complete that thought, I should mention that today Google announced that they've begun supporting RDFa and microformats in what they call "rich snippets". RDFa is a mechanism for embedding RDF in static HTML web pages, while microformats are a simpler and less formalized way to embed metadata in web pages. Using either mechanism, this means that web page authors can hide information in structures mean to be read by machines in the same web pages that humans can read.

Concentrating on just the RDFa mechanism, it's interesting to see how Google expects that vocabulary will be propagated to agents that want to contribute to the semantic web: Google will announce the vocabulary that it understands, and everyone else will use that vocabulary. Resistance is futile. Not only does Google have the market power to set a de facto standard, but it has the intellectual power to do a good job of it- one of the engineers on the Google team working on "rich snippets" is Ramanathan V. Guha, who happens to be one of the inventors of RDF.

You would think that It would be easy to find an RDF property that has been declared to use in assertions like "the ISBN of 'digital Copyright' is 1-57392-889-5". No such luck. Dublin Core, a schema developed in part by the library community, has an "identifier" element which can be modified to indicate the element contains an isbn, but no isbn property. Maybe I just couldn't find it. Similarly, MODS, which is closely related to library standards, has an identifierType element type that can contain an ISBN, but you have to add type=isbn to the element to make it an ISBN. Documentation for RDFa wants you to use the ISBN to make a urn and to make this the subject of your assertion, not an attribute (ignoring the fact the ISBN identifies things that you sell in a bookstore (for example, the paperback version of a book) rather than what most humans think of as books. I also found entries for isbn in schemes like The Agricultural Metadata Element Set v.1.1 and a mention in the IMS Learning Resource Meta-Data XML Binding. Finally I should note that while OpenURL (a standard that I worked on) provides an XML format which includes an ISBN element, it's defined in such a way that it can't be used in other schemas.

The case of ISBN illustrates some of the barriers to vocabulary reuse, and although there are those who are criticizing Google for not reusing vocabulary, you can see why Google thinks it could work better if they just define vocabulary by fiat.

Monday, May 11, 2009

Dancing Parrots and how the Semantic Web will Happen

Last week there was a story on NPR about a dancing parrot. A neuroscientist in San Diego discovered a sulfur-crested cockatoo named Snowball dancing to the Backstreet Boys on a YouTube video. There were a number of interesting aspects to the story, including the fact that YouTube is now being used as a research corpus for animal behavior. What caught my ear however was the mention of follow-on YouTube research by a graduate student in the psychology department at Harvard, which found that the only animals which exhibited dancing skills on YouTube were 14 species of parrot and an elephant. The graduate student notes that like humans — and unlike dogs or cats — parrots and elephants are both known to be vocal mimics. They can imitate sounds. The hypothesis, then, is that our ability to dance is a byproduct of our ability to vocally mimic others.

In the development of language, mimicry is crucial. We learn to speak by repeating what others our saying. We acquire vocabulary by hearing the words that others use. We almost never acquire vocabulary by looking for words in a dictionary.

Last week, I attended a "Semantic Web Meet-Up" in NYC. One of the speakers was describing Knoodl.com, which is described as facilitating "community-oriented development of OWL based ontologies and RDF knowledgebases." I think its fair to say that Knoodl would like to be a sort of dictionary for the semantic web. To my mind, the semantic web just hasn't happened yet because its been very hard to connect data from different knowledge silos. Vocabulary used in one silo tends not to get used in other silos. Ironically, the twitterers attending the meeting couldn't even arrive at a common hashtag for the meeting- at least three tags for the meeting were used. I'm not sure if that fact says much about Twitter hashtags or about the people attending the Meetup. One of the most intriguing things to me about Twitter has been to observe how hashtags are propagated. I find myself mimicking others as I slowly learn the vocabulary and grammar of the new environment. It struck me that it is this quality of Twitter that makes me want to anoint it as a substrate for semantic web actualization.

Maybe the semantic web needs more than just dictionaries and registries and authorities and linked data to become the next big thing. Maybe what it really needs is some dancing parrots. Software agents that have the capacity to mimic the semantics of the other software agents in a global environment.

Friday, May 8, 2009

Death, another unsolved problem. Wait... Facebook says it's a "Known Problem"

Last September, the library world lost one of its really nice people and great resources. Johan van Halm was one of those people I would see at practically every conference I went to. I remember meeting him a bit more than 10 years ago when I had just started in the library and information business. He went out of his way to introduce me to many of his friends, and I soon realized that he knew just about everyone in the library automation world. We would make it a point to get together for a drink at least once every ALA. I really miss him.

But I'm not writing about Johan today- but I had to get that out of the way. I wanted to write about LinkedIn. You see, Johan is a "connection" of mine on LinkedIn. Every time I look at the list of my connections on LinkedIn, there I see Johan, living on forever in social network space. I could remove him as a connection, of course, but somehow that just doesn't seem right. Also, LinkedIn encourages you to treat your connections and the network they expoe to you as valuable assets and protects them accordingly. In contrast, your Facebook and Twitter "friends" are by default exposed to just about everyone. So on LinkedIn, it would not be to one's advantage to unconnect with someone just because they've left the corporeal parts of this life.

I hope you're sitting down, dear reader, because I have some serious things to discuss with you. Neither you nor I are going to be forever avoid "becoming deceased". However, to make this bleak situation a bit easier to swallow, let's assume, just for fun, that both you and are are going to live forever. In that case, I can pretty much guarantee you that everyone who is following you now on Twitter, all your Facebook Friends, all your LinkedIn Connections, all the friends of your friends, and even all the email accounts that you send jokes to, they will all either pass away into inactivity or become zombies controlled by someone else who is probably not your friend or connection.

Now it might just be that all the social networks we have become so enthralled with have just assumed that they themeselves will have gone belly-up or will have gone through their liquidity event before the death problem becomes severe. Or more likely, its just that they experence such a large volume of accounts fading away into inactivity that having users die is only a small perturbation on their services. But it is certainly the case that as these services become more important to our lives, the fact that we are not immortal increasingly must be addressed.

LinkedIn has done it this way:

What if I see a Profile of someone who is deceased?

Unfortunately, there may be a time when you come across a Profile of a deceased colleague, classmate or connection. If this occurs you are welcome to notify Customer Service that the Profile still exists and may need to be removed. We ask that you provide any important information about the deceased member that may aid our Privacy Department in their investigations and act on the account accordingly. Items to provide in your email would be one or two of the following:
  1. An Obituary Link.
  2. A Death Notice.
  3. Consular Report of Death.
  4. Death Certificate.
The profile "may need to be removed"?????

Facebook is also ready with a policy:
Profile: Bugs and Known Problems

I’d like to report a deceased user or an account that needs to be memorialized.

Please report this information here so that we can memorialize this person’s account. Memorializing the account removes certain more sensitive information like status updates and restricts profile access to confirmed friends only. Please note that in order to protect the privacy of the deceased user, we cannot provide login information for the account to anyone. We do honor requests from close family members to close the account completely.
"Bugs and Known Problems"???? There are some other odd results when you search Facebook help for "death", and following one of these results, you get: a link to Facebook's "Deceased" page.

GMail has an elaborate paper based procedure to access a deceased person's mail:
Accessing a deceased person's mail

If an individual has passed away and you need access to the content of his or her mail, please fax or mail us the following information:

  1. Your full name and contact information, including a verifiable email address.
  2. The Gmail address of the individual who passed away.
  3. a.The full header from an email message that you have received at your verifiable email address, from the Gmail address in question. (To obtain the header from a message in Gmail, open the message, click 'More options,' then click 'Show original.' Copy everything from 'Delivered-To:' through the 'References:' line. To obtain headers from other webmail or email providers, please refer to http://mail.google.com/support/bin/answer.py?hl=en&answer=22454#) b.The entire contents of the message.
  4. Proof of death.
  5. One of the following: a) if the decedent was 18 or older, please provide a proof of authority under local law that you are the lawful representative of the deceased or his or her estate or b) if the decedent was under the age of 18 and you are the parent of the individual, please provide a copy of the decedent’s birth certificate.
Postal Mail:

Google Inc.
Attention: Gmail User Support
1600 Amphitheatre Parkway
Mountain View, CA 94043
Fax: 650-644-0358 After we've received the above information, we'll need 30 days to process and validate the documents that you've provided. If you need access to the address sooner, in accordance with state and federal law, it is Google's policy to only provide information pursuant to a valid third party court order or other appropriate legal process. Please note that our ability to ability to comply with these requests varies according to applicable law.
I'll bet that even if you have made a last will and testament, you've not included any instructions there about what your executor is to do with your email accounts, your blog passwords, your websites, your social networks. Probably your family might want access to your flickr and youtube account. Also, if you think about it, you don't really want your executor poking around in your e-mails, especially that address you use only for illicit activity.

Part of my interest in the death problem stems from my interest in the Google Book Search settlement. You see, book authors and publishers have been ignoring the death problem for much longer than the Facebooks and Gmails of the world have been. The result is that many works are "orphaned", which means that the rights holders cannot be found or died without leaving instructions or documention about what to do with their intellectual property. It's worse outside the US, because the duration of copyright protection frequently depends on the death date of the author, which can be rather difficult to ascertain. Now that we have the technology to make out-of-print books in libraries generally available through the internet, the corpus of orphan works is once again important, but copyright law presents barriers to many uses, particularly those with economic value less that the cost of finding rights holders and obtaining permissions.

Do you think that perhaps Google is saving its deceased-person access requests for the day 50 years from now when they will become relevant to copyright status? I'll bet the answer is NO.

Tuesday, May 5, 2009

Smart Social Networks (consult a health professional before reading)

Different social networks have different properties with respect to transmission of information. Physicists have studied these sorts of networks in a variety of contexts- the problems of electronic conduction in random media and oil propagation in porous rock are deeply connected to the problems of virus transmission between people or between computers, for that matter.

A phenomenon that occurs in many of these networks is that propagation through the network can depend exponentially on parameters such as connection strength, connectedness of nodes, dimensionality of the network, etc. To use twitter as an example, a link might propagate to ten times as many people if the retweet rate changes by 10% (numbers completely fabricated). Or it might be that Macs are a million times less likely to be infected by a virus even because they are only 10 times less likely to retransmit a virus. (the product of a factor of 5 lower density and a factor of 2 intrinsic security (numbers that might not be true, but should be)).

I've been thinking a lot recently about propagation on networks this past week because of the news on influenza A (H1N1) AKA swine flu. You see, I got back from a week's vacation in Mexico a week ago, and I was just amazed at the break-out of flu hysteria. The amount of sheer stupidity out there is just appalling. I am reminded of Ionesco's Rhinocéros when I hear how common sense has also taken a vacation and gotten the flu. It's particularly striking when raw prejudice is accepted by otherwise rational people as prudence. One of many amazing bits of non-thought that I've heard or seen this past week: Perfectly healthy Students at Slippery Rock State University who went to Mexico are being prevented from attending graduation.

Lost in this are many shreds of rationality: Let's suppose a pandemic is really going to happen. That means that all of us will get exposed sooner or later. From the public health point of view, the spread needs to be slowed sufficiently that public health resources can be marshalled in plenty of time. But from the individual point of view, however, the worst thing for you is to contract the disease at the peak of the pandemic, when society's resources are most strained. There are two ways to avoid this- one is to become a hermit and hope you are not exposed until after the peak. The other is to contract the disease early, before the public health system comes under strain (making sure not to give it to anyone else). It's just the flu!

Maybe part of the reason for hysteria is that people don't understand the exponentials I mentioned above. When confronted with a potentially deadly pandemic, it's hard to imagine that something as simple as washing your hands could stop the scary monster in its tracks. You know intuitively that washing your hands will only protect you a little bit, so how can that be the answer?

For the purposes of this blog, the swine flu is just a chunk of information that has the ability to be transferred from one node of a social network to another. It's a reminder that social transmission of information can have bad consequences as well as good ones, and that the ultimate health of the social network is determined by how well it can discriminate between the two. Twitter seems to be a very useful medium for transmission of memes. The network of twitters seems to act as an intelligent filter which exponentially removes things I'm not interested in while delivering to me stuff that I am interested in. Facebook has recently seemed very effective at transmitting "quizzes" to me, which I'm not thrilled with, but seems to be evolving in a more self-aware way.

And in case you wondered, there is
  1. no chance you'll get swine flu just by reading this
  2. ...because I am perfectly healthy and showing no symptoms
  3. ...but you may want to contact your local health authorities just to be sure
  4. ...and since you've read it anyway, go wash your hands.