The Semantic Web Versus The Fallacies Of Distributed Computing

Back in the early days of the web, Peter Deutsch from Sun penned a classic list: The Fallacies of Distributed Computing. Peter took a long, hard look at dozens of networked systems that failed, and realized that almost every failure made one or more catastrophic assumptions:

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn't change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

Any time you make an assumption along the lines of the fallacies above, your project will almost certainly fail. These fallacies are best explained in an article by Arnon Rotem-Gal-Oz, but today I will focus on fallacy #5: Topology doesn't change, and how the semantic web will fail partially because its creators made this fatal assumption.

As I mentioned before, proponents of the "Semantic Web" are trying to dial down their more grandiose claims, and focus on items with more concrete value. The term that Tim Berners-Lee is using these days is Linked Data. The core idea is to encourage people to put highly structured data on the web, and not just unstructured HTML documents, so the data is easier for machines to read and understand.

ummmmm.... ok...

Funny thing, people have been doing this for decades. Tons of folks make structured data available as "scrapable" HTML tables, as formatted XML files, or even as plain ol' Comma Seperated Value (CSV) files that you can open in Excel. Not to mention the dozens of open web services and APIs... allowing you to do anything from check stock quotes, to doing a Google Maps mashup. There really is nothing groundbreaking here... and I find it painfully disingenuous for somebody to claim that such an obvious step was "their magic idea."

Well, not so fast... in an attempt to breath relevance back into the "Semantic Web," Tim claims that "Real Linked Data" needs to follow three basic rules:

  1. URLs should not just go to documents, but structured data describing what the document is about: people, places, products, events, etc.
  2. The data should be important and meaningful, and should be in some kind of standard format.
  3. The returned structured data has relationships to other kinds of structured data. If a person was born in Germany, the data about that user should contain a link to the data about Germany.

OK... so your data has to not only be in a standard format... but it needs links to other data objects in standard formats. And this is exactly where they fail to heed the warnings about the fallacies of distributed computing! Your topology will always change... not only physical topology, but also the logical topology.

Or, more succinctly, what the heck is the URL to Germany?!?!?

Look... links break all the time. People move servers. People shut down servers. People go out of business. People start charging for access to their data. People upgrade their architecture, and choose a different logical hierarchy for their data. Companies get acquired, or go out of business. Countries merge, or get conquered. Embarrassing content is removed from the web. Therefore, if you use links for identifiers, don't expect your identifiers to work for very long. You will need to spend a lot of time and energy maintaining broken links, when quite frankly you could do quite fine without them in the first place.

An identifier says what something is. A link says where you can find it. These concepts should be kept absolutely separated. Its a bad bad bad bad bad idea to blend the "where" with the "what" into one single identifier... even the much touted Dereferenceable URIs won't cut it, especially from a long-term data maintenance perspective... because the data they deference to might no longer be there!

So, where does that leave us? Exactly where we are. There are plenty of ways to create a system of globally unique IDs, whether you are a major standards body, or a small company with your own product numbers. But we shouldn't use brittle links... we should use scoped identifiers instead. We need a simple, terse way to describe what something is, that in no way, shape, or form looks like a URL. The identifier is the "what." We need a secondary web service -- like Google -- to tell us the most likely "where." At most, data pages should contain a link to a "suggested web service" to translate the "what" into the "where." Of course... that web service might not exist in 5 years, so proceed with caution.

For example, we could use something similar to Java package names to make sure anybody with a DNS name can create their own identifier... For example, there's a perfectly good ISO standard for country names. So you tell me, which is a better identifier for Germany?

  • http://en.wikipedia.org/wiki/Germany
  • http://de.wikipedia.org/wiki/Deutschland
  • http://linkeddata.openlinksw.com/about/Germany#this
  • http://dbpedia.org/resource/Germany
  • org.iso.3166-1.de

I don't know... Openlinsw.com and DBPedia might not be around in 3 years, and data is supposed to be permanent. Wikipedia will probably be around for a while, but should it go to the English page or the German page? The ISO 3166 identifier may not be clickable, but at least it works for both German and English speakers! Also, if you remove the dots and Google it, the first hit gives you exactly the info you need. Plus, these ISO codes will exist forever, even if the ISO itself gets overrun by self-aware semantic web agents.

I just can't shake the feeling that using links for identifiers leads to a false sense of reliability. Your identifiers are some of the most important parts of your data: they should be something globally unique and permanent... and the web is anything but permanent.

Lets' accept the fact that the topology will change, create a system of globally unique identifiers that are independent of topology, and go from there.

I dug up some great examples

I dug up some great examples of "links break all the time"

http://www.kibu.com/
http://www.kozmo.com/

taken from this list
http://www.cnet.com/1990-11136_1-6278387-1.html

According to your analysis,

According to your analysis, the WWW could never work. It conflates naming and location, it has broken links everywhere, and its identifiers (URIs) are much less stable than older identifier schemes such as ISBNs. Nevertheless, it appears to be pretty useful.

Re: According to your analysis,

The web is primarily useful because of the existence of web browsers and search engines, not the HTTP protocol ;-)

In general, the loose-and-fast web protocols and hyperlinks work great for unstructured information, because the effort is on the end user to extract meaning from the noise. Embedding structure and meaning can sometimes be helpful, but if you do so, you need to use an ID that's a little more permanent. Otherwise, you'll have a system that can only describe what's currently a document on the web... which pretty much defeats the whole goal of a "linked data" web.

For example... if there ever were an embedded microformat that says "this is a book," it would be wiser to have an ISBN for the identifier, rather than a link to Amazon. That way, both Barnes & Nobel and Borders would use it as well, and web services could understand that these microformats describe the exact same logical entity... which again, is pretty much a requirement.

Identifying people becomes an even bigger problem... every time I try to design a system of unique IDs for people, it always winds up being a massive security problem... Even with OpenID, I just don't think it will take off.

Dude

There is so much coming in Semantic Web, you haven't even come close to articulating what will drive sustainable value, versus what will cause those like you who miss the critical issues - to fail. Good luck.

Re: Dude

On many matters, I prefer to be an actor as opposed to a critic... However in this case, I must vehemently criticize: the semantic web has been vastly oversold, and will leave massive disappointment in its wake.

There is so much coming in Semantic Web, you haven't even come close to articulating what will drive sustainable value, versus what will cause those like you who miss the critical issues - to fail. Good luck.

That's what people said about both Esperanto and XML. I am deeply familiar about how it could be valuable, but I am also deeply familiar about how human nature has ensured that similar projects in the past have been utter failures. I believe the burden of proof is on the boosters; not myself. If you have bought into what semantic web vendors are selling, then I believe it it you who needs the luck.

Please note: I am fully open to the possibility that I am wrong. However, I will change my opinion if -- and ONLY if -- the proponents of the semantic web can achieve anything of value that is measurably better that what can be achieved with ordinary relational logic. Since the past 10 years have seen zero actual progress with the "semantic web", I think my opinions are perfectly defensible.

Recent comments