Back in the early days of the web, Peter Deutsch from Sun penned a classic list: The Fallacies of Distributed Computing. Peter took a long, hard look at dozens of networked systems that failed, and realized that almost every failure made one or more catastrophic assumptions:
- The network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn't change.
- There is one administrator.
- Transport cost is zero.
- The network is homogeneous.
Any time you make an assumption along the lines of the fallacies above, your project will almost certainly fail. These fallacies are best explained in an article by Arnon Rotem-Gal-Oz, but today I will focus on fallacy #5: Topology doesn't change, and how the semantic web will fail partially because its creators made this fatal assumption.
As I mentioned before, proponents of the "Semantic Web" are trying to dial down their more grandiose claims, and focus on items with more concrete value. The term that Tim Berners-Lee is using these days is Linked Data. The core idea is to encourage people to put highly structured data on the web, and not just unstructured HTML documents, so the data is easier for machines to read and understand.
Funny thing, people have been doing this for decades. Tons of folks make structured data available as "scrapable" HTML tables, as formatted XML files, or even as plain ol' Comma Seperated Value (CSV) files that you can open in Excel. Not to mention the dozens of open web services and APIs... allowing you to do anything from check stock quotes, to doing a Google Maps mashup. There really is nothing groundbreaking here... and I find it painfully disingenuous for somebody to claim that such an obvious step was "their magic idea."
Well, not so fast... in an attempt to breath relevance back into the "Semantic Web," Tim claims that "Real Linked Data" needs to follow three basic rules:
- URLs should not just go to documents, but structured data describing what the document is about: people, places, products, events, etc.
- The data should be important and meaningful, and should be in some kind of standard format.
- The returned structured data has relationships to other kinds of structured data. If a person was born in Germany, the data about that user should contain a link to the data about Germany.
OK... so your data has to not only be in a standard format... but it needs links to other data objects in standard formats. And this is exactly where they fail to heed the warnings about the fallacies of distributed computing! Your topology will always change... not only physical topology, but also the logical topology.
Or, more succinctly, what the heck is the URL to Germany?!?!?
Look... links break all the time. People move servers. People shut down servers. People go out of business. People start charging for access to their data. People upgrade their architecture, and choose a different logical hierarchy for their data. Companies get acquired, or go out of business. Countries merge, or get conquered. Embarrassing content is removed from the web. Therefore, if you use links for identifiers, don't expect your identifiers to work for very long. You will need to spend a lot of time and energy maintaining broken links, when quite frankly you could do quite fine without them in the first place.
An identifier says what something is. A link says where you can find it. These concepts should be kept absolutely separated. Its a bad bad bad bad bad idea to blend the "where" with the "what" into one single identifier... even the much touted Dereferenceable URIs won't cut it, especially from a long-term data maintenance perspective... because the data they deference to might no longer be there!
So, where does that leave us? Exactly where we are. There are plenty of ways to create a system of globally unique IDs, whether you are a major standards body, or a small company with your own product numbers. But we shouldn't use brittle links... we should use scoped identifiers instead. We need a simple, terse way to describe what something is, that in no way, shape, or form looks like a URL. The identifier is the "what." We need a secondary web service -- like Google -- to tell us the most likely "where." At most, data pages should contain a link to a "suggested web service" to translate the "what" into the "where." Of course... that web service might not exist in 5 years, so proceed with caution.
For example, we could use something similar to Java package names to make sure anybody with a DNS name can create their own identifier... For example, there's a perfectly good ISO standard for country names. So you tell me, which is a better identifier for Germany?
I don't know... Openlinsw.com and DBPedia might not be around in 3 years, and data is supposed to be permanent. Wikipedia will probably be around for a while, but should it go to the English page or the German page? The ISO 3166 identifier may not be clickable, but at least it works for both German and English speakers! Also, if you remove the dots and Google it, the first hit gives you exactly the info you need. Plus, these ISO codes will exist forever, even if the ISO itself gets overrun by self-aware semantic web agents.
I just can't shake the feeling that using links for identifiers leads to a false sense of reliability. Your identifiers are some of the most important parts of your data: they should be something globally unique and permanent... and the web is anything but permanent.
Lets' accept the fact that the topology will change, create a system of globally unique identifiers that are independent of topology, and go from there.