Sunday, March 28, 2010

URI Fragments

I learned something interesting when reading the Architecture of the World Wide Web, Volume One. It turns out that URI fragments (the part of a URI after a '#' character) are not interpreted as part of a URI:

Note that the HTML implementation in Emma's browser did not need to understand the syntax or semantics of the SVG fragment (nor does the SVG implementation have to understand HTML, WebCGM, RDF ... fragment syntax or semantics; it merely had to recognize the # delimiter from the URI syntax [URI] and remove the fragment when accessing the resource). This orthogonality (§5.1) is an important feature of Web architecture; it is what enabled Emma's browser to provide a useful service without requiring an upgrade.

The semantics of a fragment identifier are defined by the set of representations that might result from a retrieval action on the primary resource. The fragment's format and resolution are therefore dependent on the type of a potentially retrieved representation, even though such a retrieval is only performed if the URI is dereferenced. If no such representation exists, then the semantics of the fragment are considered unknown and, effectively, unconstrained. Fragment identifier semantics are orthogonal to URI schemes and thus cannot be redefined by URI scheme specifications.

Interpretation of the fragment identifier is performed solely by the agent that dereferences a URI; the fragment identifier is not passed to other systems during the process of retrieval. This means that some intermediaries in Web architecture (such as proxies) have no interaction with fragment identifiers and that redirection (in HTTP [RFC2616], for example) does not account for fragments.

While I had an intuitive understanding of how browsers work with URI fragments in HTML documents (retrieve the document, find the fragment, display the document starting at the fragment), I hadn't considered the semantic split, nor the implications of URI fragments being applied to other kinds of representations.

I find the handling of fragments interesting in a number of ways. For one thing, it means that as new content types become part of the web, the creators of those content types are free to map URI fragments into that content type. So in HTML the format of URI fragments is generally a textual name that appears in the html. But in a 3D modeling format, it might take the form of [position,orientation,scale] to define a location from which the model is being viewed, the direction the camera is facing, and the scaling factor. That's nice because it allows URI fragments to be structured in a manner most appropriate for the kind of representation being retrieved.

One possibly surprising consequence of this split is that URI fragments are not considered in URI resolution activities such as interacting with proxies or redirection. It also means that you shouldn't try to use URI's with fragments as if they represented actual resources, since the web isn't allowed to cache individual fragments and no semantic interpretation is allowed from the '#' character onward.

Wednesday, March 24, 2010

Why you should care about adhering to the Architecture of the World Wide Web

I was reading more of the Architecture of the World Wide Web, Volume One", and RFC 2616 and the definition of "safe" methods:

9.1.1 Safe Methods
Implementors should be aware that the software represents the user in their interactions over the Internet, and should be careful to allow the user to be aware of any actions they might take which may have an unexpected significance to themselves or others.
In particular, the convention has been established that the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval. These methods ought to be considered "safe". This allows user agents to represent other methods, such as POST, PUT and DELETE, in a special way, so that the user is made aware of the fact that a possibly unsafe action is being requested.
Naturally, it is not possible to ensure that the server does not generate side-effects as a result of performing a GET request; in fact, some dynamic resources consider that a feature. The important distinction here is that the user did not request the side-effects, so therefore cannot be held accountable for them.

This matches quite nicely with my intuition about how the web works (at least the normal web).  That might seem like something trivial, but in fact it is something profound.

Unfortunately, many, many web services violate the principle of safe operations. For example, they frequently have all interactions occur via GET, and use query parameters or headers or other conventions to stipulate whether the result is a resource retrieval, creation, update, or delete. This is (unfortunately) only one example of aberrant service design.

At first, this might not seem like such a problem. But as you start to think about consuming such services or providing support for services you've built, you realize that the conventions of HTTP and the architecture of the web represent a sort of "lingua franca". Rather than making your services do more, you are making your services easier for (potential) consumers of your service to understand and use. If you adhere to the architecture of the web, your services don't do any more than they would have done otherwise. Rather (and this is crucial) you have implemented your services in such a way that users of your services can very easily and rapidly understand their features, capabilities, and limitations. That is a huge advantage in today's fast-moving technology world.

Monday, March 08, 2010

Parsing "The Architecture of the World Wide Web, Volume 1"; URI allocation

As I continue to read Architecture of the World Wide Web, Volume I, I keep running into material that is completely outside of what I would have expected, yet valuable.

For example, Section 2.2.2 talks about URI allocation. Since URIs are supposed to identify a single resource, it becomes important to make sure that the social organizations which allocate and assign URIs are organized so that they don't allocate the same URI to refer to more than one resource. In other words, we want to make sure that we give organizations authority to assign URIs that don't overlap, so that different organizations don't assign the same URI to different resources (sort of like giving the same Social Security number or driver's license number or bank account number to two different people).

This sort of material may sound obvious when we read it. But it is frequently not obvious to everyone involved in building, deploying, managing, and evolving software systems. In fact, I think failure to make these sorts of issues clear at the architectural and administrative levels is quite possibly the single greatest cause of problems in managing software systems in the real world.

Tuesday, March 02, 2010

Principals of the Web

As I noted in a previous entry, I've been reading Architecture of the World Wide Web, Volume One and am finding it a great read. For example, take this little gem:
Constraint: URIs Identify a Single Resource

Assign distinct URIs to distinct resources.

In a nutshell the authors have made it clear that a URI should refer to a particular resource. And just a bit further on they point out that URI's can be aliases for a single resource:
Just as one might wish to refer to a person by different names (by full name, first name only, sports nickname, romantic nickname, and so forth), Web architecture allows the association of more than one URI with a resource. URIs that identify the same resource are called URI aliases. The section on URI aliases (§2.3.1) discusses some of the potential costs of creating multiple URIs for the same resource.

They even offer thoughts on the performance consequences of aliases.

Ya gotta love it... :-)

Monday, March 01, 2010

The Architecture of the World Wide Web - Volume 1

As part of my foray into RESTful services, I've been reading The Architecture of the World Wide Web - Volume 1 and find it refreshingly informative. For example:

The choice of syntax for global identifiers is somewhat arbitrary; it is their global scope that is important. The Uniform Resource Identifier, [URI], has been successfully deployed since the creation of the Web. There are substantial benefits to participating in the existing network of URIs, including linking, bookmarking, caching, and indexing by search engines, and there are substantial costs to creating a new identification system that has the same properties as URIs.


And:

A resource should have an associated URI if another party might reasonably want to create a hypertext link to it, make or refute assertions about it, retrieve or cache a representation of it, include all or part of it by reference into another representation, annotate it, or perform other operations on it. Software developers should expect that sharing URIs across applications will be useful, even if that utility is not initially evident.


There is so much packed into each of these brief statements, and they are in equivalent of the first 10 pages of the document.

I find it both amazing and sad that this document was published in 2004 yet I've found very few references to it in the six years since it's publication. Maybe I just haven't been looking in the right places?

I will share additional passages that I find enlightening in the days (and weeks?) to come.