Factsheet
DOI® System and Internet Identifier Specifications
Version 2.0
A standard represents an agreement by a community to do things in a specified way to address a common problem. Whilst the DOI community has developed the DOI System, it has also ensured conformance with relevant generic external formal standards. This factsheet discusses those relevant in the Internet communities (IETF and W3C). There is currently considerable debate here on the issue of generic standards for naming objects.
Comparing generic identifier standards
A DOI® name differs from commonly used Internet pointers to material such as the URL, because it identifies an object as a first-class entity, not simply the place where the object is located. A DOI name also differs from identifiers such as the International Standard Book Number (ISBN), International Standard Recording Code (ISRC), etc., because it can be associated with defined services and is immediately actionable on a network.
The comparison of persistent identifier approaches is difficult because they are not all doing the same thing. Imprecisely referring to a set of schemes as 'identifiers' doesn't mean that they can be compared easily. Similarly, when any two technologies (e.g., two web browsers) are compared, the criteria used for comparison must be defined.
URI, URL, and URN
As noted by W3C and IETF (in RFC 3305), there is fundamental confusion as to the relationship of URL, URN, and URI. This cannot be easily rectified as there are two incompatible views: these are irretrievably confused in documentation (which in addition is poorly version controlled); the W3C web site is out of date on the topic. The following gives the common consensus as far as it exists; but if there is a mis-understanding due to incompatible views, one must determine which view is used. The main problems are confusions re identifier, representation, and access mechanism; lack of appreciation of identifier usage outside the WWW; use for non-digital referents; and not perceiving the web as only part of the Internet and the Internet as only part of information). In one view, URIs have two subclasses: URN (identifying names) and URL (identifying single locations), and therefore used incorrectly in the absence of anything else as a shorthand for the identifier of the resource at that location). In the other view, web-identifier schemes are all URI schemes, as a given URI scheme may define subspaces; some of these may be access mechanisms (e.g., "http:") whilst others may be namespaces (e.g., "urn:").
URI
Uniform Resource Identifier (RFC 3986) provides an extensible means for identifying a resource within the World Wide Web. Each URI begins with a scheme name that refers to a specification for assigning identifiers within that scheme; each scheme's specification may further restrict the syntax and semantics of identifiers using that scheme.
URI specification defines (1) an implementation to access a location on a file server, commonly accessed using the http protocol though other protocols are allowed; (2) a syntax for referencing, through which e.g., ISBNs can be specified as URIs. The network path of the URI is implicitly DNS based; original URI specifications that assume the URI to be opaque have been overtaken by practical usage which assumes that the initial URI parser will look for meaningful characters (such as dot and slash).
RFC 3305 (which attempts to clarify URI,URN,URL concepts) lists as an unanswered problem: "The use of URIs as identifiers that don't actually identify network resources" (for example, they identify an abstract object, or a physical object). This is important in any semantic application. To address this, the info URI scheme (RFC 4452:
http://info-uri.info) was developed by library and publishing communities for "URIs of information assets that have identifiers in public namespaces but have no representation within the URI allocation". OpenURL adopts it and was a key the motivation for it. InfoURI registrations can be made by anyone, not necessarily the authority for a particular namespace. DOI is registered in the infoURI scheme.
URN
Uniform Resource Name (RFC 2141) is a specification for defining names (identifiers) of resources for use on the internet. Locations are assumed to be independent of names. RFC 2141 defines (1) a formal registration process as a urn namespace, and (2) accompanying specifications to implement a series of functional requirements for such namespaces. Any existing identifier may be specified as a URN: e.g., urn:isbn:1234567891234; such identifiers may be implemented using a specially written URN plug-in and resolved to URLs: functionally this gives nothing beyond that achieved by coherent management of the corresponding URLs.
URN architecture assumes a DNS-based Resolution Discovery Service (RDS) to find the service appropriate to the given URN scheme. However no such widely deployed RDS schemes currently exist: browsers cannot action URN strings without some additional programming in the form of a "plug-in". These carry no guarantee of ready interoperability with other deployments, which may require a different plug-in for each implementation and may use conflicting data approaches.
The set of URNs, of the form "urn: nid: nnnnnn", is a URN namespace. ("nid" is here a URN namespace identifier, not a "URN scheme", nor a "URI scheme.") The official
IANA list of registered NIDslists 40 registered NIDs; many of these are not widely used as URNs (e.g., ISSN, ISBN).
DOI is not registered as a URN namespace, despite fulfilling all the functional requirements, since URN registration appears to offer no advantage to the DOI System. It requires an additional layer of administration for defining DOI as a URN namespace (the string urn:doi:10.1000/1 rather than the simpler doi:10.1000/1) and an additional step of unnecessary redirection to access the resolution service, already achieved through either http proxy or native resolution. If RDS mechanisms supporting URN specifications become widely available, DOI will be registered as a URN.
URL
Uniform Resource Locator (RFC 1738) is a location on a file server in the WWW; more recently (less clearly) redefined as "a type of URI that identifies a resource via a representation of its primary access mechanism (e.g., its network "location"), rather than by some other attributes it may have. URL is a useful but informal concept. ..."(RFC 3305). In practice, it identifies a single location, and therefore is widely used incorrectly as a (mutable) identifier of the resource at that location (so the same resource at two URLs would have two URL "identifiers"). This bad practice arose from the failure to distinguish name and location in early WWW development. Adding to the problem, URLs carry semantics of the Domain Name they are based on and are therefore unsuitable as opaque identifiers; they may also be contextually qualified. URLs are pervasive as the foremost mechanism throughout the WWW, but less useful outside it.
Attempts to circumvent the problem of using URLs as citable identifiers by developing persistent identifier alternatives are well documented (PURL, DOI, ARK, etc.).
A DOI name may be represented as a URL (http string) by prefacing the string http://dx.doi.org/ to the DOI of the document (e.g., to resolve the DOI name 10.1000/182, enter into a browser the address: http://dx.doi.org/10.1000/182). Web pages or other hypertext documents can include hypertext links in this form.
DOI functional requirements
The DOI system is designed to fulfil several additional functional requirements which we believe offer significant advantages in generic naming, notably:
- Neutral as to implementation. DOI allows but does not require http or other protocols. The design principle is that DOIs are not specific to the web or any other implementation (e.g., information may be delivered in non-web platforms such as PDAs). DOI is designed to be applicable in any environment on the Internet (the global information system linked by a globally unique address space based on the Internet Protocol (IP) using the Transmission Control Protocol/Internet Protocol (TCP/IP) suite).
- Granularity of naming and administration at the object level. Allows but does not mandate coarser level granularity tools such as domain names. Specifically, DOI resolution in native resolver form does not require the use of the DNS (Domain Name System): the DNS administrative model argues against using it as a general-purpose name system and has well-recognised problems of security and updating.
- Neutral as to language/character set. Compatible with, but not restricted to, the ascii character set. DOI names can use the Unicode capability of the Handle System to develop DOI names in Japanese, Chinese, etc., characters. The current DOI syntax restricts initial implementations to ascii simply for ease of adoption, but is intended to be widened (backward compatibility) to Unicode in a future revision.