The Digital Object Identifier System
Using DOIs Navigation Bar
Using DOIs Sub-Level Navigation Bar

Handling of Special Characters Within the DOI Syntax

The following explains why and how certain characters must be hex encoded within DOIs embedded within URLs. This is summarized in a DOI Link set of tables which can be used as a quick reference.

The Handle System, the technology which underlies the DOI, is designed to accept any string of any length. (Note that earlier implementations did have a 128 character limit, but this limitation has been removed.) A more formal definition of Handle System syntax, including internationalization, can be found in the recent IETF Internet Draft "Handle System Namespace and Service Definition".

But the DOI application of the Handle System doesn't exist in isolation. Specifically, much of the current and near future use of DOIs will be in the context of the Web. Using DOIs of the form

http://dx.doi.org/10.1000/1

means using DOIs embedded in URLs, and so attention must be paid to URL syntax, particularly to characters which have special meanings and so are excluded from URLs, e.g., URL delimiters. Any DOI which contains one of these excluded characters, e.g., the # mark, cannot be embedded in a URL without encoding the excluded characters to hide them from the URL mechanisms. An example is given below.

In August 1998, the standards track protocol document "Uniform Resource Identifiers (URI): Generic Syntax" which updates and merges "Uniform Resource Locators" [RFC1738] and "Relative Uniform Resource Locators" [RFC1808] in order to define a single, generic syntax for all URI, was published and can be found at

http://www.ics.uci.edu/pub/ietf/uri/rfc2396.txt

In our opinion this is currently the best statement of URL (now described as a subset of URI) syntax. We will try to keep this link updated, but please check for later versions as time goes on. And of course there are numerous reference books and guides that address the subject.

In addition to formal specifications, the DOI team had to also consider current practice, primarily as defined by the behavior of current Web browsers. As is frequently the case, current implementations are somewhat more forgiving than the formal specifications, although there is never any guarantee that this will continue. In our recommendations below we do, however, distinguish between those encodings which are absolutely required to function today versus those which are deemed at least unwise by the specifications but which are not required by current browsers.

The Internet Draft referred to above defines excluded characters within the framework of US-ASCII. Within that framework, excluded characters are categorized by the Draft as follows. (Note that the | symbol is used to separate sequences of characters, such that "#" | "%" is a list of two characters.)

control = <US-ASCII coded characters 00-1F and 7F hexadecimal>

Control characters are sometimes defined as not printable, although note that horizontal tab falls into this class.

space = <US-ASCII coded character 20 hexadecimal>

The space character is frequently used to show the end of a string and is also sometimes spuriously introduced in transcription or cutting and pasting.

delims = "<" | ">" | "#" | "%" | <">

URL delimiters are used to define the start and finish of a URL in various contexts. Note, however, that "%" is included in this list not because it is a delimiter but because it is used for hex encoding and so must always be encoded itself.

unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

The Draft defines these as a variety of characters which are regarded as troublesome in various contexts such as gateways.

A series of experiments with current browsers indicates that the characters which MUST be avoided or encoded in all contexts in the current web environment are control, space, hash mark (#), percent (%), and double quote ("). If used in DOIs embedded in URLs, these characters must be hex encoded in order to be correctly interpreted by browsers. Additionally, any other characters may be hex encoded. Please see the aforementioned RFCs and Internet Draft for the rationales on why certain characters should be encoded in certain contexts.

Hex encoding consists of substituting for the given character its hex value preceded by percent. Thus, # becomes %23 and

http://dx.doi.org/10.1000/456#789

is encoded as

http://dx.doi.org/10.1000/456%23789

The browser does not now encounter the bare #, which it would normally treat as the end of the URL and the start of a fragment, and so sends the entire string off to the DOI network of servers for resolution, instead of stopping at the #.

Note that the DOI is not changed, only its representation within the URL. The DOI is still 10.123/456#789, and is so stored in the DOI system. Thus, the hex encoding must be reversed before the resolution takes place. That decoding is done in the various forms and proxy servers that surround the DOI resolution system.

The DOI web forms and batch submission procedures have been built to accept either hex encoded or raw characters in all cases but one. Percent (%) will always be interpreted as the start of a hex encoded character and so must always be hex encoded itself, as %25.

DOI creators and maintainers thus need to keep the following rules in mind:

  • There are no character restrictions for DOI schemes per se.
  • When DOIs are embedded in URLs, they must follow the URL syntax conventions, but the same DOIs need not follow those conventions in other contexts, e.g., inventory databases.
  • The percent character (%) must always be hex encoded (%25) in any web form, batch input, or URL. Other excluded characters must always be hex encoded in URLs, but may be entered either raw or hex encoded in web forms and batch jobs.

These rules are summarized in the DOI Link quick reference table.


     To demonstrate the DOI System and its underlying Handle System technology, this web site uses two methods of DOI resolution for site navigation. Underlined text contains DOIs embedded in URLs that are resolved using a http proxy server. They work for anyone using a standard browser.
      The small gold symbols -- DOI Link Example --  accompanying most text-based links represent DOIs which resolve using the Handle System protocol directly. Navigating with these DOIs requires that you download and install the CNRI Handle System Resolver web browser plug-in.


about the doi: | technology | genres | history | resources | articles & related information
about the foundation: | description | membership | member list
using dois: | overview | guidelines | request a prefix | forms
activities & events: | announcements | workshops | mailing lists | presentations | discussion topics
home | what's new? | site index | contact us | members only


International  DOI Foundation
Updated: 21 Aug 00