[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Metadata] Reference Linking: A Note on Syntax





Appendix 2 of the paper "DOIs used for reference linking"  introduces a
syntactic convention which purports to simplify the use of DOIs. Leaving aside
the thornier issue of whether we really need to assign DOIs to Creations (or
Works) - virtual things (ghosts?) which have no tangible existence and hence
cannot be experienced/consumed/enjoyed or in any sense known, I would like to
suggest that the proposed syntax is imperfectly conceived.

What we are trying to implement with the DOI is a 21st century identifier (and
beyond?). Instead we seem to be harking back to the baroque majesty of the SICI
code. Why W/P/D/R as a type code? Why the presumption of English in a URN? Why
capitalized? (I know that the Handle system may be indifferent to case but the
DOI is surely not limited to the Handle technology. And  anyway capitals
themselves are an older technology superseded by the lowercase, cursive style -
note also that the default case on most keyboards is lowercase.) Why the
addition of a pair of parentheses? One (or none) is sufficient. We already have
the slash as a delimiter between prefix and suffix.

I would further suggest that if we need to inspect the identifier it will be at
the machine level. No user is going to gaze on at the DOI string to elicit
semantic evidences. If we really do need to incorporate this inline intelligence
then a single digit will suffice. (We have anyway always loosely talked about
the DOI as a "number".) A digit would also be kinder on I18N. And a digit lends
itself more readily to extension as more categories may be conceptualized later.
(Of course, maintaining this intelligence in the associated metadata is the more
obvious route. Two years ago we didn't know about Works and Manifestations. Two
years hence, what other base types will we have discovered? Metadata can always
be augmented, the persistent identifier - the DOI - never.)

For background it may be useful to consider Academic Press experience over 2
years with the DOI which has been to decisively reject any intelligence in the
DOI suffix and to focus instead on metadata. In particular, the SICI string was
dicovered to be a non-viable identifier. For resolution discovery purposes it is
flawed both semantically and syntactically.

Semantically, the identifier carries it's associated metadata inline and each
and every piece of metadata must be known. A user cannot generate the string
from standard bibliographic citations. To accommodate this shortcoming AP
initially opted for minimal SICI codes where we retained only that minimal set
of metadata that could be derived from a citation. The next (and final) step was
to externalize the metadata. This allows us to make a citation match using only
a subset of the associated metadata.

Syntactically, the SICI is a disaster. While version 2 was standardized in 1996,
it really belongs to an older time. It is based on ASCII. It is written in
English. It is true that it can be transported via SMTP, but it requires hex
encoding if used as a URI in HTTP, and it requires entifying if packaged within
SGML/XML instances. The SICI is over-specialized.

This has led AP to adopt their own production identifier as a viable suffx
string, ie

     10.1006/jmbi.1999.1234

This is a robust identifier, primitive enough that it can survive in a wide
range of environments without encoding. What it refers to will be evident from
its usage context. We have accepted that resolution discovery must be metadata
driven. The only intelligence in the "number" is that it is a URN  (or will be
when registered as a NID) and that should be sufficient.






------------------------------------------------------
Metadata maillist  -  Metadata@doi.org
http://www.doi.org/mailman/listinfo/metadata