Appendix 4 indecs Data Dictionary
This appendix describes the indecs Data Dictionary which forms a key part of the DOI®
Data Model. The indecs Data Dictionary (iDD) contains all Terms used in DOI® System AP
Metadata Declarations, ONIX messages and other schemes, and formal mappings of the
relationships between them. A detailed understanding of the iDD is not required by DOI® name
assigners or developers; mapping and related services may be made available through
The indecs Data Dictionary (iDD) is an ontology which exists to support semantic analysis and metadata interoperability for all DOI®-AP and ONIX metadata schemes. It is the repository of all Terms used in DOI® Kernel Metadata Declarations and Resource Metadata Declarations (RMDs), and other Terms that are required to establish the semantic relationships between them. The IDD is managed on behalf of IDF by a Maintenance Agency.
The iDD is jointly controlled by the IDF and EDItEUR, managers of the ONIX message formats which are the generally used standards in the international publishing industry. It is also fully integrated with the MPEG-21 Rights Data Dictionary (due to become a full International Standard in 2004).
All Terms used in public DOI Kernal and Resource Metadata Declarations must be mapped into the iDD, creating a network of equivalences and other relationships which can support different metadata functions, including the transformation of metadata from the Terms of one AP to another and the use of Terms from different APs together in cross-domain applications.
The iDD is the IDF's tool for semantic analysis, both to assist DOI System Registration Agences RAs in the development and validation of their Metadata Declarations, and to complete the mapping of those Declarations into the iDD itself to create semantic compatibility to enable interoperability with other schemes.
The iDD will be viewed online by RAs, and semantic analysis and mapping is managed by the iDD Maintenance Agency (Ontologyx).
The iDD has developed concepts which are also being worked on elsewhere (e.g. ABC, CIDOC, MetaNet etc). We believe that the semantic mapping of some of these other developments isn't rich or contextual enough to deliver full interoperability: the vocabulary base is too limited and growing it is a non-trivial matter as we know: one-to-one schema mapping of fairly simple terms will start to break down as they become multi-schema and contextualized, unless there is a comprehensive underlying ontology to map onto. The "thesaurus" approach isn't robust enough. Approaches such as the ABC model are geared to describing resources in an event-based way, rather than modelling contexts per se, which limits its scope.
The architecture of the iDD has been developed from the indecs framework analysis. IDF has been closely involved in its development since the beginning of the first indecs project in 1998. IDF was a partner in the Contecs:DD consortium which backed the second stage of the framework's development as the basis of the International Standard MPEG-21 Rights Data Dictionary in 2001-3. Through this process the point was reached in 2003 where an implementation of an operational indecs dictionary supporting lookup, mapping and data transformation was realized although the transformation tools remain under development by Ontologyx.
iDD is an ontology based on the indecs ContextModel, a simple but powerful data model which takes activity as the basis of its semantics. Fundamental terms of the iDD -- Verb, Context, Agent, Resource, Time, Place, Property, Quality and Relator -- provide a framework by which, through the process of Subtyping, Terms of any type or complexity can be defined and related.
The iDD is housed in an SQL database and its Terms are available for online lookup by RAs. It has a highly generic table structure and the bulk of its knowledge is contained in subject-predicate-object "triples" (compatible with those used in the emerging Semantic Web standards RDF and OWL) which can be grouped in chains and sets of any complexity and granularity (the iDD is expressible in RDF and OWL).
Mapping of Terms into iDD is, of necessity, a process requiring human analysis and validation (see A 4.6 Semantic Analysis and Mapping of Terms). However, once mapped, the structure of iDD is capable of supporting metadata queries and transformations to a high level of complexity, including the generation of scheme-to-scheme maps. Though rich, the iDD is equally at home with simple or complex schemes and structures. It explicitly establishes all the relationships required for semantic engineering of Metadata Declarations, enabling Declarations to be designed and mapped at the level of granularity appropriate to an AP.
Currently the IDF and EDItEUR community are considering the appropriate format to use for iids (indecs identifiers, the unique identifiers of each element in the dictionary). No final decision has yet been reached, but the discussion is straightforward. The URIs will take the form of DOI names. A DOI name consists of a prefix and a suffix separated by a slash, e.g. 10.1234/1234567
In order to easily manage DOI names for elements as opposed to those for first class content objects, IDF uses straightforward numeric prefixes for all content objects (such as 10.1234/), and reserved specific prefixes for administrative purposes. These will take the form of 10.ap/???? for Application Profiles, 10.ra/???? for Registration Agencies, and so on (where ???? indicates the variable suffix string). Therefore a likely prefix for a metadata element in the indecs Data Dictionary would be 10.iid/
The string would be published as a URI available to those who have access to the iDD, as a DOI name.
In future, these identifiers would be used for establishing interoperability between different IDF and ONIX schemas (and third parties), enabling data transformation and integration between them.
Each Term in the iDD has Attributes as shown below. A Term itself is not the TermName but the underlying abstract meaning. Each Term has a unique identifier within the iDD and a public "iDDtag" by which it may be referenced in XML and other schemas. A DOI-AP will be established for iDD Terms so that DOI names can be used for lookup and other functions.
A Term may have any number of different names (in different Languages where appropriate) for use in different APs and schemes. For example, a Writer in one AP may be the same as an Author in another, and both these names can be maintained in iDD under their respective Authorities.
A Term may also have any number of expressions of its Definition, also in different Languages if required, appropriate to each RA.
A Term is mapped into the iDD by groups of "triple" Relationships which define its parents and other "family" relationships, and the specific constraints under which it operates. Attributes of an iDD Term are shown in the table below.
The iDD contains various TermSets:
All these Terms are available for use by any RA in any combination, along with new Terms which the RA has mapped into iDD. Other Termsets will be added from time to time as required.
The iDD is maintained for IDF by an appointed Maintenance Agency. Its functions are:
The relationship between IDF and the Maintenance Agency is to be managed through a Service Contract. The costs of semantic analysis, mapping and data querying and transformation services will be borne directly by users of the system.
All Terms used in Kernel Metadata and Resource Metadata Declarations should be mapped into the iDD. This mapping establishes the relationship between a Term and all other Terms used in these formats, and is the way in which semantic integrity of the DOI System is achieved.
The unusual aspect of mapping to the iDD is that a mapped term becomes a part of the Dictionary itself. The iDD structure is capable of recognizing any number of contextual meanings, and as new ones are identified in the course of mapping, they are placed in their appropriate place in the dictionary through Type hierarchies and RelationshipSets.
Mapping is a consensual exercise. It requires agreement between the organization responsible for managing the AP and the Maintenance Agency of the iDD that a given mapping is a correct interpretation of the meaning of a Term. This consent is registered as authorizations on a Term-by-Term basis in the iDD. Mappings cannot be added arbitrarily to the iDD by either the Maintenance Agency nor an RA. Explicit authority is essential because third parties such as other RAs and DOI System users will be reliant on iDD mapping (it follows the third principle of the indecs framework -- "Designated Authority").
Mapping is a formal exercise with several required steps for each Term. However, Terms are not mapped in isolation. The first step is that the structure of the whole scheme to be mapped is drawn roughly into a "tree" or set of "trees" whose branches descend from the high-level elements of the iDD (including Context, Verb, Agent, Resource, Time, Place, Quality, and Relator) so that its own hierarchy can be understood and Terms mapped in the most appropriate order (starting at the top of each tree).
Each term is allocated a unique identity as it is "plugged into" the iDD using one or more "triple" relationships with existing iDD Terms. The set of these Relationships is called ther Term's Genealogy. The Genealogy identifies those Relationships by which the Term inherits meaning, or acquires global constraints upon its inherited meaning. The most important of these are the equivalence Relationships (for example cal:journal IsSameAs idd:Journal) and parent-child Relationships (for example onix:Author IsSubClassOf idd:Creator). A global constraint on this might be onix:Author IsCreatorOf idd:Words. Note that although these examples are shown here for convenience with Names, within the iDD structure of each of these Terms is represented not by a Name but by a unique Term Identifier in the form of a DOI name so the iDD is in essense a complex network of DOI names.
Many Terms have contextual as well as global constraints, and these are then described in sets of Relationships known as ContextualConstraint Sets. These allow dependencies to be shown. For example, a Term such as onix:replacesISBN in the ONIX for Books message set refers to the "International Standard Book Number of a former product which the current product replaces". Semantic analysis against the iDD deconstructs this into its component elements, and represents them in a formal chain of Relationships like this.
#1 IsA onix:replacesISBN
In these Relationships the numbers #1, #2 and #3 represent values of the Terms in any particular instance. These "arbitrary values" re-appear in different relationships, providing the contextual links with which the chain is built. The chain thus described provides all the necessary logic for queries or transformations based on knowledge of the actual value in any circumstances of any two of the three values (#1, #2 or #3).
A MeaningType of Derived or PartlyDerived is given to the Term. A Derived Term is one whose semantics are fully explained by its axiomatic Relationships (such as onix:replacesISBN). A PartlyDerived Term is one which relies to some extent on an external meaning defined in language.
Names, Definitions and Comments are added to the Term as required.
This is a painstaking process, but it is a once-off for each Term or scheme, with subsequent maintenance required only when new Terms are added, or amendments made. Mechanisms for modifying mappings, adding and deleting new Terms are provided for by the iDD (although of course the consequences of such changes can be serious to legacy data).
Mappings are concerned fundamentally with meanings, not names. Terms can have different names in different APs, and the same word can mean different things in different APs. The DOI Data Model (DDM) does not mandate the use of any standardized vocabulary outside of the Kernel metadata. All relationships within the DMS and iDD are described with unique identifiers for each Term.
The level of granularity described above is unnecessary if only two or three schemes are being mapped. However, the fundamental assumption underlying the DOI Data Model is that in time there will be many APs whose metadata requires integrating at various levels, whether simply at the Kernel level or to support data insterchange or more complex searching and processing. Semantic integrity on such a scale is unachievable without a central tool such as the iDD, for two simple reasons.
First, precise mapping depends upon at least one of the mapped schemes having a richer underlying model in which to precisely locate the others' terms. To give a very simple example, if one scheme has a Term "Author" (meaning "a creator of words") and the other has a Term "Composer" (meaning "a creator of music"), there is no direct relationship between the two: they are not equivalent, and neither is a subtype of the other. To establish a relationship, both need to be mapped to another element (say, "Creator") of which they are both subtypes. Then to distinguish them, the elements "Words" and "Music" need also to be identified. A common underlying model or ontology is needed, in which the new Terms already exist or can be added, to establish the explicit relationships which the individual schemes lack the richness to express.
In general, schemes adopt data models which are tailored to meet their own particular requirements, and these are normally not rich enough to support unambiguous mapping. For example, a trial mapping recently between ONIX and a major metadata scheme from the educational world showed many approximate, unresolvable and ambiguous relationships. This is not because of any failings of either scheme, but because they were being tested beyond their original scope. Of course some one-to-one mappings can be very successful, if the schemes are well designed and operating in similar domains, but even here it is rarely adequate to support generalized automated processing. The iDD is designed for the purpose of supporting unambiguous, contextual mapping: that is its primary job.
Secondly, the more schemes come into play, the more one-to-one mappings will be required, each of which is costly in resources and likely to be less than adequate for the reasons just given. With the rapid growth of metadata schemes this is becoming an increasing problem.
The diagram below illustrates what happens if six schemes need to map to each other. Each scheme must do five one-to-one mappings: a total of fifteen mappings, probably with very mixed results. Any further scheme which joins this community then has to be mapped to each other scheme; and the task grows by arithmetic progression. When there are n schemes, there are (n/2)x(n-1) one-to-one mappings needed. With twenty-five schemes, that is 300 possible one-to-one maps.
The next diagram shows the same mappings carried out through a central point. Each scheme requires mapping once (n schemes require n mappings) and thereafter it should be possible to create any required one-to-one mappings making use of the iDD ContextModel structure.
There are two important health warnings to make on this model.
First, iDD cannot produce unambiguous or precise mappings if the Terms used by an RA are themselves ambiguous or imprecise. iDD can accurately describe the ambiguity and leave the resolution to users. iDD can achieve is accurate mapping as far as the source data allows, producing better results than a host of many-to-many mappings based on more limited models and varying techniques.
Secondly, the iDD contains the logic and data to support many kinds of processing, such as data transformations or the creation of scheme-to-scheme maps, but these require the development of application software and business processes. Contextual mappings provide one of the necessary bases for semantic interoperability, but do not provide everything.