DOI or LOD or DOI and LOD: Difference between revisions

From CETAF ISTC Wiki
bwf>Gregor Hagedorn
No edit summary
bwf>Gregor Hagedorn
No edit summary
Line 1: Line 1:
During the discussions in the pro-iBiosphere project the community almost unambiguously agreed that it should put the Life Science Identifiers, being a technology driven only by the biodiversity community for the past 8 years, aside and move forward to functionaly equivalent more widely used solutions.
The pro-iBiosphere project has investigated the status of the use of stable identifier methods in the Biodiversity community. In the course of several workshops the participants from Europe, the US, and Australia almost unanimously agreed that Life Science Identifiers (LSIDs), being a technology driven only by the biodiversity community for the past 8 years, should be abandoned. It was also widely agreed, that the preferred form of any identifier, be it Linked Open Data URIs (LOD-URIs), DOIs, or ARKs, should be the Semantic Web compatible http-form (including the DOI or ARK resolver).


However, the question whether the management of stable URIs should occur decentralized (at multiple institutions, each using the standard URI-stability technology provided by most web servers), or whether a special, centralized technology such as DOI should be mandatory continues to be discussed. The following image show a comparison of the Semantic Web / Linked Open Data and DOI technologies:


[[File:DOI and LOD (pro-iBiosphere discussion 2013, G. Hagedorn).PNG|600px|DOI and LOD]]
[[File:DOI vs LOD (pro-iBiosphere discussion 2013, G. Hagedorn).PNG|600px|DOI vs LOD]]
[[File:DOI vs LOD (pro-iBiosphere discussion 2013, G. Hagedorn).PNG|600px|DOI vs LOD]]
The top scenario shows the functioning of a DOI resolution service. A separate server, the DOI resolution provider accepts the request, consults its internal Stability Mapping Definitions (where the DOI is mapped to the final URI), differentiates between RDF and html requests, and forwards the request to the ultimate destination. The client (machine or human) no longer sees the stable DOI, but the redirected, potentially unstable URI.
The bottom scenario shows the same situation in a linked data webserver setup within one institution. The webserver itself differentiates between RDF requests from machines (red dot on the left side) and HTML requests from humans using a web browsers. Using content negotiation, both requests to the same URI are directed to RDF data and HTML web pages respectively. The webserver also consults its internal mapping definitions to maintain the URIs stable. One advantage of this situation is that the client (machine or human) continues to see a stable URI.
Technically both scenarios work very similarly. The DOI example has minor advantages with respect to stability (which happen almost exclusively should the domain be lost by accident or because after a merger the domain transfer is neglected). By introducing the additional redirection layer only a single domain name is needed (which is a single point of failure, but which also can be reasonably expected to be managed to the highest standards). The DOI has the disadvantage that the URI as seen from the client side changes, because the redirect goes to a different server and is not handled within a system.
The main distinction between the two scenarios is therefore between centralized and decentralized stability management.
[[File:Biodiversity community DOI system (pro-iBiosphere discussion 2013, G. Hagedorn).PNG|600px|Biodiversity community DOI system]]
[[File:Biodiversity community DOI system (pro-iBiosphere discussion 2013, G. Hagedorn).PNG|600px|Biodiversity community DOI system]]
The slide shows the scenario where millions of requests from millions of clients have to be forwarded by a central resolver infrastructure to a large number of data providers. The service requirements of a biodiversity DOI service, providing the canonical identifiers to all living things in the semantic web (including humans, their parasites, crops, pets, etc.) may in fact be several order of magnitude higher than a CrossRef or DataCite DOI redirection. For data relations involving organism, including those from medicine, agriculture, etc., these DOIs would have to be resolved with every query or reasoning.
Some additional comments on the slide above:
# The central redirection table can grow very large. Technically this is manageable, but requires resources.
# The large number of involved data providers may requires substantial human resources.
# Updating the redirection table by a provider for, e.g., 30 objects, can only be done through scripting. It requires the provider (e.g. a natural history collection) to learn the API of the central redirection service.
# Because major current DOI systems such as CrossRef or DataCite provide identifiers to Digitally published Object, and define some metadata expectations to this extent, it is rather doubtful whether they are suitable for physical specimens or abstract taxon concepts. The slide therefore assumes a Biodiversity owned and maintained community infrastructure, run, e.g. by GBIF. Who exactly is running the infrastructure is, however, secondary. The primary argument is that load can be high and management and resources need to be adequate and sustainably financed.
A central system has some advantages, especially with respect to additional services like quality control, centralized and reliable global statistics. However, these advantages may be decisive, depending on ones needs. Unfortunately, it may be time consuming to reach a consensus on this.
However, there is some good news: Substantial concerns above about the management resources required to maintain the mapping between DOIs and URIs both at the central redirection provider and at each data provider can be reduced, by first implementing well managed locally stable URIs. Doing so is straightforward, does not require additional technology, and drastically reduces the frequency or even likely that changes at an additional central redirections are necessary.
Thus, whether a central DOI system will be adopted over time or not: Investing today into establishing good management practices for stable, semantic web compatible identifiers at each institution will not be wasted effort. It may be that the solutions is LOD-URis <b>and</b> DOIs:
[[File:DOI and LOD (pro-iBiosphere discussion 2013, G. Hagedorn).PNG|600px|DOI and LOD]]

Revision as of 00:21, 4 November 2013

The pro-iBiosphere project has investigated the status of the use of stable identifier methods in the Biodiversity community. In the course of several workshops the participants from Europe, the US, and Australia almost unanimously agreed that Life Science Identifiers (LSIDs), being a technology driven only by the biodiversity community for the past 8 years, should be abandoned. It was also widely agreed, that the preferred form of any identifier, be it Linked Open Data URIs (LOD-URIs), DOIs, or ARKs, should be the Semantic Web compatible http-form (including the DOI or ARK resolver).

However, the question whether the management of stable URIs should occur decentralized (at multiple institutions, each using the standard URI-stability technology provided by most web servers), or whether a special, centralized technology such as DOI should be mandatory continues to be discussed. The following image show a comparison of the Semantic Web / Linked Open Data and DOI technologies:

DOI vs LOD

The top scenario shows the functioning of a DOI resolution service. A separate server, the DOI resolution provider accepts the request, consults its internal Stability Mapping Definitions (where the DOI is mapped to the final URI), differentiates between RDF and html requests, and forwards the request to the ultimate destination. The client (machine or human) no longer sees the stable DOI, but the redirected, potentially unstable URI.

The bottom scenario shows the same situation in a linked data webserver setup within one institution. The webserver itself differentiates between RDF requests from machines (red dot on the left side) and HTML requests from humans using a web browsers. Using content negotiation, both requests to the same URI are directed to RDF data and HTML web pages respectively. The webserver also consults its internal mapping definitions to maintain the URIs stable. One advantage of this situation is that the client (machine or human) continues to see a stable URI.

Technically both scenarios work very similarly. The DOI example has minor advantages with respect to stability (which happen almost exclusively should the domain be lost by accident or because after a merger the domain transfer is neglected). By introducing the additional redirection layer only a single domain name is needed (which is a single point of failure, but which also can be reasonably expected to be managed to the highest standards). The DOI has the disadvantage that the URI as seen from the client side changes, because the redirect goes to a different server and is not handled within a system.

The main distinction between the two scenarios is therefore between centralized and decentralized stability management.

Biodiversity community DOI system

The slide shows the scenario where millions of requests from millions of clients have to be forwarded by a central resolver infrastructure to a large number of data providers. The service requirements of a biodiversity DOI service, providing the canonical identifiers to all living things in the semantic web (including humans, their parasites, crops, pets, etc.) may in fact be several order of magnitude higher than a CrossRef or DataCite DOI redirection. For data relations involving organism, including those from medicine, agriculture, etc., these DOIs would have to be resolved with every query or reasoning.

Some additional comments on the slide above:

  1. The central redirection table can grow very large. Technically this is manageable, but requires resources.
  2. The large number of involved data providers may requires substantial human resources.
  3. Updating the redirection table by a provider for, e.g., 30 objects, can only be done through scripting. It requires the provider (e.g. a natural history collection) to learn the API of the central redirection service.
  4. Because major current DOI systems such as CrossRef or DataCite provide identifiers to Digitally published Object, and define some metadata expectations to this extent, it is rather doubtful whether they are suitable for physical specimens or abstract taxon concepts. The slide therefore assumes a Biodiversity owned and maintained community infrastructure, run, e.g. by GBIF. Who exactly is running the infrastructure is, however, secondary. The primary argument is that load can be high and management and resources need to be adequate and sustainably financed.

A central system has some advantages, especially with respect to additional services like quality control, centralized and reliable global statistics. However, these advantages may be decisive, depending on ones needs. Unfortunately, it may be time consuming to reach a consensus on this.

However, there is some good news: Substantial concerns above about the management resources required to maintain the mapping between DOIs and URIs both at the central redirection provider and at each data provider can be reduced, by first implementing well managed locally stable URIs. Doing so is straightforward, does not require additional technology, and drastically reduces the frequency or even likely that changes at an additional central redirections are necessary.

Thus, whether a central DOI system will be adopted over time or not: Investing today into establishing good management practices for stable, semantic web compatible identifiers at each institution will not be wasted effort. It may be that the solutions is LOD-URis and DOIs:

DOI and LOD