Best practices for stable URIs: Difference between revisions
| bwf>Donat Agosti | bwf>Donat Agosti  | ||
| Line 140: | Line 140: | ||
| Anton Güntsch: | Anton Güntsch: | ||
| * object at http://herbarium.bgbm.org/object/ | * object at http://herbarium.bgbm.org/object/BW16684010 | ||
| * rdf at http://herbarium.bgbm.org/data/rdf/ | * rdf at http://herbarium.bgbm.org/data/rdf/BW16684010 | ||
| * html at http://herbarium.bgbm.org/data/page/ | * html at http://herbarium.bgbm.org/data/page/BW16684010 | ||
| Dag Endresen (prototype under development): | Dag Endresen (prototype under development): | ||
Revision as of 10:19, 20 March 2014
| Recommended citation: Gregor Hagedorn, Terry Catapano, Anton Güntsch, Daniel Mietchen, Dag Endresen, Soraya Sierra, Quentin Groom, Jordan Biserkov, Falko Glöckler & Robert Morris, 2013. Best practices for stable URIs http://wiki.pro-ibiosphere.eu/wiki/Best_practices_for_stable_URIs. | 
Introduction
1. It is important to keep the mission-critical URIs (or URLs, or IRIs, or web-adresses) stable. Make a deliberate choice which pages and which classes of objects you want to manage as stable. Do not aim to keep all your URIs stable forever: this may become unmanageable.
2. The primary purpose of this discussion is to support others in finding good URI patterns. The secondary purpose is to assess whether it is possible that some institutions voluntarily share the same pattern to ease recognition and set a recognizable example for others to follow?
3. Linked Open Data and the Semantic Web in particular use http-URIs to identify resources as well as to retrieve information about them. The Semantic Web works with any kind of http-URIs, including those that do not follow these best practices. However, it works best if URIs are kept stable. This can be difficult for some URI patterns; the present discussion makes suggestions how to make it reasonably likely to be able to keep your URIs stable.
4. While the present discussion may be useful when looking for stable URIs patterns for other purposes than Linked Open Data and the Semantic Web, it largely focuses on these and some aspects are specific to the Semantic Web.
5. Keep the URI very simple right from the start. In the face of changing technology, at some point you will have to use the webserver's rewrite module to keep URIs stable. The simpler the URI pattern is, the easier this becomes. Thus the first recommendation is: Create a simple URI and use rewriting right from the start. Define simple URI patterns (= no ports, no extensions like .php or .aspx, no parameters with ? or &) that are being rewritten to your current technology.
6. If several different URIs exist within a particular dereferencing service (e.g. two http-URIs) that point to exactly the same resource:
- Declare one as the "preferred" (canonical) URI.
- Inform about the equivalence either through redirects (e.g. http status 301) or through owl:sameAs or skos:closeMatch statements in associated rdf.
7. Highly recommended references:
- Sauermann & Cyganiak 2008, Cool URIs for the Semantic Web.
- Hyam, R.D., Drinkwater, R.E. & Harris, D.J. Stable citations for herbarium specimens on the internet: an illustration from a taxonomic revision of Duboscia (Malvaceae) Phytotaxa 73: 17–30 (2012) PDF
- Stable Identifiers for Specimens Workshop (Roger Hyam, Edinburgh).
- Kevin Richards, Richard White, Nicola Nicolson, Richard Pyle 2011 A Beginner’s Guide to Persistent Identifiers (good general discussion, but most solutions discussed would not function in a Linked Data world).
- OBO Foundry Identifier Policy.
Recommended patterns for stable URIs
A generally recommended URI pattern is the following:
- http://subdomain.yourdomain.org/path/variable-identifier
For the Semantic Web and Linked Open Data, the URI for the abstract concept/physical object and the URI for the related information resource (html, RDF, json) may be two independent URIs of the form above connected by an http 303 (“see also”) redirect. Alternatively, a URI like the above may be used for the information resource plus a “hash-URI” for the abstract concept/physical object. A hash URI appends a fragment identifier at the end:
- http://subdomain.yourdomain.org/path/variable-identifier#hash
- http:// is called a "scheme". The semantic web requires http here (nothing else, not even https!).
- subdomain: If the stable URIs use a general purpose domain with many different services, it may be desirable to add a dedicated subdomain for specific services. The use of subdomains offers the flexibility that in the future several institutions share or merge their operations for a set of subdomains without affecting the stability of these URIs. If the main domain is already dedicated to a specific service, the use of a subdomain is irrelevant.
- yourdomain.org/: The main domain name, like rbge.org.uk, zoobank.org, ipni.org, naturalis.nl, nhm.ac.uk/ . This provides global uniqueness of any locally unique string following it, anchors trust and authority of the information, and provides branding and traceability of citations.
- Note: The scheme and domain part of URIs (e.g. http://subdomain.yourdomain.org/) is case-insensitive. For example, three URIs containing "subdomain", "Subdomain", and "SUBDOMAIN" would point to the same resource. For the semantic web, however, this part should always be entirely in lowercase letters. All other parts (part, identifier) are required to be case sensitive
 
- path: The part that remains constant for different identifiers of the same class (e.g., taxa, specimens). Similar to a subdomain, this increases the ease with which identifiers can be kept stable over decades (using web server rewrite modules).
- The path may consist of several parts like “/specimen/id/” or When using a pattern without a path like http://zoobank.org/7D39CAAA-4B4B-4588-A372-D4097162B1CD, the variable-identifier part after the domain must have a form which can always be distinguished from any other possible path or service on the same domain. In the example above, both the homogeneous length of a UUID, the formatting with hyphens and the absence of any other punctuation makes this likely. In most circumstances it is not recommended to omit a path on institutional domains (for which the number of other services may be very large).
- It is not required, but best practive that the path should not contain a colon (:). While in principle legal, the colon does create problems when a relative URI starts with a colon in the first part of the path. Relative URIs may be unavoidable if both http and https schemas are to be supported for human readable html. Also, some bugs in relation to colon in the path may exist with particular software.
- When using a pattern without a path like http://zoobank.org/7D39CAAA-4B4B-4588-A372-D4097162B1CD, the variable-identifier part after the domain must have a form which can always be distinguished from any other possible path or service on the same domain. In the example above, both the homogeneous length of a UUID, the formatting with hyphens and the absence of any other punctuation makes this likely. In most circumstances it is not recommended to omit a path on institutional domains (for which the number of other services may be very large).
 
- variable-identifier: The part that changes for each object. It will usually be a number or code you also use otherwise, like a simple locally unique or code (123, a123, M-2361318, ...) or it may be a UUID like 1C4EDC178AD79DD7F1A5AB856E8C5BCA.
- Best practice for choosing this identifier are to use an existing scheme for which uniqueness within a project or institution is already managed. If an object or concept code exist, if for specimen it is perhaps already attached by QR- or barcodes to specimens, this should be used. The following explains some advantages and disadvantages of certain choices:
- Short incremental identifiers or codes have some examples if you expect use cases in which codes must be compared with each other by human for identity, entered via a keyboard, or encoded in barcodes or QR codes (the physical size of the encoding increases with the number of characters). The disadvantage of these codes is that they can generally only be guaranteed to be unique if they are only created while a data connection to a central code registry exists.
- Long UUID codes (as in the Zoobank example) have advantages if you need to create identifiers on devices that are not connected to a central instance (e. g. for mobile field recordings). In this case it is however possible to still avoid them, by using them only for a field collectors number, but not for the final accession number. Another advantage of UUIDs is that misreading one or two letters of the code will usually be detected (creation is non-sequential, well distributed, and only a tiny portion of the codes will ever be used in a given institution). The disadvantages that come with the length are the mirror image of the advantages of short codes above.
 
 
- Best practice for choosing this identifier are to use an existing scheme for which uniqueness within a project or institution is already managed. If an object or concept code exist, if for specimen it is perhaps already attached by QR- or barcodes to specimens, this should be used. The following explains some advantages and disadvantages of certain choices:
- #hash: This is necessary for the Semantic web when using the “hash-method” to distinguish between the abstract concept or concrete object (e.g. Formica rufa or a specific physical specimen, which cannot be transmitted through the internet, but described) and the web pages (html, pdf, rdf-data).
- The URI containing the hash must always refer to the abstract concept of physical object which, unlike the information resource, cannot be transmitted through the internet.
 
Examples of words or strings to use in the parts of the URI pattern above
For the subdomain and the path at least two strings are needed, with the hash method another one after the "#". The choice of words or strings does not depend on technical or stability criteria, but largely on social considerations.
Widely used terms may be sorted into these categories:
- Generic terms like: "resource", "portal", "content", "object", "concept", "thing", "topic", "citable". NOTE: ADDITIONAL PROPOSAL WELCOME!
- Terms for classes of objects or concepts like: "taxon"/"taxa", "taxonconcept", "name", "term", "sample", "specimen", "treatment", "description", "morphology", "collection", "person"/"people", "organisation"/"institution", "locality", "herbarium".
- An indicator of stability like "stable", "permanent", "permalink", "stable-id", "purl" (= permanent URL). NOTE: ADDITIONAL PROPOSAL WELCOME!
- Terms expressing only the already known fact, that this is about identifiers (which is redundant, but at the same time an advantage if one is seeking words with no relevant semantics): "id", "identifier", "guid".
- Other terms with no or reduced semantic like: "dx", "zb" (e.g. abbreviation of Zoobank), "res", "it", "o", "t", "s", "p".
Recommendations:
1. Most humans find repetitions like http://object.example.org/object/123#object or concatenations of closely overlapping terms like http://object.example.org/concept/123#topic confusing.
2. In the semantic web, the word "data" should be avoided where referring to the concept or thing itself (as opposed to the data about it). A URI like data.organisation.org/specimen/123 for a specimen itself (but redirected to another URI when the data are being returned) is easily misinterpreted as referring to the data rather than the object.
3. In principle, a similar concern may be raised over the use of "id" or "identifier" (the semantic web would speak about the thing by means of an identifier, not about the identifier), but these concerns are probably negligible.
4. Terms from the categories above can probably used interchangeably for subdomain and path, i.e. specimen.example.org/object/123 and object.example.org/specimen/123 work similarly well.
- If you foresee that operations for different objects classes may in the future be consolidated within different consortia, it may be desirable to put the object class (like specimen) in the subdomain.
5. For the hash tag to indicate that the URI with hash is the real thing, the one without the data, the choices are more limited. Examples:
- specimen.example.org/res/123#specimen
- specimen.example.org/res/123#object
- specimen.example.org/res/123#obj
- specimen.example.org/res/123#id
- specimen.example.org/res/123#itself
- PLEASE ADD YOUR EXAMPLES!
- (The above applies only to the hash method, not 303 redirection, see here)
Examples:
- http://object.example.org/res/123#specimen
- http://specimen.example.org/stable-id/123#physical
- http://id.example.org/specimen/123#obj
- http://res.example.org/specimen/123#id
- http://permanent.example.org/specimen/123#id
YOUR Preferred pattern for specimen or scientific names
Gregor Hagedorn:
- object at http://specimen.example.org/permanent/123#obj
- rdf/html at http://specimen.example.org/permanent/123
Falko Glöckler: The Museum für Naturkunde Berlin will use
- object at http://coll.mfn-berlin.org/u/ZMB_123
- rdf at http://coll.mfn-berlin.org/u/ZMB_123.rdf
- json at http://coll.mfn-berlin.org/u/ZMB_123.json
- xml at http://coll.mfn-berlin.org/u/ZMB_123.xml
- turtle at http://coll.mfn-berlin.org/u/ZMB_123.turtle
- html at http://coll.mfn-berlin.org/u/ZMB_123.html
- Note: the URI is constrained by the need to keep it short to be able to use a small QR-code model for tiny labels. Else we would have preferred to not abbreviate collection to coll or unit to u.
 
For images we will use:
- media itself at, e.g.
 http://media.mfn-berlin.org/u/ZMB_123__dorsal.jpg,
 http://media.mfn-berlin.org/u/ZMB_123__frontal_1.dng,
 http://media.mfn-berlin.org/u/ZMB_123__dorsal.png
- rdf metadata at http://media.mfn-berlin.org/u/ZMB_123__dorsal.rdf
- html (media context page with human readable metadata plus - where possible embedded - media item at http://media.mfn-berlin.org/u/ZMB_123__dorsal.html
- Note: The pattern above assumes that media that are present in different mime types are always converted from each other, and that no accidental id conflict exists such that ZMB_123__dorsal.jpg and ZMB_123__dorsal.png are actually refer to different abstract "works" (e.g. by different photographers).
- Note: Media may have complex relations. For images some of these facets are:
- A single specimen may be present in different views (standard like frontal, dorsal, ventral, or non-standard one)
- The same view of one specimen may be present with different focus, or captured by different photographer. Focus series used to create stacked images may be present.
- A single image may be present in different media formats (dng, png, tiff, jpg) at the same resolution
- A single image may be present in different levels of post-capture processing (sharpened, background removed or changed, etc.)
- A single image may be present at different resolution (max, web, thumbs, etc.)
 
- Note: All blanks (%20-characters) in media file names will be replaced by an underscore before publishing
 
Richard Pyle (from gplus discussion, "{UUID-identifier}" is a concrete UUID): 
- object at http://zoobank.org/{UUID-identifier}
- rdf/html at http://zoobank.org/NomenclaturalAct/{UUID-identifier}
Peter DeVries:
- object at http://ocs.taxonconcept.org/ocs/0da685c9-9cdc-4dff-baf3-38d1bdbc6552
- rdf/html at http://ocs.taxonconcept.org/ocs/0da685c9-9cdc-4dff-baf3-38d1bdbc6552.html
Roger Hyam:
- object at http://data.rbge.org.uk/herb/E00435912
- rdf/html at http://elmer.rbge.org.uk/bgbase/vherb/bgbasevherb.php?cfg=bgbase/vherb/bgbasevherb.cfg&specimens_barcode=E00435912
Quentin Groom:
- object at http://herbariumspecimen.belgium.museum/permanent/BR5030008086350#id
- rdf/html at http://herbariumspecimen.belgium.museum/permanent/BR5030008086350
Terry Catapano (for Plazi treatments):
- object at http://treatment.plazi.org/id/503DD3E082B645B18CFE08E3C03580ED
- html/xml/rdf/json representations at http://treatment.plazi.org/id/503DD3E082B645B18CFE08E3C03580ED.[html|xml|rdf|json]
Jordan Biserkov:
- object at http://stable.example.org/specimens/7D39CAAA-4B4B-4588-A372-D4097162B1CD#concept
- rdf/html at http://stable.example.org/specimens/7D39CAAA-4B4B-4588-A372-D4097162B1CD
Anton Güntsch:
- object at http://herbarium.bgbm.org/object/BW16684010
- rdf at http://herbarium.bgbm.org/data/rdf/BW16684010
- html at http://herbarium.bgbm.org/data/page/BW16684010
Dag Endresen (prototype under development):
- Pattern: resolver + UUID [+ suffix extension, or MIME type]
- physical object at http://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3
- redirected to the information object found at http://gbif.no/resolver/41d9cbb4-4590-4265-8079-ca44d46d27c3.[html|rdf|json|n3|txt|csv]
Brian Fisher, antweb:
- object at http://www.antweb.org/specimen/CASENT0104542
- rdf/html at (tbd)
YOUR NAME:
- object at
- rdf/html at
Please add in your preferred pattern based on the notes above as well as new ideas. Can we achieve a set of patterns (not a single one) that others could mimic? I think this might help to spread the idea...
All accounts of biowikifarm instances work here, if you have no account Please Request an Account.
Further reading:
- Web server 303 redirection for the semantic web
- Roger Hyam's summary of the "Stable Identifiers for Specimens Workshop" (4th & 5th of June 2013): http://stories.rbge.org.uk/archives/3846
- Anton Güntsch's presentation on the stable URI initiative during the CETAF34 meeting in Edinburgh (10th and 11th of September 2013): http://wiki.pro-ibiosphere.eu/wiki/File:CETAF34_ISTC_Guentsch.pdf
- DOI or LOD or DOI and LOD?


