Questions, problem solutions and further discussions (Guide of best practices): Difference between revisions
bwf>Andreas Plank m (→ CETAF Specimen Preview Profile (CSPP): added) |
m (Adjusted links to Import issues with CETAF identifiers (which sits now in the default namespace)) |
||
(33 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
{{Alert box|content=This is the main page for all further questions, discussions. Some discussions have already extra talk pages but questions can also be asked here. If needed, add new questions as section. For ''nested'' discussion you can use the <code><nowiki>:</nowiki></code> as first line character, it will indent text of that line. If you add a comment or questions better mark it with your Wiki signature and use 4 tildes <code><nowiki>~~~~</nowiki></code> (becomes replaced by the Wiki; advanced usage: 5 tildes <code><nowiki>~~~~~</nowiki></code> give date only, 3 tildes <code><nowiki>~~~</nowiki></code> give the user name only). | {{Alert box|content=This is the ''main page'' for all further questions, discussions. Some discussions have already extra talk pages but questions can also be asked here. If needed, add new questions as section. For ''nested'' discussion you can use the <code><nowiki>:</nowiki></code> as first line character, it will indent text of that line. If you add a comment or questions better mark it with your Wiki signature and use 4 tildes <code><nowiki>~~~~</nowiki></code> (becomes replaced by the Wiki; advanced usage: 5 tildes <code><nowiki>~~~~~</nowiki></code> give date only, 3 tildes <code><nowiki>~~~</nowiki></code> give the user name only). If you have questions please feel free to add an entirely new section at the appropriate part or add a subsection to an existing section. | ||
< | |||
<hr style="margin-top:2ex;" /> | |||
To be ''notified'' of changes on any particular page (via [[Special:Preferences|your user preferences]] e-mail options) use the star [[File:Vector skin - page not in the watchlist.png|link=|Page not in the watchlist]] and change it to [[File:Vector skin - page in the watchlist.png|link=|Page in the watchlist]], or below in the Wiki Editor use » [✓] Watch this page « | |||
}} | }} | ||
== What Institution has which Identifiers or Implementations? == | |||
See | |||
* [[Standards compliance dashboard]] | |||
* [[IDs and LOD Discussion]] | |||
* [[Talk: CETAF Specimen Preview Profile (CSPP)]] | |||
== <u>C</u>ETAF <u>S</u>pecimen <u>P</u>review <u>P</u>rofile (CSPP) == | == <u>C</u>ETAF <u>S</u>pecimen <u>P</u>review <u>P</u>rofile (CSPP) == | ||
A set of standard data components for data exchange, see: | |||
* | * [[Talk: CETAF Specimen Preview Profile (CSPP)]] | ||
* | * (older) [[Talk: CETAF Specimen Preview Profile]] | ||
== Splitting of collection specimens{{anchor|FAQ - Splitting of collection specimens}} == | == Splitting of collection specimens{{anchor|FAQ - Splitting of collection specimens}} == | ||
Line 18: | Line 24: | ||
{{Talk:Splitting of collection specimens (Guide best practices)}} | {{Talk:Splitting of collection specimens (Guide best practices)}} | ||
[[Category: | == Common Technical Problems == | ||
Issues of particular institutes are listed separately please refer to: [[Import issues with CETAF identifiers]] | |||
=== Redirection and Issuing of RDF vs. Human Readable Page === | |||
If I [Andreas Plank] understand it correctly, on CETAF-Level 2—having machine-readable RDF metadata: | |||
… the bare <code>http<nowiki></nowiki>://our-institution.org/any-specimen/123CETAF-ID</code> does not need a {{abbr|URL}} redirect necessarily | |||
# it shall issue a web page to be read by humans (default) | |||
# it shall issue {{abbr|RDF}} ''if'' requested via HTTP Header <tt>Accept: 'application/rdf+xml'</tt>; | |||
… but ''if'' … | |||
: … you have implemented a redirect of the original <code>http<nowiki></nowiki>://our-institution.org/any-specimen/123CETAF-ID</code> to any other resource, let’s say <tt>http<nowiki></nowiki>://our-institution.org/any-specimen/'''rdf'''/123CETAF-ID</tt> or <tt>http<nowiki></nowiki>://our-institution.org/''nice-webpage/specimen/bellis-perennis/some-yx456-ID''</tt> … | |||
: … ''then'' implement a redirect of HTTP response status code 303 “See Other” instead of the sometimes used 302 code “Found” or “Moved Temporarily” which is hard to tell how the client would interpret the 302 response code. | |||
=== Developing RDF or Proposal of Lightweight Data File Storage (TriG format){{anchor|develop RDF via TriG format or use it as dump data storage}} === | |||
RDF/XML is complicate to read and perhaps to develop in the mapping of nested data. A more readable approach is using the TriG format<ref group="reference" name="Bizer and Cyganiak 2014">''Bizer, C. and Cyganiak, R.'' 2014. ‘RDF 1.1 TriG — RDF Dataset Language. W3C Recommendation 25 February 2014’. Edited by Gavin Carothers and Andy Seaborne. https://www.w3.org/TR/trig/.</ref> and convert it eventually to RDF/XML or to any other needed data format. The TriG format is easy to read, it reads like a sentence which has segmented data elements (semicolons ;) and ending with a dot (.); then comes the next “sentence”. Looking at the minimum example of the [[CETAF Specimen Preview Profile (CSPP)]] in TriG format, it gets formatted (of course without line numbers) like: | |||
<blockquote> | |||
<syntaxhighlight class="lineno-gray" lang="text" style="font-size:smaller;" line highlight="6,9" > | |||
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . | |||
@prefix dwc: <http://rs.tdwg.org/dwc/terms/> . | |||
@prefix dc: <http://purl.org/dc/terms/> . | |||
<http://herbarium.bgbm.org/data/rdf/B100068798> | |||
dc:subject <http://herbarium.bgbm.org/object/B100068798> ; | |||
dc:created "2019-11-11T15:41:25+01:00" . | |||
<http://herbarium.bgbm.org/object/B100068798> | |||
dc:title "Erysimum salangense Polatschek & Rech.f." ; | |||
dc:created "1967-07-14" ; | |||
dc:type "Specimen" ; | |||
dc:publisher "BGBM" ; | |||
dwc:scientificName "Erysimum salangense Polatschek & Rech.f." ; | |||
dwc:previousIdentifications "Erysimum salangense Polatschek & Rech.f." ; | |||
dwc:family "CRUCIFERAE" ; | |||
dwc:countryCode "AF" ; | |||
dwc:decimalLongitude "69.033332824707" ; | |||
dwc:decimalLatitude "35.366664886475" ; | |||
dwc:recordedBy "Rechinger,K.H." ; | |||
dwc:fieldNumber "37047" ; | |||
dwc:associatedMedia <http://ww2.bgbm.org/herbarium/images/B/10/00/68/79/B_10_0068798.jpg> . | |||
</syntaxhighlight> | |||
One can see that line 6 declares a <tt>dc:subject</tt> which is then defined from line 9 on, this becomes nested automatically later when conversion to RDF/XML is done. Lines 5 to 7 explain the very RDF document, because it is delivered under a different URI then the client has requested it (he requested the actual CETAF-ID URI (line 9) and got redirected to this document). Lines 9 and following explain the CETAF-ID URI which explain the actual herbarium specimen. | |||
</blockquote> | |||
<div class="mw-collapsible mw-collapsed"> | |||
Going further and enrich more data to it, then the TriG format becomes more nested (you should see the example by click “expand“ on the right) | |||
<div style="padding-left:1em;border-left:1px solid gray" class="mw-collapsible-content"> | |||
Line numbers are added to illustrate the relations | |||
: line 8 describes a <tt>dc:subject</tt> which is further detailed from line 11 on; | |||
:: from line 11 on it contains in line 36 a wiki base entry (<tt>dwciri:recordedBy</tt>) that itself has details stated from line 48 on (and so forth also with the triple iiif for additional media) …<br />Note: the <code>dwciri:recordedBy</code> has the same meaning as <tt>dwc:recordedBy</tt>, but as an RDF predicate <tt>dwciri:recordedBy</tt> is intended to be repeatable and have an {{abbr|IRI}}-reference object | |||
<syntaxhighlight class="lineno-gray" lang="text" style="font-size:smaller;" line highlight="8,11,36,38,40,48" > | |||
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . | |||
@prefix dwc: <http://rs.tdwg.org/dwc/terms/> . | |||
@prefix dwciri: <http://rs.tdwg.org/dwc/iri/> . | |||
@prefix dc: <http://purl.org/dc/terms/> . | |||
@prefix owl: <http://www.w3.org/2002/07/owl#> . | |||
<http://herbarium.bgbm.org/data/rdf/B100068798> | |||
dc:subject <http://herbarium.bgbm.org/object/B100068798> ; | |||
dc:created "2019-11-11T15:41:25+01:00" . | |||
<http://herbarium.bgbm.org/object/B100068798> | |||
dc:title "Erysimum salangense Polatschek & Rech.f." ; | |||
dc:description "A herbarium specimen of Erysimum salangense Polatschek & Rech.f. collected by Rechinger,K.H." ; | |||
dc:creator "Rechinger, K.H." ; | |||
dc:created "1967-07-14" ; | |||
dc:type "Specimen" ; | |||
dc:publisher "BGBM" ; | |||
dwc:materialSampleID "http://herbarium.bgbm.org/object/B100068798" ; | |||
dwc:basisOfRecord "PreservedSpecimen" ; | |||
dwc:collectionCode "B" ; | |||
dwc:catalogNumber "B 10 0068798" ; | |||
dwc:scientificName "Erysimum salangense Polatschek & Rech.f." ; | |||
dwc:previousIdentifications "Erysimum salangense Polatschek & Rech.f." ; | |||
dwc:family "CRUCIFERAE" ; | |||
dwc:genus "Erysimum" ; | |||
dwc:specificEpithet "salangense" ; | |||
dwc:country "Afghanistan" ; | |||
dwc:countryCode "AF" ; | |||
dwc:locality "\n Afghanistan: NE-Afghanistan, Kathagan. Sar-i Hauz, in declivibus borealibus jugi Salang. substr. granit. Alt.: 2600 m. 14.07.1967, Leg.: K. H. Rechinger 37047.\n " ; | |||
dwc:decimalLongitude "69.033332824707" ; | |||
dwc:decimalLatitude "35.366664886475" ; | |||
dwc:eventDate "1967-07-14" ; | |||
dwc:recordNumber "37047" ; | |||
dwc:recordedBy "Rechinger,K.H." ; | |||
dwc:fieldNumber "37047" ; | |||
dwciri:recordedBy <http://www.wikidata.org/entity/Q78738> ; | |||
dwc:associatedMedia <http://ww2.bgbm.org/herbarium/images/B/10/00/68/79/B_10_0068798.jpg> ; | |||
dc:relation <http://herbarium.bgbm.org/iiif/B100068798> . | |||
<http://herbarium.bgbm.org/iiif/B100068798> | |||
dc:identifier <http://herbarium.bgbm.org/iiif/B100068798> ; | |||
dc:type <http://iiif.io/api/presentation/3#Manifest> ; | |||
dc:subject <http://herbarium.bgbm.org/object/B100068798> ; | |||
dc:format "application/ld+json" ; | |||
dc:description "A IIIF resource for this specimen."@en ; | |||
dc:created "" . | |||
<http://www.wikidata.org/entity/Q78738> | |||
owl:sameAs <http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/d5fea488-5786-4106-af90-396ef452c3aa> ; | |||
owl:sameAs <https://viaf.org/viaf/100383596/> . | |||
</syntaxhighlight> | |||
</div> | |||
</div> | |||
Of course you can convert RDF/XML to TriG or n-Triple statements back and forth by using some command line tools; it is illustrated here by applying {{abbr|CLI}} binaries of Apache Jena (https://jena.apache.org/): | |||
<syntaxhighlight lang="bash" style="font-size:smaller;"> | |||
#!/bin/bash | |||
########## validate RDF or test conversion into TriG | |||
rdfparse -t -s -R cetafid_123456.rdf | |||
# parse RDF in test mode (-t), strict (-s most warnings are errors), assume RDF embedded XML document (-R) | |||
ntriples --validate cetafid_123456.rdf > cetafid_123456.rdf.ttl.log # or | |||
turtle --validate cetafid_123456.rdf > cetafid_123456.rdf.ttl.log | |||
# validate conversion to triples: <subject> <predicate> <object>. errors to log file | |||
########## convert RDF to TriG, n-triples (back and forth) | |||
ntriples --quiet cetafid_123456.rdf > cetafid_123456.rdf.ttl # or | |||
turtle --quiet cetafid_123456.rdf > cetafid_123456.rdf.ttl | |||
# convert to triples format: <subject> <predicate> <object>. based on a RDF/XML document | |||
# Note that --quiet does not suppresses errors | |||
turtle --output=trig cetafid_123456.rdf > cetafid_123456.rdf.trig # not formatted with property prefixes (streams data) | |||
turtle --formatted=trig cetafid_123456.rdf > cetafid_123456.rdf.formatted.trig # formats data with property prefixes (needs more memory) | |||
# convert to TriG format based on a RDF/XML document | |||
turtle --output=trig --compress cetafid_123456.rdf > cetafid_123456.rdf.trig.gz # not formatted with property prefixes (streams data) | |||
turtle --formatted=trig --compress cetafid_123456.rdf > cetafid_123456.rdf.formatted.trig.gz # formats data with property prefixes (needs more memory) | |||
# convert to TriG format based on a RDF/XML document and compress it to gz | |||
turtle --output=rdfxml cetafid_123456.rdf.trig > cetafid_123456.rdf.trig.rdf | |||
# convert back to RDF/XML based on TriG format | |||
</syntaxhighlight> | |||
=== Mistakes or Errors in RDF === | |||
For importing RDF to the {{abbr|SPARQL}} interface there are some errors that break the import process and must be fixed beforehand (see also detailed import issues on | |||
[[Import issues with CETAF identifiers]]). Common mistakes or errors are: | |||
: '''Proper XML Encoding'''—Make sure to follow the XML rules to encode data into RDF, e.g. the ampersand <code>&</code> must be <code>&amp;</code>; or if data fields contain tag-elements the <code><</code> or <code>></code> must be encoded as <code>&lt;</code> or <code>&gt;</code> and so forth (perhaps use https://www.w3.org/RDF/Validator/ in general or a software, command line tool that can check it properly) | |||
: '''RDF Data Elements Conforming to [[CSPP]]''' | |||
:* (<s><tt>dc:kindOfMaterial</tt></s>) You might make a mistake reading superficially the [[CSPP|CSPP-elements documentation]] and think, the CSPP-element might be exactly the same as the data element in RDF. Please take care to distinguish this, the CSPP-elements are just for communication purposes but ''are not'' the data element itself ;-), for instance: element <tt>kindOfMaterial</tt> shall be mapped into <syntaxhighlight lang="xml" inline><dcterms:type></dcterms:type></syntaxhighlight> or element <tt>collectorName</tt> shall be mapped into <syntaxhighlight lang="xml" inline><dwc:recordedBy></dwc:recordedBy></syntaxhighlight> etc., see accordingly on that table of documentation. | |||
:* <tt>dc:relation</tt> nesting mistake: it is meant to be only inside <syntaxhighlight lang="xml" inline><rdf:Description rdf:about="..." ><!-- data --><dc:relation><!-- related rdf:Description nests here --></dc:relation><!-- data --></rdf:Description></syntaxhighlight> | |||
: '''URI Encoding'''—Encode {{abbr|URIs}} the right way, e.g. no bare ''spaces'' like in the (technically wrong encoded) example in attribute <code>rdf:about</code>:<br /><syntaxhighlight lang="xml" inline><rdf:Description rdf:about="http://data.biodiversitydata.nl/naturalis/specimen/0U 0281519"><!-- … data omitted … --></rdf:Description></syntaxhighlight><br />so, using encoding of URIs ''space'' must be properly encoded as <code>%20</code>:<br /><syntaxhighlight lang="xml" inline><rdf:Description rdf:about="http://data.biodiversitydata.nl/naturalis/specimen/0U%20%200281519"><!-- … data omitted … --></rdf:Description></syntaxhighlight> | |||
:* further reading see Section ‘Percentage Encoding‘ ([https://tools.ietf.org/html/rfc3986#section-2.1 rfc3986#section-2.1]) in ''Berners-Lee et al.''<ref group="reference" name="Berners-Lee et al. 2005">''Berners-Lee et al.'' 2005. ‘Uniform Resource Identifier (URI): Generic Syntax’. https://tools.ietf.org/html/rfc3986</ref>. | |||
:* further reading see {{abbr|IRI}}-specifications ([https://www.w3.org/TR/rdf11-concepts/#section-IRIs ‘3.2. IRIs’ #section-IRIs]) in ''Klyne et al.''<ref group="reference" name="Klyne et.al. 2014">''Klyne et al.'' 2014. In RDF 1.1 Concepts and Abstract Syntax. W3C Recommendation. https://www.w3.org/TR/rdf11-concepts/</ref>. | |||
: '''Unicode / UTF-8 Problems'''—Sometimes odd characters cannot be read or encoded into utf-8 characters, example: | |||
:* in http://coldb.mnhn.fr/catalognumber/mnhn/f/dac98.2 (the ? illustrates where the odd character is) in <syntaxhighlight lang="xml" inline><dwc:municipality>Szirdokpisp?Ki</dwc:municipality></syntaxhighlight> | |||
:* <code>rdfparse</code> found: [line: …, col: 34] An invalid XML character (Unicode: 0x19) was found in the element content of the document | |||
:* see on [[Import issues with CETAF identifiers#issue Unicode and UTF-8 (MNHN)|Issues of Unicode of many {{abbr|MNHN}}-CETAF-IDs]] | |||
==== Unicode Characters not in normal Form C ==== | |||
Query Unicode string data may cause a problem in getting the expected characters, because of different possible character encodings that are possible for one character (often the warning is given: <tt>String not in Unicode Normal Form C</tt>, see https://en.wikipedia.org/wiki/Unicode_equivalence). Example: the normal form C of "Верховинський" is not an Unicode equivalence of "Верховинський", it only ''appears'' so when reading, actually there are different encodings used here (to illustrate it and using the JSON representation): <code>'...\u0439'</code> й vs <code>'...\u0438\u0306'</code> й. | |||
'''Best practice:''' If there is a single Unicode character available, then favour the Unicode single character instead of the composed character equivalent. There are technical helper functions to account for this but when using a simple search this problem pops up as well and one wonders why nothing appears to be are found. | |||
Using Apache Jena one can circumvent this character encoding problem by using the <code>fn:normalize-unicode("unicode string")</code>, the following SPARQL-query may illustrate it: | |||
<syntaxhighlight lang="SPARQL" style="font-size:smaller;"> | |||
PREFIX fn: <http://www.w3.org/2005/xpath-functions#> | |||
PREFIX dwc: <http://rs.tdwg.org/dwc/terms/> | |||
# it means: s = subject; p = predicate; o = object | |||
SELECT ?s ?p ?o | |||
WHERE | |||
{ ?s ?p ?o ; | |||
dwc:locality ?locality | |||
FILTER ( | |||
( ?s = <http://wu.jacq.org/object/WU0107989> ) | |||
&& contains(fn:normalize-unicode(?locality), "Верховинський") | |||
) | |||
} | |||
</syntaxhighlight> | |||
==== Missing Linkage of RDF Redirect Document to CETAF-ID ==== | |||
If you have set up a URL redirection to the CETAF-ID RDF then make sure, that the describing RDF contains a semantic linkage of the redirected RDF document to the CETAF-ID RDF via <code>dcterms:subject</code><ref group="remark">often written also as <code>dc:subject</code>, but checking the RDF‘s prefix definition it should both resolve eventually to <code><nowiki><http://purl.org/dc/terms/></nowiki></code></ref>. To illustrate what is meant see the following minimal example written on the left in TriG format<ref group="reference" name="Bizer and Cyganiak 2014" /> and on the right in RDF/XML with the highlighted CETAF-ID, that is a subject so to say (<tt>dcterms:subject</tt>) of the redirect RDF document: | |||
<table class="vertical-align-top" ><!-- | |||
--><tr><!-- | |||
--><td><syntaxhighlight class="lineno-gray" lang="text" style="font-size:smaller;" line highlight="6,9" > | |||
@prefix dcterms: <http://purl.org/dc/terms/> . | |||
@prefix dwc: <http://rs.tdwg.org/dwc/terms/> . | |||
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . | |||
<https://redirect.of.CETAF-ID.de/whatever-path/collection_XYZ-1234> | |||
dcterms:subject <https://id.of.CETAF-ID.de/collection_XYZ/1234> ; | |||
dcterms:title "rdf document for XYZ Collection Specimen XYZ-1234" . | |||
<https://id.of.CETAF-ID.de/collection_XYZ/1234> | |||
dcterms:title "Specimen XYZ-1234 (XYZ Collection)" ; | |||
dcterms:created "2013-9-20" ; | |||
… … | |||
dwc:scientificName "Geophilus electricus (Linnaeus, 1758)" . | |||
</syntaxhighlight></td><!-- | |||
--><td><syntaxhighlight lang="xml" style="font-size:smaller;" highlight="6,9" > | |||
<rdf:RDF | |||
xmlns:dcterms="http://purl.org/dc/terms/" | |||
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" | |||
xmlns:dwc="http://rs.tdwg.org/dwc/terms/" > | |||
<rdf:Description rdf:about="https://redirect.of.CETAF-ID.de/whatever-path/collection_XYZ-1234"> | |||
<dcterms:subject rdf:resource="https://id.of.CETAF-ID.de/collection_XYZ/1234"/> | |||
<dcterms:title>rdf document for XYZ Collection Specimen XYZ-1234</dcterms:title> | |||
</rdf:Description> | |||
<rdf:Description rdf:about="https://id.of.CETAF-ID.de/collection_XYZ/1234"> | |||
<dcterms:title>Specimen XYZ-1234 (XYZ Collection)</dcterms:title> | |||
<dcterms:created>2013-9-20</dcterms:created> | |||
<!-- … --> | |||
<dwc:scientificName>Geophilus electricus (Linnaeus, 1758)</dwc:scientificName> | |||
</rdf:Description> | |||
</rdf:RDF> | |||
</syntaxhighlight></td><!-- | |||
--></tr><!-- | |||
--></table> | |||
Note the following (left side, TriG format): | |||
* line 5 contains the URI of the RDF redirect document itself | |||
* line 6 describes a (related) <tt>dcterms:subject</tt> which is further described from line 9 on; this is the actual CETAF-ID specimen | |||
* from line 9 on the CETAF-ID specimen is described in detail | |||
== References == | |||
<references group="reference" /> | |||
== Remarks == | |||
<references group="remark" /> | |||
[[Category: Guide for CETAF Stable Identifiers]] | |||
[[Category: Discussion]] |
Latest revision as of 13:35, 5 March 2025
:
as first line character, it will indent text of that line. If you add a comment or questions better mark it with your Wiki signature and use 4 tildes ~~~~
(becomes replaced by the Wiki; advanced usage: 5 tildes ~~~~~
give date only, 3 tildes ~~~
give the user name only). If you have questions please feel free to add an entirely new section at the appropriate part or add a subsection to an existing section.
To be notified of changes on any particular page (via your user preferences e-mail options) use the star


What Institution has which Identifiers or Implementations?
See
CETAF Specimen Preview Profile (CSPP)
A set of standard data components for data exchange, see:
Splitting of collection specimens
This part is from Talk:Splitting of collection specimens (Guide best practices) (see on that talk page for perhaps further details):
Q1. What happens to the NSId when a physical specimen is split into parts?
[Alex Hardisty] The original DS and NSId is retained and updated to point to each of the new parts, with a relation (see below). Each new part gets its own DS/NSId. Each new DS is linked back to its parent.
- [Anton Güntsch (BGBM)] This is what we recommend as well. In addition, I think that the original specimen record needs to know its successors (and provide links to them). This might sound redundant but one cannot rely on the presence of the reverse relationship and a performant inference.
Q2. What happens to the NSId after the physical specimen ceases to exist?
[Alex Hardisty] The general approach is that once created a Digital Specimen and its corresponding NSId exists permanently. When the corresponding physical specimen ceases to exist (e.g., because it was destroyed, lost, etc.) change in status should be recorded by the insertion of a new status information element into the Digital Specimen. Possible statuses are: extant, lost, destroyed, split. <any more?>
- [Anton Güntsch (BGBM)] Exactly. Digital records of specimens have to be kept forever in the CMS and get a meaningful status (‘unclear‘ might be an additional one). We need to agree a controlled terminology for this and we need to find an element representing this status. I believe that neither DwC nor ABCD has this already. Will check.
Q3. How do I represent relationships between specimens (e.g., duplicates) in a standardized way?
[Alex Hardisty] What is the list of standard relations that must be supported? isDuplicateOf, isParatypeOf, hasHolotype, …
- [Anton Güntsch (BGBM)] Again, the terminology needs to be agreed/developed.
Common Technical Problems
Issues of particular institutes are listed separately please refer to: Import issues with CETAF identifiers
Redirection and Issuing of RDF vs. Human Readable Page
If I [Andreas Plank] understand it correctly, on CETAF-Level 2—having machine-readable RDF metadata:
… the bare http://our-institution.org/any-specimen/123CETAF-ID
does not need a URL redirect necessarily
- it shall issue a web page to be read by humans (default)
- it shall issue RDF if requested via HTTP Header Accept: 'application/rdf+xml';
… but if …
- … you have implemented a redirect of the original
http://our-institution.org/any-specimen/123CETAF-ID
to any other resource, let’s say http://our-institution.org/any-specimen/rdf/123CETAF-ID or http://our-institution.org/nice-webpage/specimen/bellis-perennis/some-yx456-ID … - … then implement a redirect of HTTP response status code 303 “See Other” instead of the sometimes used 302 code “Found” or “Moved Temporarily” which is hard to tell how the client would interpret the 302 response code.
Developing RDF or Proposal of Lightweight Data File Storage (TriG format)
RDF/XML is complicate to read and perhaps to develop in the mapping of nested data. A more readable approach is using the TriG format[reference 1] and convert it eventually to RDF/XML or to any other needed data format. The TriG format is easy to read, it reads like a sentence which has segmented data elements (semicolons ;) and ending with a dot (.); then comes the next “sentence”. Looking at the minimum example of the CETAF Specimen Preview Profile (CSPP) in TriG format, it gets formatted (of course without line numbers) like:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix dwc: <http://rs.tdwg.org/dwc/terms/> . @prefix dc: <http://purl.org/dc/terms/> . <http://herbarium.bgbm.org/data/rdf/B100068798> dc:subject <http://herbarium.bgbm.org/object/B100068798> ; dc:created "2019-11-11T15:41:25+01:00" . <http://herbarium.bgbm.org/object/B100068798> dc:title "Erysimum salangense Polatschek & Rech.f." ; dc:created "1967-07-14" ; dc:type "Specimen" ; dc:publisher "BGBM" ; dwc:scientificName "Erysimum salangense Polatschek & Rech.f." ; dwc:previousIdentifications "Erysimum salangense Polatschek & Rech.f." ; dwc:family "CRUCIFERAE" ; dwc:countryCode "AF" ; dwc:decimalLongitude "69.033332824707" ; dwc:decimalLatitude "35.366664886475" ; dwc:recordedBy "Rechinger,K.H." ; dwc:fieldNumber "37047" ; dwc:associatedMedia <http://ww2.bgbm.org/herbarium/images/B/10/00/68/79/B_10_0068798.jpg> .One can see that line 6 declares a dc:subject which is then defined from line 9 on, this becomes nested automatically later when conversion to RDF/XML is done. Lines 5 to 7 explain the very RDF document, because it is delivered under a different URI then the client has requested it (he requested the actual CETAF-ID URI (line 9) and got redirected to this document). Lines 9 and following explain the CETAF-ID URI which explain the actual herbarium specimen.
Going further and enrich more data to it, then the TriG format becomes more nested (you should see the example by click “expand“ on the right)
Line numbers are added to illustrate the relations
- line 8 describes a dc:subject which is further detailed from line 11 on;
- from line 11 on it contains in line 36 a wiki base entry (dwciri:recordedBy) that itself has details stated from line 48 on (and so forth also with the triple iiif for additional media) …
Note: thedwciri:recordedBy
has the same meaning as dwc:recordedBy, but as an RDF predicate dwciri:recordedBy is intended to be repeatable and have an IRI-reference object
- from line 11 on it contains in line 36 a wiki base entry (dwciri:recordedBy) that itself has details stated from line 48 on (and so forth also with the triple iiif for additional media) …
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dwc: <http://rs.tdwg.org/dwc/terms/> .
@prefix dwciri: <http://rs.tdwg.org/dwc/iri/> .
@prefix dc: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
<http://herbarium.bgbm.org/data/rdf/B100068798>
dc:subject <http://herbarium.bgbm.org/object/B100068798> ;
dc:created "2019-11-11T15:41:25+01:00" .
<http://herbarium.bgbm.org/object/B100068798>
dc:title "Erysimum salangense Polatschek & Rech.f." ;
dc:description "A herbarium specimen of Erysimum salangense Polatschek & Rech.f. collected by Rechinger,K.H." ;
dc:creator "Rechinger, K.H." ;
dc:created "1967-07-14" ;
dc:type "Specimen" ;
dc:publisher "BGBM" ;
dwc:materialSampleID "http://herbarium.bgbm.org/object/B100068798" ;
dwc:basisOfRecord "PreservedSpecimen" ;
dwc:collectionCode "B" ;
dwc:catalogNumber "B 10 0068798" ;
dwc:scientificName "Erysimum salangense Polatschek & Rech.f." ;
dwc:previousIdentifications "Erysimum salangense Polatschek & Rech.f." ;
dwc:family "CRUCIFERAE" ;
dwc:genus "Erysimum" ;
dwc:specificEpithet "salangense" ;
dwc:country "Afghanistan" ;
dwc:countryCode "AF" ;
dwc:locality "\n Afghanistan: NE-Afghanistan, Kathagan. Sar-i Hauz, in declivibus borealibus jugi Salang. substr. granit. Alt.: 2600 m. 14.07.1967, Leg.: K. H. Rechinger 37047.\n " ;
dwc:decimalLongitude "69.033332824707" ;
dwc:decimalLatitude "35.366664886475" ;
dwc:eventDate "1967-07-14" ;
dwc:recordNumber "37047" ;
dwc:recordedBy "Rechinger,K.H." ;
dwc:fieldNumber "37047" ;
dwciri:recordedBy <http://www.wikidata.org/entity/Q78738> ;
dwc:associatedMedia <http://ww2.bgbm.org/herbarium/images/B/10/00/68/79/B_10_0068798.jpg> ;
dc:relation <http://herbarium.bgbm.org/iiif/B100068798> .
<http://herbarium.bgbm.org/iiif/B100068798>
dc:identifier <http://herbarium.bgbm.org/iiif/B100068798> ;
dc:type <http://iiif.io/api/presentation/3#Manifest> ;
dc:subject <http://herbarium.bgbm.org/object/B100068798> ;
dc:format "application/ld+json" ;
dc:description "A IIIF resource for this specimen."@en ;
dc:created "" .
<http://www.wikidata.org/entity/Q78738>
owl:sameAs <http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/d5fea488-5786-4106-af90-396ef452c3aa> ;
owl:sameAs <https://viaf.org/viaf/100383596/> .
Of course you can convert RDF/XML to TriG or n-Triple statements back and forth by using some command line tools; it is illustrated here by applying CLI binaries of Apache Jena (https://jena.apache.org/):
#!/bin/bash
########## validate RDF or test conversion into TriG
rdfparse -t -s -R cetafid_123456.rdf
# parse RDF in test mode (-t), strict (-s most warnings are errors), assume RDF embedded XML document (-R)
ntriples --validate cetafid_123456.rdf > cetafid_123456.rdf.ttl.log # or
turtle --validate cetafid_123456.rdf > cetafid_123456.rdf.ttl.log
# validate conversion to triples: <subject> <predicate> <object>. errors to log file
########## convert RDF to TriG, n-triples (back and forth)
ntriples --quiet cetafid_123456.rdf > cetafid_123456.rdf.ttl # or
turtle --quiet cetafid_123456.rdf > cetafid_123456.rdf.ttl
# convert to triples format: <subject> <predicate> <object>. based on a RDF/XML document
# Note that --quiet does not suppresses errors
turtle --output=trig cetafid_123456.rdf > cetafid_123456.rdf.trig # not formatted with property prefixes (streams data)
turtle --formatted=trig cetafid_123456.rdf > cetafid_123456.rdf.formatted.trig # formats data with property prefixes (needs more memory)
# convert to TriG format based on a RDF/XML document
turtle --output=trig --compress cetafid_123456.rdf > cetafid_123456.rdf.trig.gz # not formatted with property prefixes (streams data)
turtle --formatted=trig --compress cetafid_123456.rdf > cetafid_123456.rdf.formatted.trig.gz # formats data with property prefixes (needs more memory)
# convert to TriG format based on a RDF/XML document and compress it to gz
turtle --output=rdfxml cetafid_123456.rdf.trig > cetafid_123456.rdf.trig.rdf
# convert back to RDF/XML based on TriG format
Mistakes or Errors in RDF
For importing RDF to the SPARQL interface there are some errors that break the import process and must be fixed beforehand (see also detailed import issues on Import issues with CETAF identifiers). Common mistakes or errors are:
- Proper XML Encoding—Make sure to follow the XML rules to encode data into RDF, e.g. the ampersand
&
must be&
; or if data fields contain tag-elements the<
or>
must be encoded as<
or>
and so forth (perhaps use https://www.w3.org/RDF/Validator/ in general or a software, command line tool that can check it properly)
- RDF Data Elements Conforming to CSPP
- (
dc:kindOfMaterial) You might make a mistake reading superficially the CSPP-elements documentation and think, the CSPP-element might be exactly the same as the data element in RDF. Please take care to distinguish this, the CSPP-elements are just for communication purposes but are not the data element itself ;-), for instance: element kindOfMaterial shall be mapped into<dcterms:type></dcterms:type>
or element collectorName shall be mapped into<dwc:recordedBy></dwc:recordedBy>
etc., see accordingly on that table of documentation. - dc:relation nesting mistake: it is meant to be only inside
<rdf:Description rdf:about="..." ><!-- data --><dc:relation><!-- related rdf:Description nests here --></dc:relation><!-- data --></rdf:Description>
- (
- URI Encoding—Encode URIs the right way, e.g. no bare spaces like in the (technically wrong encoded) example in attribute
rdf:about
:<rdf:Description rdf:about="http://data.biodiversitydata.nl/naturalis/specimen/0U 0281519"><!-- … data omitted … --></rdf:Description>
so, using encoding of URIs space must be properly encoded as%20
:<rdf:Description rdf:about="http://data.biodiversitydata.nl/naturalis/specimen/0U%20%200281519"><!-- … data omitted … --></rdf:Description>
- further reading see Section ‘Percentage Encoding‘ (rfc3986#section-2.1) in Berners-Lee et al.[reference 2].
- further reading see IRI-specifications (‘3.2. IRIs’ #section-IRIs) in Klyne et al.[reference 3].
- Unicode / UTF-8 Problems—Sometimes odd characters cannot be read or encoded into utf-8 characters, example:
- in http://coldb.mnhn.fr/catalognumber/mnhn/f/dac98.2 (the ? illustrates where the odd character is) in
<dwc:municipality>Szirdokpisp?Ki</dwc:municipality>
rdfparse
found: [line: …, col: 34] An invalid XML character (Unicode: 0x19) was found in the element content of the document- see on Issues of Unicode of many MNHN-CETAF-IDs
- in http://coldb.mnhn.fr/catalognumber/mnhn/f/dac98.2 (the ? illustrates where the odd character is) in
Unicode Characters not in normal Form C
Query Unicode string data may cause a problem in getting the expected characters, because of different possible character encodings that are possible for one character (often the warning is given: String not in Unicode Normal Form C, see https://en.wikipedia.org/wiki/Unicode_equivalence). Example: the normal form C of "Верховинський" is not an Unicode equivalence of "Верховинський", it only appears so when reading, actually there are different encodings used here (to illustrate it and using the JSON representation): '...\u0439'
й vs '...\u0438\u0306'
й.
Best practice: If there is a single Unicode character available, then favour the Unicode single character instead of the composed character equivalent. There are technical helper functions to account for this but when using a simple search this problem pops up as well and one wonders why nothing appears to be are found.
Using Apache Jena one can circumvent this character encoding problem by using the fn:normalize-unicode("unicode string")
, the following SPARQL-query may illustrate it:
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>
PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>
# it means: s = subject; p = predicate; o = object
SELECT ?s ?p ?o
WHERE
{ ?s ?p ?o ;
dwc:locality ?locality
FILTER (
( ?s = <http://wu.jacq.org/object/WU0107989> )
&& contains(fn:normalize-unicode(?locality), "Верховинський")
)
}
Missing Linkage of RDF Redirect Document to CETAF-ID
If you have set up a URL redirection to the CETAF-ID RDF then make sure, that the describing RDF contains a semantic linkage of the redirected RDF document to the CETAF-ID RDF via dcterms:subject
[remark 1]. To illustrate what is meant see the following minimal example written on the left in TriG format[reference 1] and on the right in RDF/XML with the highlighted CETAF-ID, that is a subject so to say (dcterms:subject) of the redirect RDF document:
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix dwc: <http://rs.tdwg.org/dwc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<https://redirect.of.CETAF-ID.de/whatever-path/collection_XYZ-1234>
dcterms:subject <https://id.of.CETAF-ID.de/collection_XYZ/1234> ;
dcterms:title "rdf document for XYZ Collection Specimen XYZ-1234" .
<https://id.of.CETAF-ID.de/collection_XYZ/1234>
dcterms:title "Specimen XYZ-1234 (XYZ Collection)" ;
dcterms:created "2013-9-20" ;
… …
dwc:scientificName "Geophilus electricus (Linnaeus, 1758)" . | <rdf:RDF
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dwc="http://rs.tdwg.org/dwc/terms/" >
<rdf:Description rdf:about="https://redirect.of.CETAF-ID.de/whatever-path/collection_XYZ-1234">
<dcterms:subject rdf:resource="https://id.of.CETAF-ID.de/collection_XYZ/1234"/>
<dcterms:title>rdf document for XYZ Collection Specimen XYZ-1234</dcterms:title>
</rdf:Description>
<rdf:Description rdf:about="https://id.of.CETAF-ID.de/collection_XYZ/1234">
<dcterms:title>Specimen XYZ-1234 (XYZ Collection)</dcterms:title>
<dcterms:created>2013-9-20</dcterms:created>
<!-- … -->
<dwc:scientificName>Geophilus electricus (Linnaeus, 1758)</dwc:scientificName>
</rdf:Description>
</rdf:RDF> |
Note the following (left side, TriG format):
- line 5 contains the URI of the RDF redirect document itself
- line 6 describes a (related) dcterms:subject which is further described from line 9 on; this is the actual CETAF-ID specimen
- from line 9 on the CETAF-ID specimen is described in detail
References
- ↑ 1.0 1.1 Bizer, C. and Cyganiak, R. 2014. ‘RDF 1.1 TriG — RDF Dataset Language. W3C Recommendation 25 February 2014’. Edited by Gavin Carothers and Andy Seaborne. https://www.w3.org/TR/trig/.
- ↑ Berners-Lee et al. 2005. ‘Uniform Resource Identifier (URI): Generic Syntax’. https://tools.ietf.org/html/rfc3986
- ↑ Klyne et al. 2014. In RDF 1.1 Concepts and Abstract Syntax. W3C Recommendation. https://www.w3.org/TR/rdf11-concepts/
Remarks
- ↑ often written also as
dc:subject
, but checking the RDF‘s prefix definition it should both resolve eventually to<http://purl.org/dc/terms/>