IDs and LOD Discussion: Difference between revisions
| bwf>Simon Chagnoux No edit summary | bwf>Andreas Plank  | ||
| (6 intermediate revisions by 3 users not shown) | |||
| Line 25: | Line 25: | ||
| The same workflow could be implemented for geographic features. | The same workflow could be implemented for geographic features. | ||
| [MNHN] For the difficult task of disambiguing homonyms, David Shorthouse proposed at TDWG2017 [https://doi.org/10.3897/tdwgproceedings.1.19829] to use co-collectors as an heuristic | |||
| ==== BGBM Results ==== | |||
| As a basis for semantic annotation of collectors and collector teams we exported 201685 specimen records from our local collection database (we have more records in JACQ, but wanted to start with something locally). We then extracted collector names from the table and imported them into an EXCEL file sorted by number of specimens mentioning the respective collector. We also added links to example specimens for each record. | |||
| We then asked a student to go through the list and add links to collector pages in WikiData, HUH, VIAF. We also advised her to skip records if the mapping to a semantic resource seemed to be ambiguous  or too difficult to find. Within one person week we achieved the following results: | |||
| * 1082 collector names linked up to semantic resources (WikiData, HUH, or VIAF) | |||
| * 71320 specimens where all collectors have been linked to a semantic resource | |||
| * 63417 specimens with at least one collector linked to a semantic resource | |||
| === Linking to institutions === | === Linking to institutions === | ||
| Line 34: | Line 44: | ||
| For the values of those attribute we have two candidates : | For the values of those attribute we have two candidates : | ||
| ==== 1)http://grbio.org/ ==== | ==== 1) http://grbio.org/ ==== | ||
| http://biocol.org/urn:lsid:biocol.org:col:34988 for MNHN Paris | * http://biocol.org/urn:lsid:biocol.org:col:34988 for MNHN Paris | ||
| http://biocol.org/urn:lsid:biocol.org:col:15605 for Botanic Garden Meise | * http://biocol.org/urn:lsid:biocol.org:col:15605 for Botanic Garden Meise | ||
| Pro : community managed CETAF and TDWG related | * Pro : community managed CETAF and TDWG related | ||
| Con : URI not resolved and seems there is no data attached | * Con : URI not resolved and seems there is no data attached | ||
| ==== https://www.wikidata.org/ ===== | ==== 2) https://www.wikidata.org/ ===== | ||
| https://www.wikidata.org/wiki/Q838691 for MNHN Paris | * https://www.wikidata.org/wiki/Q838691 for MNHN Paris | ||
| https://www.wikidata.org/wiki/Q3052500 for Botanic Garden Meise | * https://www.wikidata.org/wiki/Q3052500 for Botanic Garden Meise | ||
| * Pro : many related information, URI resolved (Html only ?) | |||
| * Con : Out of date, in the case of Meise, but editable | |||
| == CETAF collection data index == | == CETAF collection data index == | ||
| Line 58: | Line 68: | ||
| * 22,040,872 are HTTP URIs starting with ''http://'', | * 22,040,872 are HTTP URIs starting with ''http://'', | ||
| * 21,812,600 URIs conform with the base URLs listed on http://cetaf.org/cetaf-stable-identifiers. | * 21,812,600 URIs conform with the base URLs listed on http://cetaf.org/cetaf-stable-identifiers. | ||
| Several CETAF organisations have implemented IDs but not the redirection function. Others do have redirection but do not return RDF.  This complies with the level 1/2/3 specification of CETAF URIs but should definitely harmonised. We should also continue to work on a prefered target (RDF) format (perhaps as a level 4). | |||
| We also found that several implementations obviously did not map the new IDs in their GBIF provider software installations so that they are not visible on the GBIF portal. This should be fixed. | |||
| Based on the assemssment of existing CETAF ID implementations, we investigated potential platforms for a central catalogue of CETAF IDs. Options would be a simple (mysql) database or some kind of advanced RDF triple store. A good compromise seems to be Blazegraph, which is a pretty easy to use and at the same time a capable triple store DB (https://www.blazegraph.com/). | |||
| Our experiments with Blazegraph showed that harvesting from larges providers (e.g. RBGE with roughly 1M records) is working well. | |||
| == The situation at the BR herbarium, Meise== | == The situation at the BR herbarium, Meise== | ||
| Line 74: | Line 92: | ||
| [Anton Güntsch (BGBM)]: Once we have completed our top (say) 500 collectors we would be very interested in organising a shared list of collectors with links to external ressources. For example, our list will have the HUH ID and also IDs to VIAF and WikiData. Meise could then easily retrieve VIAF and WikiData IDs using the HUH IDs. | [Anton Güntsch (BGBM)]: Once we have completed our top (say) 500 collectors we would be very interested in organising a shared list of collectors with links to external ressources. For example, our list will have the HUH ID and also IDs to VIAF and WikiData. Meise could then easily retrieve VIAF and WikiData IDs using the HUH IDs. | ||
| [[Category: Guide for CETAF Stable Identifiers]] | |||
| [[Category: Discussion]] | |||
Latest revision as of 12:07, 15 June 2020
In 2016 and 2017, the ISTC decided that improving LOD capabilities of CETAF Stable Identifiers for collection objects should become a priority. This involves primarily
- activities for improving links from collection metadata to external resources and concepts and
- implementation of a working CETAF collection data index prototype as a basis for advanced inference mechanisms.
Ideas, discussions, and outcomes linked to these targets will be documented on this page. Please feel free to add your thoughts / comments / results below. More information about the CETAF identifier initiative is available on the main wikipage.
Improving links to external resources
Linking to people
[BGBM]: we started to discuss how to enrich our (rdf) metadata and concluded that we will start with looking closer at collectors. Our first step will be to export collector names and collector IDs from our herbarium management system (JACQ) and sort them by frequency of use. We will then setup a spreadsheet with columns for ...
- collector name
- local collector ID BGBM
- link to example specimen(s)
- external resource: wikidata
- external resource: HUH
- external resource VIAF
- problem flag
... and ask a student assistant to search for collectors in wikidata / HUH / VIAF and enter the (URI) identifiers.
By starting with frequent collectors we hope to be able to achieve a wide coverage with reasonable efforts. It would be great if other herbaria could also start to work into this spreadsheet. In this case we would probably just have to add more fields for local collector IDs.
The same workflow could be implemented for geographic features.
[MNHN] For the difficult task of disambiguing homonyms, David Shorthouse proposed at TDWG2017 [1] to use co-collectors as an heuristic
BGBM Results
As a basis for semantic annotation of collectors and collector teams we exported 201685 specimen records from our local collection database (we have more records in JACQ, but wanted to start with something locally). We then extracted collector names from the table and imported them into an EXCEL file sorted by number of specimens mentioning the respective collector. We also added links to example specimens for each record. We then asked a student to go through the list and add links to collector pages in WikiData, HUH, VIAF. We also advised her to skip records if the mapping to a semantic resource seemed to be ambiguous or too difficult to find. Within one person week we achieved the following results:
- 1082 collector names linked up to semantic resources (WikiData, HUH, or VIAF)
- 71320 specimens where all collectors have been linked to a semantic resource
- 63417 specimens with at least one collector linked to a semantic resource
Linking to institutions
The stable identifiers include a domain name that is the property of the institution responsible for the specimens. So the first Identifier workshops expresed no need to link to the institution. But as this kind of "implicit link" is not usable by machine on the Web of data, it could be worth adding.
The current version of CSPP link only specimen to a web page that could be an instutional website, maybe adding http://rs.tdwg.org/dwc/terms/#institutionID attribute could be useful.
For the values of those attribute we have two candidates :
1) http://grbio.org/
- http://biocol.org/urn:lsid:biocol.org:col:34988 for MNHN Paris
- http://biocol.org/urn:lsid:biocol.org:col:15605 for Botanic Garden Meise
- Pro : community managed CETAF and TDWG related
- Con : URI not resolved and seems there is no data attached
2) https://www.wikidata.org/ =
- https://www.wikidata.org/wiki/Q838691 for MNHN Paris
- https://www.wikidata.org/wiki/Q3052500 for Botanic Garden Meise
- Pro : many related information, URI resolved (Html only ?)
- Con : Out of date, in the case of Meise, but editable
CETAF collection data index
[BGBM]: As a first step, we created a list of CETAF identifiers found in GBIF.
As of October 17th, 2017, the 13 institutions listed on http://cetaf.org/cetaf-stable-identifiers shared
- 33,177,510 occurrences with GBIF, of which
- 30,679,787 used a GUID (http://rs.tdwg.org/dwc/terms/occurrenceID), of which
- 22,040,872 are HTTP URIs starting with http://,
- 21,812,600 URIs conform with the base URLs listed on http://cetaf.org/cetaf-stable-identifiers.
Several CETAF organisations have implemented IDs but not the redirection function. Others do have redirection but do not return RDF. This complies with the level 1/2/3 specification of CETAF URIs but should definitely harmonised. We should also continue to work on a prefered target (RDF) format (perhaps as a level 4).
We also found that several implementations obviously did not map the new IDs in their GBIF provider software installations so that they are not visible on the GBIF portal. This should be fixed.
Based on the assemssment of existing CETAF ID implementations, we investigated potential platforms for a central catalogue of CETAF IDs. Options would be a simple (mysql) database or some kind of advanced RDF triple store. A good compromise seems to be Blazegraph, which is a pretty easy to use and at the same time a capable triple store DB (https://www.blazegraph.com/).
Our experiments with Blazegraph showed that harvesting from larges providers (e.g. RBGE with roughly 1M records) is working well.
The situation at the BR herbarium, Meise
We have manually linked our top 900 collectors to the HUH. This was done manually to ensure that biographical details matched in our database and in HUH. In the process we identified about 230 collectors that were not at HUH. We have since given details of these collectors to HUH so that they can improve their data and we can complete the link for these additional collectors. Currently we are digitising a very large numbers of specimens (>1,000,000) so the number of collectors will increase and their frequencies will change. Therefore, we will conduct more linking once these data are available.
Our new specimen portal [2] has stable identifiers and has a machine readable RDF version of each specimen. Within this RDF is the link to the HUH database.
... <rdf:Description rdf:about="Glaziou A."> <owl:sameAs rdf:resource="http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/0832e613-7879-4f72-89f9-78e55c6ac1a9"/> <dwc:recordedBy>Glaziou A.</dwc:recordedBy> </rdf:Description> ...
[Anton Güntsch (BGBM)]: Once we have completed our top (say) 500 collectors we would be very interested in organising a shared list of collectors with links to external ressources. For example, our list will have the HUH ID and also IDs to VIAF and WikiData. Meise could then easily retrieve VIAF and WikiData IDs using the HUH IDs.


