-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
suggest to add reference to Elliott et. al 2020 #84
Comments
Thanks @jhpoelen for the pointer. I hadn't caught that reference earlier. Just read it, nice paper. And I agree with the stance the paper takes and think we should cite it. Nevertheless, it also raised some points that would be nice to clarify. See below if you have interest. TL;DR: I agree contentid approaches are great, and would improve identifier systems a lot. I'm not sure the paper accounted for the DataONE PID (Persisitent Identifier) system and its use of a resolver service to find the ephemeral URI locations for persistently identified, immutable objects in the DataONE network. Long ramble about DataONE Persistent Identifiers and Resolver serviceI'd love to discuss some of the nuances of the assertions and conclusions in that paper. In DataONE, we worked hard to provide a means for location-independent identifiers, and to ensure that the identifier system in DataONE accommodates versioning and provenance relationships, even though many of the contributing repositories don't provide that information. As an aggregator, there is only a limited degree of influence over what the data providers do with the content they hold. Our recommended stance is that, regardless of the identifier system used, repositories should 1) mint a location-agnostic identifier (PID) for their content, which we treat as an opaque reference to a checksum-immutable object; 2) provide versioning and provenance relationships linking these immutable objects, and 3) replicate the objects across multiple repositories in the network for both backup and high availability. The DataONE resolver service then can provide the current locations for any given object in the network identified by a PID -- at no point should people rely upon historically cached URIs for those locations, as those are subject to rapid change. In other words, the service URIs for accessing DataONE registered objects are ephemeral, but the object identifiers are persistent. In your tests, did you use the resolver service, and did you try the multiple replica locations for a given object PID when determining whether it is available? Another point concerns the concept of a "Dataset", in the "dcat:Dataset" sense. As in DCAT, in DataONE we use "Dataset" as synonymouns with "Data Package", which represents an aggregation (in the ORE sense) of individually identified digital objects that represents a scientifically useful (and citable) collection. Most repositories provide DOI identifiers for these data packages, rather than at the individual object level. So, when you are looking at persistence in your study, were you looking at "Dataset" persistence? And how did you account for the idea that any given dataset might be composed of dozens (or hundreds of thousands) of individually identifiable digital objects, each with its own unique hash checksum and varying levels of persistence? In your paper, I did not understand what the contentid would be for a "Dataset" such as the one that is viewable on DataONE here: https://search.dataone.org/view/doi%3A10.15485%2F1842334 That dataset consists of many digitial files, each with its own persistent identifier and checksum. The metadata file in that Dataset has the PID $ curl https://cn.dataone.org/cn/v2/resolve/ess-dive-0d52dba18c3904f-20220125T193640771
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:objectLocationList xmlns:ns2="http://ns.dataone.org/service/types/v1">
<identifier>ess-dive-0d52dba18c3904f-20220125T193640771</identifier>
<objectLocation>
<nodeIdentifier>urn:node:CN</nodeIdentifier>
<baseURL>https://cn.dataone.org/cn</baseURL>
<version>v1</version>
<version>v2</version>
<url>https://cn.dataone.org/cn/v2/object/ess-dive-0d52dba18c3904f-20220125T193640771</url>
</objectLocation>
<objectLocation>
<nodeIdentifier>urn:node:KNB</nodeIdentifier>
<baseURL>https://knb.ecoinformatics.org/knb/d1/mn</baseURL>
<version>v1</version>
<version>v2</version>
<url>https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-0d52dba18c3904f-20220125T193640771</url>
</objectLocation>
<objectLocation>
<nodeIdentifier>urn:node:ESS_DIVE</nodeIdentifier>
<baseURL>https://data.ess-dive.lbl.gov/catalog/d1/mn</baseURL>
<version>v1</version>
<version>v2</version>
<url>https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-0d52dba18c3904f-20220125T193640771</url>
</objectLocation>
<objectLocation>
<nodeIdentifier>urn:node:UIC</nodeIdentifier>
<baseURL>https://dataone.lib.uic.edu/metacat/d1/mn</baseURL>
<version>v1</version>
<version>v2</version>
<url>https://dataone.lib.uic.edu/metacat/d1/mn/v2/object/ess-dive-0d52dba18c3904f-20220125T193640771</url>
</objectLocation>
</ns2:objectLocationList> In contrast, the CSV data object in that data package has the PID $ curl https://cn.dataone.org/cn/v2/resolve/ess-dive-9ffb26bf9b0a2a7-20220112T000129646599
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:objectLocationList xmlns:ns2="http://ns.dataone.org/service/types/v1">
<identifier>ess-dive-9ffb26bf9b0a2a7-20220112T000129646599</identifier>
<objectLocation>
<nodeIdentifier>urn:node:KNB</nodeIdentifier>
<baseURL>https://knb.ecoinformatics.org/knb/d1/mn</baseURL>
<version>v1</version>
<version>v2</version>
<url>https://knb.ecoinformatics.org/knb/d1/mn/v2/object/ess-dive-9ffb26bf9b0a2a7-20220112T000129646599</url>
</objectLocation>
<objectLocation>
<nodeIdentifier>urn:node:ESS_DIVE</nodeIdentifier>
<baseURL>https://data.ess-dive.lbl.gov/catalog/d1/mn</baseURL>
<version>v1</version>
<version>v2</version>
<url>https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-9ffb26bf9b0a2a7-20220112T000129646599</url>
</objectLocation>
</ns2:objectLocationList> Did you check all of these locations when you were assessing reliability of the PIDs? And did you account for the fact that, from day-to-day, any of the service URLs are ephemeral in DataONE, but the PIDs are persistent? And how does this affect the stability percentages you reported? I would argue that the best form for these PIDs would be as contentid hashes (rather than UUIDs or any of the other formats typically in use i the network). In which case, I think the DataONE network closely matches the design of hash-archive and similar systems. I'm fully aligned with your conclusions about the utility of contentids as one of the best formats for persistent identifiers of digital objects, but I think there are still barriers to deploying them at scale across a highly heterogeneous network of repositories that have widely varying views on the utility of immutability. But I'm looking forward to trying. |
Thanks @jhpoelen for pointing us to this! Really nice paper. As you both know I think, I ran a similar experiment on 4,047,485 of the DataONE PIDs returned by the DataONE API (I didn't get around to the largest objects) in this R script. Here's the resulting dataone.tsv. From this I see: library(readr)
library(dplyr)
d <- read_csv("https://minio.thelio.carlboettiger.info/shared-data/dataone-hashes.tsv")
d %>% count(status)
#> # A tibble: 2 × 2
#> status n
#> <int> <int>
#> 1 200 3644732
#> 2 404 402753
d %>% summarise(missing = mean(is.na(sha256)))
#> # A tibble: 1 × 1
#> missing
#> <dbl>
#> 1 0.0995
d2 <- d %>% mutate(domain = urltools::domain(source))
d2 %>% filter(is.na(sha256)) %>% count(domain, sort=TRUE)
#> # A tibble: 28 × 2
#> domain n
#> <chr> <int>
#> 1 datadryad.org 289817
#> 2 dataone.tdar.org 69145
#> 3 usgs.ornl.gov 21082
#> 4 cn.dataone.org 8009
#> 5 mn-unm-1.dataone.org 3278
#> 6 dataone-prod.pop.umn.edu 3054
#> 7 arcticdata.io 1803
#> 8 mn-orc-1.dataone.org 1641
#> 9 gstore.unm.edu 1065
#> 10 mn-ucsb-1.dataone.org 965
#> # … with 18 more rows Created on 2022-03-12 by the reprex package (v2.0.1) as you see, just shy of 10% of the 4 million PIDs couldn't be retrieved to compute hash. The majority of those coming from datadryad.org, which as @mbjones already told me was a known issue at that time. I recall a variety of issues accounted for the other PIDs for which I could not resolve content, ranging from some outdated HTTPS certs to some old servers (e.g. running HTTP/0.9, which recent versions of I think the URLs recorded in my tsv are all of the format |
Great to see that you and Matt are working on a contentid paper, and thanks for mentioning Preston in:
contentid/paper/paper.Rmd
Line 768 in 665f0e9
Suggest to cite:
MJ Elliott, JH Poelen, JAB Fortes (2020). Toward Reliable Biodiversity Dataset References. Ecological Informatics. https://doi.org/10.1016/j.ecoinf.2020.101132
by @mielliott for context and related work.
The text was updated successfully, but these errors were encountered: