Data di Pubblicazione:
2022
Abstract:
Background In the past decade many different national, EU and global projects have been successful in raising awareness about Open Science and the importance of making data findable and accessible such as stated in the FAIR principles (Wilkinson et al. 2016).In this respect, there have been many advances with respect to options for discovering data. A multitude of either thematic or general catalogues are providing faceted browsing interfaces for humans and Application Programming Interfaces (APIs) for use by machines and similarly, data-citations in publications offer references to resources hosted by repositories. However, using such catalogues and data-citations, researchers are not guaranteed to obtain access to the data itself. Mostly the resource link in the catalogue (and also in the metadata) or citation is a "landing-page", a description of the resource meant for human consumption. The landing-page may contain instructions how to access or download the resource itself but usually it is difficult to parse by machines.FAIR data accessThus the approach sketched above does not meet the requirements in scenarios where applications need assured and quick access to data. Also the FAIR principles interpretation from GO FAIR states*1 that these "emphasise machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data." The requirement for providing a Persistent Identifier (PID) for a resource*2, is mostly interpreted as meaning a PID for the resource's metadata or landing-page only. Note that we ignore the need for user authentication and authorization prior to accessing data, here we will only consider data that is 'freely' accessible.To improve the situation with respect to machine data accessibility a number of technologies and approaches that have been discussed in the CLARIN and Social Sciences and Humanities (SSH) infrastructure domain can be useful. We present some and comment on their suitability.SignpostingSignposting*3 is a technology proposed by van de Sompel (Sompel and Nelson 2015) to release relevant technical and bibliographical attributes from a resource URI. It's well described, and uses the HTTP protocol to provide additional information via HTTP Link Headers*4. Alternatively, for HTML type resources, the information may also be provided in HTML Link elements. In the CLARIN community the signposting concept was accepted, but its proposed implementation deviated from van de Sompel and made it less dependent on the HTTP protocol (Arnold et al. 2021). However on the downside, the signposting information is embedded in the CLARIN specific Component Metadata (CMDI) (Broeder et al. 2012), and so makes it CLARIN specific, or at least requires clients to have specific knowledge about CMDI.CLARIN Digital Object Gateway (DOG)One approach that is currently worked on for the CLARIN research infrastructure is the creation of a DOG library*5 and (later) a service that provides a proxy gateway from the resource PID to the actual data. DOG uses implicit knowledge about the different repository solutions that are used by the CLARIN B-type centres*6 and some repositories outside the CLARIN infrastructure. DOG works in two steps: first obtaining metadata from the resource PID and secondly extracting resource links from the metadata. Each of the repositories registered within DOG has a minimal configuration specifying how to parse fields of interest from the resource's metadata. For B-type CLARIN centres DOG uses content negotiation as the primary way of obtaining the metadata in CMDI format.
Tipologia CRIS:
04.02 Abstract in Atti di convegno
Keywords:
data management; metadata; CLARIN; PIDs; repositories; Signposting; Fair Digital Objects
Elenco autori:
Concordia, Cesare
Link alla scheda completa: