I spoke with Sebastiaan Fluitsma about Ineo and the role of software metadata today. On past occasions I also often spoke with @JanOdijk about this. I believe software metadata curation fits the theme of this interest group, although it also heavily involves Linked-Open-Data (poking @rlzijdeman), so I'll post my thought on software metadata and automatic harvesting here:
Ineo is a portal for researchers that aims to present various CLARIAH resources (tools/services and data). One of the concerns we acknowledged was the need to keep the tool metadata in Ineo up-to-date; tool metadata should be accurate, version numbers correct, links valid. This may seem obvious but it is something that often goes wrong, so I'm advocating for clear update and automatic harvesting procedures for metadata.
I have been using codemeta as a solution for all my software metadata needs. Codemeta is a linked open data scheme for describing software metadata and it is especially focussed at providing so-called *'crosswalks' * with various other existing software metadata standards. The crosswalk link metadata description fields from for example the Python Package Index, CRAN, Maven, Debian, etc. to to fields that are included in schema.org . Tools like codemetapy and codemetar do such conversions.
I'm a big supporter of storing the metadata as close to the software as possible, and automatically harvest and convert it where possible and in doing so avoid unnecessary data duplication. A certain amount of fundamental metadata can be harvested from the software repositories where the software is deposited (Python Package Index, CRAN, Maven, Debian,etc) .. Alternatively, codemeta metadata can be explicitly provided in the software's source code repository, by including a codemeta.json
file as for example here.
For instance, for all software installed in a LaMachine installation, a codemeta registry is automatically compiled that describes all software it contains. This is used in turn to present a simple portal page like the one that can be seen on https://webservices.cls.ru.nl (A LaMachine installation on a production server in Nijmegen). Of course Ineo is going to be more elaborate than this, but I would still be in favour of letting it pull metadata from a registry that's in part compiled by automatic harvesting from other metadata sources as much as possible. I would want to prevent having all kinds of different versions of duplicated metadata in existence, especially if those are independently and manually curated.
The codemeta initiative is limited to the more fundamental metadata that describes all software, this is not enough. There has been an effort by @JanOdijk to compile official CMDI metadata for various CLARIN/CLARIAH WP3 tools which takes into account more elaborate domain-specific metadata. This has been a manual curation effort. This is great, but a main concern I have here is that there seems not be any proper update & maintenance mechanism here; currently the raw CMDI files are put on surfdrive. I'd much rather see them maintained in a git repository here in the CLARIAH group so we have a 1) clear update procedure, 2) proper version control and 3) transparency & community interaction.
I think metadata collection/curation could be a layered approach where we combine data from multiple sources when needed. We first grab the basic metadata from as close to the source as possible (converting it from whatever repo it is stored in to codemeta), usually containing metadata directly provided by the developers. Then on top of that we can have a manual curation effort that adds extra CLARIAH domain-specific fields. The final result could be expressed as linked open data in some form, like the JSON-LD that I use for codemeta, which I think is more flexible and preferably, but even as CMDI if that is still preferred. (I believe there are existing initiatives within CLARIAH that treat CMDI as Linked Open Data, like cmd2rdf?). Tools like Ineo can in turn pull from some kind of central CLARIAH metadata registry to always present accurate metadata.
This is just my view on things of course, which I just want to throw out here for debate because I think we have some gains to make here. The LOD-crowd probably has more to say on this too.