GithubHelp home page GithubHelp logo

jhpoelen / name-alignment-ecdysis Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 125 KB

Attempt to align all names in Ecdysis Biodiversity Datasets using Nomer and Preston

License: Creative Commons Zero v1.0 Universal

name-alignment-ecdysis's Introduction

datasets taxonomies
id name
col
Catalogue of Life
id name
gbif
GBIF Backbone Taxonomy
id name
itis
Integrated Taxonomic Information System
id name
ncbi
NCBI Taxonomy
id name
globi
GloBI Taxon Graph

Name Alignment

Name Alignment by Nomer

Aligning taxonomic names is a common task in biodiversity informatics.

This template repository offers an automated method to align scientific names in csv/tsv files and darwin core archive with common taxonomic name lists like Catalogue of Life, NCBI Taxonomy, Integrated Taxonomic Information System (ITIS), and GBIF Backbone taxonomy.

To re-use:

  1. create your own repository using this repository as a template
  2. edit the README.md and add the urls / filenames to the resources you'd like to review. Note that only the following types are supported at time of writing (June 2022): text/csv, text/tab-separated-values, application/dwca.
  3. for now only names in column "scientificName" (tsv/csv), and "http://rs.tdwg.org/dwc/terms/scientificName" (DwC-A) will be aligned
  4. commit the changes to github
  5. inspect results of name alignment in "Github Actions" (e.g., sample results) )
  6. download results from provided single-use https://file.io link (e.g., look for Download the name alignment results with the single-use, and expiring, file.io link at: https://file.io/[something] in alignment report)
  7. to re-create results, change your name list in github or select "re-run jobs" in Github Actions.

Origin

This repository was conceived on 2022-03-08 during the Alien CSI Hack-a-thon in Romania by Christina, Quentin, Jorrit, Jasmijn, .... For more information see https://github.com/alien-csi/alien-csi-hackathon .

Contributors

name affiliation orcid
Jorrit Poelen GloBI; Ronin Institute https://orcid.org/0000-0003-3138-4118
your name your affiliation your orcid

Feedback / issues

This repository uses scripts in https://github.com/globalbioticinteractions/globinizer. These script use commandline tools like GloBI's nomer, cut, sed, etc.

Misc Notes

install nomer java8 / java11 -

https://github.com/globalbioticinteractions/nomer

e.g., Carl Boettiger taxondb R package

Print names and add a tab in front, to prepare for nomer.

cat foodorganisms.txt | sed 's/^/\t/g' > foodorganisms.tsv

Nomer expects the format to be:

[id][tab][name]

e.g., id\tname NCBI:9606\tHomo sapiens

Print names to screen and append itis taxonomic interpretation, and write/redirect to a file 'name-itis.tsv'

cat foodorganisms.tsv | nomer append itis > name-itis.tsv

open in LibreOffice Calc

Repeat with 'gbif' instead of 'itis'

Provenance of DwC-A Names

The name context of names extracted from DwC-A are captured in a funny looking text:

line:zip:hash://sha256/fe63af46ed66abd253ee148e383fb51da6695ce3848d0bde39af18aa77d364fb!/occurrences.csv!/L10

extracted from a generated names-aligned.tsv:

$ cat names-aligned.tsv | grep hash | grep occurrence | head -n1
line:zip:hash://sha256/fe63af46ed66abd253ee148e383fb51da6695ce3848d0bde39af18aa77d364fb!/occurrences.csv!/L10	Lasioglossum	SAME_AS	line:zip:hash://sha256/fe63af46ed66abd253ee148e383fb51da6695ce3848d0bde39af18aa77d364fb!/occurrences.csv!/L10	Lasioglossum								HAS_ACCEPTED_NAME	COL:5B4P	Lasioglossum	genus		Biota | Animalia | Arthropoda | Insecta | Hymenoptera | Apoidea | Halictidae | Halictinae | Halictini | Lasioglossum	COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:HYM | COL:625GP | COL:625H4 | COL:JMV | COL:KV7 | COL:5B4P	unranked | kingdom | phylum | class | order | superfamily | family | subfamily | tribe | genus	https://www.catalogueoflife.org/data/taxon/5B4P	

This text identifies the row from which the name was extracted. In this case, line 10, from file occurrences.csv contained in the zip file with content id hash://sha256/fe63af46ed66abd253ee148e383fb51da6695ce3848d0bde39af18aa77d364fb . If you retain the tracked dataset (in this case UC Santa Barbara Invertebrate Zoology Collection accessed on 2022-06-30) provided in the data/ folder of the name aligment archive, you can use Preston to dig up the original record using:

$ preston cat 'line:zip:hash://sha256/fe63af46ed66abd253ee148e383fb51da6695ce3848d0bde39af18aa77d364fb!/occurrences.csv!/L10' 
881449,UCSB,IZC,,b03a3f0c-bfa5-4e02-b5d3-56ff38626302,PreservedSpecimen,a8a4f8b1-38f1-4e10-9b75-b2e86ac196fc,UCSB-IZC00038312,,Animalia|Arthropoda|Hexapoda|Insecta|Pterygota|Neoptera|Hymenoptera|Apocrita|Aculeata|Apoidea|Halictidae|Halictinae|Halictini,Animalia,Arthropoda,Insecta,Hymenoptera,Halictidae,Lasioglossum,186125,"Curtis, 1833",Lasioglossum,,,,,Genus,"EEMB/ENV S 96",24-May-2022,,,,,,"Sophie Cameron",,2022-04-26,2022,4,26,116,,,,"Newly restored salt marsh",PAN2,,,,,,,"on flower of Eschscholzia californica",,,Adult,Female,1,Pinned,"United States",California,"Santa Barbara",,"University of California Santa Barbara North Campus Open Space",,34.42174,-119.87186,WGS84,10,,,,GPS,,,,,,,,,,,,"2022-05-31 10:52:55",http://creativecommons.org/publicdomain/zero/1.0/,"The Regents of the University of California",https://www.ccber.ucsb.edu/collections/databases-searching-specimen-data-and-images,urn:uuid:a8a4f8b1-38f1-4e10-9b75-b2e86ac196fc,https://serv.biokic.asu.edu/ecdysis/collections/individual/index.php?occid=881449

which links to a preserved specimen with occurrenceId b03a3f0c-bfa5-4e02-b5d3-56ff38626302 and landing page at https://serv.biokic.asu.edu/ecdysis/collections/individual/index.php?occid=881449 . Also see screenshot made on 2022-06-30.

With this context, you can trace the origin and context of the name in great detail. This detail can be used to troubleshoot bugs in the name alignment process, or provide granular feedback to those that maintain the dataset or taxonomy.

name-alignment-ecdysis's People

Contributors

jhpoelen avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.