GithubHelp home page GithubHelp logo

nfdi4plants / arctokenization Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 2.0 7.17 MB

Definition of controlled vocabulary tokens and library to tokenize ARC metadata into these tokens

Home Page: https://nfdi4plants.github.io/ARCTokenization/

License: MIT License

Batchfile 0.02% Shell 0.03% F# 96.56% GLSL 3.39%

arctokenization's Introduction

ArcGraphModel

Library structure

CvTokens

classDiagram
    ICvBase <|-- IParam : Inherits
    IParamBase <|-- IParam : Inherits
    ICvBase <|.. CvObject : Implements
    ICvBase <|.. CvContainer : Implements
    IParam <|.. UserParam : Implements
    IParam <|.. CvParam : Implements
    <<Interface>> ICvBase
    <<Interface>> IParamBase
    <<Interface>> IParam
    class ICvBase{
        + CvTerm     
    }
    class IParamBase{
        + CvValue
        + WithValue()
    }
    class IParam{
        + CvTerm 
        + CvValue
        + WithValue()
    }
    class CvObject{
        + Attributes
        + CvTerm
        + Generic Value
    }
    class CvContainer{
        + Attributes
        + CvTerm
        + Children        
        + GetSingle()       
        + SetSingle()
        + GetMany()
        + SetMany()

    }
    class CvParam{
        + Attributes
        + CvTerm
        + CvValue
        + WithValue()
    }
    class UserParam{
        + Attributes
        + Term
        + CvValue
        + WithValue()
    }

Develop

Prerequisites

  • .NET 6 SDK
  • nodejs (tested with ~v16)

Setup

  • dotnet tool restore
  • npm install

Build whole project

Linux/macOS

  • make build.sh executable
  • run build.sh

Windows

run build.cmd

or run the build project directly:

dotnet run --project ./build/build.fsproj

Build ontologies (YAML to OBO)

Linux/macOS

  • make build.sh executable
  • run build.sh buildOntologies

Windows

run build.cmd buildOntologies

or run the build project directly:

dotnet run --project ./build/build.fsproj buildOntologies

Test

Linux/macOS

  • run build.sh runTests

Windows

  • run build.cmd runTests

or run the build project directly:

dotnet run --project ./build/build.fsproj runTests

arctokenization's People

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

arctokenization's Issues

Use new equals overrides for testing CvParam equality in ARCTokenizarion tests

Currently, equality is only tested by comparing some member values:

let structuralEquality (cvpExpectec : CvParam) (cvpActual : CvParam) =
termNamesEqual cvpExpectec cvpActual
accessionsEqual cvpExpectec cvpActual
refUrisEqual cvpExpectec cvpActual
valuesEqual cvpExpectec cvpActual

Since #32 adds custom equality overrides, we should use those when it is merged.

Feature Request - Enhanced Tokenization for Specific Folders and Files

Description:
I would like to request a new feature for the GitHub repository that involves the enhancement of the tokenization tool to handle specific folders and files differently. The goal is to provide more flexibility and customization for handling top-level folders and their corresponding subfolders, as well as specific file types within those folders.

Features Requested:

  1. Folder Handling:

    • Top-level folders named "studies," "assays," "runs," and "workflows" should be treated differently during tokenization.
    • Subfolders directly beneath each top-level folder, named "study," "assay," "run," and "workflow," should also be handled differently.
  2. File Type Handling:

    • The tokenization tool should recognize and handle files with the following extensions differently:
      • CWL files (*.cwl)
      • IML files (*.iml)
      • ISA files related to the top-level folders mentioned above.

Expected Behavior:

  • Files within "study," "assay," "run," and "workflow" subfolders should be treated differently based on their file types.
  • ISA files within the top-level folders should also be handled differently during tokenization.

Rationale:
This feature will greatly benefit users working with structured data in the specified domain, allowing for more precise and customized tokenization based on the folder and file context.

Additional Notes:
Feel free to reach out if more information or clarification is needed. Thank you for considering this feature request.

Investigation Ontology: Investigation Person ORCID misses `follows` relationship

[Term]
id: INVMSO:00000094
name: Comment[Investigation Person ORCID]
def: ""
synonym: "Comment[<Investigation Person ORCID>]" EXACT []
relationship: part_of INVMSO:00000021 ! INVESTIGATION CONTACTS

misses a follows relationship to INVMSO:00000093. This is needed in arc-validate to check the Investigation file's schema.

Same goes for

[Term]
id: INVMSO:00000093
name: Comment[<Investigation Person ORCID>]
def: ""
synonym: "Comment[Investigation Person ORCID]" EXACT []
is_obsolete: true
relationship: part_of INVMSO:00000021 ! INVESTIGATION CONTACTS

[Discussion] AnnotationTable.TokenizedAnnotationTable

I think we should reconsider the current design of this type as it's kind of an awkward state:

Currently it is split into a list of IO columns and a list of Term Columns. This has two-fold problems according to the current proposed state of the ARC specification 1.2:

  1. What about non-term and non-IO columns like Protocol REF?
  2. There MUST be at most 1 Input and 1 Output Column, so a list seems counterintuitive.

Alternatively to trying to design this in some specific way, we could also keep it more naive and just have a list of columns (including terms, IOs and whatever)?

#25

[BUG] Study structural ontology has incorrect ID links in some cases

"STUDY METADATA" has ID "[...]62":

[Term]
id: STDMSO:00000062
name: STUDY METADATA
def: ""
synonym: "STUDY" EXACT []
is_obsolete: true
relationship: part_of STDMSO:00000001 ! Study Metadata

Yet, in some cases, it is referred to as having ID "[...]51":

[Term]
id: STDMSO:00000003
name: Study Identifier
def: ""
relationship: part_of STDMSO:00000002 ! STUDY
relationship: part_of STDMSO:00000051 ! STUDY METADATA
relationship: follows STDMSO:00000002 ! STUDY
relationship: follows STDMSO:00000051 ! STUDY METADATA

[Term]
id: STDMSO:00000004
name: Study Title
def: ""
relationship: part_of STDMSO:00000002 ! STUDY
relationship: part_of STDMSO:00000051 ! STUDY METADATA
relationship: follows STDMSO:00000003 ! Study Identifier

and so forth.

We need to fix that in order to have relations between terms!

Add integration tests

We definitely need more unit tests, especially integration tests since a we have some functionality tested at the utmost atom level but this does not help us if the whole pipeline functions cannot parse an, e.g., Investigation file.

[BUG] Comment <Investigation Person ORCID> is parsed as UserParam and has no CvTerm

But it would definitely be better if it was parsed as CvParam. Maybe allow Comments at all?

Also: ORCID misses a CvTerm in the respective ontology.

Le workaround...

let cvparams = 
    params      // this is a IParam list, parsed from an ISA.XLSX file
    |> List.map (
        fun p -> 
            match CvParam.tryCvParam p with
            | Some cvp -> cvp
            | None -> CvParam(p.ID, p.Name, p.RefUri, p.Value, p :?> CvAttributeCollection)
    )

[BUG] `parseTableColumn` and `parseColumn` naming is interchanged

https://github.com/nfdi4plants/ArcGraphModel/blob/74c57e61aa5cd798cf0702f9daab053b14af32b2/src/ArcGraphModel.IO/ISA/Worksheet.fs#L24-L37

The name is parseColumn but it parses the columns of the worksheet's first table while

https://github.com/nfdi4plants/ArcGraphModel/blob/74c57e61aa5cd798cf0702f9daab053b14af32b2/src/ArcGraphModel.IO/ISA/Worksheet.fs#L39-L52

parses the columns of the whole worksheet, ignoring tables though the name is parseTableColumns.

This means, the names are mixed up.

Improve parsers to cover more CvTerms

The changes seem fine for getting just a CvParam list. But don't forget to extend the list of KeyMappers if you want to transform all fields to CvParams. I didn't finish it up completely.
https://github.com/nfdi4plants/ArcGraphModel/blob/74c57e61aa5cd798cf0702f9daab053b14af32b2/src/ArcGraphModel.IO/ISA/KeyParser.fs#L38

Originally posted by @HLWeil in #17 (comment)

Edit by kMutagene

Things that need to happen for this (for structural terms in metadata sheet sections):

  • Port obo parser to separate library, reference it (done, see https://github.com/CSBiology/FsOboParser)
  • Create structural ontologies for
    • Investigation metadata sheets
    • Study metadata sheets
    • Assay metadata sheets
  • Create CvParams from all terms in a structural ontology
  • Add these terms to the matching logic that creates the flat token lists for metadata sheets

Investigation Ontology: STUDY misses some `follows` relationships

[Term]
name: STUDY
id: INVMSO:00000033
def: ""
relationship: part_of INVMSO:00000001 ! Investigation Metadata
relationship: follows INVMSO:00000032 ! Investigation Person Roles Term Source REF

As seen here

image,
STUDY can also follow Study Person Roles Term Source REF (for every Study added to an Investigation after the first one). This is not mapped atm. in Investigation Metadata ontology.

Image's source file: https://git.nfdi4plants.org/muehlhaus/ArcPrototype/-/blob/main/isa.investigation.xlsx


Also misses follows to Investigation Person ORCID:

image
(same source as above)

Move ControlledVocabulary to own repo

ControlledVocabulary is not only needed in ARCTokenization but also in other projects that may rely on CV types (like, possibly, OBO.NET) or that could integrate the CV model in any other way.
Therefore, ControlledVocabulary should imo be in its own repo to circumvent circular dependencies with other repos on the one hand and to thin out the bloated ARCTokenization repo on the other.

What's your opinion on this, @kMutagene?

[BUG] Annotation table parsing can result in CvParam lists of incorrect length when parsing incorrect building blocks

See the code here:

Everytime this loop parses incorrect building blocks (e.g., when rest is something even if a correct set of building blocks was parsed before) it yields the next parsing results into the same list.

This is problematic for the following reason: This function is used on pre-grouped lists of headers, and expected to return a group of columns of an Annotation table as a list of CvParams.

Example:

Parsing

Parameter[Time] Unit Term Source REF (PATO:0000165) Term Accession Number (PATO:0000165)
5 hour some TSR some TAN
2 hour some TSR some TAN

returns a list of 2 CvParams, correctly representing the 2 rows of this building block. however, when there is some wrong value that is not
caught by grouping, e.g.

Parameter[Time] Unit Term Source REF (PATO:0000165) Term Accession Number (PATO:0000165) Hello XD
5 hour some TSR some TAN uhm
2 hour some TSR some TAN yea

the loop will return a list of 4 CvParams, the 2 correct ones, and the incorrect column matching this case:

I know that we want to have a least-assumption approach, but we can be absolutely certain that the input table has the same amount of cells per row. This bug would lead to transposition of the 2D CvParam list that represents the table columns not working in some cases, because the dimensions are not the same in the inner collections, therefore preventing accessing rows of the table, which represent the process graphs. We need to reflect the knowledge in the output somehow, e.g. by not using yield! here:

yield! cvPars
yield! loop false rest

and aggregating the results in the calling function instead.

`Study.parseMetadataSheetfromFile` does not parse metadata sheet: says worksheet is not present

System.Exception: No worksheet named 'Study' or 'isa_study' found in the workbook
   at ARCTokenization.Workbook.getStudyMetadataSheet(Boolean useLastSheetOnIncorrectName, FsWorkbook study)
   at ARCTokenization.Study.parseMetadataRowsFromFile(String path, FSharpOption`1 UseLastSheetOnIncorrectName)
   at ARCTokenization.Study.parseMetadataSheetfromFile(String path, FSharpOption`1 UseLastSheetOnIncorrectName)
   at ArcValidation.ArcGraph.fromXlsxFile(Dictionary`2 onto, FSharpFunc`2 xlsxParsing, String xlsxPath) in C:\Repos\nfdi4plants\arc-validate\src\ARCValidation\ArcGraph.fs:line 229
   at <StartupCode$FSI_0010>.$FSI_0010.main@() in C:\Repos\nfdi4plants\arc-validate\prototype.fsx:line 194
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
Stopped due to error

On the attached file, Study.parseMetadataSheetfromFile does not work although a worksheet with the name is given.

isa.study.xlsx

(File taken from https://git.nfdi4plants.org/muehlhaus/ArcPrototype)

Unravel tokenization

Atm., a lot of the tokenization already packs some information into CvAttribute collection among other things. There are also some Bugs in parsing Keys associated with this.
We need to disentangle this structural muddle and also add some more testing to make this mess more stable than it is now.

Add CodeGenerator for structural ontologies

Atm., the F# code for structural ontologies are written by hand from the respective OBO/YAML files. We need to automatize this for easier updating of the structural ontology libraries.

[Discussion] is it really necessary to nest static ontology terms

currently, creating an arc-validate test with the nested static ontology structure looks like this:

ARCExpect.test (TestID.Name INVMSO.``Investigation Metadata``.INVESTIGATION.``Investigation Title``.Name) {
    cvParams
    |> ARCExpect.ByTerm.contains INVMSO.``Investigation Metadata``.INVESTIGATION.``Investigation Title``
}

this is a LOT of noise. INVMSO has < 100 terms. i think it would be both easier AND more discoverable to create a flat class, so that it would just be

ARCExpect.test (TestID.TermName INVMSO.``Investigation Title``) {
    cvParams
    |> ARCExpect.ByTerm.contains INVMSO.``Investigation Title``
}

This way. you would have all possible terms once you write INVMSO. without the need of knowing about any nested structure.

Check how missing values are handled

example:

we need some way of grouping the flat token list to column-based objects, e.g. persons.

image

First idea would be using cell address information.

`Assay.parseMetadataSheetfromFile` does not parse metadata sheet: says worksheet is not present

System.Exception: No worksheet named 'Assay' or 'isa_assay' found in the workbook
   at ARCTokenization.Workbook.getAssayMetadataSheet(Boolean useLastSheetOnIncorrectName, FsWorkbook assay)
   at ARCTokenization.Assay.parseMetadataRowsFromFile(String path, FSharpOption`1 UseLastSheetOnIncorrectName)
   at ARCTokenization.Assay.parseMetadataSheetFromFile(String path, FSharpOption`1 UseLastSheetOnIncorrectName)
   at ArcValidation.ArcGraph.fromXlsxFile(Dictionary`2 onto, FSharpFunc`2 xlsxParsing, String xlsxPath) in C:\Repos\nfdi4plants\arc-validate\src\ARCValidation\ArcGraph.fs:line 229
   at <StartupCode$FSI_0011>.$FSI_0011.main@() in C:\Repos\nfdi4plants\arc-validate\prototype.fsx:line 198
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
Stopped due to error

On the attached file, Assay.parseMetadataSheetfromFile does not work although a worksheet with the name is given.

isa.assay.xlsx

(File taken from https://git.nfdi4plants.org/muehlhaus/ArcPrototype)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.