nfdi4plants / arctokenization Goto Github PK

View Code? Open in Web Editor NEW

3.0 2.0 2.0 7.17 MB

Definition of controlled vocabulary tokens and library to tokenize ARC metadata into these tokens

Home Page: https://nfdi4plants.github.io/ARCTokenization/

License: MIT License

Batchfile 0.02% Shell 0.03% F# 96.56% GLSL 3.39%

arctokenization's Introduction

ArcGraphModel

Library structure

CvTokens

classDiagram
    ICvBase <|-- IParam : Inherits
    IParamBase <|-- IParam : Inherits
    ICvBase <|.. CvObject : Implements
    ICvBase <|.. CvContainer : Implements
    IParam <|.. UserParam : Implements
    IParam <|.. CvParam : Implements
    <<Interface>> ICvBase
    <<Interface>> IParamBase
    <<Interface>> IParam
    class ICvBase{
        + CvTerm     
    }
    class IParamBase{
        + CvValue
        + WithValue()
    }
    class IParam{
        + CvTerm 
        + CvValue
        + WithValue()
    }
    class CvObject{
        + Attributes
        + CvTerm
        + Generic Value
    }
    class CvContainer{
        + Attributes
        + CvTerm
        + Children        
        + GetSingle()       
        + SetSingle()
        + GetMany()
        + SetMany()

    }
    class CvParam{
        + Attributes
        + CvTerm
        + CvValue
        + WithValue()
    }
    class UserParam{
        + Attributes
        + Term
        + CvValue
        + WithValue()
    }

Develop

Prerequisites

.NET 6 SDK
nodejs (tested with ~v16)

Setup

dotnet tool restore
npm install

Build whole project

Linux/macOS

make build.sh executable
run build.sh

Windows

run build.cmd

or run the build project directly:

dotnet run --project ./build/build.fsproj

Build ontologies (YAML to OBO)

Linux/macOS

make build.sh executable
run build.sh buildOntologies

Windows

run build.cmd buildOntologies

or run the build project directly:

dotnet run --project ./build/build.fsproj buildOntologies

Test

Linux/macOS

run build.sh runTests

Windows

run build.cmd runTests

or run the build project directly:

dotnet run --project ./build/build.fsproj runTests

arctokenization's People

Stargazers

Watchers

Forkers

freymaurer librachris

arctokenization's Issues

Use new equals overrides for testing CvParam equality in ARCTokenizarion tests

Currently, equality is only tested by comparing some member values:

ARCTokenization/tests/ARCTokenization.Tests/TestUtils.fs

Lines 58 to 62 in 243020e

 let structuralEquality (cvpExpectec : CvParam) (cvpActual : CvParam) = 

 termNamesEqual cvpExpectec cvpActual 

 accessionsEqual cvpExpectec cvpActual 

 refUrisEqual cvpExpectec cvpActual 

 valuesEqual cvpExpectec cvpActual

Since #32 adds custom equality overrides, we should use those when it is merged.

Feature Request - Enhanced Tokenization for Specific Folders and Files

Description:
I would like to request a new feature for the GitHub repository that involves the enhancement of the tokenization tool to handle specific folders and files differently. The goal is to provide more flexibility and customization for handling top-level folders and their corresponding subfolders, as well as specific file types within those folders.

Features Requested:

Folder Handling:
- Top-level folders named "studies," "assays," "runs," and "workflows" should be treated differently during tokenization.
- Subfolders directly beneath each top-level folder, named "study," "assay," "run," and "workflow," should also be handled differently.
File Type Handling:
- The tokenization tool should recognize and handle files with the following extensions differently:
  - CWL files (*.cwl)
  - IML files (*.iml)
  - ISA files related to the top-level folders mentioned above.

Expected Behavior:

Files within "study," "assay," "run," and "workflow" subfolders should be treated differently based on their file types.
ISA files within the top-level folders should also be handled differently during tokenization.

Rationale:
This feature will greatly benefit users working with structured data in the specified domain, allowing for more precise and customized tokenization based on the folder and file context.

Additional Notes:
Feel free to reach out if more information or clarification is needed. Thank you for considering this feature request.

Investigation Ontology: Investigation Person ORCID misses `follows` relationship

ARCTokenization/src/ARCTokenization/structural_ontologies/investigation_metadata_structural_ontology.obo

Lines 236 to 241 in 2dfd46f

 [Term] 

 id: INVMSO:00000094 

 name: Comment[Investigation Person ORCID] 

 def: "" 

 synonym: "Comment[<Investigation Person ORCID>]" EXACT [] 

 relationship: part_of INVMSO:00000021 ! INVESTIGATION CONTACTS

misses a follows relationship to INVMSO:00000093. This is needed in arc-validate to check the Investigation file's schema.

Same goes for

ARCTokenization/src/ARCTokenization/structural_ontologies/investigation_metadata_structural_ontology.obo

Lines 228 to 234 in 2dfd46f

 [Term] 

 id: INVMSO:00000093 

 name: Comment[<Investigation Person ORCID>] 

 def: "" 

 synonym: "Comment[Investigation Person ORCID]" EXACT [] 

 is_obsolete: true 

 relationship: part_of INVMSO:00000021 ! INVESTIGATION CONTACTS

ProcessGraph parsing function (Study & Assay) only works with full paths but a function for tokens (like those for metadata sheets) would be nice

Unify isa column names as type in ArcGraphModel

Currently the column names are implemented in at least 3 different libraries. This should be unified.

As far as I understand it, the ArcGraphModel is the bottom most library everything is based on.

I suggest, to implement the column names as union type with member functions for qol features.

See example here in Swate:

https://github.com/nfdi4plants/Swate/blob/177348ab6b27f51e6aee43c007c7396063b9b52a/src/Shared/OfficeInteropTypes.fs#L27

Implement equality for CvParam to compare with another CvParam (StructuralEquality)

Might be a bit more difficult since CvParams can be nested infinitely via having CvParams in their Attribute collection (therefore recursion is needed)...

Make ARCMock accessible for consumption in other libraries

Implemented via 7ba5548 , needs to be tested an published for you to consume for graph completion tests @omaus

[Discussion] AnnotationTable.TokenizedAnnotationTable

I think we should reconsider the current design of this type as it's kind of an awkward state:

Currently it is split into a list of IO columns and a list of Term Columns. This has two-fold problems according to the current proposed state of the ARC specification 1.2:

What about non-term and non-IO columns like Protocol REF?
There MUST be at most 1 Input and 1 Output Column, so a list seems counterintuitive.

Alternatively to trying to design this in some specific way, we could also keep it more naive and just have a list of columns (including terms, IOs and whatever)?

#25

[BUG] Study structural ontology has incorrect ID links in some cases

"STUDY METADATA" has ID "[...]62":

ARCTokenization/src/ARCTokenization/structural_ontologies/study_metadata_structural_ontology.obo

Lines 21 to 27 in 8387a24

 [Term] 

 id: STDMSO:00000062 

 name: STUDY METADATA 

 def: "" 

 synonym: "STUDY" EXACT [] 

 is_obsolete: true 

 relationship: part_of STDMSO:00000001 ! Study Metadata

Yet, in some cases, it is referred to as having ID "[...]51":

ARCTokenization/src/ARCTokenization/structural_ontologies/study_metadata_structural_ontology.obo

Lines 29 to 36 in 8387a24

 [Term] 

 id: STDMSO:00000003 

 name: Study Identifier 

 def: "" 

 relationship: part_of STDMSO:00000002 ! STUDY 

 relationship: part_of STDMSO:00000051 ! STUDY METADATA 

 relationship: follows STDMSO:00000002 ! STUDY 

 relationship: follows STDMSO:00000051 ! STUDY METADATA

ARCTokenization/src/ARCTokenization/structural_ontologies/study_metadata_structural_ontology.obo

Lines 38 to 44 in 8387a24

 [Term] 

 id: STDMSO:00000004 

 name: Study Title 

 def: "" 

 relationship: part_of STDMSO:00000002 ! STUDY 

 relationship: part_of STDMSO:00000051 ! STUDY METADATA 

 relationship: follows STDMSO:00000003 ! Study Identifier

and so forth.

We need to fix that in order to have relations between terms!

Add integration tests

We definitely need more unit tests, especially integration tests since a we have some functionality tested at the utmost atom level but this does not help us if the whole pipeline functions cannot parse an, e.g., Investigation file.

[BUG] Comment <Investigation Person ORCID> is parsed as UserParam and has no CvTerm

But it would definitely be better if it was parsed as CvParam. Maybe allow Comments at all?

Also: ORCID misses a CvTerm in the respective ontology.

Le workaround...

let cvparams = 
    params      // this is a IParam list, parsed from an ISA.XLSX file
    |> List.map (
        fun p -> 
            match CvParam.tryCvParam p with
            | Some cvp -> cvp
            | None -> CvParam(p.ID, p.Name, p.RefUri, p.Value, p :?> CvAttributeCollection)
    )

[BUG] `parseTableColumn` and `parseColumn` naming is interchanged

https://github.com/nfdi4plants/ArcGraphModel/blob/74c57e61aa5cd798cf0702f9daab053b14af32b2/src/ArcGraphModel.IO/ISA/Worksheet.fs#L24-L37

The name is parseColumn but it parses the columns of the worksheet's first table while

https://github.com/nfdi4plants/ArcGraphModel/blob/74c57e61aa5cd798cf0702f9daab053b14af32b2/src/ArcGraphModel.IO/ISA/Worksheet.fs#L39-L52

parses the columns of the whole worksheet, ignoring tables though the name is parseTableColumns.

This means, the names are mixed up.

Improve parsers to cover more CvTerms

The changes seem fine for getting just a CvParam list. But don't forget to extend the list of KeyMappers if you want to transform all fields to CvParams. I didn't finish it up completely.
https://github.com/nfdi4plants/ArcGraphModel/blob/74c57e61aa5cd798cf0702f9daab053b14af32b2/src/ArcGraphModel.IO/ISA/KeyParser.fs#L38

Originally posted by @HLWeil in #17 (comment)

Edit by kMutagene

Things that need to happen for this (for structural terms in metadata sheet sections):

Port obo parser to separate library, reference it (done, see https://github.com/CSBiology/FsOboParser)
Create structural ontologies for
- Investigation metadata sheets
- Study metadata sheets
- Assay metadata sheets
Create CvParams from all terms in a structural ontology
Add these terms to the matching logic that creates the flat token lists for metadata sheets

Investigation Ontology: STUDY misses some `follows` relationships

ARCTokenization/src/ARCTokenization/structural_ontologies/investigation_metadata_structural_ontology.obo

Lines 243 to 248 in 2dfd46f

 [Term] 

 name: STUDY 

 id: INVMSO:00000033 

 def: "" 

 relationship: part_of INVMSO:00000001 ! Investigation Metadata 

 relationship: follows INVMSO:00000032 ! Investigation Person Roles Term Source REF

As seen here

,
STUDY can also follow Study Person Roles Term Source REF (for every Study added to an Investigation after the first one). This is not mapped atm. in Investigation Metadata ontology.

Image's source file: https://git.nfdi4plants.org/muehlhaus/ArcPrototype/-/blob/main/isa.investigation.xlsx

Also misses follows to Investigation Person ORCID:

(same source as above)

Adapt IO column parsing logic to new `Input[X] / Output[Y]` headers

after the big updates surrounding arctrl, there will also be a new, more flexible input/output column syntax. See this issue:

nfdi4plants/Swate#316 (comment)

Since we partition the columns into IO vs rest, we just need to update our regexes accordingly, so let this issue stand here as a reminder to do that eventually.

Transfer ArcGraph into ARCTokenization

Atm., the functionality to use ARCs as graphs is only available in ArcValidation library. Yet it would be more useful to move into this repo, since this functionality might be of use in other contexts.

Move ControlledVocabulary to own repo

ControlledVocabulary is not only needed in ARCTokenization but also in other projects that may rely on CV types (like, possibly, OBO.NET) or that could integrate the CV model in any other way.
Therefore, ControlledVocabulary should imo be in its own repo to circumvent circular dependencies with other repos on the one hand and to thin out the bloated ARCTokenization repo on the other.

What's your opinion on this, @kMutagene?

Refactor FileSystem token-based metadata parsing from ValidationPackage POC into ARCTokenization

reminder @omaus, this is the section:

https://github.com/nfdi4plants/arc-validate/blob/bf1995584a06eb0a0353ac4a35a038d45dd6e48e/playgrounds/qcPackage_prototypes/invenio_prototype_v0.1.0.fsx#L40-L81

[BUG] Annotation table parsing can result in CvParam lists of incorrect length when parsing incorrect building blocks

See the code here:

ARCTokenization/src/ARCTokenization/AnnotationTable.fs

Line 202 in 4c93bfc

yield! loop false rest

Everytime this loop parses incorrect building blocks (e.g., when rest is something even if a correct set of building blocks was parsed before) it yields the next parsing results into the same list.

This is problematic for the following reason: This function is used on pre-grouped lists of headers, and expected to return a group of columns of an Annotation table as a list of CvParams.

Example:

Parsing

Parameter[Time]	Unit	Term Source REF (PATO:0000165)	Term Accession Number (PATO:0000165)
5	hour	some TSR	some TAN
2	hour	some TSR	some TAN

returns a list of 2 CvParams, correctly representing the 2 rows of this building block. however, when there is some wrong value that is not
caught by grouping, e.g.

Parameter[Time]	Unit	Term Source REF (PATO:0000165)	Term Accession Number (PATO:0000165)	Hello XD
5	hour	some TSR	some TAN	uhm
2	hour	some TSR	some TAN	yea

the loop will return a list of 4 CvParams, the 2 correct ones, and the incorrect column matching this case:

ARCTokenization/src/ARCTokenization/AnnotationTable.fs

Line 204 in 4c93bfc

| a :: [] ->

I know that we want to have a least-assumption approach, but we can be absolutely certain that the input table has the same amount of cells per row. This bug would lead to transposition of the 2D CvParam list that represents the table columns not working in some cases, because the dimensions are not the same in the inner collections, therefore preventing accessing rows of the table, which represent the process graphs. We need to reflect the knowledge in the output somehow, e.g. by not using yield! here:

ARCTokenization/src/ARCTokenization/AnnotationTable.fs

Lines 188 to 189 in 4c93bfc

 yield! cvPars 

 yield! loop false rest

and aggregating the results in the calling function instead.

[BUG] When parsing User Comments, the name of the comment (not the value!) gets lost

User Comments like this:

get partly lost when parsing:

ARCTokenization/src/ARCTokenization/MetadataSheet.fs

Lines 28 to 29 in ee6390f

 | Comment _ -> 

 fun pv -> CvParam(Terms.StructuralTerms.userComment, pv, attributes)

The name of the comment (in the case above: "Worksheet") is ignored which results in information loss when parsing ARCs.

`Study.parseMetadataSheetfromFile` does not parse metadata sheet: says worksheet is not present

System.Exception: No worksheet named 'Study' or 'isa_study' found in the workbook
   at ARCTokenization.Workbook.getStudyMetadataSheet(Boolean useLastSheetOnIncorrectName, FsWorkbook study)
   at ARCTokenization.Study.parseMetadataRowsFromFile(String path, FSharpOption`1 UseLastSheetOnIncorrectName)
   at ARCTokenization.Study.parseMetadataSheetfromFile(String path, FSharpOption`1 UseLastSheetOnIncorrectName)
   at ArcValidation.ArcGraph.fromXlsxFile(Dictionary`2 onto, FSharpFunc`2 xlsxParsing, String xlsxPath) in C:\Repos\nfdi4plants\arc-validate\src\ARCValidation\ArcGraph.fs:line 229
   at <StartupCode$FSI_0010>.$FSI_0010.main@() in C:\Repos\nfdi4plants\arc-validate\prototype.fsx:line 194
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
Stopped due to error

On the attached file, Study.parseMetadataSheetfromFile does not work although a worksheet with the name is given.

isa.study.xlsx

(File taken from https://git.nfdi4plants.org/muehlhaus/ArcPrototype)

Match column headers via arctrl regex instead of exact string matching

see https://github.com/nfdi4plants/ISADotNet/blob/arctrl/src/ISA/ISA/Regex.fs

[Feature Request] Annotation Table Graph

Implement a graph representing the flow of experimental entities (sample,data etc) over the different processes.

[BUG] `ARCMock.AssayMetadataTokens` misses Assay Technology Type TAN

[Fable] support warning

Just so i don't forget about this:

Unravel tokenization

Atm., a lot of the tokenization already packs some information into CvAttribute collection among other things. There are also some Bugs in parsing Keys associated with this.
We need to disentangle this structural muddle and also add some more testing to make this mess more stable than it is now.

Use OBO.NET.CodeGeneration for structural ontology generation

Add CodeGenerator for structural ontologies

Atm., the F# code for structural ontologies are written by hand from the respective OBO/YAML files. We need to automatize this for easier updating of the structural ontology libraries.

`ARCMock` does not allow adding of User Comments

...though we need them for comprehensive unit tests.

[Discussion] is it really necessary to nest static ontology terms

currently, creating an arc-validate test with the nested static ontology structure looks like this:

ARCExpect.test (TestID.Name INVMSO.``Investigation Metadata``.INVESTIGATION.``Investigation Title``.Name) {
    cvParams
    |> ARCExpect.ByTerm.contains INVMSO.``Investigation Metadata``.INVESTIGATION.``Investigation Title``
}

this is a LOT of noise. INVMSO has < 100 terms. i think it would be both easier AND more discoverable to create a flat class, so that it would just be

ARCExpect.test (TestID.TermName INVMSO.``Investigation Title``) {
    cvParams
    |> ARCExpect.ByTerm.contains INVMSO.``Investigation Title``
}

This way. you would have all possible terms once you write INVMSO. without the need of knowing about any nested structure.

Check how missing values are handled

example:

we need some way of grouping the flat token list to column-based objects, e.g. persons.

First idea would be using cell address information.

Add integration tests for parsing ISA files from tokens

[Feature] New parser for ArcGraphModel.IO

For arc-validates CvParam approach, we need a new parser that only takes CvParams instead of CvParams and some CvParams already parsed into CvContainers. Therefore, the parser functions in https://github.com/nfdi4plants/ArcGraphModel/blob/main/src/ArcGraphModel.IO/ISA/Worksheet.fs need to be updated for using convertTokens function instead of parseLine.

`Assay.parseMetadataSheetfromFile` does not parse metadata sheet: says worksheet is not present

System.Exception: No worksheet named 'Assay' or 'isa_assay' found in the workbook
   at ARCTokenization.Workbook.getAssayMetadataSheet(Boolean useLastSheetOnIncorrectName, FsWorkbook assay)
   at ARCTokenization.Assay.parseMetadataRowsFromFile(String path, FSharpOption`1 UseLastSheetOnIncorrectName)
   at ARCTokenization.Assay.parseMetadataSheetFromFile(String path, FSharpOption`1 UseLastSheetOnIncorrectName)
   at ArcValidation.ArcGraph.fromXlsxFile(Dictionary`2 onto, FSharpFunc`2 xlsxParsing, String xlsxPath) in C:\Repos\nfdi4plants\arc-validate\src\ARCValidation\ArcGraph.fs:line 229
   at <StartupCode$FSI_0011>.$FSI_0011.main@() in C:\Repos\nfdi4plants\arc-validate\prototype.fsx:line 198
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
Stopped due to error

On the attached file, Assay.parseMetadataSheetfromFile does not work although a worksheet with the name is given.

isa.assay.xlsx

(File taken from https://git.nfdi4plants.org/muehlhaus/ArcPrototype)

[BUG] Fix wrong display when `printfn`ing

When, e.g., taking a CvParam and calling printfn "%A" on it, it does not display the same as when calling its .ToString() method.
Fix this.

	let structuralEquality (cvpExpectec : CvParam) (cvpActual : CvParam) =
	termNamesEqual cvpExpectec cvpActual
	accessionsEqual cvpExpectec cvpActual
	refUrisEqual cvpExpectec cvpActual
	valuesEqual cvpExpectec cvpActual

	[Term]
	id: INVMSO:00000094
	name: Comment[Investigation Person ORCID]
	def: ""
	synonym: "Comment[<Investigation Person ORCID>]" EXACT []
	relationship: part_of INVMSO:00000021 ! INVESTIGATION CONTACTS

	[Term]
	id: INVMSO:00000093
	name: Comment[<Investigation Person ORCID>]
	def: ""
	synonym: "Comment[Investigation Person ORCID]" EXACT []
	is_obsolete: true
	relationship: part_of INVMSO:00000021 ! INVESTIGATION CONTACTS

	[Term]
	id: STDMSO:00000062
	name: STUDY METADATA
	def: ""
	synonym: "STUDY" EXACT []
	is_obsolete: true
	relationship: part_of STDMSO:00000001 ! Study Metadata

	[Term]
	id: STDMSO:00000003
	name: Study Identifier
	def: ""
	relationship: part_of STDMSO:00000002 ! STUDY
	relationship: part_of STDMSO:00000051 ! STUDY METADATA
	relationship: follows STDMSO:00000002 ! STUDY
	relationship: follows STDMSO:00000051 ! STUDY METADATA

	[Term]
	id: STDMSO:00000004
	name: Study Title
	def: ""
	relationship: part_of STDMSO:00000002 ! STUDY
	relationship: part_of STDMSO:00000051 ! STUDY METADATA
	relationship: follows STDMSO:00000003 ! Study Identifier

	[Term]
	name: STUDY
	id: INVMSO:00000033
	def: ""
	relationship: part_of INVMSO:00000001 ! Investigation Metadata
	relationship: follows INVMSO:00000032 ! Investigation Person Roles Term Source REF

	\| Comment _ ->
	fun pv -> CvParam(Terms.StructuralTerms.userComment, pv, attributes)

nfdi4plants / arctokenization Goto Github PK

arctokenization's Introduction

ArcGraphModel

Library structure

CvTokens

Develop

Prerequisites

Setup

Build whole project

Linux/macOS

Windows

or run the build project directly:

Build ontologies (YAML to OBO)

Linux/macOS

Windows

or run the build project directly:

Test

Linux/macOS

Windows

or run the build project directly:

arctokenization's People

Stargazers

Watchers

Forkers

arctokenization's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs