GithubHelp home page GithubHelp logo

knowledge-graph-hub / kg-obo Goto Github PK

View Code? Open in Web Editor NEW
28.0 3.0 2.0 6.26 MB

A package to transform all OBO ontologies into KGX TSV format and OBO json, and put the transformed graph in KGhub

Home Page: https://knowledge-graph-hub.github.io/kg-obo/getting_started.html

License: GNU General Public License v3.0

Python 99.14% Makefile 0.37% Batchfile 0.47% Shell 0.02%
obo obofoundry knowledge-graph ontology-infrastructure biolink monarchinitiative

kg-obo's People

Contributors

caufieldjh avatar justaddcoffee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

smartniz seandavi

kg-obo's Issues

Upload tar-compressed nodes and edges

Describe the desired behavior

Currently, node and edge files are uploaded without compression. Compressing to .tar.gz format before uploading would save space and bandwidth.

A few lingering transform errors

Describe the bug

Looks like there are 165 transformed ontologies (which is great!), but there are still a few lingering transforms with problems:

To Reproduce

See links above

Expected behavior

Ontology transforms should either have links to version/[ontology name].tar.gz or not appear in index.html

Version

11133da

Additional context

These are sort of long tail of problematic ontologies, but they aren't needed for our main use case, kg-idg

Build hangs unexpectedly at various times while running transforms

Describe the bug

As of Jenkins build 65, the build hangs with no further output or errors upon reaching this point:

...
17:43:17  Looking for kg-obo/fao/index.html
17:43:17  Found kg-obo/fao/index.html
17:43:17  Looking for kg-obo/hso/index.html
17:43:17  Found kg-obo/hso/index.html
17:43:17  Looking for kg-obo/cmo/index.html
17:43:17  Found kg-obo/cmo/index.html
17:43:17  Refreshed root index at kg-obo

The corresponding point in transform.py is:

kg-obo/kg_obo/transform.py

Lines 516 to 535 in 0fa1250

# If requested, refresh the root index.html
if force_index_refresh and not s3_test:
print(f"Refreshing root index on {bucket}, {remote_path}")
if kg_obo.upload.update_index_files(bucket, remote_path, data_dir, update_root=True):
kg_obo_logger.info(f"Refreshed root index at {remote_path}")
print(f"Refreshed root index at {remote_path}")
else:
kg_obo_logger.info(f"Failed to refresh root index at {remote_path}")
print(f"Failed to refresh root index at {remote_path}")
elif force_index_refresh and s3_test:
print(f"Mock refreshing root index on {bucket}, {remote_path}")
if kg_obo.upload.mock_update_index_files(bucket, remote_path, data_dir, update_root=True):
kg_obo_logger.info(f"Mock refreshed root index at {remote_path}")
print(f"Mock refreshed root index at {remote_path}")
else:
kg_obo_logger.info(f"Failed to mock refresh root index at {remote_path}")
print(f"Failed to mock refresh root index at {remote_path}")
# Get the OBO Foundry list YAML and process each
yaml_onto_list_filtered = retrieve_obofoundry_yaml(skip=skip, get_only=get_only)

To Reproduce

Run the following on Jenkins, on the main branch:

python3.8 run.py --bucket kg-hub-public-data --no_dl_progress --force_index_refresh --robot_path /var/lib/jenkins/workspace/knowledge-graph-hub_kg-obo_main/gitrepo/venv/bin/robot

Expected behavior

The next output should resemble the following:

processing ontologies:   0%|                                                                                                                                                    | 0/1 [00:00<?, ?it/s]
bfo
http://purl.obolibrary.org/obo/bfo.owl

Version

12fbcc7

Additional context

An identical run with the --s3_test option on appears to run as expected.

When version of ontology isn't found in OWL, transform code thinks it has been transformed already

Describe the bug

If version info isn't found in OWL, version is set to "NA", which makes sense. When the code then checks tracking.yaml, if the ontology hasn't been transformed already, it sees NA, and thinks this "version" has already been transformed.

Example is aro:

processing ontologies:   8%|▊         | 16/193 [1:32:59<3:34:37, 72.75s/it] INFO:kg-obo:Loading aro
21:11:06  aro
21:11:06  http://purl.obolibrary.org/obo/aro.owl
21:11:06  
21:11:06  
  0%|          | 0.00/358k [00:00<?, ?B/s][A
7.05MB [00:00, 135MB/s]                   
21:11:06  INFO:kg-obo:Current VersionIRI for aro: NA
21:11:07  INFO:kg-obo:Have already transformed aro: NA

To Reproduce

See here

Expected behavior

We should differentiate between NA that means "couldn't find a version" (which is what get_owl_iri means by NA) and NA that means "this hasn't been transformed yet (which is what tracking.yaml means by NA).

Simple fix would be for get_owl_iri to return something other than NA when it can't find a version. Say, "unknown_version", or the ontology name, or a timestamp.

(nb: a timestamp would mean a new version would be created in kg-obo on every pipeline runs, which might be okay)

Version

3a24fdb

Additional context

Add any other context about the problem here.

Transforms for which KGX reports errors and warnings are still uploaded to s3

Describe the bug

Transforms for which KGX reports errors and warnings are still uploaded to s3 - for example DOID

~ $ s3cmd ls s3://kg-hub-public-data/kg-obo/doid/2021-08-17/
2021-09-14 21:19      1494347  s3://kg-hub-public-data/kg-obo/doid/2021-08-17/doid.tar.gz

To Reproduce

Expected behavior

Should not upload transformed ontologies when there is an error or warning

Version

5652f33

NewConnectionError produces unhandled exception

In the most recent build:

21:53:53  fideo
21:53:53  http://purl.obolibrary.org/obo/fideo.owl
21:53:53  HTTPSConnectionPool(host='gitub.u-bordeaux.fr', port=443): Max retries exceeded with url: /erias/fideo/-/raw/master/fideo.owl (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7efd52f76280>: Failed to establish a new connection: [Errno -2] Name or service not known'))
21:53:53  Removing lock due to error...
21:53:53  deleting lock file s3_bucket:kg-hub-public-data, s3_path:kg-obo/lock
21:53:53  Lock removed.

This causes the build to fail.
Failing to access an OBO URL is certainly worth logging but should be treated like transform errors.

Option to disable progress bars

Describe the desired behavior

TQDM output progress bars make Jenkins console output hard to read. Provide a click option to disable/enable them.

Some ontologies aren't being transformed fully because some OWL files contain imports to other OWL files

Describe the desired behavior

Some OWL files contain imports to other OWL files, and KGX does not seem to follow these imports. For example, here is the OWL representation of Upheno:

<?xml version="1.0"?>
<!DOCTYPE rdf:RDF [
    <!ENTITY owl "http://www.w3.org/2002/07/owl#" >
    <!ENTITY obo "http://purl.obolibrary.org/obo/" >
    <!ENTITY xsd "http://www.w3.org/2001/XMLSchema#" >
    <!ENTITY rdfs "http://www.w3.org/2000/01/rdf-schema#" >
    <!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#" >
    <!ENTITY oboInOwl "http://www.geneontology.org/formats/oboInOwl#" >
]>


<rdf:RDF xmlns="&obo;x-bfo.owl#"
     xml:base="&obo;x-bfo.owl"
     xmlns:obo="http://purl.obolibrary.org/obo/"
     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
     xmlns:owl="http://www.w3.org/2002/07/owl#"
     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:oboInOwl="http://www.geneontology.org/formats/oboInOwl#">
    <owl:Ontology rdf:about="&obo;upheno.owl">
        <owl:imports rdf:resource="&obo;upheno/metazoa.owl"/>
    </owl:Ontology>
</rdf:RDF>

Note this block:

    <owl:Ontology rdf:about="&obo;upheno.owl">
        <owl:imports rdf:resource="&obo;upheno/metazoa.owl"/>
    </owl:Ontology>

which points to upheno/metazoa.owl, where all the good stuff is.

Because of this, kg-obo transforms currently upheno to this JSON, which is not terribly useful:

{
    "nodes": [
        {
            "id": "OBO:upheno.owl",
            "type": "owl:Ontology",
            "category": [
                "biolink:NamedThing"
            ],
            "provided_by": [
                "uphenolm7m33re"
            ]
        },
        {
            "id": "OBO:upheno/metazoa.owl",
            "category": [
                "biolink:NamedThing"
            ],
            "provided_by": [
                "uphenolm7m33re"
            ]
        }
    ],
    "edges": [
        {
            "subject": "OBO:upheno.owl",
            "predicate": "owl:imports",
            "object": "OBO:upheno/metazoa.owl",
            "relation": "owl:imports",
            "knowledge_source": [
                "uphenolm7m33re"
            ]
        }
    ]
}

Additional context

I don't think support for this is critical for our immediate use case that is driving development, i.e. kg-idg.

For now, we should possibly look for imports like this in the XML and abandon the transform with an error if they are present.

Eventually, we will want to parse the XML, find these imports, download these OWL files, and feed these to KGX in addition to the "main" OWL file.

@cmungall @matentzn @caufieldjh

FAO version not parsed as expected

Describe the bug

The OBO fao has a slightly different IRI structure from most, causing its version to be parsed incorrectly.

To Reproduce

$ python3 run.py --s3_test --bucket kg-obo --get_only fao
Output will be in 'data/fao/fao/' rather than 'data/fao/2020-05-07/'.

In the tracking.yaml, the relevant lines are:

  fao:
    current_iri: http://purl.obolibrary.org/obo/fao/releases/2020-05-07/fao/fao-base.owl
    current_version: fao

Compare to bfo:

  bfo:
    current_iri: http://purl.obolibrary.org/obo/bfo/2019-08-26/bfo.owl
    current_version: '2019-08-26'

Expected behavior

The version for fao should be '2020-05-07'.

Rename .tar.gz files to be more descriptive

Describe the desired behavior

Per convo with Chris, we probably should rename .tar.gz files to better explain what exactly they contain - e.g. _kgx_tsv.tar.gz. Alternatively, and maybe less painfully, we could add this information to the README somewhere.

Additional context

See here for example

lock file should be checked for on target s3 bucket, not locally

Describe the bug

Right now we are checking for a lock file before we proceed with a kg-obo run, see here. This prevents two runs of kg-obo at the same time on the same machine, but two separate runs will can and will still proceed on separate machines.

This will be a problem: in Jenkins multiple runs will be started when several PRs are merged in quick succession, and these will all run at the same time and possibly upload at the same time to the same S3 bucket, which will be a mess.

To Reproduce

python run.py --get_only bfo # for example

Expected behavior

I'd suggest that we instead look for a file named lock on the target S3 bucket, and if it exists, die. If it doesn't exist, create it on the S3 bucket and proceed with transform, then delete it after the run is finished. (We'll need to invalidate the cache too so it really doesn't show up when the next transform starts.)

Version

46a8768

Write index.html to provide file navigation

Upon upload to S3 bucket, files should be navigable using simple index.html.
Will need index for all files and for individual OBOs.
Will also need to update upon new uploads (i.e., don't update if no changes).

Data validation two ways

Describe the desired behavior

Would like to verify transforms in at least the following two manners:

  • Check for changes above a certain threshold, e.g., too many changes to new version
  • Do unit tests to verify certain axiom-dependent patterns are being represented as expected (but will require pre-processing to relax first)

index.html includes href's to ontologies that didn't transform correctly and aren't uploaded to s3

Describe the bug

The index.html at https://kg-hub.berkeleybop.io/kg-obo/index.html contains many (possibly 30, see below) bad links, for example cvdo that I think are caused by upload_index_html_files including links to ontologies that didn't transform correctly.

To Reproduce

I count 163 transformed ontologies on s3:
~ $ s3cmd ls s3://kg-hub-public-data/kg-obo/ | grep -v index.html | grep -v tracking.yaml | grep -c ''
163
~ $

But 193 transformed ontologies in the kg-obo/index.html file:

~ $ s3cmd get --force s3://kg-hub-public-data/kg-obo/index.html
~ $ grep href index.html | grep -v '../' | grep -v tracking.yaml | grep -c ''
193

Expected behavior

Ontologies that don't transform correctly shouldn't be in the root index.html

Version

Jenkins build 42, commit df32ea0

Validate links during Jenkins build

One of the primary functions of this project is to provide stable URLs to transformed ontologies.
Consequently, we should keep track of all new URLs we create, and do the following:

  • Append new URLs to a specific tracking file, independent of the tracking.yaml used to list OBO versions
  • Check URL validity and resolvability during Jenkins build
  • Create redirects?

Build fails on "unhashable type: 'dict'"

Describe the bug

On build 46, the run fails immediately after transforming go with the following output:

21:20:19  INFO:kg-obo:Encountered errors in transforming or parsing to json: {'BNode Errors': 528, 'Other Errors': 528}
21:20:19  INFO:kg-obo:go_kgx_tsv.tar.gz 5466794 bytes
21:20:19  INFO:kg-obo:go_kgx.json 65302825 bytes
21:20:19  INFO:kg-obo:json_transform.log 128450 bytes
21:20:19  INFO:kg-obo:tsv_transform.log 256900 bytes
21:20:19  INFO:kg-obo:Successfully completed transform of go
21:20:19  
processing ontologies:   2%|▏         | 3/193 [13:08<13:51:47, 262.67s/it]
21:20:19  unhashable type: 'dict'
21:20:19  Removing lock due to error...
21:20:19  deleting lock file s3_bucket:kg-hub-public-data, s3_path:kg-obo/lock
21:20:19  Lock removed.

To Reproduce

As part of Jenkins build:
python3.8 run.py --bucket kg-hub-public-data --no_dl_progress

Expected behavior

Build should continue.

Suspect this is due to replacing the previous OBO version, entitled "release", with the new one, entitled "2021-09-01".

Version

9b97aff

Parse the owl:Ontology rdf:about attribute to retrieve IRI

Describe the bug

Some IRIs are defined within a tag attribute and aren't parsed by the IRI retrieval function.
e.g., chmo:

    <owl:Ontology rdf:about="http://purl.obolibrary.org/obo/chmo.owl">
        <dc:description rdf:datatype="http://www.w3.org/2001/XMLSchema#string">CHMO, the chemical methods ontology, describes methods used to collect data in chemical experiments, such as mass spectrometry and electron microscopy prepare and separate material for further analysis, such as sample ionisation, chromatography, and electrophoresis synthesise materials, such as epitaxy and continuous vapour deposition It also describes the instruments used in these experiments, such as mass spectrometers and chromatography columns. It is intended to be complementary to the Ontology for Biomedical Investigations (OBI).</dc:description>
        <dc:title rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Chemical Methods Ontology</dc:title>
        <terms:license rdf:resource="http://creativecommons.org/licenses/by/4.0/"/>
        <oboInOwl:hasOBOFormatVersion rdf:datatype="http://www.w3.org/2001/XMLSchema#string">1.2</oboInOwl:hasOBOFormatVersion>
        <oboInOwl:saved-by rdf:datatype="http://www.w3.org/2001/XMLSchema#string">batchelorc</oboInOwl:saved-by>
    </owl:Ontology>

http://purl.obolibrary.org/obo/chmo.owl is a valid IRI and can be used in the tracking file, even without a version number being provided.

Right now, this attribute is not parsed, so the IRI becomes the default of "release".

To Reproduce

Run the following:
python3 run.py --get_only chmo --s3_test --bucket test-bucket --save_local

Note that output will include:

Version IRI not found.                                                                                                                                                             
Release date not found.
Current VersionIRI for chmo: release

Expected behavior

IRIs without version numbers are still valid IRIs, and should be included in the tracking.yaml.

Version

a365645

Parsing micro fails due to incorrect version

Describe the bug

Parsing micro fails with a FileNotFoundError, apparently due to having a very complicated VersionInfo value that can't be turned into a valid filename.

This wasn't encountered until now because micro requires imports, so it was previously ignored.

To Reproduce

See Jenkins build 62:

13:22:03  micro
13:22:03  http://purl.obolibrary.org/obo/micro.owl
13:22:03  Current VersionIRI for micro: &obo;MicrO.owl
13:22:03  Current version for micro: MicrO%20%28An%20Ontology%20of%20Prokaryotic%20Phenotypic%20and%20Metabolic%20Characters%29.%20%20Version%201.5.1%20released%206/14/2018.%0A%0AIncludes%20terms%20and%20term%20synonyms%20extracted%20from%20%26gt%3B%203000%20prokaryotic%20taxonomic%20descriptions%2C%20collected%20from%20a%20large%20number%20of%20taxonomic%20descriptions%20from%20Archaea%2C%20Cyanobacteria%2C%20Bacteroidetes%2C%20Firmicutes%20and%20Mollicutes.%0A%0AThe%20ontology%20and%20the%20synonym%20lists%20were%20developed%20to%20facilitate%20the%20automated%20extraction%20of%20phenotypic%20data%20and%20character%20states%20from%20prokaryotic%20taxonomic%20descriptions%20using%20a%20natural%20language%20processing%20algorithm%20%28MicroPIE%29.%20%20MicroPIE%20was%20developed%20by%20Hong%20Cui%2C%20Elvis%20Hsin-Hui%20Wu%2C%20and%20Jin%20Mao%20%28University%20of%20Arizona%29%20in%20collaboration%20with%20Carrine%20E.%20Blank%20%28University%20of%20Montana%29%20and%20Lisa%20R.%20Moore%20%28University%20of%20Southern%20Maine%29.%0A%0ADescriptions%20and%20links%20to%20MicroPIE%20can%20be%20found%20at%20http%3A//avatol.org/ngp/nlp/overview-2/.%0Ahttps%3A//github.com/biosemantics/micropie2%0A%0AThe%20most%20current%20version%20of%20MicrO%20can%20be%20downloaded%20from%20https%3A//github.com/carrineblank/MicrO.
13:22:03  Header for micro requests these imports: https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/BFO_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/BSPO_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/CHMO_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/CL_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/ChEBI_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/DRON_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/ENVO_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/FMA_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/GO_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/IAO_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/IDO_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/NCBITax_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/NDF-RT_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/OBI_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/PATO_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/PO_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/PR_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/REO_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/RO_imports.owl, https://raw.githubusercontent.com/carrineblank/MicrO/master/MicrOandImportModules/Uberon_imports.owl
13:22:03  Encountered unresolvable error: <class 'FileNotFoundError'> - [Errno 2] No such file or directory: 'data/micro/MicrO%20%28An%20Ontology%20of%20Prokaryotic%20Phenotypic%20and%20Metabolic%20Characters%29.%20%20Version%201.5.1%20released%206/14/2018.%0A%0AIncludes%20terms%20and%20term%20synonyms%20extracted%20from%20%26gt%3B%203000%20prokaryotic%20taxonomic%20descriptions%2C%20collected%20from%20a%20large%20number%20of%20taxonomic%20descriptions%20from%20Archaea%2C%20Cyanobacteria%2C%20Bacteroidetes%2C%20Firmicutes%20and%20Mollicutes.%0A%0AThe%20ontology%20and%20the%20synonym%20lists%20were%20developed%20to%20facilitate%20the%20automated%20extraction%20of%20phenotypic%20data%20and%20character%20states%20from%20prokaryotic%20taxonomic%20descriptions%20using%20a%20natural%20language%20processing%20algorithm%20%28MicroPIE%29.%20%20MicroPIE%20was%20developed%20by%20Hong%20Cui%2C%20Elvis%20Hsin-Hui%20Wu%2C%20and%20Jin%20Mao%20%28University%20of%20Arizona%29%20in%20collaboration%20with%20Carrine%20E.%20Blank%20%28University%20of%20Montana%29%20and%20Lisa%20R.%20Moore%20%28University%20of%20Southern%20Maine%29.%0A%0ADescriptions%20and%20links%20to%20MicroPIE%20can%20be%20found%20at%20http%3A//avatol.org/ngp/nlp/overview-2/.%0Ahttps%3A//github.com/biosemantics/micropie2%0A%0AThe%20most%20current%20version%20of%20MicrO%20can%20be%20downloaded%20from%20https%3A//github.com/carrineblank/MicrO.' ((2, 'No such file or directory'))

In the OWL:

<owl:versionInfo>MicrO (An Ontology of Prokaryotic Phenotypic and Metabolic Characters).  Version 1.5.1 released 6/14/2018.

Includes terms and term synonyms extracted from &gt; 3000 prokaryotic taxonomic descriptions, collected from a large number of taxonomic descriptions from Archaea, Cyanobacteria, Bacteroidetes, Firmicutes and Mollicutes.

The ontology and the synonym lists were developed to facilitate the automated extraction of phenotypic data and character states from prokaryotic taxonomic descriptions using a natural language processing algorithm (MicroPIE).  MicroPIE was developed by Hong Cui, Elvis Hsin-Hui Wu, and Jin Mao (University of Arizona) in collaboration with Carrine E. Blank (University of Montana) and Lisa R. Moore (University of Southern Maine).

Descriptions and links to MicroPIE can be found at http://avatol.org/ngp/nlp/overview-2/.
https://github.com/biosemantics/micropie2

The most current version of MicrO can be downloaded from https://github.com/carrineblank/MicrO.</owl:versionInfo>

Expected behavior

VersionInfo of this form could be naïvely truncated or regex'd for the word "version".

Version

59afe77

Root index is not updating

Describe the bug

The root index.html does not appear to be updated to display links to newly uploaded directories and files.

To Reproduce

https://kg-hub.berkeleybop.io/kg-obo/ contains only the following:

<!DOCTYPE html>
<html>
<head><title>Index of data</title></head>
<body>
    <h2>Index of data</h2>
    <hr>
    <ul>
        <li>
            <a href='../'>../</a>
        </li>
		<li>
			<a href=bfo>bfo</a>
		</li>
		<li>
			<a href=tracking.yaml>tracking.yaml</a>
		</li>

    </ul>
</body>
</html>

Expected behavior

The root index.html should include all of the most recently uploaded directories (at least, those beyond bfo).

kg-obo doing transforms of existing ontologies

Describe the bug

Running on Jenkins, kg-obo should skip BFO, since it already exists:

(venv) ~/PycharmProjects/kg-obo fix_jenkins_run_jenkins $ s3cmd ls s3://kg-hub-public-data/kg-obo/*
                          DIR  s3://kg-hub-public-data/kg-obo/bfo/
2021-09-03 22:28          301  s3://kg-hub-public-data/kg-obo/index.html
2021-09-03 22:28        10100  s3://kg-hub-public-data/kg-obo/tracking.yaml
(venv) ~/PycharmProjects/kg-obo fix_jenkins_run_jenkins $ 

But it actually does the BFO transform and sees that it exists and doesn't upload it:

14:30:35  BRANCH_NAME=fix_jenkins_run_jenkins
14:30:35  + python3.8 run.py --get_only bfo --bucket kg-hub-public-data
14:30:40  
processing ontologies:   0%|          | 0/1 [00:00<?, ?it/s]bfo
14:30:40  http://purl.obolibrary.org/obo/bfo.owl
14:30:40  
14:30:40  
  0%|          | 0.00/20.4k [00:00<?, ?B/s]�[A
154kB [00:00, 28.2MB/s]                    Current VersionIRI for bfo: http://purl.obolibrary.org/obo/bfo/2019-08-26/bfo.owl
14:30:40  [KGX][cli_utils.py][    transform_source] INFO: Processing source 'bfol7pepmv9'
14:30:46  
14:30:46  WARNING:ToolkitGenerator:Range of slot 'in taxon' (organism taxon) does not line with the domain of its inverse (taxon of)
14:30:46  WARNING:ToolkitGenerator:Range of slot 'taxon of' (named thing) does not line with the domain of its inverse (in taxon)
14:30:46  [KGX][owl_source.py][               parse] INFO: Parsing /tmp/bfol7pepmv9 with 'xml' format
14:30:46  [KGX][owl_source.py][               parse] INFO: /tmp/bfol7pepmv9 parsed with 1221 triples
14:30:46  [KGX][owl_source.py][               parse] INFO: Done parsing /tmp/bfol7pepmv9
14:30:47  INFO:kg-obo:bfo_edges.tsv 8519 bytes
14:30:47  INFO:kg-obo:bfo_nodes.tsv 55465 bytes
14:30:47  INFO:kg-obo:Successfully completed transform of bfo
14:30:47  INFO:kg-obo:Uploading...
14:30:48  WARNING:root:Existing file {s3_path} found on S3! Skipping.
14:30:48  WARNING:root:Existing file {s3_path} found on S3! Skipping.
14:30:48  WARNING:root:Existing file {s3_path} found on S3! Skipping.
14:30:48  WARNING:root:Existing file {s3_path} found on S3! Skipping.
14:30:48  WARNING:root:Existing file {s3_path} found on S3! Skipping.
14:30:48  WARNING:root:Existing file {s3_path} found on S3! Skipping.
14:30:48  
processing ontologies: 100%|██████████| 1/1 [00:09<00:00,  9.72s/it]
processing ontologies: 100%|██████████| 1/1 [00:09<00:00,  9.72s/it]
14:30:48  INFO:kg-obo:Successfully transformed 1: ['bfo']
14:30:48  Searching kg-obo/tracking.yaml in kg-hub-public-data
14:30:48  Searching kg-obo/index.html in kg-hub-public-data
14:30:48  Searching kg-obo/bfo/index.html in kg-hub-public-data
14:30:48  Searching kg-obo/bfo/2019-08-26/bfo_edges.tsv in kg-hub-public-data
14:30:48  Searching kg-obo/bfo/2019-08-26/index.html in kg-hub-public-data
14:30:48  Searching kg-obo/bfo/2019-08-26/bfo_nodes.tsv in kg-hub-public-data

To Reproduce

See above

Expected behavior

Should skip BFO

Version

74ee58c

Pre-process OBOs to avoid axiom conflicts

Would like to retain hierarchies within OBOs when present (i.e., going beyond base versions, or using "maximal" versions when available) but want to avoid axiom conflicts.

Describe the desired behavior

Final TSV node and edge lists should contain reasoned relationships inherited from imported OBOs but should avoid reliance upon axioms likely to conflict upon graph assembly.

ROBOT can do this when used as a pre-processing step:
See https://github.com/INCATools/ubergraph/blob/0bcc3864d5bb90b02029ef59147351e190188d11/Makefile#L19-L25

But this may not handle everything?

Additional context

Phenotype ontologies (e.g., upheno, hpo) may require specific concerns re: reasoning.

See also INCATools/ontology-development-kit#454

Ignore axioms without semantic value when transforming

Describe the desired behavior

Some OBOs, e.g. foodon, lead kgx to encounter blank nodes. This produces errors like the following:

00:52:41  [KGX][owl_source.py][          load_graph] WARNING: http://purl.obolibrary.org/obo/FOODON_00000000 http://www.w3.org/2000/01/rdf-schema#subClassOf N5c00c78cc5f44cae8b48e71a8fb410db has OWL.onProperty http://purl.obolibrary.org/obo/RO_0002180 and OWL.someValuesFrom None
00:52:41  [KGX][owl_source.py][          load_graph] WARNING: Do not know how to handle BNode: N5c00c78cc5f44cae8b48e71a8fb410db

In this case, a similar error appears ~15 times.

These are the result of axioms without semantic contributions (i.e., they're just part of a data model). We can ignore them when transforming to graphs.

Some evaluation of other OBOs may be necessary before we decide to ignore all errors of this type due to the potential for hierarchy-related strangeness (like in #21), but it is likely safe to skip them in general.

Additional context

See FoodOntology/foodon#159 for more details.

Broken links to OBO versions made into safe HTML

Describe the bug

OBO download links containing version names transcoded from invalid strings (e.g., "04%3A07%3A2013%2020%3A21") do not resolve to valid file locations on the S3 bucket.

To Reproduce

Attempt to download transforms from any of the following:
https://kg-hub.berkeleybop.io/kg-obo/pr/
https://kg-hub.berkeleybop.io/kg-obo/ehdaa2/ (Note that one version is a valid link while the other is a 404)

Expected behavior

Links should resolve to the intended file.
Version names should follow S3 key naming guidelines, as per https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html
(In short, don't use % to represent special characters, but underscores are fine)

ROBOT relax

Describe the desired behavior

This has essentially been mentioned in #21 and #29 - to reduce some OBOs axioms to a more consistent relation type, we will need to run ROBOT relax as something like the following:
robot relax --input cteno.owl --output cteno_relaxed.owl

We can run ROBOT through Py4j, doing the actual relax right around here:

This should render most OBOs more graph-compatible in general.

Lock file not removed when run fails

Describe the bug

When a run fails due to a runtime error, the lock file persists. See this run, which failed due to a runtime error, and this subsequent run, which fails because the lock file is still there.

To Reproduce

See above

Expected behavior

We should maybe catch runtime errors in run_transform (that are NOT due to the presence of a lock file), and remove the lock file

Version

21e4d9a

Incorporate versioning info

We'd like to be able to know versions of ingests, so we can 1) avoid transforming stuff we've already transformed and 2) support retrieving historical versions of each ontology, not just the most current version

Per Jim Balhoff in this ticket, we can use the version IRI to reliably retrieve version

Consider running Jenkins build more frequently than once a month

Describe the bug

Not a bug per se, but we are only running a kb-obo build on Jenkins every month, so if an ontology changes twice in a month, we probably will miss one of them.

To Reproduce

Cron is specified in Jenkinsfile here

Expected behavior

Could just change the cron to do a weekly build

Version

8554da2

Crash on tracking new fypo version - all new version tracking may fail

Describe the bug

In the most recent build run (not testing):

12:46:24  INFO:kg-obo:Current VersionIRI for fypo: http://purl.obolibrary.org/obo/fypo/releases/2021-09-21/fypo-base.owl
12:46:24  Current VersionIRI for fypo: http://purl.obolibrary.org/obo/fypo/releases/2021-09-21/fypo-base.owl
12:46:24  [KGX][cli_utils.py][    transform_source] INFO: Processing source 'fypogsytx2f3'
12:46:24  [KGX][owl_source.py][               parse] INFO: Parsing /tmp/fypogsytx2f3 with 'xml' format
12:47:32  [KGX][owl_source.py][               parse] INFO: /tmp/fypogsytx2f3 parsed with 413249 triples
12:47:32  [KGX][owl_source.py][               parse] INFO: Done parsing /tmp/fypogsytx2f3
12:47:50  INFO:kg-obo:fypo.tar.gz 807945 bytes
12:47:50  INFO:kg-obo:Successfully completed transform of fypo
12:47:50  
processing ontologies:  32%|███▏      | 62/193 [2:28:37<5:14:02, 143.84s/it]
12:47:50  list indices must be integers or slices, not str
12:47:50  Removing lock due to error...
12:47:50  deleting lock file s3_bucket:kg-hub-public-data, s3_path:kg-obo/lock
12:47:50  Lock removed.

This seems to be the result of an error in the track_obo_version function in transform.py, around line 199:

    #If we already have a version, move it to archive
    if tracking["ontologies"][name]["current_version"] != "NA":
        if "archive" not in tracking["ontologies"][name]:
            tracking["ontologies"][name] = []
        tracking["ontologies"][name]["archive"].append({"iri": iri, "version": version})
    else:
        tracking["ontologies"][name]["current_iri"] = iri
        tracking["ontologies"][name]["current_version"] = version

The version in the most recent tracking.yaml is older than that being retrieved from OBO Foundry.

To Reproduce

python3 run.py --bucket kg-obo --get_only fypo

Expected behavior

The tracking.yaml should be updated to resemble the following:

  fypo:
    current_iri: http://purl.obolibrary.org/obo/fypo/releases/2021-09-21/fypo-base.owl
    current_version: '2021-09-21'
    archive:
    - iri: http://purl.obolibrary.org/obo/fypo/releases/2021-09-16/fypo-base.owl
      version: '2021-09-16'

Version

7dab134

Too many remote files being added to index

Describe the bug

Writing index.html for transformed OBO subdirectories is adding names of files on the bucket already...but it's adding all of them.

See https://kg-hub.berkeleybop.io/kg-obo/go/index.html

To Reproduce

python3.8 run.py --bucket kg-hub-public-data --no_dl_progress on Jenkins

Expected behavior

The only filenames to be added to an index are those in the current directory.

Version

3b1f70c

Additional context

Offending line is here:

for key in client.list_objects(Bucket=bucket)['Contents']:

Expected behavior was to get kg-obo data and subdirectories only, but this yields all keys across the entire bucket.

GO IRI and version not parsed

Describe the bug

The IRI for GO is not being parsed, though it does not appear to have changed content or format within the OWL file itself.
Here's the GO-base header:

<?xml version="1.0"?>
<rdf:RDF xmlns="http://purl.obolibrary.org/obo/go/go-base.owl#"
     xml:base="http://purl.obolibrary.org/obo/go/go-base.owl"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:go="http://purl.obolibrary.org/obo/go#"
     xmlns:obo="http://www.geneontology.org/formats/oboInOwl#http://purl.obolibrary.org/obo/"
     xmlns:owl="http://www.w3.org/2002/07/owl#"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:xml="http://www.w3.org/XML/1998/namespace"
     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
     xmlns:obo1="http://purl.obolibrary.org/obo/"
     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
     xmlns:terms="http://purl.org/dc/terms/"
     xmlns:terms2="http://www.geneontology.org/formats/oboInOwl#http://purl.org/dc/terms/"
     xmlns:oboInOwl="http://www.geneontology.org/formats/oboInOwl#">
    <owl:Ontology rdf:about="http://purl.obolibrary.org/obo/go/go-base.owl">
        <owl:versionIRI rdf:resource="http://purl.obolibrary.org/obo/go/releases/2021-09-01/go-base.owl"/>
        <dc:description rdf:datatype="http://www.w3.org/2001/XMLSchema#string">The Gene Ontology (GO) provides a framework and set of concepts for describing the functions of gene products from all organisms.</dc:description>
        <dc:title rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Gene Ontology</dc:title>
        <terms:license rdf:resource="http://creativecommons.org/licenses/by/4.0/"/>
        <oboInOwl:default-namespace rdf:datatype="http://www.w3.org/2001/XMLSchema#string">gene_ontology</oboInOwl:default-namespace>
        <oboInOwl:hasOBOFormatVersion rdf:datatype="http://www.w3.org/2001/XMLSchema#string">1.2</oboInOwl:hasOBOFormatVersion>
        <owl:versionInfo rdf:datatype="http://www.w3.org/2001/XMLSchema#string">2021-09-01</owl:versionInfo>
    </owl:Ontology>

Right now, this attribute is not parsed, so the IRI becomes the default of "release".

To Reproduce

Run the following:
python3 run.py --get_only go --s3_test --bucket test-bucket --save_local

Note that output will include:

Version IRI not found.                                                                                                                                                             
Release date not found.
Current VersionIRI for chmo: release

Expected behavior

GO should have an IRI of 'http://purl.obolibrary.org/obo/go/releases/2021-09-01/go-base.owl' and a version of '2021-09-01'. or similar.

Version

a365645

Filter the obsolete OBOs

Would like to avoid download and transformation of any obsolete OBO, as defined by the OBO Foundry YAML.
Can keep inactive OBOs for now.

Problem transforming chebi

Describe the bug

Runtime error when pushing chebi transform:

11:29:49  INFO:kg-obo:chebi_nodes.tsv 102144331 bytes
11:29:49  INFO:kg-obo:chebi_edges.tsv 64121999 bytes
11:29:49  INFO:kg-obo:Successfully completed transform of chebi
11:29:49  
processing ontologies:   1%|          | 1/193 [50:43<162:20:28, 3043.90s/it]
11:29:49  Traceback (most recent call last):
11:29:49    File "run.py", line 39, in <module>
11:29:49      run()
11:29:49    File "/var/lib/jenkins/workspace/knowledge-graph-hub_kg-obo_main/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
11:29:49      return self.main(*args, **kwargs)
11:29:49    File "/var/lib/jenkins/workspace/knowledge-graph-hub_kg-obo_main/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1062, in main
11:29:49      rv = self.invoke(ctx)
11:29:49    File "/var/lib/jenkins/workspace/knowledge-graph-hub_kg-obo_main/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
11:29:49      return ctx.invoke(self.callback, **ctx.params)
11:29:49    File "/var/lib/jenkins/workspace/knowledge-graph-hub_kg-obo_main/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 763, in invoke
11:29:49      return __callback(*args, **kwargs)
11:29:49    File "run.py", line 36, in run
11:29:49      run_transform(skip, get_only, bucket, save_local, s3_test)
11:29:49    File "/var/lib/jenkins/workspace/knowledge-graph-hub_kg-obo_main/gitrepo/kg_obo/transform.py", line 349, in run_transform
11:29:49      track_obo_version(ontology_name, owl_iri, owl_version, bucket)
11:29:49    File "/var/lib/jenkins/workspace/knowledge-graph-hub_kg-obo_main/gitrepo/kg_obo/transform.py", line 160, in track_obo_version
11:29:49      client.upload_file(track_file_local_path, bucket)
11:29:49  TypeError: upload_file() missing 1 required positional argument: 'Key'

To Reproduce

See here

Expected behavior

Should transform

Version

8554da2

Prevent multiple concurrent runs of pipeline

We want to prevent multiple concurrent runs of the pipeline. This would cause these runs to all be writing to the s3 bucket at the same time and writing mutually inconsistent data.

This problem may result for example from multiple PRs being merged in quick succession, since Jenkins starts a new run when it sees a new commit to main.

Describe the desired behavior

Any run of the pipeline should check whether there is a currently running pipeline, and if so, it should fail.

One way of accomplishing this might be for run.py to look for a file at, say:

s3_bucket/kg-obo/locked

and if it exists, die. If this file doesn't exist, write this file to the s3c and continue with pipeline, then remove this file. (Other better ideas welcome.)

Provide the JSON format output

Describe the desired behavior

We currently provide compressed node and edge lists as output, but there are use cases involving JSON input as well. We can provide this (or any other KGX transform output format!) on the kg-hub site with fairly little effort.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.