Entities | ETL workflows on DSRI |
---|---|
Genes | |
Drugs | |
Proteins | |
Publications | |
Associations |
Install pip package:
pip install d2s
Install yarrrml-parser:
npm i -g @rmlio/yarrrml-parser
A GitHub Action workflow is defined to run each ETL workflow on DSRI. You can also easily run them locally (you might face scalability issues for some large datasets though, so try to use a sample for testing). Checking a dataset workflow definition is a good way to see exactly the process to convert this dataset to RDF using the SemanticScience ontology.
Start the workspace with docker-compose
docker-compose up
Access the workspace started on http://localhost:8888
The source code for the d2s-cli
will be cloned and installed locally in the docker container, this way you can make change to the d2s-cli
code, and they will be automatically used by the d2s
command in the docker container.
Go to the folder of the dataset you want to process, and use d2s
to run the dataset processing based on its metadata.ttl
file: e.g. to run HGNC
:
cd datasets/HGNC
d2s run --sample 100
All temporary files are put in the data/
folder
If you face conflicts with already installed packages, then you might want to use a Virtual Environment to isolate the installation in the current folder before installing d2s
:
# Create the virtual environment folder in your workspace
python3 -m venv .venv
# Activate it using a script in the created folder
source .venv/bin/activate
For VisualStudio Code users: add autocomplete and validation for YARRRML mappings files and the d2s.yml
config file in VisualStudio Code easily with the YAML extension from RedHat. Go to VisualStudio Code, open settings (File
> Preferences
> Settings
or Ctrl + ,
). Then add the following lines to the settings.json
:
"yaml.schemas": {
"https://raw.githubusercontent.com/bio2kg/bio2kg-etl/main/resources/yarrrml.schema.json": ["*.yarrr.yml"],
"https://raw.githubusercontent.com/MaastrichtU-IDS/d2s-cli/master/resources/d2s-config.schema.json": ["d2s.yml"],
}
To process a new dataset create a new folder in the datasets
folder, and add the following files to map the dataset to RDF:
- Define a
dataset-yourdataset.ttl
ordataset-yourdataset.jsonld
file to describe your dataset, the files to download and potential preprocessing script to run. - Optionally define a
prepare.sh
script to perform specification preparation steps on the dataset. - Define a
mapping-yourdataset.yarrr.yml
file to map the dataset to the SemanticScience ontology following the Bio2KG model. You can use YARRRML Matey editor to write and test your YARRRML mappings.
Use datasets/HGNC
as starter for tabular files, or datasets/InterPro
for XML files.
Multiple solutions are available to generate RDF from RML mappings:
-
rmlmapper-java: the reference implementation, works well with CSV, XML and functions. But out of memory quickly for large files (e.g. DrugBank, iProClass)
-
RMLStreamer (scala): works well with large CSV and XML files. Not working with functions (e.g. DrugBank, iProClass)
-
RocketRML (js): works well with medium size CSV (2G max) and XML. Easy to define new functions in JavaScript (no need to rebuild the jar and add turtle files). But faces issues when running in the workflow runner (missing
make
forpugixml
). Seedatasets/DrugBank
ordatasets/iProClass
.
See also:
-
SDM-RDFizer: work started in
datasets/CTD
-
CARML: can't be used as executable apparently, requires to write a java program
Create the image in your project (to do only once):
oc new-build --name bio2kg-workspace --binary
Build the Dockerfile
on the DSRI (to re-run everytime you make a change to the script and content of the docker image on your machine):
oc start-build bio2kg-workspace --from-dir=. --follow --wait
Start the JupyterLab/VSCode workspace with Helm (cf. docs to add the repository), replace /bio2kg/
by your project name:
helm install workspace dsri/jupyterlab \
--set serviceAccount.name=anyuid \
--set service.openshiftRoute.enabled=true \
--set image.repository=image-registry.openshift-image-registry.svc:5000/bio2kg/bio2kg-workspace \
--set image.tag=latest \
--set image.pullPolicy=Always \
--set storage.mountPath=/home/jovyan/work \
--set gitUrl=https://github.com/bio2kg/bio2kg-etl \
--set password=changeme
Add DrugBank user email and password:
--set extraEnvs[0].name=DRUGBANK_USER \
--set extraEnvs[0].value=your_drugbank_user \
--set extraEnvs[1].name=DRUGBANK_PASSWORD \
--set extraEnvs[1].value=your_drugbank_password \
Uninstall the chart:
helm uninstall workspace
Delete the build config:
oc delete bc/bio2kg-workspace
oc delete imagestreamtag/bio2kg-workspace
On the DSRI you can easily create Virtuoso triplestores by using the dedicated template in the Catalog (cf. this docs for Virtuoso LDP)
First define the password as environment variable:
export DBA_PASSWORD=yourpassword
Start the production triplestore:
oc new-app virtuoso-triplestore -p PASSWORD=$DBA_PASSWORD \
-p APPLICATION_NAME=triplestore \
-p STORAGE_SIZE=300Gi \
-p DEFAULT_GRAPH=https://data.bio2kg.org/graph \
-p TRIPLESTORE_URL=triplestore-bio2kg.apps.dsri2.unimaas.nl
Start the staging triplestore:
oc new-app virtuoso-triplestore -p PASSWORD=$DBA_PASSWORD \
-p APPLICATION_NAME=staging \
-p STORAGE_SIZE=300Gi \
-p DEFAULT_GRAPH=https://data.bio2kg.org/graph \
-p TRIPLESTORE_URL=triplestore-bio2kg.apps.dsri2.unimaas.nl
After starting the Virtuoso triplestores you will need to install additional VAD packages and create the right folder to enable the Linked Data Platform features:
./prepare_virtuoso_dsri.sh triplestore
./prepare_virtuoso_dsri.sh staging
Configure Virtuoso:
-
Instructions to enable CORS for the SPARQL endpoint via the admin UI: http://vos.openlinksw.com/owiki/wiki/VOS/VirtTipsAndTricksCORsEnableSPARQLURLs
-
Instructions to enable the faceted browser and full text search via the admin UI: http://vos.openlinksw.com/owiki/wiki/VOS/VirtFacetBrowserInstallConfig
Delete a triplestore on DSRI:
oc delete all,secret,configmaps,serviceaccount,rolebinding --selector app=staging
oc delete all,secret,configmaps,serviceaccount,rolebinding --selector app=triplestore
# Delete the persistent volume:
oc delete pvc --selector app=staging
oc delete pvc --selector app=triplestore
Add or update the template in the bio2kg
project on the DSRI:
oc apply -f https://raw.githubusercontent.com/vemonet/flink-on-openshift/master/template-flink-dsri.yml
Create the Flink cluster in your project on DSRI:
oc new-app apache-flink -p APPLICATION_NAME=flink \
-p PASSWORD=$DBA_PASSWORD \
-p STORAGE_SIZE=1Ti \
-p WORKER_COUNT="4" \
-p TASKS_SLOTS="64" \
-p CPU_LIMIT="32" \
-p MEMORY_LIMIT=100Gi \
-p LOG_LEVEL=DEBUG \
-p FLINK_IMAGE="ghcr.io/maastrichtu-ids/rml-streamer:latest"
# -p FLINK_IMAGE="flink:1.12.3-scala_2.11"
Check this repo to build the image of Flink with the RMLStreamer.
Clone the repository with all datasets mappings in /mnt
in the RMLStreamer:
oc exec $(oc get pod --selector app=flink --selector component=jobmanager --no-headers -o=custom-columns=NAME:.metadata.name) -- bash -c "cd /mnt && git clone https://github.com/bio2kg/bio2kg-etl.git && mv {bio2kg-etl/*,bio2kg-etl/.*} . && rmdir bio2kg-etl"
oc exec $(oc get pod --selector app=flink --selector component=jobmanager --no-headers -o=custom-columns=NAME:.metadata.name) -- git clone https://github.com/bio2kg/bio2kg-etl.git
Download the latest release of the RMLStreamer.jar
file in the /mnt
folder in the Flink cluster (to do only if not already present)
oc exec $(oc get pod --selector app=flink --selector component=jobmanager --no-headers -o=custom-columns=NAME:.metadata.name) -- bash -c "curl -s https://api.github.com/repos/RMLio/RMLStreamer/releases/latest | grep browser_download_url | grep .jar | cut -d '\"' -f 4 | wget -O /mnt/RMLStreamer.jar -qi -"
Submit a job with the CLI:
export FLINK_POD=$(oc get pod --selector app=flink --selector component=jobmanager --no-headers -o=custom-columns=NAME:.metadata.name)
oc exec $FLINK_POD -- /opt/flink/bin/flink run -p 64 -c io.rml.framework.Main /opt/RMLStreamer.jar toFile -m /mnt/mapping.rml.ttl -o /mnt/bio2kg-output.nt --job-name "RMLStreamer Bio2KG - dataset"
See the Flink docs for more details on running jobs using the CLI or Kubernetes native execution. And more Flink docs for Kubernetes deployment
Submit a job with the UI (not working):
io.rml.framework.Main
toFile -m /mnt/datasets/iProClass/data/iproclass-mapping.rml.ttl -o /mnt/datasets/iProClass/output/output.nt
Upload a jar file with the API:
curl -X POST -F [email protected] flink-jobmanager-rest:8081/jars/upload
curl flink-jobmanager-rest:8081/jars
Run the jar previously uploaded (not working):
curl -X POST flink-jobmanager-rest:8081/jars/b61e9ca3-ca5b-437f-bfe5-59661ed5826c_RMLStreamer.jar/run?parallelism=4&entry-class=io.rml.framework.Main&program-args=toFile -m /mnt/datasets/iProClass/data/iproclass-mapping.rml.ttl -o /mnt/datasets/iProClass/output/output.nt
Returns error: {"errors":["org.apache.flink.runtime.rest.handler.RestHandlerException: No jobs included in application.
Docs for Flink API:
- https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/rest_api.html#submitting-programs
- https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/rest_api/
Uninstall the Flink cluster from your project on DSRI:
oc delete all,secret,configmaps,serviceaccount,rolebinding --selector app=flink
# Delete also the persistent volume:
oc delete pvc,all,secret,configmaps,serviceaccount,rolebinding --selector app=flink
You can define GitHub Actions workflows YAML files in the folder .github/workflows
to run on the DSRI:
jobs:
your-job:
runs-on: ["self-hosted", "dsri", "bio2kg" ]
steps: ...
You can install anything you want with conda, pip, yarn, npm, maven.
The workflow-runner image is built and publish at every change to workflows/Dockerfile
by a GitHub Actions workflow.
Build with the latest version of miniforge conda automatically downloaded:
docker build -t ghcr.io/bio2kg/workflow-runner:latest -f workflows/Dockerfile .
Quick try:
docker run -it --entrypoint=bash ghcr.io/bio2kg/workflow-runner:latest
Push:
docker push ghcr.io/bio2kg/workflow-runner:latest
You can easily start a GitHub Actions workflow runner in your project on the DSRI using the Actions runner Helm chart:
- Get an access token with a user in the bio2kg organization, and export it in your environment
export GITHUB_PAT="TOKEN"
- Go to your project on the DSRI
oc project bio2kg
- Start the runners in your project
helm install actions-runner openshift-actions-runner/actions-runner \
--set-string githubPat=$GITHUB_PAT \
--set-string githubOwner=bio2kg \
--set runnerLabels="{ dsri, bio2kg }" \
--set replicas=5 \
--set serviceAccountName=anyuid \
--set memoryRequest="512Mi" \
--set memoryLimit="200Gi" \
--set cpuRequest="100m" \
--set cpuLimit="128" \
--set runnerImage=ghcr.io/bio2kg/workflow-runner \
--set runnerTag=latest
- Check the runners are available from GitHub: go to your organization Settings page on GitHub, then go to the Actions tab, click go to the Runner tab, and scroll to the bottom. In the list of active runners you should see the runners you just deployed.
Uninstall:
helm uninstall actions-runner
Some relevant resources:
Install the official Airflow Helm chart:
helm repo add apache-airflow https://airflow.apache.org
helm repo update
Deploy airflow with dags synchronized with this GitHub repository workflows/dags
folder (cf. workflows/airflow-helm-values.yaml
for all deployment settings):
helm install airflow apache-airflow/airflow \
-f workflows/airflow-helm-values.yaml \
--set webserver.defaultUser.password=$DBA_PASSWORD
Fix the postgresql deployment (because setting the serviceAccount.name
of the sub chart postgresql
don't work, but should be possible according to the official helm docs):
oc patch statefulset/airflow-postgresql --patch '{"spec":{"template":{"spec": {"serviceAccountName": "anyuid"}}}}'
To access it you can forward the webserver on your machine http://localhost:8080
oc port-forward svc/airflow-webserver 8080
Or expose the service on a URL (accessible when on the UM VPN) with HTTPS enabled:
oc expose svc/airflow-webserver
oc patch route/airflow-webserver --patch '{"spec":{"tls": {"termination": "edge", "insecureEdgeTerminationPolicy": "Redirect"}}}'
If you try to use a different database for production, you can use this on the DSRI:
oc new-app postgresql-persistent \ -p DATABASE_SERVICE_NAME=airflow-postgresql \ -p POSTGRESQL_DATABASE=airflow \ -p POSTGRESQL_USER=postgres \ -p POSTGRESQL_PASSWORD=$DBA_PASSWORD \ -p VOLUME_CAPACITY=20Gi \ -p MEMORY_LIMIT=1Gi \ -p POSTGRESQL_VERSION="10-el8"But getting error due to the fact that the service is not directly part of the helm release (which shows the doc to deploy in production is wrong, and the chart poorly written):
Error: rendered manifests contain a resource that already exists. Unable to continue with install: Service "airflow-postgresql" in namespace "bio2kg" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by": must be set to "Helm"; annotation validation error: missing key "meta.helm.sh/release-name": must be set to "airflow"
Delete:
helm uninstall airflow
Experimental.
See docs to install Prefect Server on Kubernetes
helm repo add prefecthq https://prefecthq.github.io/server/
helm repo update
Install Prefect in your project (oc project my-project
) on OpenShift:
helm install prefect prefecthq/prefect-server \
--set agent.enabled=true \
--set jobs.createTenant.enabled=true \
--set serviceAccount.create=false --set serviceAccount.name=anyuid \
--set postgresql.useSubChart=true \
--set postgresql.serviceAccount.name=anyuid \
--set ui.apolloApiUrl=http://prefect-graphql-bio2kg.apps.dsri2.unimaas.nl/graphql
Run this to fix the postgres error:
oc patch statefulsets/prefect-postgresql --patch '{"spec":{"template": {"spec": {"serviceAccountName": "anyuid"}}}}'
Change a settings in a running Prefect server:
helm upgrade prefect prefecthq/prefect-server \
--set ui.apolloApiUrl=http://prefect-graphql-bio2kg.apps.dsri2.unimaas.nl/graphql
Uninstall:
helm uninstall prefect
On your laptop, change the host in the configuration file ~/.prefect/config.toml
(cf. docs)
[server]
host = "http://prefect-graphql-bio2kg.apps.dsri2.unimaas.nl"
port = 80
Create the project (if not already existing)
prefect create project 'bio2kg'
Register a workflow:
python3 workflows/prefect-workflow.py
Install Argo workflows using Helm charts:
helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
Start Argo workflows in the current project on DSRI:
helm install argo-workflows argo/argo-workflows \
--set workflow.serviceAccount.create=true \
--set workflow.serviceAccount.rbac.create=true \
--set workflow.serviceAccount.name="anyuid" \
--set controller.serviceAccount.create=true \
--set controller.serviceAccount.rbac.create=true \
--set controller.serviceAccount.name="anyuid" \
--set server.serviceAccount.create=true \
--set server.serviceAccount.rbac.create=true \
--set server.serviceAccount.name="anyuid"
Delete Argo workflows:
helm uninstall argo-workflows
Maybe fix with adding SCC for hostpath
We reused some RML mappings from this publication: https://doi.org/10.5281/zenodo.3552369