GithubHelp home page GithubHelp logo

Comments (17)

hoelzer avatar hoelzer commented on August 15, 2024

Here the error when I restart the workflow, 3 deepvirfinder runs finished sucessfully but then it breaks:

Error executing process > 'deepvirfinder_wf:deepvirfinder (8)'

Caused by:
  Process `deepvirfinder_wf:deepvirfinder (8)` terminated with an error exit status (130)

Command executed:

  rnd=0.13958455425038307
  dvf.py -c 8 -i ERR579308_host_filtered_filt500bp.fa -o ERR579308_host_filtered_filt500bp
  cp ERR579308_host_filtered_filt500bp/*.txt ERR579308_host_filtered_filt500bp_${rnd//0.}.list

Command exit status:
  130

Command output:
  1. Loading Models.
     model directory /DeepVirFinder/models
  2. Encoding and Predicting Sequences.
     processing line 1
     processing line 156114

Command error:
  Using Theano backend.

from what_the_phage.

replikation avatar replikation commented on August 15, 2024

file lock, that has i think nothing to do with deepvirfinder. it can't access because another process with the same ID is active. at least this is how i understood that part, i could be wrong.

INFO (theano.gof.compilelock): To manually release the lock, delete /homes/mhoelzer/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-debian-10.0--3.6.9-64/lock_dir

can you delete this and restart the process? I had such an issue once, mainly because I was starting and stopping to fast and not waiting for the completion of runs ( due to bug fixing)

from what_the_phage.

hoelzer avatar hoelzer commented on August 15, 2024

from what_the_phage.

hoelzer avatar hoelzer commented on August 15, 2024

I deleted

/homes/mhoelzer/.theano/*

but the same error occurs. So seems to me like a configuration problem of the HPC.

I run the workflow without deepvirfinder and the same input files and it finishes. I don't know what this .theano folder is and why deepvirfinder writes files there. Need some google. It seems theano is a python package used by deepvirfinder.

And there are already reported cluster issues:
https://groups.google.com/forum/#!topic/theano-users/eJ2vl2PUTk4

This is what I put in my ~/.theanorc files for this:

[global]
base_compiledir=/tmp/%(user)s/theano.NOBACKUP

I will test this and report

from what_the_phage.

hoelzer avatar hoelzer commented on August 15, 2024

(update it does not really finish even if I skip deepvirfinder because of marvel error see #20 )

from what_the_phage.

hoelzer avatar hoelzer commented on August 15, 2024

Ok, adding this on the HPC

~/.theanorc:

[global]
base_compiledir=/scratch/%(user)s/theano.NOBACKUP

solved the issue with deepvirfinder in some cases but not completely. (I use /scratch instead of /tmp because this is recommended for the cluster here)

I had to start the workflow multiple times with -resume flag to get some additional deepvirfinder processes done. Maybe a real fix could involve adding some delay between each deepvirfinder process. Or simply do not execute deepvirfinder processes in parallel when running on a cluster.

from what_the_phage.

hoelzer avatar hoelzer commented on August 15, 2024

Ok, I also don't think that this will be solved by executing only single deepvirfinder processes. I tried now a single file always and had the same problems.

I also tried

[global]
config.compile.timeout = 1000

according to: pymc-devs/pymc#1463

and also

[global]
base_compiledir=/scratch/%(user)s/theano.NOBACKUP
config.compile.timeout = 10000

both did not help.

But maybe this file needs to be added to the docker container? I am not sure

Smaller files seem to work always because not so many tmp files are written by theano. And outside of a cluster environment it seems to be anyway no problem.

from what_the_phage.

replikation avatar replikation commented on August 15, 2024

Ok, adding this on the HPC

~/.theanorc:

[global]
base_compiledir=/scratch/%(user)s/theano.NOBACKUP

solved the issue with deepvirfinder in some cases but not completely. (I use /scratch instead of /tmp because this is recommended for the cluster here)

I had to start the workflow multiple times with -resume flag to get some additional deepvirfinder processes done. Maybe a real fix could involve adding some delay between each deepvirfinder process. Or simply do not execute deepvirfinder processes in parallel when running on a cluster.

  • @hoelzer so what do you think might be the best solution here? if i understand correctly its a "deepvirfinder issue" or?
  • i could do a fasta split into chunks to avoid overloading deepvirfinder
  • or a maxFork:1 could do the trick? so we dont ahve any parallelisation in deepvirfinder?

from what_the_phage.

hoelzer avatar hoelzer commented on August 15, 2024

@replikation I will look into this again. We also had some singularity update here on the cluster and thus I will simply test the current status of WtP again. I also cleaned up the LSF config file now and will push this to master directly.

from what_the_phage.

replikation avatar replikation commented on August 15, 2024
  • alright, if you have the same issue please add the maxForks 1 to the deepvirfinder process
  • if it's still causing the same issues we might need to try something else

from what_the_phage.

replikation avatar replikation commented on August 15, 2024
  • unassigned @Stormrider935 and myself as we cannot replicate the issue and can only help you to fix it

from what_the_phage.

hoelzer avatar hoelzer commented on August 15, 2024

I tried maxForks 1 and also scratch '/scratch' to use the local nodes disk space. But still deepvirfinder crashes

Error executing process > 'deepvirfinder_wf:deepvirfinder (1)'

Caused by:
  Process `deepvirfinder_wf:deepvirfinder (1)` terminated with an error exit status (130)

Command executed:

  rnd=0.9163528844360542
  dvf.py -c 8 -i hybrid.fa -o hybrid
  cp hybrid/*.txt hybrid_${rnd//0.}.list

Command exit status:
  130

Command output:
  1. Loading Models.
     model directory /DeepVirFinder/models
  2. Encoding and Predicting Sequences.
     processing line 1
     NODE_2_length_212949_cov_9_292657 has >30% Ns, skipping it
     NODE_8_length_151413_cov_7_485881 has >30% Ns, skipping it
     NODE_41_length_97136_cov_13_197103 has >30% Ns, skipping it
     NODE_43_length_96553_cov_14_051794 has >30% Ns, skipping it
     NODE_65_length_85493_cov_13_142442 has >30% Ns, skipping it
     NODE_70_length_83362_cov_10_682812 has >30% Ns, skipping it
     NODE_81_length_81007_cov_8_679489 has >30% Ns, skipping it
     NODE_88_length_78495_cov_12_659319 has >30% Ns, skipping it
     processing line 184860

Command error:
  WARNING: Non existent 'bind path' source: '/nfs/acedb/vol1'
  Using Theano backend.
  INFO (theano.gof.compilelock): Waiting for existing lock by process '55229' (I am process '55254')
  INFO (theano.gof.compilelock): To manually release the lock, delete /scratch/mhoelzer/theano.NOBACKUP/compiledir_Linux-3.10-el7.x86_64-x86_64-with-debian-10.0--3.6.9-64/lock_dir

Work dir:
  /hps/nobackup2/production/metagenomics/mhoelzer/nextflow-work-mhoelzer/fd/6de9291fbfbe413f30f30a0daee345

I am currently trying to get it run on another cluster with updated singularity here.

from what_the_phage.

replikation avatar replikation commented on August 15, 2024

@hoelzer

  • ill push today an update were you can deactivate tools via option flag
  • maybe its easier for you to just deactivate deepvirfinder
  • iam currently running a few test files to validate the commit before pushing it

from what_the_phage.

hoelzer avatar hoelzer commented on August 15, 2024

@replikation yeah that would be great, than I can simply skip this hpc-unfriendly tool

from what_the_phage.

replikation avatar replikation commented on August 15, 2024
  • workaround in 6f8266a
  • try the --dv flag to deactivate deepvirfinder

from what_the_phage.

hoelzer avatar hoelzer commented on August 15, 2024

UPDATE:

deepvirfinder finished now even on a large input FASTA on the LSF cluster

Completed at: 13-Jan-2020 04:48:06
Duration    : 1d 19h 9m 34s
CPU hours   : 1'398.0 (26% cached)
Succeeded   : 5
Cached      : 12

What maybe helped was that I cleared the workdir before... I will close this for now because it seems to be a really specific problem with the cluster.

from what_the_phage.

replikation avatar replikation commented on August 15, 2024
  • thanks for the info

from what_the_phage.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.