GithubHelp home page GithubHelp logo

fredhutch / diy-cromwell-server Goto Github PK

View Code? Open in Web Editor NEW
5.0 4.0 2.0 160 KB

A repo containing instructions for running a Cromwell server on Gizmo at the Fred Hutch. (Contact Amy Paguirigan for questions)

Shell 100.00%
wdl cromwell workflow-management workflow workflow-engine

diy-cromwell-server's Issues

S3 file access could potentially stall jobs without failing

@vortexing

If a user doesn't have read permissions for all files in S3 needed to run a workflow the workflow/job could get stalled.

At this point the Cromwell shiny app will say that the job is running.

However, on SLURM no job appears to be running/being submitted.

Shiny App Screenshot
Screenshot 2024-01-12 at 2 21 14 PM

Job list screenshot
Screenshot 2024-01-12 at 2 21 30 PM

Scatter using Singularity fails creating SIF

When using singularity exec within tasks executed within scatter there is a race condition when the Docker/Singularity image isn't in the cache. On NFS mounted home directories this apparently results in an error "stale NFS file handle"

To replicate it's necessary to remove the images from the Singularity cache (~/.singularity). It also is difficult to replicate with simple images (e.g. ubuntu). The Broad's GATK image seems to reproduce this error fairly reliably.

==> shard-0/execution/stderr <==
2020/04/10 07:06:02 debug unpacking entry           path=root/.conda root=/loc/scratch/46802618/rootfs-3371b095-7b34-11ea-ae11-002590e2b58e type=53
2020/04/10 07:06:02 debug unpacking entry           path=root/.conda/pkgs root=/loc/scratch/46802618/rootfs-3371b095-7b34-11ea-ae11-002590e2b58e type=53
2020/04/10 07:06:02 debug unpacking entry           path=root/.conda/pkgs/urls root=/loc/scratch/46802618/rootfs-3371b095-7b34-11ea-ae11-002590e2b58e type=48
2020/04/10 07:06:02 debug unpacking entry           path=root/.conda/pkgs/urls.txt root=/loc/scratch/46802618/rootfs-3371b095-7b34-11ea-ae11-002590e2b58e type=48
2020/04/10 07:06:02 debug unpacking entry           path=root/.gradle root=/loc/scratch/46802618/rootfs-3371b095-7b34-11ea-ae11-002590e2b58e type=53
2020/04/10 07:06:02 debug unpacking entry           path=root/gatk.jar root=/loc/scratch/46802618/rootfs-3371b095-7b34-11ea-ae11-002590e2b58e type=50
2020/04/10 07:06:02 debug unpacking entry           path=root/run_unit_tests.sh root=/loc/scratch/46802618/rootfs-3371b095-7b34-11ea-ae11-002590e2b58e type=48
DEBUG   [U=34152,P=14608]  Full()                        Inserting Metadata
DEBUG   [U=34152,P=14608]  Full()                        Calling assembler
INFO    [U=34152,P=14608]  Assemble()                    Creating SIF file...
DEBUG   [U=34152,P=14608]  cleanUp()                     Cleaning up "/loc/scratch/46802618/rootfs-3371b095-7b34-11ea-ae11-002590e2b58e" and "/loc/scratch/46802618/bundle-temp-539962740"
FATAL   [U=34152,P=14608]  replaceURIWithImage()         Unable to handle docker://broadinstitute/gatk@sha256:0dd5cb7f9321dc5a43e7667ed4682147b1e827d6a3e5f7bf4545313df6d491aa uri: unable to build: while creating SIF: while creating container: writing data object for SIF file: copying data object file to SIF file: write /home/mrg/.singularity/cache/oci-tmp/0dd5cb7f9321dc5a43e7667ed4682147b1e827d6a3e5f7bf4545313df6d491aa/gatk@sha256_0dd5cb7f9321dc5a43e7667ed4682147b1e827d6a3e5f7bf4545313df6d491aa.sif: stale NFS file handle
==> shard-1/execution/stderr <==
2020/04/10 07:06:13 debug unpacking entry           path=root/.conda root=/loc/scratch/46802619/rootfs-4740289f-7b34-11ea-ad57-002590e2b824 type=53
2020/04/10 07:06:13 debug unpacking entry           path=root/.conda/pkgs root=/loc/scratch/46802619/rootfs-4740289f-7b34-11ea-ad57-002590e2b824 type=53
2020/04/10 07:06:13 debug unpacking entry           path=root/.conda/pkgs/urls root=/loc/scratch/46802619/rootfs-4740289f-7b34-11ea-ad57-002590e2b824 type=48
2020/04/10 07:06:13 debug unpacking entry           path=root/.conda/pkgs/urls.txt root=/loc/scratch/46802619/rootfs-4740289f-7b34-11ea-ad57-002590e2b824 type=48
2020/04/10 07:06:13 debug unpacking entry           path=root/.gradle root=/loc/scratch/46802619/rootfs-4740289f-7b34-11ea-ad57-002590e2b824 type=53
2020/04/10 07:06:13 debug unpacking entry           path=root/gatk.jar root=/loc/scratch/46802619/rootfs-4740289f-7b34-11ea-ad57-002590e2b824 type=50
2020/04/10 07:06:13 debug unpacking entry           path=root/run_unit_tests.sh root=/loc/scratch/46802619/rootfs-4740289f-7b34-11ea-ad57-002590e2b824 type=48
DEBUG   [U=34152,P=3126]   Full()                        Inserting Metadata
DEBUG   [U=34152,P=3126]   Full()                        Calling assembler
INFO    [U=34152,P=3126]   Assemble()                    Creating SIF file...
VERBOSE [U=34152,P=3126]   Full()                        Build complete: /home/mrg/.singularity/cache/oci-tmp/0dd5cb7f9321dc5a43e7667ed4682147b1e827d6a3e5f7bf4545313df6d491aa/gatk@sha256_0dd5cb7f9321dc5a43e7667ed4682147b1e827d6a3e5f7bf4545313df6d491aa.sif
DEBUG   [U=34152,P=3126]   cleanUp()                     Cleaning up "/loc/scratch/46802619/rootfs-4740289f-7b34-11ea-ad57-002590e2b824" and "/loc/scratch/46802619/bundle-temp-701530895"
VERBOSE [U=34152,P=3126]   handleOCI()                   Image cached as SIF at /home/mrg/.singularity/cache/oci-tmp/0dd5cb7f9321dc5a43e7667ed4682147b1e827d6a3e5f7bf4545313df6d491aa/gatk@sha256_0dd5cb7f9321dc5a43e7667ed4682147b1e827d6a3e5f7bf4545313df6d491aa.sif
DEBUG   [U=34152,P=3126]   execStarter()                 Checking for encrypted system partition

.... output trimmed- log indicates this shard ran the container....

localBatchFileScatter parse.inputs.json pointing to the wrong sample.batchfile.tsv file

hey Amy,

I'm starting to look more closely at some actual wdl files to understand that end of things.

In testWorkflows/localBatchFileScatter, there is a local sample.batchfile.tsv file, but it's not being used.

The parse.inputs.json instead points to a copy of sample.batchfile.tsv here: /fh/fast/paguirigan_a/pub/ReferenceDataSets/workflow_testing_data/WDL/batchFileScatter /sample.batchfile.tsv, which uses files hosted on S3 rather than on /fh/fast

I think that's not what you intend - I'm guessing this workflow won't work for people who haven't got their AWS credentials set up: does that make sense?

Janet

3 Ls

silly question: do you mean for the dirname testWorkflows/hellloSingularityHostname to have 3 Ls in helllo?

Separate cromwellParams from X.conf

Let's shift to specifying the config file path when we kick off the server rather than as a Param. Will make testing/adapting easier later.

confusing comment line in variantCalling-workflow.wdl

hey,

again, I'm using the examples as a starting point for understand wdl syntax.

tiny picky thing: in the variantCalling-workflow.wdl script, the comment on line 68 (" # Get the basename, i.e. strip the filepath and the extension") - I don't think the script is doing that on line 69? Looks like it's just concatenating a couple of the specified inputs.

I'm sure I will want to learn how to do the filepath stripping thing sometime - I'll look elsewhere to find it.

Janet

Output and Error information pruned during Singularity exec

It looks like the stderr (and likely stdout) is getting pruned during execution when Singularity is used to run the script:

VERBOSE [U=0,P=3126]       print()                       Set messagelevel to: 5
VERBOSE [U=0,P=3126]       init()                        Starter initialization
DEBUG   [U=0,P=3126]       load_overlay_module()         Trying to load overlay kernel module
DEBUG   [U=0,P=3126]       load_overlay_module()         Overlay seems not supported by the kernel
DEBUG   [U=0,P=3126]       get_pipe_exec_fd()            PIPE_EXEC_FD value: 9
VERBOSE [U=0,P=3126]       is_suid()                     Check if we are running as setuid
VERBOSE [U=0,P=3126]       priv_drop()                   Drop root privileges
DEBUG   [U=34152,P=3126]   init()                        Read engine configuration
tail: shard-1/execution/stderr: file truncated
DEBUG   [U=34152,P=3126]   Master()                      Child exited with exit status 0

Note the "file truncated" message. The lines prior to this is output from singularity -d exec. The resultant stderr file in the execution directory doesn't contain any of the lines above the "file truncated" message.

I believe this happens when the execution script redirects stdout and stderr. The submit script uses Slurm options to save stdout/stderr to files in the execution directory. From a representative output script (script.submit):

   -o /fh/scratch/delete10/_HDC/user/mrg/cromwell-root/cromwell-executions/parseBatchFile/b9e0a739-23e3-485d-8bab-857349697458/call-test/shard-1/execution/stdout \
-e /fh/scratch/delete10/_HDC/user/mrg/cromwell-root/cromwell-executions/parseBatchFile/b9e0a739-23e3-485d-8bab-857349697458/call-test/shard-1/execution/stderr \

When the script is run (script) the following lines are truncating any output occurring between the startup of the job on the node and the execution of the script in the container (particularly Singularity startup messages):

tee '/cromwell-executions/parseBatchFile/b9e0a739-23e3-485d-8bab-857349697458/call-test/shard-1/execution/stdout' < "$outb9e0a739" &
tee '/cromwell-executions/parseBatchFile/b9e0a739-23e3-485d-8bab-857349697458/call-test/shard-1/execution/stderr' < "$errb9e0a739" >&2 &

Note that in the container (where script is run) the path /fh/scratch/delete10/_HDC/user/mrg/cromwell-root/cromwell-executions has been mounted on /cromwell-executions). tee (without other options) will truncate output prior to writing to the file. This removes any output between job start and script execution.

I suspect that adding -a to the tee command will fix this problem.

Cromwell Version?

Commit used: 77c318a

  "workflowName": "hello_hostname",
  "workflowProcessingEvents": [
    {
      "cromwellId": "cromid-53866a4",
      "description": "PickedUp",
      "timestamp": "2020-04-07T21:16:47.895Z",
      "cromwellVersion": "47"
    },
    {
      "cromwellId": "cromid-53866a4",
      "description": "Finished",
      "timestamp": "2020-04-07T21:16:47.915Z",
      "cromwellVersion": "47"
    }
...

  "status": "Failed",
  "failures": [
    {
      "causedBy": [],
      "message": "/fh/scratch"
    }

This is what is in the failure message for my workflow using the baseConfig right now. I'm loading version 49 and via the API when I request the version of the server running, I get V49, but in the workflow itself, I'm still getting version 47 which was the previous one we were using.

memory requests not being used in fh-cromwell.conf

hey!

I put this in Slack but realized that this is a better place for it.

I think memory requests made in the runtime{} block of a wdl are being ignored when running on the local cluster.

I'll attach
(1) a demo WDL
(2) an options json file that goes with it, simply to prevent caching
(3) the script.submit generated for the task. The CPU request is honored but the memory request is not part of the sbatch command in script.submit.

Does that make sense?

thanks!

Janet

Use included file for database connection parameters

The current mechanism puts the database connection string (including username and password) in the process stack where it is viewable by anyone on the system. This is somewhat undesirable- though I don't know precisely what harm could come, general "best practices" would suggest we improve this.

I've done a little work and it looks like we may be able to specify database connection parameters in a file separate from the general Cromwell config (e.g. fh-slurm-sing-cromwell.conf) and use HOCON's include directive:

include required(classpath("application"))
include required(file("database.conf"))
###### FH Slurm Backend, with call caching, without docker/singularity
   ....

Where database.conf contains the database section as specified by Cromwell:

database {
  profile = "slick.jdbc.MySQLProfile$"
  db {
    driver = "com.mysql.cj.jdbc.Driver"
    url = "jdbc:mysql://mydb:32222/cromdb?rewriteBatchedStatements=true&serverTimezone=UTC"
    user = "username"
    password = "password"
    connectionTimeout = 5000
  }
}

The path above doesn't have to be in the current directory- my understanding is that we could specify any path for that, though I'm not sure if things like "~" are expanded.

Server Time Zone Unrecognized

Caused by: com.mysql.cj.exceptions.InvalidConnectionAttributeException: The server time zone value 'PDT' is unrecognized or represents more than one time zone. You must configure either the server or JDBC driver (via the serverTimezone configuration property) to use a more specifc time zone value if you want to utilize time zone support.

Seems to be an interaction between JDBC, Java8, and Mariadb. Easy fix (i.e. not mucking about server-side) involves specifying the time zone in the connection URL serverTimezone=UTC (note extreme case sensitivity)

Glob + env modules= OK; Glob + containers= FAIL.

Put in the readme and in teh docs in the Wiki that when using Cromwell right now (v49), in our config we have a line to create softlinks instead of hardlinks for globs.

Currently that means that you can use globs in outputs when you are using env modules, BUT that will break when you use a docker/singularity container. In that case, in your task where you need to glob the output, change it to this:

cp files-to-glob . 
ls <glob pattern> > outputFiles.txt

------
output {
Array[File] outputGlobFiles = read_lines("outputFiles.txt")
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.