pepkit / geofetch Goto Github PK

View Code? Open in Web Editor NEW

41.0 41.0 5.0 939 KB

Builds a PEP from SRA or GEO accessions

Home Page: https://pep.databio.org/geofetch/

License: BSD 2-Clause "Simplified" License

Python 100.00%

download-manager geo metadata sra sra-data

geofetch's Introduction

Install

pip install --user --upgrade https://github.com/pepkit/pepkit/archive/master.zip

Install dev version

pip install --user --upgrade https://github.com/pepkit/pepkit/archive/dev.zip

geofetch's People

Contributors

Stargazers

Watchers

Forkers

vreuter j-lawson wook2014 rimanb pedro-w

geofetch's Issues

Error giving sample list file

Given example_sample_list_2.tsv like this:

GSE27432	GSM678211

I get this error:

geofetch -i example_sample_list_2.tsv 
Metadata folder: /home/nsheff/garage/geofetch/example_sample_list_2
Accession list file found: example_sample_list_2.tsv
Traceback (most recent call last):
  File "/home/nsheff/.local/bin/geofetch", line 8, in <module>
    sys.exit(main())
  File "/home/nsheff/.local/lib/python3.8/site-packages/geofetch/geofetch.py", line 1591, in main
    Geofetcher(args).fetch_all()
  File "/home/nsheff/.local/lib/python3.8/site-packages/geofetch/geofetch.py", line 147, in fetch_all
    acc_GSE_list = parse_accessions(
  File "/home/nsheff/.local/lib/python3.8/site-packages/geofetch/utils.py", line 120, in parse_accessions
    if acc_GSE_list.has_key(gse):  # GSE already has a GSM; add the next one
AttributeError: 'dict' object has no attribute 'has_key'

The argument --acc-anno is not working for processed data

For now, there is no possibility to produce annotation sheets for each accession. We should add it.

Decouple Geofetcher and argparser

The Geofetcher class is dependent on argparse, that provides limitations and issuses in future development.

Decoupling of Geofetcher and Argparser will solve this problem.

Docs build failing

Docs are not building correctly at RTD.

@khoroshevskyi please send me your readthedocs username and I will add you as a maintainer, so you can get familiar with documentation building.

Header in auto config

The automatically produced project config file GSE<xxxxx>_config.yaml has a constant header that I think we could omit.

# This is an example project for how to use the sra_convert pipeline to convert
# .sra files to .bam files using looper.

update to new pipeline interface

geofetch has a --pipeline-interface command that adds it in for looper. but with the new piface spec there's a distinction between sample and project pifaces. what does geofetch do, then?

Ensure python3 compatibility

At least 3.5, 3.6, and 3.7
Motivated by, e.g.: #16

add pipeline interface argument

geofetch could add a new command-line argument that would allow the user to automatically add a pipeline_interface attribute to the project, so it doesn't have to be done afterwards manually.

Error when downloading GSE26320

When trying to download processed data from accession GSE26320, geofetch reads some of the samples but then produces the following error:

Traceback (most recent call last):
File "/home/jev4xy/.local/bin/geofetch", line 11, in
sys.exit(main())
File "/home/jev4xy/.local/lib/python3.6/site-packages/geofetch/geofetch.py", line 760, in main
run_geofetch(sys.argv[1:])
File "/home/jev4xy/.local/lib/python3.6/site-packages/geofetch/geofetch.py", line 405, in run_geofetch
if line[0] is "^":
IndexError: string index out of range

Keyerror:experiment

SRP: SRP087641
Parsing SRA file to download SRR records
Traceback (most recent call last):
File "./geofetch.py", line 750, in
main(sys.argv[1:])
File "./geofetch.py", line 543, in main
experiment = line["Experiment"]
KeyError: 'Experiment'

NameError: name 'Sys' is not defined

You must provide a geo_folder to download processed data.
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/bin/geofetch", line 10, in <module>
    sys.exit(main())
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geofetch/geofetch.py", line 896, in main
    run_geofetch(sys.argv[1:])
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geofetch/geofetch.py", line 591, in run_geofetch
    Sys.exit
NameError: name 'Sys' is not defined

Incorrect path in PEP to the downloaded files

After downloading processed data with metadata, path to each file is incorrect. Project uploaded in peppy and pepr is have incorrect paths to the files.
Initial yaml file has to be changed.

input sample lists are not working for processed data

Input sample lists works for raw data. But if we specify second column with GSM accessions it will download all samples from speciefied experiement, by not taking into account samples.

Filtering an accession's samples

We don't currently have the capability to subset an individual accession and perform a fetch/download for just a subset of the GSM's, do we? Could we implement that? Not sure about frequency of use cases in general, but I'd definitely find it useful for certain fields, especially organism, cell/tissue type, and library strategy / data type.

Fix PyPI readme formatting

on pypi the readme is not formatting markdown correctly.

https://pypi.org/project/geofetch/

spaces in sample names

geofetch currently allows spaces in sample names in the accession file format. these should be disallowed as they cause downstream challenges.

_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

My command: python geofetch.py -i GSE86043

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/guol03/.wget-hsts'. HSTS will be disabled.

SRP: SRP086702
Parsing SRA file to download SRR records
Traceback (most recent call last):
  File "geofetch.py", line 760, in <module>
    main(sys.argv[1:])
  File "geofetch.py", line 536, in main
    for line in input_file:
  File "/home/guol03/anaconda3/lib/python3.6/csv.py", line 111, in __next__
    self.fieldnames
  File "/home/guol03/anaconda3/lib/python3.6/csv.py", line 98, in fieldnames
    self._fieldnames = next(self.reader)
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

update to new pep spec

When the pep spec is released, geofetch will need to write the new spec.

BTW, @stolarczyk -- here's another example use case of writing a pep object to disk.

add usage to docs

README needs usage docs, possibly using something like:

https://github.com/pepkit/looper/blob/1f21936b885abbed0f779426f780febe2624753c/update-usage-docs.sh

geofetch only grabs the first of processed data files

if a geo record has multiple processed data files, geofetch will only provide the first one in the annotation sheet.

input sample lists are not working

using geofetch with -i INPUT is supposed to work for either a GSE directly, or a file with a list of GSEs, like this:

file: example_sample_list.tsv

GSE185701	GSM5621756	Huh7 siNC H3K27ac
GSE185701	GSM5621756	Huh7_siNC_H3K27ac
GSE185701	GSM5621758	Huh7_siDHX37_H3K27ac
GSE185701	GSM5621758	Huh7_siDHX37_H3K27ac
GSE185701	GSM5621760	Huh7_DHX37
GSE185701	GSM5621760	Huh7_DHX37
GSE185701	GSM5621761	Huh7_PLRG1
GSE185701	GSM5621761	Huh7_PLRG1

But this is now failing:

geofetch -i example_sample_list.tsv --processed -n bright_test --just-metadata
Metadata folder: /home/nsheff/garage/geofetch/bright_test
Accession list file found: example_sample_list.tsv
Traceback (most recent call last):
  File "/home/nsheff/.local/bin/geofetch", line 8, in <module>
    sys.exit(main())
  File "/home/nsheff/.local/lib/python3.8/site-packages/geofetch/geofetch.py", line 1591, in main
    Geofetcher(args).fetch_all()
  File "/home/nsheff/.local/lib/python3.8/site-packages/geofetch/geofetch.py", line 147, in fetch_all
    acc_GSE_list = parse_accessions(
  File "/home/nsheff/.local/lib/python3.8/site-packages/geofetch/utils.py", line 120, in parse_accessions
    if acc_GSE_list.has_key(gse):  # GSE already has a GSM; add the next one
AttributeError: 'dict' object has no attribute 'has_key'

Mkdocs build failing

You can build docs with mkdocs serve

The code_blocking updates are causing documentation build to fail.

Don't use absolute hard paths in output PEPs

Right now, when using --processed the PEP that is output uses a hard-coded absolute path to the sample table it creates. This makes the resulting PEP not portable.

Will the PEP allow a relative path? if so, can we instead use relative paths there?

e.g.

Command-line help for geofetch.py

Super minor, but geofetch.py --help terminal text has some strange wrapping. I think it'd be OK to not print out the environment variables and instead the user to manually echo whichever ones he/she is curious about.

usage: geofetch.py [-h] -i INPUT [-p] [-m METADATA_FOLDER] [-b BAM_FOLDER]
                  [-s SRA_FOLDER] [--picard PICARD_PATH] [-d] [-r] [-x]

Automatic GEO SRA data downloader

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        required: a GEO (GSE) accession, or a file with a list
                        of GSE numbers
  -p, --processed       By default, download raw data. Turn this flag to
                        download processed data instead.
  -m METADATA_FOLDER, --metadata METADATA_FOLDER
                        Specify a location to store metadata [Default: $SRAMET
                        A:/sfs/lustre/allocations/shefflab/data/sra_meta/]
  -b BAM_FOLDER, --bamfolder BAM_FOLDER
                        Optional: Specify a location to store bam files
                        [Default: $SRABAM:/sfs/lustre/allocations/shefflab/dat
                        a/sra_bam/]
  -s SRA_FOLDER, --srafolder SRA_FOLDER
                        Optional: Specify a location to store sra files
                        [Default:
                        $SRARAW:/sfs/lustre/allocations/shefflab/data/sra/]
  --picard PICARD_PATH  Specify a path to the picard jar, if you want to
                        convert fastq to bam [Default: $PICARD:/apps/software/
                        standard/core/picard/2.1.1/picard.jar]
  -d, --dry-run         If set, don't actually run downloads, just create
                        metadata
  -r, --refresh-metadata
                        If set, re-download metadata even if it exists.
  -x, --split-experiments
                        By default, SRX experiments with multiple SRR Runs
                        will be merged in the metadata sheets. You can treat
                        each run as a separate sample with this argument.

In series annotation file files are stored in the list

Geofetch is creating series annotation file where some of the columns show up as a python list. This list should be: 1) separated, or 2) stored as plain text

ignore case in filter regex

I'm writing a universal --filter to match processed files with genomic regions. This is what I have:

(narrowPeak|bed|bedGraph|bw|bigBed){1}(\.gz)?$

However, I realized that there are *.bedgraph and *.bedGraph files out there, so I'd need to either include bedgraph to my "OR" collection or ignore case in the regex. The re.compile method does not seem to accept /<regex>/i, but re.compile("<regex>", re.IGNORECASE). Therefore there's no way to do this on the command line. It would be nice to make it possible.

Clean up soft files

After downloading all metadata and creating PEPs all soft fils are still in the repository.
Should we add feature to the geofetch that will clean up all unneeded soft files?

@nsheff

Unit tests

Geofetch needs some unit tests.

Messaging UI update

if there are no limits we say:

Limit to: OrderedDict()

We should simply say "no limit", or leave this message off.

Add file size filter for processed files

Possibility of filtering processed files by size before downloading can be very useful option in geofetch.

when downloading procsesed data, filter by file type

some experiments provide lots of data of different types. we may b interested in only a subset of filetypes. would be nice to say --filter .bed, .bed.gz to get only those file types.

change csv to tsv?

because it's common that GEO metadata submitters include commas in the 'description' fields, I think we should switch our sample annotation csv output to tsv. this would just make it easier for very simple, delimiter software (like cat data.csv | perl -pe 's/((?<=,)|(?<=^)),/ ,/g;' | column -t -s, | less -S) to do something meaningful and not get messed up by within-field commas. I think within-field tabs are less frequent.

Always getting SLURM memory errors

This could actually be something with pypiper or specific to SLURM, but I'm getting a memory error for every (seemingly anyway, I haven't checked every one) sra_convert operation.
Logfiles are here: /sfs/lustre/allocations/shefflab/processed/GSE63137/submission, and here's the message:

$ tail convert_MethylC-seq_*.log
==> convert_MethylC-seq_excitatory_neurons_rep1.log <==
##### [Epilogue:]
*   Total elapsed time:  2:36:33
*     Peak memory used:  0.01 GB
* Pipeline completed at:  (11-11 07:18:37) elapsed: 9393.0 _TIME_

Pypiper terminating spawned child process 107099...(tee)
slurmstepd: error: Exceeded job memory limit at some point.
srun: error: udc-ba26-19: task 0: Out Of Memory
srun: Terminating job step 2750598.0
slurmstepd: error: Exceeded job memory limit at some point.

==> convert_MethylC-seq_excitatory_neurons_rep2.log <==
##### [Epilogue:]
*   Total elapsed time:  2:30:40
*     Peak memory used:  0.01 GB
* Pipeline completed at:  (11-11 07:12:44) elapsed: 9040.0 _TIME_

Pypiper terminating spawned child process 107100...(tee)
slurmstepd: error: Exceeded job memory limit at some point.
srun: error: udc-ba26-19: task 0: Out Of Memory
srun: Terminating job step 2750599.0
slurmstepd: error: Exceeded job memory limit at some point.

==> convert_MethylC-seq_PV_neurons_rep1.log <==
##### [Epilogue:]
*   Total elapsed time:  3:26:09
*     Peak memory used:  0.01 GB
* Pipeline completed at:  (11-11 08:08:14) elapsed: 12369.0 _TIME_

Pypiper terminating spawned child process 169430...(tee)
slurmstepd: error: Exceeded job memory limit at some point.
srun: error: udc-ba26-12: task 0: Out Of Memory
srun: Terminating job step 2750600.0
slurmstepd: error: Exceeded job memory limit at some point.

==> convert_MethylC-seq_PV_neurons_rep2.log <==
##### [Epilogue:]
*   Total elapsed time:  3:05:17
*     Peak memory used:  0.01 GB
* Pipeline completed at:  (11-11 07:47:21) elapsed: 11117.0 _TIME_

Pypiper terminating spawned child process 169434...(tee)
slurmstepd: error: Exceeded job memory limit at some point.
srun: error: udc-ba26-12: task 0: Out Of Memory
srun: Terminating job step 2750601.0
slurmstepd: error: Exceeded job memory limit at some point.

==> convert_MethylC-seq_VIP_neurons_rep1.log <==
##### [Epilogue:]
*   Total elapsed time:  2:55:00
*     Peak memory used:  0.01 GB
* Pipeline completed at:  (11-11 07:37:05) elapsed: 10500.0 _TIME_

Pypiper terminating spawned child process 169432...(tee)
slurmstepd: error: Exceeded job memory limit at some point.
srun: error: udc-ba26-12: task 0: Out Of Memory
srun: Terminating job step 2750602.0
slurmstepd: error: Exceeded job memory limit at some point.

==> convert_MethylC-seq_VIP_neurons_rep2.log <==
##### [Epilogue:]
*   Total elapsed time:  3:44:25
*     Peak memory used:  0.01 GB
* Pipeline completed at:  (11-11 08:26:30) elapsed: 13465.0 _TIME_

Pypiper terminating spawned child process 169433...(tee)
slurmstepd: error: Exceeded job memory limit at some point.
srun: error: udc-ba26-12: task 0: Out Of Memory
srun: Terminating job step 2750603.0
slurmstepd: error: Exceeded job memory limit at some point.

KeyError: 'Experiment' with missing SRP#

  Found sample GSM678211  Unable to get SRA accession (SRP#) from GEO GSE SOFT file. No raw data?
But the GSM has an SRX number; 
Instead of an SRP, using SRX identifier for this sample:  GSM678211
--2018-11-02 12:24:34--  http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=gsm&acc=GSM678211&form=text&view=full
Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 130.14.29.110
Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|130.14.29.110|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=gsm&acc=GSM678211&form=text&view=full [following]
--2018-11-02 12:24:34--  https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=gsm&acc=GSM678211&form=text&view=full
Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [geo/text]
Saving to: ‘/sfs/lustre/allocations/shefflab/data/sra_meta/GEO_rrbs_diverse/GSE27432_SRA.csv’

    [ <=>                                                                                      ] 3,319       --.-K/s   in 0s      

2018-11-02 12:24:35 (294 MB/s) - ‘/sfs/lustre/allocations/shefflab/data/sra_meta/GEO_rrbs_diverse/GSE27432_SRA.csv’ saved [3319]

SRP: GSM678211
Parsing SRA file to download SRR records
Traceback (most recent call last):
  File "/home/ns5bc/code//geofetch/geofetch/geofetch.py", line 750, in <module>
    main(sys.argv[1:])
  File "/home/ns5bc/code//geofetch/geofetch/geofetch.py", line 543, in main
    experiment = line["Experiment"]
KeyError: 'Experiment'

produce config file

geofetch could really easily produce a basic config file instead of making the user do it.

Metadata standardization

At the moment, geofetch can download, filter, save metadata for the specific accessions in GEO. But metadata in GEO is stored in different, messy ways. Some of the information can be redundant and some can be stored in different places.

e.g. sample genome information may be stored in 3 (or more) different keys (dictionary keys):

'Sample_description': ['assembly: 'hg19', ...]
"Sample_characteristics_ch1": ['genome build': 'hg19', ...]
"Sample_data_processing": ['Genome_build': 'hg19', ...]

To create good, standardized PEP .csv metadata file, all information has to be be carefuly preprocessed. Especially this can be useful to create new endpoint in pephub.

In my opinion we have to create new class, or set of function, that will be separated from geofetch and will standardize all GEO metadata.

Integration of geofetch and pephub

pephub: https://github.com/pepkit/pephub

Ideas of integration are in this issue:

-pepkit/pephub#21 (comment)

Additionaly:

Download metadata (with created pep) with 100K excessions with bedfiles
Think about reorganization of geofetch code, so it can be easily integrated with pephub

Incorrect help --data-source description

default argument for data-source is samples, but in help is all. it should be corrected

Adding filter that separates experiment and sample data

In GEO some files are stored as experiment data and other in sample data that are linked to the experiment. Metadata experiment supplementary file contains less information and can't be merged with sample supplementary file metadata.

Tasks:

Separate metadata of experiment supplementary file and metadata of sample supplementary file.
Add filter that will specify what which metadata and processed data we want to download.

No annotations table in metadata folder?

What causes the annotations table that's ready for pepkit to be written vs. not? I think we should log a warning if this is not written.

geofetch sample annotation column order is unintuitive

Column order in sample annotation sheets output by geofetch is not intuitive. Could it be similar to this issue in peppy:

pepkit/peppy#213

maybe solvable using an OrderedDict at some point?

specificity plot labels are unreadable

we should rotate the labels 90 degress like in the chrom plots

geofetch should not created folders named '${SRAMETA}'

under some circumstances (like, maybe if SRAMETA variable is not populated?) geofetch will create a folder called ${SRAMETA}.

it should not do that.

Line parsing flexibility

We should be able to overcome some SOFT file problems; for now we can skip over such lines but should consider populating, e.g., missing sample_name column in such cases, if that's an/the underlying issue.

From a collaborator:

I followed a tip and run geofetch under 2.7 - it returns all files, great :)

Unfortunately, when I tried run it exactly with GSE12417 I received:

  Found sample GSM311598Traceback (most recent call last):

  File "./Library/Frameworks/Python.framework/Versions/2.7/bin/geofetch", line 11, in <module>

    sys.exit(main())

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/geofetch/geofetch.py", line 776, in main

    run_geofetch(sys.argv[1:])

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/geofetch/geofetch.py", line 471, in run_geofetch

    pl = parse_SOFT_line(line)

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/geofetch/geofetch.py", line 169, in parse_SOFT_line

    return {elems[0].rstrip(): elems[1].lstrip()}

IndexError: list index out of range

And script was not able to produce annotation file.

I will be grateful for any idea!

error on missing processed data location

provide a clean error message when processed data is requested, but no location provided.

missing slash

config files produced by geofetch use this for derived attributes:

{SRABAM}{SRR}

But this only works if {SRABAM} has a trailing slash, which it may not. It should check for that and use {SRABAM}/{SRR} if necessary.

Also for {SRARAW}.

show populated folder target.

geofetch prints out:

Given metadata folder: ${SRAMETA}
Initial raw metadata folder: ${SRAMETA}
Final raw metadata folder: ${SRAMETA}/

it would be nice if it displayed the populated version in addition.

Status check

At least when running sra_convert via looper and using the --lump option, looper check doesn't correctly provide a readout of the status of each conversion in progress, instead reporting nothing. It seems like this may be due to the location and/or non-granular path for, e.g. the _running.flag files.

killing child processes

If you ctrl+c signal a running process that has already started a prefetch, geofetch will quit but the subprocess appears to complete.

perhaps we need it to monitor the child process and kill it, too.

Automatic retry

Sometimes an initial download attempt does not succeed, but often it does when tries again:

2019-04-15T21:02:26 prefetch.2.9.1: 1) Downloading 'SRR6363360'...
2019-04-15T21:02:26 prefetch.2.9.1:  Downloading via https...
2019-04-15T21:03:40 prefetch.2.9.1 int: transfer incomplete while reading file within network system module - Cannot KStreamRead: https://sra-download.ncbi.nlm.nih.gov/traces/sra55/SRR/006214/SRR6363360
2019-04-15T21:03:40 prefetch.2.9.1: 1) failed to download SRR6363360
Get SRR: SRR6363361 (SRX3459388)

Could we have a geofetch run track some subset of error types and automatically retry? This could be configurable in terms of error kinds, number of samples, and number of attempts per sample.