GithubHelp home page GithubHelp logo

file_suffix flag about gunc HOT 24 CLOSED

grp-bork avatar grp-bork commented on August 23, 2024
file_suffix flag

from gunc.

Comments (24)

Biofarmer avatar Biofarmer commented on August 23, 2024

In addition, may I ask whether compressed fasta (.fna.gz) could be directly used by GUNC?
Thanks

from gunc.

fullama avatar fullama commented on August 23, 2024

If providing with --input_dir and --file_suffix .fna, I am wondering whether GUNC could make right action on those kind of genomes that contains .fna in the middle of names?

gunc will take everything before the first occurance of .fna in the file name and use that as the sample name in the output

may I ask whether --file_suffix .fna is still needed when --input_file is provide?

No, but the sample names in your output will contain the suffix, providing the suffix is only there to allow gunc to remove it from the input filename

may I ask whether compressed fasta (.fna.gz) could be directly used by GUNC?

Yes

any other questions let me know!

from gunc.

Biofarmer avatar Biofarmer commented on August 23, 2024

Hi, many thanks for your reply. So, the function of --file_suffix is just to provide the file name in the output, there is no any effect on the detection of chimerism in genomes, do I understand correctly?

from gunc.

fullama avatar fullama commented on August 23, 2024

Correct! :)

from gunc.

Biofarmer avatar Biofarmer commented on August 23, 2024

Okay, many thanks.

from gunc.

Biofarmer avatar Biofarmer commented on August 23, 2024

Sorry, a further question about the database of progenomes or GTDB, may I ask which one is generally recommended to use for the detection of chimerism in genomes?
Thanks

from gunc.

defleury avatar defleury commented on August 23, 2024

Hi @Biofarmer !

Both databases work fine, we found little difference in accuracy. However, since the GUNC db based on proGenomes is smaller, it is faster to run so we use it by default.

from gunc.

Biofarmer avatar Biofarmer commented on August 23, 2024

Okay, it is good to know. Thanks

from gunc.

Biofarmer avatar Biofarmer commented on August 23, 2024

Hi, I am running GUNC for 10000 genomes with 5 threads, and it has been running 8 days, and now at Running Diamond period. There is no "diamond_output" folder in the output directory, is it normal?
In addition, may I ask whether it is possible to check the process of diamond? Because I want to know how long I still need to wait, if longer, I may kill the job and rerun it with more threads or split the genomes into parts.
PS, I have been run another 10000 genomes with 10 threads before, which was finished with 3 days.
Thanks

from gunc.

fullama avatar fullama commented on August 23, 2024

there is no way currently of seeing the progress of diamond, the run time can vary depending on the input.. but maybe you just want to run them in smaller batches? if you are running so many genomes at once i would increase both cpus and memory..

from gunc.

Biofarmer avatar Biofarmer commented on August 23, 2024

Hi, thanks for your reply. In addition, if I understand correctly, the genecall files will be merged from all input genomes, if so, may I ask whether the label of each contig (text after ">" but before the first space, which is taken by prodigal as gene ID) should be unique for all input genomes? Or it does not matter? Thanks

from gunc.

fullama avatar fullama commented on August 23, 2024

it shouldnt matter.. they are merged but are tagged with the name of the genome file so they can be separated after diamond has run

from gunc.

Biofarmer avatar Biofarmer commented on August 23, 2024

Okay...that's great. Thanks. The merged genecall files is intermediate, and has been deleted once finished and cannot be seen, right?

from gunc.

Biofarmer avatar Biofarmer commented on August 23, 2024

In addition, as the --temp_dir directory by default is Current working directory. If I submit several jobs in the same working directory at once with different output directory, may I ask whether GUNC will select the right temporary files from the same working directory? Is the temporary file of each job with different names? Thanks

from gunc.

Biofarmer avatar Biofarmer commented on August 23, 2024

Hi, may I ask the answers from questions as above?

from gunc.

fullama avatar fullama commented on August 23, 2024

I think it would be fine but try it out and see to be sure..

from gunc.

Biofarmer avatar Biofarmer commented on August 23, 2024

Hi, many thanks for your confirmation.

from gunc.

Biofarmer avatar Biofarmer commented on August 23, 2024

Hi, I just used GUNC to check the genome from NCBI, GCF_902703415.1_Combinated_assembly_ONT_-_Illumina_genomic.fna. It failed, however, when changed the genome name without _-_ to GCF_902703415.1_Combinated_assembly_ONT_---_Illumina_genomic.fna, or GCF_902703415.1_Combinated_assembly_ONT_--_Illumina_genomic.fna, or GCF_902703415.1_Combinated_assembly_ONT__Illumina_genomic.fna, or
GCF_902703415.1_Combinated_assembly_ONT-Illumina_genomic.fna, or
GCF_902703415.1_Combinated_assembly_ONT_Illumina_genomic.fna. Anyone worked with GUNC. May I ask why only genome name containing _-_ format did not work with GUNC?

And GUNC worked for genome GCF_902109435.1_40087_F01_genomic.fna, but there was no value in output, is it due to its small size of genome?

Thanks

from gunc.

Biofarmer avatar Biofarmer commented on August 23, 2024

Another question: there is a slight difference in the number of n_genes_mapped in the output when running one genome individually (--input_fasta genome.fna) or running a few genomes together (--input_dir genome_folder/ --file_suffix .fna). May I ask whether it is normal and why?
Thanks

from gunc.

fullama avatar fullama commented on August 23, 2024

samples with _-_ dont work because internally gunc uses it as a delimiter to label sequences with the samples name when merging them together.. I didnt think anyone would ever use _-_ in a sample name.. ill see if i can change it for the next version..

from gunc.

fullama avatar fullama commented on August 23, 2024

Another question: there is a slight difference in the number of n_genes_mapped in the output when running one genome individually (--input_fasta genome.fna) or running a few genomes together (--input_dir genome_folder/ --file_suffix .fna). May I ask whether it is normal and why? Thanks

Can you give an example of where the output differs, so i can look more closely?

from gunc.

Biofarmer avatar Biofarmer commented on August 23, 2024

samples with _-_ dont work because internally gunc uses it as a delimiter to label sequences with the samples name when merging them together.. I didnt think anyone would ever use _-_ in a sample name.. ill see if i can change it for the next version..

Thanks for reply.

from gunc.

Biofarmer avatar Biofarmer commented on August 23, 2024

Another question: there is a slight difference in the number of n_genes_mapped in the output when running one genome individually (--input_fasta genome.fna) or running a few genomes together (--input_dir genome_folder/ --file_suffix .fna). May I ask whether it is normal and why? Thanks

Can you give an example of where the output differs, so i can look more closely?

Hi, just take the genomes from NCBI (GCF_902703415.1 and GCF_900232175.1) for example, the n_genes_mapped is 5145 and 6693 when tested together, and the value will be 5157 and 6740 respectively when tested individually. The overall conclusion is almost same.
By the way, may I ask whether the result of pass.GUNC is based on the third digit of clade_separation_score after decimal? Because I see same genomes that are reported with clade_separation_score 0.45, sometimes are given False or sometimes True for pass.GUNC. So I am wondering the clade_separation_score is just reported with two digits after decimal, but judgement for pass.GUNC is based on at least three digits after decimal.

from gunc.

fullama avatar fullama commented on August 23, 2024

Hi, just take the genomes from NCBI (GCF_902703415.1 and GCF_900232175.1) for example, the n_genes_mapped is 5145 and 6693 when tested together, and the value will be 5157 and 6740 respectively when tested individually. The overall conclusion is almost same.

this will be fixed in the next version of gunc

may I ask whether the result of pass.GUNC is based on the third digit of clade_separation_score after decimal?

yes, the output will be amended to include more decimal places in the next version also

from gunc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.