GithubHelp home page GithubHelp logo

Comments (7)

mathog avatar mathog commented on August 29, 2024

Hmm, the numbers in the run above do not match those in README.md. Significant or just the documentation being slightly out of sync with the software?

from redundans.

mathog avatar mathog commented on August 29, 2024

Inserted this before the original line 136 in fasta2homozygous.py

print "DEBUG 136 identities", identities, " sizes ", sizes, " i ", i #DEBUG

and reran the test. The log file now has:

 Reduction...
#file name      genome size     contigs heterozygous size       [%]     heterozygous contigs    [%]     identity [%]    possible joins  homozygous size [%]    homozygous contigs       [%]
DEBUG 136 identities [0.9531478770131772, 0.949937106918239, 0.9471886495007882, 0.9524475524475524, 0.9423763386027537, 0.9504310344827587, 0.9414634146341463, 0.9590163934426229, 0.9799777530589544, 0.9449152542372882, 0.9450222882615156, 0.9526813880126183, 0.9495268138801262, 0.9509493670886076, 0.9456869009584664, 0.92914653784219, 0.9511400651465798, 0.947107438016529, 0.9576271186440678, 0.9449378330373002, 0.9396709323583181, 0.9481481481481482, 0.9398496240601504, 0.9580152671755725, 0.9496124031007752, 0.95703125, 0.9636363636363636, 0.9559748427672956, 0.9308176100628931, 0.9621848739495799, 0.952914798206278, 0.9565217391304348, 0.9564220183486238, 0.9597156398104265, 0.9501187648456056, 0.9569377990430622, 0.9553349875930521, 0.9428571428571428, 0.8882978723404256, 0.8457446808510638, 0.9547872340425532, 0.9491978609625669, 0.9438502673796791, 0.9592391304347826, 0.9536784741144414, 0.967032967032967, 0.9497206703910615, 0.9606741573033708, 0.9602272727272727, 0.9498525073746312, 0.878698224852071, 0.8372781065088757, 0.9497041420118343, 0.8169491525423729, 0.960960960960961, 0.963963963963964, 0.9637462235649547, 0.9636363636363636, 0.9465408805031447, 0.9577922077922078, 0.9802631578947368, 0.9503311258278145, 0.9662162162162162, 0.7941176470588235, 0.8577405857740585, 0.9646643109540636, 0.9672727272727273, 0.9550561797752809, 0.96484375, 0.8731707317073171, 0.9484978540772532, 0.9655172413793104, 0.9567099567099567, 0.9696969696969697, 0.9641255605381166, 0.7352941176470589, 0.6683417085427136, 0.732620320855615, 0.8108108108108109, 0.7812971342383107, 0.7035175879396985, 0.8206521739130435, 0.9585253456221198, 0.9626168224299065, 0.9626168224299065]  sizes  [6830, 3975, 3806, 3575, 1961, 1856, 1230, 976, 899, 708, 673, 634, 634, 632, 626, 621, 614, 605, 590, 563, 547, 540, 532, 524, 516, 512, 495, 477, 477, 476, 446, 437, 436, 422, 421, 418, 403, 385, 376, 376, 376, 374, 374, 368, 367, 364, 358, 356, 352, 339, 338, 338, 338, 295, 333, 333, 331, 330, 318, 308, 304, 302, 296, 272, 239, 283, 275, 267, 256, 205, 233, 232, 231, 231, 223, 221, 199, 187, 185, 221, 199, 184, 217, 214, 214]  i  85
test/run1/contigs.fa    163897  245     66377   40.50   221     90.20   94.854  0       97520   59.50   24      9.80
<snip>
DEBUG 136 identities []  sizes  []  i 
Traceback (most recent call last):
<snip>
  File "/home/mathog/src/redundans/bin/fasta2homozygous.py", line 138, in fasta2skip
    print "DEBUG 136 identities", identities, " sizes ", sizes, " i ", i  #DEBUG   
UnboundLocalError: local variable 'i' referenced before assignment

In other words, "hits" in fasta2skip the 2nd time it is called is empty. The first time it was called there were some of these. Is no "hits" the 2nd time a reasonable result for this version of the code? Even if that is not itself a problem, the code should handle that state, which it does not. So that much is certainly a bug.

Other than these debug related changes the output is the same.

from redundans.

mathog avatar mathog commented on August 29, 2024

Did a clean install on a CentOS 7.4 system. Compared to previous system:

Python 2.7 in /usr/bin only (previous had it in /usr/local/bin, with 2.6 in /usr/bin)
lastal not in path (previous had a version of lastal in the path)
bwa not in path (previous had a version of bwa in the path)
parallel not in path (previous had a version of parallel in the path)
perl 5.16 in /bin/perl (previous had 5.20 in /home/mathog/perl5/perlbrew/perls/perl-5.20.0t/bin/perl)

Test run:

./redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1
Options: Namespace(fasta='test/contigs.fa', fastq=['test/5000_1.fq.gz', 'test/5000_2.fq.gz', 'test/600_1.fq.gz', 'test/600_2.fq.gz', 'test/pacbio.fq.gz'], identity=0.51, iters=2, joins=5, limit=0.2, linkratio=0.7, log=<open file '<stderr>', mode 'w' at 0x7f2e8b7281e0>, longreads=[], mapq=10, minLength=200, nocleaning=True, nogapclosing=True, norearrangements=False, noreduction=True, noscaffolding=True, outdir='test/run1', overlap=0.8, reference='', resume=False, threads=4, verbose=True)

##################################################
[Thu Jan 11 10:17:28 2018] Reduction...
#file name      genome size     contigs heterozygous size       [%]     heterozygous contigs    [%]     identity [%]    possible joins  homozygous size [%]     homozygous contigs       [%]
[WARNING] numpy or matplotlib missing! Cannot plot histogram
test/run1/contigs.fa    163897  245     66377   40.50   221     90.20   94.854  0       97520   59.50   24      9.80

##################################################
[Thu Jan 11 10:17:28 2018] Estimating parameters of libraries...
 Aligning 19504 mates per library...
Insert size statistics                          Mates orientation stats
FastQ files     read length     median  mean    stdev   FF      FR      RF      RR
test/5000_1.fq.gz test/5000_2.fq.gz     50      4986    4981.70 692.22  0       4067    14      0
test/600_1.fq.gz test/600_2.fq.gz       100     599     598.56  47.48   0       10000   0       0

##################################################
[Thu Jan 11 10:17:29 2018] Scaffolding...
 iteration 1.1: test/run1/contigs.reduced.fa    24      97520   39.355  17      94157   7321    2195    0       29603
   19505 pairs. 17325 passed filtering [88.82%]. 1641 in different contigs [8.41%].
    1526 pairs. 556 in different contigs [36.44%].
 iteration 1.2: test/run1/_sspace.1.1.fa        3       97829   39.344  3       97829   87528   6274    1024    87528
   19505 pairs. 17607 passed filtering [90.27%]. 185 in different contigs [0.95%].
    1188 pairs. 113 in different contigs [9.51%].
 iteration 2.1: test/run1/_sspace.1.2.fa        2       98197   39.344  2       98197   94170   94170   1392    94170
   19505 pairs. 15104 passed filtering [77.44%]. 720 in different contigs [3.69%].
    3420 pairs. 264 in different contigs [7.72%].
 iteration 2.2: test/run1/_sspace.2.1.fa        1       99484   39.344  1       99484   99484   99484   2679    99484
   19505 pairs. 15145 passed filtering [77.65%]. 0 in different contigs [0.00%].
    3396 pairs. 0 in different contigs [0.00%].

##################################################
[Thu Jan 11 10:17:37 2018] Gap closing...
 iteration 1.1: test/run1/scaffolds.fa  1       99484   39.344  1       99484   99484   99484   2679    99484
 iteration 1.2: test/run1/_gapcloser.1.1.fa     1       99503   39.483  1       99503   99503   99503   985     99503

[Thu Jan 11 10:17:39 2018] Final reduction...
#file name      genome size     contigs heterozygous size       [%]     heterozygous contigs    [%]     identity [%]    possible joins  homozygous size [%]     homozygous contigs       [%]
[WARNING] numpy or matplotlib missing! Cannot plot histogram
test/run1/scaffolds.filled.fa   99504   1       0       0.00    0       0.00    0.000   0       99504   100.00  1       100.00

##################################################
[Thu Jan 11 10:17:39 2018] Reporting statistics...
#fname  contigs bases   GC [%]  contigs >1kb    bases in contigs >1kb   N50     N90     Ns      longest
test/contigs.fa 245     163897  40.298  24      117391  3975    233     0       29603
test/run1/contigs.fa    245     163897  40.298  24      117391  3975    233     0       29603
test/run1/contigs.reduced.fa    24      97520   39.355  17      94157   7321    2195    0       29603
test/run1/_sspace.1.1.fa        3       97829   39.344  3       97829   87528   6274    1024    87528
test/run1/_sspace.1.2.fa        2       98197   39.344  2       98197   94170   94170   1392    94170
test/run1/_sspace.2.1.fa        1       99484   39.344  1       99484   99484   99484   2679    99484
test/run1/_sspace.2.2.fa        1       99484   39.344  1       99484   99484   99484   2679    99484
test/run1/scaffolds.fa  1       99484   39.344  1       99484   99484   99484   2679    99484
test/run1/_gapcloser.1.1.fa     1       99503   39.483  1       99503   99503   99503   985     99503
test/run1/_gapcloser.1.2.fa     1       99504   39.483  1       99504   99504   99504   985     99504
test/run1/scaffolds.filled.fa   1       99504   39.483  1       99504   99504   99504   985     99504
test/run1/scaffolds.reduced.fa  1       99504   39.483  1       99504   99504   99504   985     99504

##################################################
[Thu Jan 11 10:17:39 2018] Cleaning-up...
#Time elapsed: 0:00:11.161135

That looks like it might be correct. It diverges from the Centos 6.9 run at iteration 1.2 in scaffolding. Tried to make the 6.8 environment more like that on Centos 7.4 with:

cd ~/src/redundans
export PATH=.:/bin:/usr/bin:/usr/sbin:/sbin
ln -s /usr/local/bin/python2.7 python
# lastal, bwa, parallel no longer in path, python 2.7 is, perl 5.10 is
rm -rf test/run1
./redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1

Same results on this system as before.

Examined the contents of ~/src/redundans/test/run1 and found that the directory structure was different. On both there are directories named "_sspace.1.1" but the contents were not the same. The one which worked had:


ls -alR _sspace.1.1
_sspace.1.1:
total 124
drwxr-xr-x. 6 mathog biostaff  4096 Jan 11 10:17 .
drwxr-xr-x. 6 mathog biostaff  4096 Jan 11 10:17 ..
drwxr-xr-x. 2 mathog biostaff  4096 Jan 11 10:17 alignoutput
drwxr-xr-x. 2 mathog biostaff  4096 Jan 11 10:17 intermediate_results
drwxr-xr-x. 2 mathog biostaff  4096 Jan 11 10:17 pairinfo
drwxr-xr-x. 2 mathog biostaff  4096 Jan 11 10:17 reads
-rw-r--r--. 1 mathog biostaff 99522 Jan 11 10:17 _sspace.1.1.final.scaffolds.fasta

_sspace.1.1/alignoutput:
total 8
drwxr-xr-x. 2 mathog biostaff 4096 Jan 11 10:17 .
drwxr-xr-x. 6 mathog biostaff 4096 Jan 11 10:17 ..

_sspace.1.1/intermediate_results:
total 204
drwxr-xr-x. 2 mathog biostaff  4096 Jan 11 10:17 .
drwxr-xr-x. 6 mathog biostaff  4096 Jan 11 10:17 ..
-rw-r--r--. 1 mathog biostaff 99000 Jan 11 10:17 _sspace.1.1.formattedcontigs_min0.fasta
-rw-r--r--. 1 mathog biostaff 97893 Jan 11 10:17 _sspace.1.1.lib1.scaffolds.fasta

_sspace.1.1/pairinfo:
total 8
drwxr-xr-x. 2 mathog biostaff 4096 Jan 11 10:17 .
drwxr-xr-x. 6 mathog biostaff 4096 Jan 11 10:17 ..

_sspace.1.1/reads:
total 8
drwxr-xr-x. 2 mathog biostaff 4096 Jan 11 10:17 .
drwxr-xr-x. 6 mathog biostaff 4096 Jan 11 10:17 ..

the one which failed had:

 ls -alR _sspace.1.1
_sspace.1.1:
total 136
drwxrwxr-x  6 mathog mathog  4096 Jan 11 10:56 .
drwxrwxr-x 10 mathog mathog  4096 Jan 11 10:56 ..
drwxrwxr-x  2 mathog mathog  4096 Jan 11 10:56 alignoutput
drwxrwxr-x  2 mathog mathog  4096 Jan 11 10:56 intermediate_results
drwxrwxr-x  2 mathog mathog  4096 Jan 11 10:56 pairinfo
drwxrwxr-x  2 mathog mathog  4096 Jan 11 10:56 reads
-rw-rw-r--  1 mathog mathog   904 Jan 11 10:56 _sspace.1.1.final.evidence
-rw-rw-r--  1 mathog mathog 99507 Jan 11 10:56 _sspace.1.1.final.scaffolds.fasta
-rw-rw-r--  1 mathog mathog  1124 Jan 11 10:56 _sspace.1.1.logfile.txt
-rw-rw-r--  1 mathog mathog  1802 Jan 11 10:56 _sspace.1.1.summaryfile.txt

_sspace.1.1/alignoutput:
total 8
drwxrwxr-x 2 mathog mathog 4096 Jan 11 10:56 .
drwxrwxr-x 6 mathog mathog 4096 Jan 11 10:56 ..

_sspace.1.1/intermediate_results:
total 220
drwxrwxr-x 2 mathog mathog  4096 Jan 11 10:56 .
drwxrwxr-x 6 mathog mathog  4096 Jan 11 10:56 ..
-rw-rw-r-- 1 mathog mathog 99000 Jan 11 10:56 _sspace.1.1.formattedcontigs_min0.fasta
-rw-rw-r-- 1 mathog mathog  2328 Jan 11 10:56 _sspace.1.1_lib1.foundlinks.txt
-rw-rw-r-- 1 mathog mathog     0 Jan 11 10:56 _sspace.1.1_lib1.repeats.txt
-rw-rw-r-- 1 mathog mathog   420 Jan 11 10:56 _sspace.1.1.lib1.scaffolds
-rw-rw-r-- 1 mathog mathog   904 Jan 11 10:56 _sspace.1.1.lib1.scaffolds.evidence
-rw-rw-r-- 1 mathog mathog 97878 Jan 11 10:56 _sspace.1.1.lib1.scaffolds.fasta
-rw-rw-r-- 1 mathog mathog    34 Jan 11 10:56 _sspace.1.1.libraries.txt

_sspace.1.1/pairinfo:
total 208
drwxrwxr-x 2 mathog mathog   4096 Jan 11 10:56 .
drwxrwxr-x 6 mathog mathog   4096 Jan 11 10:56 ..
-rw-rw-r-- 1 mathog mathog      0 Jan 11 10:56 _sspace.1.1.lib1.pairing_distribution.csv
-rw-rw-r-- 1 mathog mathog 202638 Jan 11 10:56 _sspace.1.1.lib1.pairing_issues

_sspace.1.1/reads:
total 8
drwxrwxr-x 2 mathog mathog 4096 Jan 11 10:56 .
drwxrwxr-x 6 mathog mathog 4096 Jan 11 10:56 ..

There are also many more _sspace.1.1* in the run1 directory on the one that failed than there are on the one which completed. Bizarre.

The 7.4 system has bash 4.2.46 and the 6.9 system has bash 4.1.2. Hard to believe that matters.

I don't see any error messages in the log files in run1.

Suggestions???

Thanks.

from redundans.

mathog avatar mathog commented on August 29, 2024

On a second Centos 6.9 system, which mounts my ~/src from the first 6.9 machine, and has pretty much identical software, the test was run again. This uses the exact same redundans install that was unable to complete the test on the first machine. On this machine it completed, but the results differ from the Centos 7.4 machine! Here is this 3rd set of "test" results:

./redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1
Options: Namespace(fasta='test/contigs.fa', fastq=['test/5000_1.fq.gz', 'test/5000_2.fq.gz', 'test/600_1.fq.gz', 'test/600_2.fq.gz', 'test/pacbio.fq.gz'], identity=0.51, iters=2, joins=5, limit=0.2, linkratio=0.7, log=<open file '<stderr>', mode 'w' at 0x7f97e63341e0>, longreads=[], mapq=10, minLength=200, nocleaning=True, nogapclosing=True, norearrangements=False, noreduction=True, noscaffolding=True, outdir='test/run1', overlap=0.8, reference='', resume=False, threads=4, verbose=True)

##################################################
[Thu Jan 11 12:47:47 2018] Reduction...
#file name      genome size     contigs heterozygous size       [%]     heterozygous contigs    [%]     identity [%]    possible joins  homozygous size [%]     homozygous contigs     [%]
test/run1/contigs.fa    163897  245     66377   40.50   221     90.20   94.854  0       97520   59.50   24      9.80

##################################################
[Thu Jan 11 12:48:18 2018] Estimating parameters of libraries...
 Aligning 19504 mates per library...
Insert size statistics                          Mates orientation stats
FastQ files     read length     median  mean    stdev   FF      FR      RF      RR
test/5000_1.fq.gz test/5000_2.fq.gz     50      4986    4981.70 692.22  0       4067    14      0
test/600_1.fq.gz test/600_2.fq.gz       100     599     598.80  47.26   0       10000   0       0

##################################################
[Thu Jan 11 12:48:18 2018] Scaffolding...
 iteration 1.1: test/run1/contigs.reduced.fa    24      97520   39.355  17      94157   7321    2195    0       29603
   19505 pairs. 17330 passed filtering [88.85%]. 1658 in different contigs [8.50%].
    1506 pairs. 559 in different contigs [37.12%].
 iteration 1.2: test/run1/_sspace.1.1.fa        4       97554   39.344  3       97311   87541   5743    749     87541
   19505 pairs. 17640 passed filtering [90.44%]. 212 in different contigs [1.09%].
    1053 pairs. 109 in different contigs [10.35%].
 iteration 2.1: test/run1/_sspace.1.2.fa        3       97869   39.344  3       97869   87541   6301    1064    87541
   19505 pairs. 15114 passed filtering [77.49%]. 1294 in different contigs [6.63%].
    3412 pairs. 392 in different contigs [11.49%].
 iteration 2.2: test/run1/_sspace.2.1.fa        1       100051  39.344  1       100051  100051  100051  3246    100051
   19505 pairs. 15152 passed filtering [77.68%]. 0 in different contigs [0.00%].
    3392 pairs. 0 in different contigs [0.00%].

##################################################
[Thu Jan 11 12:48:33 2018] Gap closing...
 iteration 1.1: test/run1/scaffolds.fa  1       100051  39.344  1       100051  100051  100051  3246    100051
 iteration 1.2: test/run1/_gapcloser.1.1.fa     1       100546  39.563  1       100546  100546  100546  1412    100546

##################################################
[Thu Jan 11 12:48:34 2018] Final reduction...
#file name      genome size     contigs heterozygous size       [%]     heterozygous contigs    [%]     identity [%]    possible joins  homozygous size [%]     homozygous contigs     [%]
test/run1/scaffolds.filled.fa   100547  1       0       0.00    0       0.00    0.000   0       100547  100.00  1       100.00

##################################################
[Thu Jan 11 12:48:35 2018] Reporting statistics...
#fname  contigs bases   GC [%]  contigs >1kb    bases in contigs >1kb   N50     N90     Ns      longest
test/contigs.fa 245     163897  40.298  24      117391  3975    233     0       29603
test/run1/contigs.fa    245     163897  40.298  24      117391  3975    233     0       29603
test/run1/contigs.reduced.fa    24      97520   39.355  17      94157   7321    2195    0       29603
test/run1/_sspace.1.1.fa        4       97554   39.344  3       97311   87541   5743    749     87541
test/run1/_sspace.1.2.fa        3       97869   39.344  3       97869   87541   6301    1064    87541
test/run1/_sspace.2.1.fa        1       100051  39.344  1       100051  100051  100051  3246    100051
test/run1/_sspace.2.2.fa        1       100051  39.344  1       100051  100051  100051  3246    100051
test/run1/scaffolds.fa  1       100051  39.344  1       100051  100051  100051  3246    100051
test/run1/_gapcloser.1.1.fa     1       100546  39.563  1       100546  100546  100546  1412    100546
test/run1/_gapcloser.1.2.fa     1       100547  39.562  1       100547  100547  100547  1407    100547
test/run1/scaffolds.filled.fa   1       100547  39.562  1       100547  100547  100547  1407    100547
test/run1/scaffolds.reduced.fa  1       100547  39.562  1       100547  100547  100547  1407    100547

##################################################
[Thu Jan 11 12:48:35 2018] Cleaning-up...
#Time elapsed: 0:00:48.213627


Comparing the two Centos 6.9 machines...

PATH: identical
alias: identical
bash: same version 4.1.2
python --version: 2.7 (failed) 2.7.14(passed) [Centos 7.4, python 2.7.5]
site-packges: probably differences
perl: same version (same binary)

The directory structure in test/run1 when it ran well on the second Centos 6.9 machine matched the
Centos 7.4 machine, not the first Centos 6.9 machine (where the test failed).

from redundans.

mathog avatar mathog commented on August 29, 2024

Figured it out! Version numbers on failing machine were NumPy(1.9.2) and matplotlib(1.5.0). The similar 6.9 machine which worked had 1.13.3 and 2.1.0. So upgraded to NumPy(1.14.0) and matplotlib(2.1.1) on the problem machine and afterwards the redundans test would run to completion.

The test results are not stable from run to run though, ie:

rm -rf test/run1
./redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1 >/tmp/run1A 2>&1
rm -rf test/run1
./redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1 >/tmp/run1B 2>&1
diff /tmp/run1A /tmp/run1B

and they differ in many lines.

from redundans.

lpryszcz avatar lpryszcz commented on August 29, 2024

Hi, Thanks a lot for solving that!
Yes, I've noticed that individual runs may produce slightly different results - this is due to snap-aligner. It's super fast, it runs in multiple threads and I guess the differences comes from the fact that snap-aligner outputs reads in ambigous order (depending which thread finishes first), while redundans process only certain number of reads to speeds the things up, so some reads may be included and some not in individual runs. Using BWA MEM was giving stable results, yet it's much slower, especially for larger genomes.

from redundans.

lpryszcz avatar lpryszcz commented on August 29, 2024

numpy/numpy#4219 - added safecheck in Redundans 37f832f

from redundans.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.