GithubHelp home page GithubHelp logo

Comments (25)

johnlees avatar johnlees commented on August 26, 2024 1

This was fixed in #290

from poppunk.

johnlees avatar johnlees commented on August 26, 2024

Our error message isn't working properly here and should be giving you the list of missing samples, which we need to fix.

But ultimately this is being caused by a mismatch in samples between clusters, network and database. Can you check the number in each of these inputs? You can also use poppunk_info to help with this, see https://poppunk.readthedocs.io/en/latest/sketching.html#viewing-information-about-a-database

from poppunk.

HarryHung avatar HarryHung commented on August 26, 2024

Thanks for getting back to me John.

The output of poppunk_info is

Graph-tools OpenMP parallelisation enabled: with 1 threads
PopPUNK database:		GPS_v8
Sketch version:			c42cd0e22f6ef6d5c9a2900fa16367e096519170
Number of samples:		42482
K-mer sizes:			14,17,20,23,26,29
Sketch size:			9984
Contains random matches:	True
Codon phased seeds:		False

There is 42483 lines in GPS_v8_clusters.csv (i.e. 42482 samples minus the header)

There is 42483 lines in GPS_v8_external_clusters.csv (i.e. 42482 samples minus the header)

And according to the above error output of PopPUNK, it seems the network has 42482 samples as well:

Network loaded: 42482 samples

As I did not create this version of the database, I am wondering is there a way to deduce whether it is a database issue, or a general PopPUNK bug?

from poppunk.

johnlees avatar johnlees commented on August 26, 2024

Thanks for checking Harry.

As I did not create this version of the database, I am wondering is there a way to deduce whether it is a database issue, or a general PopPUNK bug?

It's hard to say I think – of course in general assignment works in this version (it's in the CI) – I suspect it's a database/files issue, but that's not to say that PopPUNK isn't at fault for picking up a bad default or being unhelpful with the error.

Any idea where the 42209 in the original error is coming from? How many lines does the .refs file in the model directory have? Have you/Steph tried using --model-dir to point at the original fit?

from poppunk.

HarryHung avatar HarryHung commented on August 26, 2024

It seems you have a good hunch on the root cause.

The line count of GPS_v8.refs is indeed 42209.

I have checked with Steph and she ran the command (no --model-dir)

poppunk_assign --db GPS_v6 --distances GPS_v6/GPS_v6.dists --query query.txt --external-clustering GPS_v6_external_clusters.csv --update-db --output GPS_v8

It seems in this case, --model-dir is necessary to make the updated DB works?

Should I ask her to rerun the command as

poppunk_assign --db GPS_v6 --distances GPS_v6/GPS_v6.dists --model-dir GPS_v6 --query query.txt --external-clustering GPS_v6_external_clusters.csv --update-db --output GPS_v8

Or is there a simpler fix?

If I understand the issue correctly, I am wondering why --model-dir is not a mandatory option when --update-db is used. Or is there a use case that require the absence of --model-dir?
There seems to be an explanation at https://poppunk.readthedocs.io/en/latest/query_assignment.html#updating-the-database, but I don't think I really understand that to be honest.

from poppunk.

johnlees avatar johnlees commented on August 26, 2024

I think the command to try should be:

python3 PopPUNK/poppunk_assign-runner.py --db ../../sl28/PopPUNK2/GPS_v8 --query qfile.txt --external-clustering ../../sl28/PopPUNK2/GPS_v8_external_clusters.csv --model-dir ../../sl28/PopPUNK2/GPS_v6 --output output

Let's see if that works.

I think our expected use is that everything for the DB is in one folder and not versioned like this, so that when you point to the DB directory it has everything in it that's needed.

The easier thing to do is just copy the fit from v6 over to the v8 directory, but I appreciate which files are needed isn't clear.

If this is indeed the issue I will try and update the documentation to be clearer.

from poppunk.

HarryHung avatar HarryHung commented on August 26, 2024

Thank you John!

I misunderstood the usage of --model-dir, I mistakenly thought it was associated to the --update-db option, now I can see it is asking poppunk_assign to use an alternative list of references.

I can confirm the following command can be completed successfully

python3 PopPUNK/poppunk_assign-runner.py --db ../../sl28/PopPUNK2/GPS_v8 --query qfile.txt --external-clustering ../../sl28/PopPUNK2/GPS_v8_external_clusters.csv --model-dir ../../sl28/PopPUNK2/GPS_v6 --output output --threads 32

So if I want to have everything in a single database directory and forgo the need to use --model-dir when running poppunk_assign, which files exactly should I copy (and change v6 in the file names to v8 I suppose) from v6 to v8 directory? And will such process result in the loss of information gained by updating the database via --update-db?

If we want to maintain versioned database, the best way forward seems to be copy the database directory, and in the newly created directory run the below command?

poppunk_assign --db GPS --query query.txt --update-db --output GPS --overwrite --external-clustering GPS_external_clusters.csv 

On another note, I notice apart from the files listed in https://poppunk.readthedocs.io/en/latest/query_assignment.html#downloading-a-database, there are a few extra files in the database directory that are not mentioned in the doc:

  • GPS_v8.refs.dists.npy
  • GPS_v8.refs.dists.pkl
  • GPS_v8_refs_graph.gt
  • GPS_v8.refs.h5

Are they neccessary? As they double the size of the database directory (there are non-.refs counterpart of them in the directory)

.
├── [583K]  GPS_v8_clusters.csv
├── [6.7G]  GPS_v8.dists.npy
├── [542K]  GPS_v8.dists.pkl
├── [589K]  GPS_v8_external_clusters.csv
├── [1.1K]  GPS_v8_fit.npz
├── [  26]  GPS_v8_fit.pkl
├── [ 20M]  GPS_v8_graph.gt
├── [4.3G]  GPS_v8.h5
├── [456K]  GPS_v8.refs
├── [6.6G]  GPS_v8.refs.dists.npy
├── [539K]  GPS_v8.refs.dists.pkl
├── [ 20M]  GPS_v8_refs_graph.gt
├── [4.2G]  GPS_v8.refs.h5
└── [ 15K]  GPS_v8_unword_clusters.csv

from poppunk.

HarryHung avatar HarryHung commented on August 26, 2024

@johnlees
On a side note, I am wondering when running poppunk_assign, does the loading of a huge database like GPS take up the majority of run time?
Because I am considering whether I should run poppunk_assign separately on each assembly to increase the stability of my pipeline instead of the current all-samples-in-one-run approach, but will not opt for it if it has a hefty time cost penalty.

from poppunk.

johnlees avatar johnlees commented on August 26, 2024

So if I want to have everything in a single database directory and forgo the need to use --model-dir when running poppunk_assign, which files exactly should I copy (and change v6 in the file names to v8 I suppose) from v6 to v8 directory? And will such process result in the loss of information gained by updating the database via --update-db?

It would be the _fit.pkl and _fit.npy files.

If we want to maintain versioned database, the best way forward seems to be copy the database directory, and in the newly created directory run the below command?

poppunk_assign --db GPS --query query.txt --update-db --output GPS --overwrite --external-clustering GPS_external_clusters.csv 

That's one option. You could also output to a new directory without overwrite

On another note, I notice apart from the files listed in https://poppunk.readthedocs.io/en/latest/query_assignment.html#downloading-a-database, there are a few extra files in the database directory that are not mentioned in the doc:

* `GPS_v8.refs.dists.npy`

* `GPS_v8.refs.dists.pkl`

* `GPS_v8_refs_graph.gt`

* `GPS_v8.refs.h5`

Are they neccessary? As they double the size of the database directory (there are non-.refs counterpart of them in the directory)

.
├── [583K]  GPS_v8_clusters.csv
├── [6.7G]  GPS_v8.dists.npy
├── [542K]  GPS_v8.dists.pkl
├── [589K]  GPS_v8_external_clusters.csv
├── [1.1K]  GPS_v8_fit.npz
├── [  26]  GPS_v8_fit.pkl
├── [ 20M]  GPS_v8_graph.gt
├── [4.3G]  GPS_v8.h5
├── [456K]  GPS_v8.refs
├── [6.6G]  GPS_v8.refs.dists.npy
├── [539K]  GPS_v8.refs.dists.pkl
├── [ 20M]  GPS_v8_refs_graph.gt
├── [4.2G]  GPS_v8.refs.h5
└── [ 15K]  GPS_v8_unword_clusters.csv

I think as long as you keep the GPS_v8.refs file the rest can be removed

from poppunk.

johnlees avatar johnlees commented on August 26, 2024

@johnlees On a side note, I am wondering when running poppunk_assign, does the loading of a huge database like GPS take up the majority of run time? Because I am considering whether I should run poppunk_assign separately on each assembly to increase the stability of my pipeline instead of the current all-samples-in-one-run approach, but will not opt for it if it has a hefty time cost penalty.

Yes I think this will be most of the cost. Have you looked at the --serial option? That might be useful as that does assignment one-by-one, but only loads the database once.

from poppunk.

HarryHung avatar HarryHung commented on August 26, 2024

Thanks for the info!

I will try to consolidate the files into a single directory and/or removing unnecessary files in the published database for assignment, then report back to you.

from poppunk.

HarryHung avatar HarryHung commented on August 26, 2024

It seems the full description of --serial is not available in the PopPUNK documentation. But if I want to run PopPUNK for each sample as an individual job in different containers (which is the preferred way of Nextflow), it seems --serial would not help.

I think it is fine for now, I would just run PopPUNK for all samples at once and split the results. Will only need to reconsider this approach if someone complains about the stability of PopPUNK (for now, it seems poppunk_assign is quite stable, if the database itself is okay)

from poppunk.

HarryHung avatar HarryHung commented on August 26, 2024

It would be the _fit.pkl and _fit.npy files.

@johnlees
I guess it should be _fit.npz, as there is no _fit.npy.

Based on md5sum, the following have the same content:

  • GPS_v6_fit.pkl vs GPS_v8_fit.pkl
  • GPS_v6_fit.npz vs GPS_v8_fit.npz

I tried to replace the original GPS_v8_fit.pkl and GPS_v8_fit.npzwith GPS_v6_fit.pkl and GPS_v6_fit.npz (also renamed v6 to v8) anyway, the same error occurs.

I notice that when I point to GPS_v6 with --model-dir (while --db points to GPS_v8), the network file being loaded are different:

  • --db points to GPS_v8 with GPS_v8_fit.pkl and GPS_v8_fit.npz replaced, but without --model-dir:
    •  Loading network from ../poppunk_db/GPS_v8/GPS_v8_graph.gt
       Network loaded: 42482 samples
      
  • --db points to GPS_v8 and with --model-dir points to GPS_v6:
    •  Loading network from ../poppunk_db/GPS_v6/GPS_v6_graph.gt
       Network loaded: 42163 samples
      

This might be related to when --model-dir is used, model_prefix is replaced

    if model_dir is not None:
        model_prefix = model_dir

subsequently, the replaced model_prefix is used to read more files

        ref_file_name = os.path.join(model_prefix,
                        os.path.basename(model_prefix) + file_extension_string + ".refs")

therefore, simply replacing two _fit.pkl and _fit.npz files do not seem to be enough.

I would be grateful if you could suggest what I should do next. As by replacing files one by one, I am afraid I might revert everything back to GPS_v6. Or at this point, we should just re-generate v8 using the previously mentioned command with --overwrite option?

from poppunk.

johnlees avatar johnlees commented on August 26, 2024

I think I need to do two things here (in #279):

  1. Make sure the overwrite option works properly
  2. Check that the .refs file after update-db in the output directory is correct

from poppunk.

HarryHung avatar HarryHung commented on August 26, 2024

Sounds good to me. Thanks John!

from poppunk.

HarryHung avatar HarryHung commented on August 26, 2024

Sorry for bothering you again @johnlees ,
I am wondering is branch i273 of #279 ready for testing, or should I wait for it to be merged into master and test on master branch instead?

Thanks!

from poppunk.

johnlees avatar johnlees commented on August 26, 2024

Sorry, there was a difficult issue with one of the dependencies which took a while to resolve. I think i273 should be ok, but I wanted to run a final test before merging it. I won't have time to do that for a while, so I would go ahead and use that branch, it should fix your issue (and if it doesn't then I'll know what to further change!)

from poppunk.

HarryHung avatar HarryHung commented on August 26, 2024

Sure. I will test it out and let you know how it goes. Thanks!

from poppunk.

HarryHung avatar HarryHung commented on August 26, 2024

Sadly, it is still not working properly.

I have noticed there is another line blocking the --overwrite option to function properly, and I attempt to fix that in PR #282

However, even with that fix in place,
Running
python3 /path/to/poppunk_assign-runner.py --db GPS_v6 --query query_for_v8_update.txt --external-clustering GPS_v6_external_clusters.csv --update-db --overwrite --output GPS_v6 --threads 32

Still leads to the following error:

PopPUNK: assign
	(with backend: sketchlib v2.1.1
	 sketchlib: /path/to/miniconda3/envs/pp_env/lib/python3.10/site-packages/pp_sketchlib.cpython-310-x86_64-linux-gnu.so)
Mode: Assigning clusters of query sequences


Graph-tools OpenMP parallelisation enabled: with 32 threads
Overwriting db: GPS_v6/GPS_v6.h5
Sketching 319 genomes using 32 thread(s)

Progress (CPU): 0 / 319
Progress (CPU): 3 / 319
Progress (CPU): 42 / 319
Progress (CPU): 69 / 319
Progress (CPU): 103 / 319
Progress (CPU): 135 / 319
Progress (CPU): 167 / 319
Progress (CPU): 201 / 319
Progress (CPU): 236 / 319
Progress (CPU): 272 / 319
Progress (CPU): 298 / 319
Progress (CPU): 319 / 319
Writing sketches to file
Loading previously refined model
Completed model loading
Missing sketch: Unable to open the group "/sketches/10050_2#1": (Symbol table) Object not found
Could not find random match chances in database, calculating assuming equal base frequencies
Traceback (most recent call last):
  File "/path/to/poppunk_assign-runner.py", line 13, in <module>
    main()
  File "/path/to/PopPUNK/assign.py", line 211, in main
    assign_query(dbFuncs,
  File "/path/to/PopPUNK/assign.py", line 307, in assign_query
    isolateClustering = assign_query_hdf5(dbFuncs,
  File "/path/to/PopPUNK/assign.py", line 474, in assign_query_hdf5
    qrDistMat = queryDatabase(rNames = rNames,
  File "/path/to/PopPUNK/sketchlib.py", line 575, in queryDatabase
    distMat = pp_sketchlib.queryDatabase(ref_db_name=ref_db,
RuntimeError: Query with empty ref or query list!

It also seems to render the partially overwritten database unusable.

If I only remove the --overwrite and output to another directory:
python3 /path/to/poppunk_assign-runner.py --db GPS_v6 --query query_for_v8_update.txt --external-clustering GPS_v6_external_clusters.csv --update-db --output GPS_v8 --threads 32
The same command can be completed without error.

But then original issue remains, without --overwrite and output to another directory, when I use the newly created GPS_v8 as it is without --model-dir pointing back to GPS_v6, it will result in the following error when assigning lineage to samples:
ERROR: There are 42482 vertices in the network but 42209 reference names supplied; please check the '--model-dir' variable is pointing to the correct directory

Please let me know if you have any idea on this @johnlees. Thanks!

from poppunk.

johnlees avatar johnlees commented on August 26, 2024

I think the solution should be the following:

  • When running with --update-db, we should automatically copy the required files to the new output directory, including the model.
  • We remove overwrite as an option from assign, it adds too much complexity.
  • We properly document how to update databases, adding a new page

@nickjcroucher does this seem reasonable to you?

from poppunk.

HarryHung avatar HarryHung commented on August 26, 2024

That sounds great John, if you and Nick agree on that and push out an update, I am more than happy to test again.

from poppunk.

nickjcroucher avatar nickjcroucher commented on August 26, 2024

@johnlees - yes, I agree, that behaviour makes sense - should we accompany the database updating with some kind of versioning as well? Or have some other way of identifying/archiving the "parent" database, and replacing it with the latest one? Just thinking about continuous updating as a backend to online systems, but I'm sure there are better thought through approaches that already exist.

from poppunk.

HarryHung avatar HarryHung commented on August 26, 2024

@johnlees I am just wondering if you have an estimation on when the new update process will be available? Thanks for all the amazing work so far!

from poppunk.

johnlees avatar johnlees commented on August 26, 2024

Realistically end of this year at best. We could talk about this (in person) if we need it urgently

from poppunk.

HarryHung avatar HarryHung commented on August 26, 2024

Thanks for the update! I will check with my team regarding its urgency.

from poppunk.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.