GithubHelp home page GithubHelp logo

Comments (9)

sea-shunned avatar sea-shunned commented on August 17, 2024

Hi Beatrice, thanks for raising this. Just to clarify, are you saying that the subtype labelled in the PVD, e.g. "Subtype 2", may represent individuals that are not labelled as 2 in ml_subtype? (Ignore 0 vs 1 indexing for the purposes of this)

from pysustain.

Bea-Taylor avatar Bea-Taylor commented on August 17, 2024

Hi Cameron, yes. For example, according to the PVD I have (n=55) in "Subtype 3", and (n=59) in "Subtype 4", however when I look at the output of ml_subtype I have (n=59) in "Subtype 3" and (n=55) in "Subtype 4".

from pysustain.

noxtoby avatar noxtoby commented on August 17, 2024

Probably the fault of the plotting routines

from pysustain.

sea-shunned avatar sea-shunned commented on August 17, 2024

The PVD order is (usually) defined as:

self._plot_subtype_order = np.argsort(ml_f_EM)[::-1]

whereas from a quick glance the ml_subtype order is defined as:

temp_mean_f = np.mean(samples_f, axis=1)
ix = np.argsort(temp_mean_f)[::-1]

Both refer to the subtype size, but the former is from the size in the final EM step for that N-subtype model (whatever pickle file you have loaded), whereas the latter comes from the proportion over the MCMC samples (if I am not mistaken).

These should usually be the same, but I guess for slight differences like this they diverge. A quick fix would be to calculate both the same way. Before doing this, it's good to check that there wasn't a deliberate reason that each method was used where it was. @LeonAksman do you know? If there's no strong reason, we can replace one with the other. Possibly best to use the MCMC version, but that's just my opinion, happy to hear others.

from pysustain.

ayoung11 avatar ayoung11 commented on August 17, 2024

I think the problem is to do with the plotting routines. One possible issue is the one Cameron mentioned. But the other issue that keeps cropping up is that the number of subjects in the plotting routines is calculated using the model fraction multiplied by the number of subjects (see e.g. line 629 in ZscoreSustain.py). This gives a different output to assigning individuals to their maximum likelihood subtype because the fraction indicates the expected proportion of individuals rather than the actual proportion of individuals. So it's also possible that what appears as switching (n=59) to (n=55) may actually be a coincidence in this case due to the discrepancy in the way of calculating the number between the PVDs and ml_subtype. I would suggest the following:

(a) The PVDs should be altered to calculate the number based on ml_subtype (or for now just display the fraction)

(b) In the PVDs we use samples_f to order the subtypes, consistent with ml_subtype

I found another bug in the plotting too for the colours of the z-score model when the number of z-scores is greater than three so planning to work on both issues as soon as I can manage to get to them. If you were planning on making changes yourself Cameron we could discuss and coordinate - let me know via email.

from pysustain.

biondof avatar biondof commented on August 17, 2024

Hi! I also encountered this problem in multiple ways. For example, the PVD indicates subtype1 n=494 and subtype2 n=296. However, the output table suggests completely different numbers, subtype1=726 and subtype2=64. In addition, the cross-validation occasionally flips the subtypes, such that the cross-validated PVD for subtype1 from looks like the PVD for subtype 2 from the non-cross-validated result (rather than being consistent with the labelling). Finally, as @ayoung11 mentions, the colour for z-score 4 is pink which makes it confusing to read as z-score 2 is also pink.

from pysustain.

sea-shunned avatar sea-shunned commented on August 17, 2024

Grappled with this a little bit recently so adding thoughts here. When attributing names to the subtypes, and plotting with e.g. stage zero individuals removed, one may want to change the order of the PVDs. Trying to change this can lead to unexpected situations where the link between ml_subtype == 1 and "Subtype 1" is broken, particularly as the order is probably already different to begin with. This then gets unintuitive, as using the subtype_order argument will not lead to expected results. For example, you may want to e.g. reverse the order entirely. For a 3 subtype model, you could then add subtype_order=[2,1,0] when calling the plot_positional_var method, but if the default subtype order (calculated as np.argsort(ml_f_EM)[::-1]) is not [0,1,2] then the result will not be the subtypes in reverse order. Thus the correspondance between ml_subtype and what is shown in the PVDs becomes unclear and confusing, and can no doubt lead to issues.

To address this, in addition to making the numbers more reflective as discussed above, there probably should be no reordering of subtypes by default, ensuring that the order shown in the PVD is always reflective of the numbers in ml_subtype. Custom reordering is then an option of course, but in that scenario the user will be fully aware that it is being done and what it therefore means.

Happy to hear thoughts on this so we can try and fix this issue.

from pysustain.

noxtoby avatar noxtoby commented on August 17, 2024

Definitely agree that the default should be no reordering.

This doesn't solve the CV situation, where one subsample (/fold) could produce a different set of subtypes, making combination across folds...challenging at best. How to best handle this will depend on the goal of the CV experiment.

from pysustain.

Zeena-Shawa avatar Zeena-Shawa commented on August 17, 2024

Just wanted to confirm that @sea-shunned's comment on the unclear correspondence between the default PVD order and subtype_order method is definetely true. For my analysis I have 11 subtypes and had to order as follows to get the same order as provided by ml_sequence_EM: subtype_order = [0,7,8,4,6,1,10,5,9,3,2], and this order does not correspond to just ordering what is seen on the PVD, nor to the indices of the re-ordering (not sure what it does correspond to).

Grappled with this a little bit recently so adding thoughts here. When attributing names to the subtypes, and plotting with e.g. stage zero individuals removed, one may want to change the order of the PVDs. Trying to change this can lead to unexpected situations where the link between ml_subtype == 1 and "Subtype 1" is broken, particularly as the order is probably already different to begin with. This then gets unintuitive, as using the subtype_order argument will not lead to expected results. For example, you may want to e.g. reverse the order entirely. For a 3 subtype model, you could then add subtype_order=[2,1,0] when calling the plot_positional_var method, but if the default subtype order (calculated as np.argsort(ml_f_EM)[::-1]) is not [0,1,2] then the result will not be the subtypes in reverse order. Thus the correspondance between ml_subtype and what is shown in the PVDs becomes unclear and confusing, and can no doubt lead to issues.

To address this, in addition to making the numbers more reflective as discussed above, there probably should be no reordering of subtypes by default, ensuring that the order shown in the PVD is always reflective of the numbers in ml_subtype. Custom reordering is then an option of course, but in that scenario the user will be fully aware that it is being done and what it therefore means.

Happy to hear thoughts on this so we can try and fix this issue.

from pysustain.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.