GithubHelp home page GithubHelp logo

Comments (10)

noxtoby avatar noxtoby commented on August 17, 2024

Advice: only look at positive z-scores and don't shift your data as you've described. Negative Z scores are not of interest to disease progression because they are on the "normal" end of biomarker measurements.

That error is a new one for me. I'll take a guess. It looks like some data cannot be assigned to a ml_subtype (hence NaN). I suspect that it's related to the weird things you've done with your data (shifting z-scores to the right). By including lots of not-abnormal measurements centred around the normal average (z=0), you might be trying to force an unidentifiable multiple-cluster solution. Perhaps a cluster that is essentially noise could result in NaN values for ml_subtype. For sure your resulting subtypes and stages would make little if any sense.

from pysustain.

katrinaCode avatar katrinaCode commented on August 17, 2024

Hi Neil,

Thank you — we have a significant proportion of our data that have negative z-scores (between 20% and 80% depending on biomarker). My understanding of z-scores from a purely mathematical sense is that only z-scores of 0 are "normal", and any other score indicates abnormality. Could you elaborate more to help me understand how negative z-scores are considered "normal" for this application, and do you have any suggestions to minimize data loss since we have so much negative data?

Thank you for your suggestion, I've tried running the notebook again without the z-score shift and setting all negative zscores to be = 0. I had done this in a previous notebook that ran without issue. The error unfortunately persists.

Thanks again!

from pysustain.

noxtoby avatar noxtoby commented on August 17, 2024

It's explained in the methods section of original SuStaIn paper. And the tutorial notebooks in this repo.

In brief:

  • z-scores in SuStaIn are calculated relative to a sample of "controls", and must be constructed such that z>0 is the abnormal direction (multiply by -1 if necessary). z<0 is the "normal" direction, by definition in the code.
  • You shouldn't be modifying your data artificially, such as by setting negative z-scores to zero!
  • You should be choosing SuStaIn hyperparameters carefully. In this case, choosing only positive z-score events: e.g., the Z_vals parameters in the tutorial notebook.

from pysustain.

katrinaCode avatar katrinaCode commented on August 17, 2024

Hi Neil, thank you!

Yes, I've read the paper(s) and the tutorials extensively but still need the clarification so thank you for your comment. I understand that z>0 is defined as the abnormal direction, however we still mathematically end up with some z-scores that are negative (which are strictly necessary, e.g. if the control mean and standard deviation are 0 and 1 respectively, as in "Preparing Data for SuStaIn") so I am trying to understand how to best handle those. For context, we are following a previous paper that included their cognitively normal/control population in the Sustain input, which is why we are not excluding patients within the control distribution.

If the distribution shouldn't be shifted nor the negative values zeroed out or removed, I am not sure what other options there are. My understanding from the literature was that SuStaIn could not accept negative inputs, but does your last point imply that data can be negative, as long as Z_vals are strictly positive? I had been looking into this independently and saw in Zscore Sustain line 69 that it specifically states that the z-scores need to be positive. Apologies if I am misinterpreting and thank you for taking the time to explain.

The same ValueError still occurs after removing the z-score shift as suggested, and when running with only positive values (negative scores zeroed, which led to complete runs in previous versions of the notebook, though I agree with your point about artificially modifying). I also ran without zeroing out negative z-scores as per my interpretation of your comment (so there were negative scores within data), with the same error. The issue is not resolved. Do you have any other suggestions?

Thank you for your help!

from pysustain.

noxtoby avatar noxtoby commented on August 17, 2024

Correct — it's the direction of abnormality that needs to be positive, not the values. Of course the data (both from patients and controls) will contain negative values after z-scoring. Apologies on behalf of the developers, but — unless I'm mistaken — the comment in line 69 of ZscoreSustain.py is poorly worded and should mean that the direction of abnormality needs to be positive.

I'm almost certain that the hyperparameter event thresholds in Z_Vals all need to be positive. Even if not (but I've never tested the code with negative z-score events), negative event thresholds make no sense as this would be in the opposite direction to disease progression. Why would anyone care if a disease subtype has an early event that amounts to biomarker abnormality that is more normal than the average non-diseased control?

In that spirit, it feels pertinent to state that clustering is not a magical weapon. The user needs to carefully consider the input data and the model. For example, sensible feature selection would exclude features that amount to noise for a given hyperparameter configuration. And hyperparameter configuration needs to respect the available disease signal in your input features. I have a paper in preparation about this.

For context, we are following a previous paper that included their cognitively normal/control population in the Sustain input, which is why we are not excluding patients within the control distribution.

I presume "patients within the controls distribution" means patients with normal-looking measurements. Of course some patients will have measurements well within normal limits, e.g., z<1. Such a biomarker might end up at the end of a data-driven sequence, but it certainly can be included in the model if desired.

And it's fine for the user to include whichever samples they like when training a data-driven model (including measurements from controls), but the model will have to be interpreted with that in mind. If controls will never develop the disease (as is usually, but not always the case), then there's a strong argument for excluding them from the training set.

In my opinion pySuStaIn is not the primary source of your issue. A little forethought is needed in terms of data science / machine learning good practice, e.g., feature selection and hyperparameter tuning as mentioned above. If you don't have a data scientist or statistician on your team, I suggest finding one or two.

from pysustain.

katrinaCode avatar katrinaCode commented on August 17, 2024

Hi Neil,

Thank you for taking the time to respond, and for clarifying the positive z-scores comment from the code. We had been struggling to find a reason why the data should not contain negative scores and confirming that it can is very helpful. I apologize if there was ambiguity in my previous comment, I had never been using negative Z_vals, and I understand their interpretation and why they must be positive.

I appreciate your comments and feedback and will be discussing the issues you raised with my group. We would appreciate if a second look could be taken at the error, or the issue could be re-opened, as we have ensured the input follows the required constraints as outlined in the literature (and as clarified by you, thank you), and we are still unable to run to completion of the 0 subtype/1 cluster problem due to this ValueError (despite having previous success). It routinely fails with the ValueError on the fourth set of MCMC iterations, no matter how many start points and MCMC iterations are used (tested 10 SPs & 1e3 MCMC iters; 15 SPs & 1e4 MCMCs; and 25 SPs and 1e6 MCMCs).

During this testing, I noticed a strange behaviour: despite using a new console/clearing all variables/restarting Spyder between each test, instead of having the same number of MCMC iterations in each set as expected, when setting N_iterations_MCMC = int(1e6) the first three sets of MCMC iterations had 1e4 iterations and the fourth had 1e6, i.e.:
MCMC Iteration: 0%| | 0/10000 [00:00<?, ?it/s]
MCMC Iteration: 0%| | 0/10000 [00:00<?, ?it/s]
MCMC Iteration: 0%| | 0/10000 [00:00<?, ?it/s]
MCMC Iteration: 0%| | 0/1000000 [00:00<?, ?it/s]
[... ValueError: cannot convert float NaN to integer]
(The Spyder console does not show the tqdm updates to the MCMC iters, so it only shows 0%, that is not the issue). So, there are two issues/behaviours occurring at the fourth set. I'd be happy to open a new issue for this MCMC iterations behaviour if that would be helpful. The model is not loading in from any previous solutions.

Thank you!

from pysustain.

noxtoby avatar noxtoby commented on August 17, 2024

Give a debugging tool a go. If that doesn't help you to isolate the source of the NaN, then I'm happy to reopen the issue.

But we probably can't help much on this end (having not seen this error before), unless you provide a minimal working example that reproduces the error. In this case, a MWE probably requires the data, or synthetic data that closely resembles the real data (and reproduces the error, of course).

from pysustain.

Tonnar avatar Tonnar commented on August 17, 2024

@katrinaCode Hello - I am currently having a similar issue did you ever find a workaround?

from pysustain.

katrinaCode avatar katrinaCode commented on August 17, 2024

Hi @Tonnar, yes! @KangMSPeter from my lab created this fix that has worked for both of us:

  1. add 1e-250 in the denominator of the total_prob_subtype_norm calculation to prevent dividing by 0 on line 556 in the subtype_and_stage_individuals function in AbstractSuStaIn, like so:
total_prob_subtype_norm         = total_prob_subtype        / ((np.tile(np.sum(total_prob_subtype, 1).reshape(len(total_prob_subtype), 1),        (1, N_S))) + 1e-250)

  1. Then, replace the try/if-else statements from lines 578 to 588 in the subtype_and_stage_individuals function in AbstractSuStaIn with this:
try:
    ml_subtype[i]           = this_subtype
except:
    ml_subtype[i]           = this_subtype[0][0]             
if this_prob_subtype.size == 1:
    if this_prob_subtype == 1:
        prob_ml_subtype[i]  = 1
    else:
        prob_ml_subtype[i]  = this_prob_subtype
else:
    try:
        prob_ml_subtype[i]  = this_prob_subtype[this_subtype]
    except:
        prob_ml_subtype[i]  = this_prob_subtype[this_subtype[0][0]]

I hope this helps!

from pysustain.

Tonnar avatar Tonnar commented on August 17, 2024

@katrinaCode Thank you so much for your response! This looks like it will help a ton!

from pysustain.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.