GithubHelp home page GithubHelp logo

Comments (7)

yzhao062 avatar yzhao062 commented on May 18, 2024 1

This is indeed a bug. Thanks for spotting this. I will investigate in the afternoon and try to provide a fix.

from pyod.

Maddosaurus avatar Maddosaurus commented on May 18, 2024

Sorry, I should have provided additional information in the first place.

I am working with Python 3.6.8 (Linux x64) with the following modules:

Keras==2.2.4
numba==0.43.1
numpy==1.16.2
pandas==0.23.4
pyod==0.6.9
scikit-learn==0.20.3
tensorflow==1.13.1

Please note that I have replaced my training data with your data generator (the outcome is still the same).

from pyod.models.auto_encoder import AutoEncoder
from pyod.utils.data import generate_data

X_train, y_train, X_test, y_test = \
        generate_data(n_train=1000,
                      n_test=100,
                      n_features=20,
                      contamination=0.2,
                      random_state=42)

clf_name = 'AutoEncoder'
clf = AutoEncoder(
    epochs=1,
    batch_size=1024,
    hidden_activation='relu', 
    output_activation='sigmoid',
    optimizer='adam',
    loss='binary_crossentropy',
    dropout_rate=0.3,
    l2_regularizer=0.1,
    validation_size=0.1,
    preprocessing=True,
    verbose=1,
    random_state=42,
    contamination=0.2,
    hidden_neurons=[10,5,5,10]
)
clf.fit(X_train)

y_test_scores = clf.predict_proba(X_test)

print("predict_proba array shape: ", y_test_scores.shape)
print(y_test_scores[0:10])

This yields

predict_proba array shape:  (100, 2)
[[0.92597692 0.07402308]
 [0.94946049 0.05053951]
 [0.9580671  0.0419329 ]
 [0.95641818 0.04358182]
 [0.95392434 0.04607566]
 [0.97283389 0.02716611]
 [0.97295465 0.02704535]
 [0.9476153  0.0523847 ]
 [0.95920739 0.04079261]
 [0.96109227 0.03890773]]

from pyod.

yzhao062 avatar yzhao062 commented on May 18, 2024

Sorry. I just misread the thread; it is not a bug. For predict_proba, the first column is the possibility of being an inlier, and the second column is the probability of being an outlier. That is why the first column + second column = 1.

I understand why the result looks suspicious. It is simply because you should show all the test sample scores (all 100 points -> the last 20 are outliers). If you only show the first 10, then you would not see any outliers.

predict_proba array shape: (100, 2)
[[ 0.9252057 0.0747943 ]
[ 0.94688598 0.05311402]
[ 0.9587255 0.0412745 ]
[ 0.95563006 0.04436994]
[ 0.95454926 0.04545074]
[ 0.97624743 0.02375257]
[ 0.97046174 0.02953826]
[ 0.94189742 0.05810258]
[ 0.95970606 0.04029394]
[ 0.95868077 0.04131923]
[ 0.98129712 0.01870288]
[ 0.96368413 0.03631587]
[ 0.94522506 0.05477494]
[ 0.96296358 0.03703642]
[ 0.9689389 0.0310611 ]
[ 0.98497563 0.01502437]
[ 0.95699417 0.04300583]
[ 0.95098874 0.04901126]
[ 0.95127329 0.04872671]
[ 0.95532323 0.04467677]
[ 0.94519313 0.05480687]
[ 0.93410618 0.06589382]
[ 0.93570345 0.06429655]
[ 0.95371992 0.04628008]
[ 0.93784384 0.06215616]
[ 0.98224056 0.01775944]
[ 0.96278406 0.03721594]
[ 0.941882 0.058118 ]
[ 0.95087458 0.04912542]
[ 0.95902701 0.04097299]
[ 0.96082519 0.03917481]
[ 0.95358535 0.04641465]
[ 0.95453017 0.04546983]
[ 0.97133914 0.02866086]
[ 0.95180271 0.04819729]
[ 0.96054154 0.03945846]
[ 0.96168497 0.03831503]
[ 0.95372931 0.04627069]
[ 0.9592321 0.0407679 ]
[ 0.96286817 0.03713183]
[ 0.95867289 0.04132711]
[ 0.94864358 0.05135642]
[ 0.95596227 0.04403773]
[ 0.95909123 0.04090877]
[ 0.96076251 0.03923749]
[ 0.96107981 0.03892019]
[ 0.92131804 0.07868196]
[ 0.93841341 0.06158659]
[ 0.9620909 0.0379091 ]
[ 0.9532265 0.0467735 ]
[ 0.95202562 0.04797438]
[ 0.96460813 0.03539187]
[ 0.97630049 0.02369951]
[ 0.94309809 0.05690191]
[ 0.9447385 0.0552615 ]
[ 0.94100407 0.05899593]
[ 0.9794299 0.0205701 ]
[ 0.94863407 0.05136593]
[ 0.94806244 0.05193756]
[ 0.9383938 0.0616062 ]
[ 0.95637531 0.04362469]
[ 0.97859733 0.02140267]
[ 0.93561244 0.06438756]
[ 0.96468223 0.03531777]
[ 0.97670565 0.02329435]
[ 0.94415471 0.05584529]
[ 0.93732504 0.06267496]
[ 0.9692657 0.0307343 ]
[ 0.96672586 0.03327414]
[ 0.94646419 0.05353581]
[ 0.92293877 0.07706123]
[ 0.97864842 0.02135158]
[ 0.96765653 0.03234347]
[ 0.95593187 0.04406813]
[ 0.97026734 0.02973266]
[ 0.95871461 0.04128539]
[ 0.9541099 0.0458901 ]
[ 0.93110127 0.06889873]
[ 0.94285685 0.05714315]
[ 0.9504546 0.0495454 ]
[ 0.25184598 0.74815402]
[ 0.40557102 0.59442898]
[ 0.18434087 0.81565913]
[ 0.18808454 0.81191546]
[ 0.23148065 0.76851935]
[ 0.26837299 0.73162701]
[ 0.21580742 0.78419258]
[ 0.14347461 0.85652539]
[ 0.29459315 0.70540685]
[ 0.30439276 0.69560724]
[ 0.28564684 0.71435316]
[ 0.31831457 0.68168543]
[ 0.2653372 0.7346628 ]
[ 0.2669916 0.7330084 ]
[ 0.2340579 0.7659421 ]
[ 0.18989429 0.81010571]
[ 0.34010795 0.65989205]
[ 0.1589018 0.8410982 ]
[ 0.23193837 0.76806163]
[ 0.46504918 0.53495082]]

from pyod.

Maddosaurus avatar Maddosaurus commented on May 18, 2024

I retried this with the generated demo data and can reproduce this behavior - thank you for the clarification!
If I print the result of clf.predict and clf.predict_proba side by side, this works out perfectly for the generated test data:

(...)
0 -> [0.97238848 0.02761152]
0 -> [0.9618547 0.0381453]
0 -> [0.95219714 0.04780286]
0 -> [0.93436772 0.06563228]
0 -> [0.945225 0.054775]
0 -> [0.95295617 0.04704383]
1 -> [0.25545642 0.74454358]
1 -> [0.4047036 0.5952964]
1 -> [0.18523465 0.81476535]
1 -> [0.1886568 0.8113432]
1 -> [0.2289029 0.7710971]
1 -> [0.27263008 0.72736992]
(...)

However, if I try the same with my production dataset, the following happens:

(...)
0 -> [0.99155502 0.00844498]
0 -> [0.99879371 0.00120629]
0 -> [0.9939181 0.0060819]
1 -> [0.97140045 0.02859955]
0 -> [0.99708464 0.00291536]
0 -> [0.99464627 0.00535373]
1 -> [0.99000773 0.00999227]
0 -> [0.99464627 0.00535373]
1 -> [0.9434721 0.0565279]
0 -> [9.99430013e-01 5.69987401e-04]
1 -> [0.99069809 0.00930191]
0 -> [0.99143665 0.00856335]
0 -> [0.99104535 0.00895465]
0 -> [0.99338351 0.00661649]
0 -> [0.99550684 0.00449316]
0 -> [9.99467018e-01 5.32982027e-04]
1 -> [0.98927478 0.01072522]
(...)

This is why I opened the issue in the first place. How is it that predict gives me a different label than predict_proba? On the generated demo data these two match, but not in my production data. Did I overlook or misread something?

If you're interested in the full code, you can find a gist with all IPython Notebooks here:
https://gist.github.com/Maddosaurus/f75bc577a403de53e0e594e8d2b56ad7

from pyod.

yzhao062 avatar yzhao062 commented on May 18, 2024

Predict() is actually a hard-cut probability conversion. For instance, you have n test points, it will force the n*contamination points to be outliers, no matter how low their outlying probabilities are.

So you will notice below that the points labeled as 1 still have higher outlying probability than the remaining that are labeled as 0. So it is a relative conversion...They are labeled as 1 is not because they are actually outliers but compare with the rest they look more suspicious.

However, if you notice this happens, it means the model somehow fails on your dataset. You should think about change parameters (such as using smaller contamination) or switch to other models, e.g., so_gaal and mo_gaal.

(...)
0 -> [0.99155502 0.00844498]
0 -> [0.99879371 0.00120629]
0 -> [0.9939181 0.0060819]
1 -> [0.97140045 0.02859955]
0 -> [0.99708464 0.00291536]
0 -> [0.99464627 0.00535373]
1 -> [0.99000773 0.00999227]
0 -> [0.99464627 0.00535373]
1 -> [0.9434721 0.0565279]
0 -> [9.99430013e-01 5.69987401e-04]
1 -> [0.99069809 0.00930191]
0 -> [0.99143665 0.00856335]
0 -> [0.99104535 0.00895465]
0 -> [0.99338351 0.00661649]
0 -> [0.99550684 0.00449316]
0 -> [9.99467018e-01 5.32982027e-04]
1 -> [0.98927478 0.01072522]
(...)

from pyod.

yzhao062 avatar yzhao062 commented on May 18, 2024

and I check the training process, you could see the loss is not steadily decreasing but fluctuating. I believe somehow the model is well trained or even converged...that also explains the discrepancy.

from pyod.

Maddosaurus avatar Maddosaurus commented on May 18, 2024

Thank you for your help and clarifications!

from pyod.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.