I'm trying to make sense of the predict_proba functio

intended clf.predict_proba usage about pyod HOT 7 CLOSED

Maddosaurus commented on May 18, 2024

intended clf.predict_proba usage

from pyod.

Comments (7)

yzhao062 commented on May 18, 2024 1

This is indeed a bug. Thanks for spotting this. I will investigate in the afternoon and try to provide a fix.

from pyod.

Maddosaurus commented on May 18, 2024

Sorry, I should have provided additional information in the first place.

I am working with Python 3.6.8 (Linux x64) with the following modules:

Keras==2.2.4
numba==0.43.1
numpy==1.16.2
pandas==0.23.4
pyod==0.6.9
scikit-learn==0.20.3
tensorflow==1.13.1

Please note that I have replaced my training data with your data generator (the outcome is still the same).

from pyod.models.auto_encoder import AutoEncoder
from pyod.utils.data import generate_data

X_train, y_train, X_test, y_test = \
        generate_data(n_train=1000,
                      n_test=100,
                      n_features=20,
                      contamination=0.2,
                      random_state=42)

clf_name = 'AutoEncoder'
clf = AutoEncoder(
    epochs=1,
    batch_size=1024,
    hidden_activation='relu', 
    output_activation='sigmoid',
    optimizer='adam',
    loss='binary_crossentropy',
    dropout_rate=0.3,
    l2_regularizer=0.1,
    validation_size=0.1,
    preprocessing=True,
    verbose=1,
    random_state=42,
    contamination=0.2,
    hidden_neurons=[10,5,5,10]
)
clf.fit(X_train)

y_test_scores = clf.predict_proba(X_test)

print("predict_proba array shape: ", y_test_scores.shape)
print(y_test_scores[0:10])

This yields

predict_proba array shape:  (100, 2)
[[0.92597692 0.07402308]
 [0.94946049 0.05053951]
 [0.9580671  0.0419329 ]
 [0.95641818 0.04358182]
 [0.95392434 0.04607566]
 [0.97283389 0.02716611]
 [0.97295465 0.02704535]
 [0.9476153  0.0523847 ]
 [0.95920739 0.04079261]
 [0.96109227 0.03890773]]

from pyod.

yzhao062 commented on May 18, 2024

Sorry. I just misread the thread; it is not a bug. For predict_proba, the first column is the possibility of being an inlier, and the second column is the probability of being an outlier. That is why the first column + second column = 1.

I understand why the result looks suspicious. It is simply because you should show all the test sample scores (all 100 points -> the last 20 are outliers). If you only show the first 10, then you would not see any outliers.

predict_proba array shape: (100, 2)
[[ 0.9252057 0.0747943 ]
[ 0.94688598 0.05311402]
[ 0.9587255 0.0412745 ]
[ 0.95563006 0.04436994]
[ 0.95454926 0.04545074]
[ 0.97624743 0.02375257]
[ 0.97046174 0.02953826]
[ 0.94189742 0.05810258]
[ 0.95970606 0.04029394]
[ 0.95868077 0.04131923]
[ 0.98129712 0.01870288]
[ 0.96368413 0.03631587]
[ 0.94522506 0.05477494]
[ 0.96296358 0.03703642]
[ 0.9689389 0.0310611 ]
[ 0.98497563 0.01502437]
[ 0.95699417 0.04300583]
[ 0.95098874 0.04901126]
[ 0.95127329 0.04872671]
[ 0.95532323 0.04467677]
[ 0.94519313 0.05480687]
[ 0.93410618 0.06589382]
[ 0.93570345 0.06429655]
[ 0.95371992 0.04628008]
[ 0.93784384 0.06215616]
[ 0.98224056 0.01775944]
[ 0.96278406 0.03721594]
[ 0.941882 0.058118 ]
[ 0.95087458 0.04912542]
[ 0.95902701 0.04097299]
[ 0.96082519 0.03917481]
[ 0.95358535 0.04641465]
[ 0.95453017 0.04546983]
[ 0.97133914 0.02866086]
[ 0.95180271 0.04819729]
[ 0.96054154 0.03945846]
[ 0.96168497 0.03831503]
[ 0.95372931 0.04627069]
[ 0.9592321 0.0407679 ]
[ 0.96286817 0.03713183]
[ 0.95867289 0.04132711]
[ 0.94864358 0.05135642]
[ 0.95596227 0.04403773]
[ 0.95909123 0.04090877]
[ 0.96076251 0.03923749]
[ 0.96107981 0.03892019]
[ 0.92131804 0.07868196]
[ 0.93841341 0.06158659]
[ 0.9620909 0.0379091 ]
[ 0.9532265 0.0467735 ]
[ 0.95202562 0.04797438]
[ 0.96460813 0.03539187]
[ 0.97630049 0.02369951]
[ 0.94309809 0.05690191]
[ 0.9447385 0.0552615 ]
[ 0.94100407 0.05899593]
[ 0.9794299 0.0205701 ]
[ 0.94863407 0.05136593]
[ 0.94806244 0.05193756]
[ 0.9383938 0.0616062 ]
[ 0.95637531 0.04362469]
[ 0.97859733 0.02140267]
[ 0.93561244 0.06438756]
[ 0.96468223 0.03531777]
[ 0.97670565 0.02329435]
[ 0.94415471 0.05584529]
[ 0.93732504 0.06267496]
[ 0.9692657 0.0307343 ]
[ 0.96672586 0.03327414]
[ 0.94646419 0.05353581]
[ 0.92293877 0.07706123]
[ 0.97864842 0.02135158]
[ 0.96765653 0.03234347]
[ 0.95593187 0.04406813]
[ 0.97026734 0.02973266]
[ 0.95871461 0.04128539]
[ 0.9541099 0.0458901 ]
[ 0.93110127 0.06889873]
[ 0.94285685 0.05714315]
[ 0.9504546 0.0495454 ]
[ 0.25184598 0.74815402]
[ 0.40557102 0.59442898]
[ 0.18434087 0.81565913]
[ 0.18808454 0.81191546]
[ 0.23148065 0.76851935]
[ 0.26837299 0.73162701]
[ 0.21580742 0.78419258]
[ 0.14347461 0.85652539]
[ 0.29459315 0.70540685]
[ 0.30439276 0.69560724]
[ 0.28564684 0.71435316]
[ 0.31831457 0.68168543]
[ 0.2653372 0.7346628 ]
[ 0.2669916 0.7330084 ]
[ 0.2340579 0.7659421 ]
[ 0.18989429 0.81010571]
[ 0.34010795 0.65989205]
[ 0.1589018 0.8410982 ]
[ 0.23193837 0.76806163]
[ 0.46504918 0.53495082]]

from pyod.

Maddosaurus commented on May 18, 2024

I retried this with the generated demo data and can reproduce this behavior - thank you for the clarification!
If I print the result of clf.predict and clf.predict_proba side by side, this works out perfectly for the generated test data:

(...)
0 -> [0.97238848 0.02761152]
0 -> [0.9618547 0.0381453]
0 -> [0.95219714 0.04780286]
0 -> [0.93436772 0.06563228]
0 -> [0.945225 0.054775]
0 -> [0.95295617 0.04704383]
1 -> [0.25545642 0.74454358]
1 -> [0.4047036 0.5952964]
1 -> [0.18523465 0.81476535]
1 -> [0.1886568 0.8113432]
1 -> [0.2289029 0.7710971]
1 -> [0.27263008 0.72736992]
(...)

However, if I try the same with my production dataset, the following happens:

(...)
0 -> [0.99155502 0.00844498]
0 -> [0.99879371 0.00120629]
0 -> [0.9939181 0.0060819]
1 -> [0.97140045 0.02859955]
0 -> [0.99708464 0.00291536]
0 -> [0.99464627 0.00535373]
1 -> [0.99000773 0.00999227]
0 -> [0.99464627 0.00535373]
1 -> [0.9434721 0.0565279]
0 -> [9.99430013e-01 5.69987401e-04]
1 -> [0.99069809 0.00930191]
0 -> [0.99143665 0.00856335]
0 -> [0.99104535 0.00895465]
0 -> [0.99338351 0.00661649]
0 -> [0.99550684 0.00449316]
0 -> [9.99467018e-01 5.32982027e-04]
1 -> [0.98927478 0.01072522]
(...)

This is why I opened the issue in the first place. How is it that predict gives me a different label than predict_proba? On the generated demo data these two match, but not in my production data. Did I overlook or misread something?

If you're interested in the full code, you can find a gist with all IPython Notebooks here:
https://gist.github.com/Maddosaurus/f75bc577a403de53e0e594e8d2b56ad7

from pyod.

yzhao062 commented on May 18, 2024

Predict() is actually a hard-cut probability conversion. For instance, you have n test points, it will force the n*contamination points to be outliers, no matter how low their outlying probabilities are.

So you will notice below that the points labeled as 1 still have higher outlying probability than the remaining that are labeled as 0. So it is a relative conversion...They are labeled as 1 is not because they are actually outliers but compare with the rest they look more suspicious.

However, if you notice this happens, it means the model somehow fails on your dataset. You should think about change parameters (such as using smaller contamination) or switch to other models, e.g., so_gaal and mo_gaal.

(...)
0 -> [0.99155502 0.00844498]
0 -> [0.99879371 0.00120629]
0 -> [0.9939181 0.0060819]
1 -> [0.97140045 0.02859955]
0 -> [0.99708464 0.00291536]
0 -> [0.99464627 0.00535373]
1 -> [0.99000773 0.00999227]
0 -> [0.99464627 0.00535373]
1 -> [0.9434721 0.0565279]
0 -> [9.99430013e-01 5.69987401e-04]
1 -> [0.99069809 0.00930191]
0 -> [0.99143665 0.00856335]
0 -> [0.99104535 0.00895465]
0 -> [0.99338351 0.00661649]
0 -> [0.99550684 0.00449316]
0 -> [9.99467018e-01 5.32982027e-04]
1 -> [0.98927478 0.01072522]
(...)

from pyod.

yzhao062 commented on May 18, 2024

and I check the training process, you could see the loss is not steadily decreasing but fluctuating. I believe somehow the model is well trained or even converged...that also explains the discrepancy.

from pyod.

Maddosaurus commented on May 18, 2024

Thank you for your help and clarifications!

from pyod.

intended clf.predict_proba usage about pyod HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs