Comments (7)
This is indeed a bug. Thanks for spotting this. I will investigate in the afternoon and try to provide a fix.
from pyod.
Sorry, I should have provided additional information in the first place.
I am working with Python 3.6.8 (Linux x64) with the following modules:
Keras==2.2.4
numba==0.43.1
numpy==1.16.2
pandas==0.23.4
pyod==0.6.9
scikit-learn==0.20.3
tensorflow==1.13.1
Please note that I have replaced my training data with your data generator (the outcome is still the same).
from pyod.models.auto_encoder import AutoEncoder
from pyod.utils.data import generate_data
X_train, y_train, X_test, y_test = \
generate_data(n_train=1000,
n_test=100,
n_features=20,
contamination=0.2,
random_state=42)
clf_name = 'AutoEncoder'
clf = AutoEncoder(
epochs=1,
batch_size=1024,
hidden_activation='relu',
output_activation='sigmoid',
optimizer='adam',
loss='binary_crossentropy',
dropout_rate=0.3,
l2_regularizer=0.1,
validation_size=0.1,
preprocessing=True,
verbose=1,
random_state=42,
contamination=0.2,
hidden_neurons=[10,5,5,10]
)
clf.fit(X_train)
y_test_scores = clf.predict_proba(X_test)
print("predict_proba array shape: ", y_test_scores.shape)
print(y_test_scores[0:10])
This yields
predict_proba array shape: (100, 2)
[[0.92597692 0.07402308]
[0.94946049 0.05053951]
[0.9580671 0.0419329 ]
[0.95641818 0.04358182]
[0.95392434 0.04607566]
[0.97283389 0.02716611]
[0.97295465 0.02704535]
[0.9476153 0.0523847 ]
[0.95920739 0.04079261]
[0.96109227 0.03890773]]
from pyod.
Sorry. I just misread the thread; it is not a bug. For predict_proba, the first column is the possibility of being an inlier, and the second column is the probability of being an outlier. That is why the first column + second column = 1.
I understand why the result looks suspicious. It is simply because you should show all the test sample scores (all 100 points -> the last 20 are outliers). If you only show the first 10, then you would not see any outliers.
predict_proba array shape: (100, 2)
[[ 0.9252057 0.0747943 ]
[ 0.94688598 0.05311402]
[ 0.9587255 0.0412745 ]
[ 0.95563006 0.04436994]
[ 0.95454926 0.04545074]
[ 0.97624743 0.02375257]
[ 0.97046174 0.02953826]
[ 0.94189742 0.05810258]
[ 0.95970606 0.04029394]
[ 0.95868077 0.04131923]
[ 0.98129712 0.01870288]
[ 0.96368413 0.03631587]
[ 0.94522506 0.05477494]
[ 0.96296358 0.03703642]
[ 0.9689389 0.0310611 ]
[ 0.98497563 0.01502437]
[ 0.95699417 0.04300583]
[ 0.95098874 0.04901126]
[ 0.95127329 0.04872671]
[ 0.95532323 0.04467677]
[ 0.94519313 0.05480687]
[ 0.93410618 0.06589382]
[ 0.93570345 0.06429655]
[ 0.95371992 0.04628008]
[ 0.93784384 0.06215616]
[ 0.98224056 0.01775944]
[ 0.96278406 0.03721594]
[ 0.941882 0.058118 ]
[ 0.95087458 0.04912542]
[ 0.95902701 0.04097299]
[ 0.96082519 0.03917481]
[ 0.95358535 0.04641465]
[ 0.95453017 0.04546983]
[ 0.97133914 0.02866086]
[ 0.95180271 0.04819729]
[ 0.96054154 0.03945846]
[ 0.96168497 0.03831503]
[ 0.95372931 0.04627069]
[ 0.9592321 0.0407679 ]
[ 0.96286817 0.03713183]
[ 0.95867289 0.04132711]
[ 0.94864358 0.05135642]
[ 0.95596227 0.04403773]
[ 0.95909123 0.04090877]
[ 0.96076251 0.03923749]
[ 0.96107981 0.03892019]
[ 0.92131804 0.07868196]
[ 0.93841341 0.06158659]
[ 0.9620909 0.0379091 ]
[ 0.9532265 0.0467735 ]
[ 0.95202562 0.04797438]
[ 0.96460813 0.03539187]
[ 0.97630049 0.02369951]
[ 0.94309809 0.05690191]
[ 0.9447385 0.0552615 ]
[ 0.94100407 0.05899593]
[ 0.9794299 0.0205701 ]
[ 0.94863407 0.05136593]
[ 0.94806244 0.05193756]
[ 0.9383938 0.0616062 ]
[ 0.95637531 0.04362469]
[ 0.97859733 0.02140267]
[ 0.93561244 0.06438756]
[ 0.96468223 0.03531777]
[ 0.97670565 0.02329435]
[ 0.94415471 0.05584529]
[ 0.93732504 0.06267496]
[ 0.9692657 0.0307343 ]
[ 0.96672586 0.03327414]
[ 0.94646419 0.05353581]
[ 0.92293877 0.07706123]
[ 0.97864842 0.02135158]
[ 0.96765653 0.03234347]
[ 0.95593187 0.04406813]
[ 0.97026734 0.02973266]
[ 0.95871461 0.04128539]
[ 0.9541099 0.0458901 ]
[ 0.93110127 0.06889873]
[ 0.94285685 0.05714315]
[ 0.9504546 0.0495454 ]
[ 0.25184598 0.74815402]
[ 0.40557102 0.59442898]
[ 0.18434087 0.81565913]
[ 0.18808454 0.81191546]
[ 0.23148065 0.76851935]
[ 0.26837299 0.73162701]
[ 0.21580742 0.78419258]
[ 0.14347461 0.85652539]
[ 0.29459315 0.70540685]
[ 0.30439276 0.69560724]
[ 0.28564684 0.71435316]
[ 0.31831457 0.68168543]
[ 0.2653372 0.7346628 ]
[ 0.2669916 0.7330084 ]
[ 0.2340579 0.7659421 ]
[ 0.18989429 0.81010571]
[ 0.34010795 0.65989205]
[ 0.1589018 0.8410982 ]
[ 0.23193837 0.76806163]
[ 0.46504918 0.53495082]]
from pyod.
I retried this with the generated demo data and can reproduce this behavior - thank you for the clarification!
If I print the result of clf.predict
and clf.predict_proba
side by side, this works out perfectly for the generated test data:
(...)
0 -> [0.97238848 0.02761152]
0 -> [0.9618547 0.0381453]
0 -> [0.95219714 0.04780286]
0 -> [0.93436772 0.06563228]
0 -> [0.945225 0.054775]
0 -> [0.95295617 0.04704383]
1 -> [0.25545642 0.74454358]
1 -> [0.4047036 0.5952964]
1 -> [0.18523465 0.81476535]
1 -> [0.1886568 0.8113432]
1 -> [0.2289029 0.7710971]
1 -> [0.27263008 0.72736992]
(...)
However, if I try the same with my production dataset, the following happens:
(...)
0 -> [0.99155502 0.00844498]
0 -> [0.99879371 0.00120629]
0 -> [0.9939181 0.0060819]
1 -> [0.97140045 0.02859955]
0 -> [0.99708464 0.00291536]
0 -> [0.99464627 0.00535373]
1 -> [0.99000773 0.00999227]
0 -> [0.99464627 0.00535373]
1 -> [0.9434721 0.0565279]
0 -> [9.99430013e-01 5.69987401e-04]
1 -> [0.99069809 0.00930191]
0 -> [0.99143665 0.00856335]
0 -> [0.99104535 0.00895465]
0 -> [0.99338351 0.00661649]
0 -> [0.99550684 0.00449316]
0 -> [9.99467018e-01 5.32982027e-04]
1 -> [0.98927478 0.01072522]
(...)
This is why I opened the issue in the first place. How is it that predict
gives me a different label than predict_proba
? On the generated demo data these two match, but not in my production data. Did I overlook or misread something?
If you're interested in the full code, you can find a gist with all IPython Notebooks here:
https://gist.github.com/Maddosaurus/f75bc577a403de53e0e594e8d2b56ad7
from pyod.
Predict() is actually a hard-cut probability conversion. For instance, you have n test points, it will force the n*contamination points to be outliers, no matter how low their outlying probabilities are.
So you will notice below that the points labeled as 1 still have higher outlying probability than the remaining that are labeled as 0. So it is a relative conversion...They are labeled as 1 is not because they are actually outliers but compare with the rest they look more suspicious.
However, if you notice this happens, it means the model somehow fails on your dataset. You should think about change parameters (such as using smaller contamination) or switch to other models, e.g., so_gaal and mo_gaal.
(...)
0 -> [0.99155502 0.00844498]
0 -> [0.99879371 0.00120629]
0 -> [0.9939181 0.0060819]
1 -> [0.97140045 0.02859955]
0 -> [0.99708464 0.00291536]
0 -> [0.99464627 0.00535373]
1 -> [0.99000773 0.00999227]
0 -> [0.99464627 0.00535373]
1 -> [0.9434721 0.0565279]
0 -> [9.99430013e-01 5.69987401e-04]
1 -> [0.99069809 0.00930191]
0 -> [0.99143665 0.00856335]
0 -> [0.99104535 0.00895465]
0 -> [0.99338351 0.00661649]
0 -> [0.99550684 0.00449316]
0 -> [9.99467018e-01 5.32982027e-04]
1 -> [0.98927478 0.01072522]
(...)
from pyod.
and I check the training process, you could see the loss is not steadily decreasing but fluctuating. I believe somehow the model is well trained or even converged...that also explains the discrepancy.
from pyod.
Thank you for your help and clarifications!
from pyod.
Related Issues (20)
- Which algorithms support ‘Semi-supervised Novelty Detection’?
- Support SHAP for model explanation
- Default parameter inconsistent with docs
- different results depending on time-span HOT 2
- saved model size
- Model prediction in Autoencoder does not support adjusting batch size
- False positive warning when manipulating pandas dataframes
- Grateful
- issues about scikit-learn HOT 3
- TOS selection methods and parameter
- Add wheels to PyPI?
- TypeError: SUOD.__init__() got an unexpected keyword argument 'cost_forecast_loc_fit' HOT 2
- DIF model: duplicate normalization
- How will the effectiveness of the model be evaluated, and does the library provide the appropriate methodology? HOT 1
- Quasi-Monte Carlo Discrepancy always predicts an outlier HOT 1
- Current implementation is not compatible with Keras 3 HOT 1
- ECOD and COPOD decision functions switched
- A problem about DeepSVDD
- Problem with ALAD model training HOT 2
- Issues with plotting the DecisionBoundaryDisplay
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyod.