GithubHelp home page GithubHelp logo

Comments (10)

razdoburdin avatar razdoburdin commented on August 16, 2024 1

Hi @davechallis ,
thanks for the reproducer. The problem you found is caused by incorrect working with missing values for scipy csr matrices.
As a quick fix you can add to your code the following:

X_np = np.empty(X.shape)
X_np.fill(np.nan)
X_np[X.nonzero()] = X[X.nonzero()]
y_d4p = model_d4p.predict_proba(X_np)

image

Let me know if found any other problems!

from scikit-learn-intelex.

davechallis avatar davechallis commented on August 16, 2024 1

@razdoburdin Fantastic, that works perfectly! Thanks for your help, I've tested with my full example, and all works great.

from scikit-learn-intelex.

razdoburdin avatar razdoburdin commented on August 16, 2024 1

@davechallis ,
We don't have a native support of sparse matrices in GBT inference yet.

from scikit-learn-intelex.

razdoburdin avatar razdoburdin commented on August 16, 2024

Hi,
thanks for using daal4py!

I didn't catch from your example, why did you use predict for xgboost and predict_proba for d4p? These two methods are not equivalent to each over.

from scikit-learn-intelex.

davechallis avatar davechallis commented on August 16, 2024

@razdoburdin Hi, thanks for taking a look! I'm using the xgboost python API predict and not the scikit-learn wrapper, so the xgboost predict function returns probabilities rather than labels (unless I'm misunderstanding the API).

E.g. in the example above, if I print out the first 5 results in each y:

y_xgb = model_xgb.predict(xgb.DMatrix(X))
y_d4p = model_d4p.predict_proba(X)[:, 0]

print("xgb", y_xgb[:5])
print("d4p", y_d4p[:5])

this outputs:

xgb [0.00546392 0.00123668 0.00123668 0.00136322 0.00360578]
d4p [0.3883016  0.38513577 0.38620013 0.38620013 0.32219833]

from scikit-learn-intelex.

davechallis avatar davechallis commented on August 16, 2024

@razdoburdin Just to make it a clearer comparison, I also did a quick check with using the sklearn API and outputting the full results of predict_proba in both cases:

import pickle
import xgboost as xgb
import daal4py as d4p

model_xgb_sklearn = xgb.XGBClassifier()
model_xgb_sklearn.load_model("model.bin")

model_d4p = d4p.convert_model(xgb.Booster().load_model("model.bin"))

with open("data.pkl", "rb") as fh:
    X = pickle.load(fh)

y_xgb = model_xgb_sklearn.predict_proba(X)
y_d4p = model_d4p.predict_proba(X)

print("xgb")
print(y_xgb[:5])
print()
print("d4p")
print(y_d4p[:5])

which outputs:

xgb
[[0.9945361  0.00546392]
 [0.9987633  0.00123668]
 [0.9987633  0.00123668]
 [0.9986368  0.00136322]
 [0.9963942  0.00360578]]

d4p
[[0.3883016  0.6116984 ]
 [0.38513577 0.61486423]
 [0.38620013 0.61379987]
 [0.38620013 0.61379987]
 [0.32219833 0.67780167]]

from scikit-learn-intelex.

razdoburdin avatar razdoburdin commented on August 16, 2024

Hi @davechallis ,
I wasn't able to reproduce the problem.
Could you please provide some launchable reproducer?

from scikit-learn-intelex.

davechallis avatar davechallis commented on August 16, 2024

@razdoburdin no problem, give me a while to sample some data, and I'll try and get someone that works end to end uploaded here.

from scikit-learn-intelex.

davechallis avatar davechallis commented on August 16, 2024

@razdoburdin I've attached a zip file containing a python script (similar to the one posted above) named demo.py, some data to classify in data.pkl, and a trained XGBoost classifier in model.bin.

If I run the script locally, I get:

X <class 'scipy.sparse._csr.csr_matrix'> float32 (50, 5300)

xgb
[[0.9945361  0.00546392]
 [0.9987633  0.00123668]
 [0.9987633  0.00123668]
 [0.9986368  0.00136322]
 [0.9963942  0.00360578]]

d4p
[[0.3883016  0.6116984 ]
 [0.38513577 0.61486423]
 [0.38620013 0.61379987]
 [0.38620013 0.61379987]
 [0.32219833 0.67780167]]

This is in a fresh python 3.12 environment, with the following packages installed:

  • xgboost==2.0.3
  • scipy==1.12.0
  • daal4py==2024.3.0

Hopefully this helps a bit, but let me know if there's anything else I can provide to help.

from scikit-learn-intelex.

davechallis avatar davechallis commented on August 16, 2024

@razdoburdin One last thing I thought I'd ask/check - is there any other approach for this that doesn't involve converting to a dense matrix first? I've tested on a few hundred classifiers I've got, and some of them are extremely sparse, so I end up hitting memory issues when converting to a dense matrix.

Or if there's any documentation/source I can read to find out more, that'd also be great.

If not, then no problem, I can maybe check matrix density then swap between daal4py and native xgboost models depending on that.

from scikit-learn-intelex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.