Comments (4)
Sadly this is a bug indeed. Sparkit trains sklearn's linear models in parallel, then averages them in a reduce step. There is at least one block, which contains only one of the labels. To check try the following:
train_Z[:, 'y']._rdd.map(lambda x: np.unique(x).size).filter(lambda x: x < 2).count()
To resolve You could randomize the train data to avoid blocks with one label, but this is still waiting for a clever solution.
from sparkit-learn.
Thanks
from sparkit-learn.
I believe I found a workaround for this. Considering these problems tend to happen to highly imbalanced datasets, I would suggest using StratifiedShuffleSplit, and alter the train_size or test_size ratio as an alternative as seen below:
for trainRatio in np.arange(0.05, 1, 0.05):
split = StratifiedShuffleSplit(n_splits=2, train_size=trainRatio)
for trainIdx, testIdx in split.split(X, y):
Xtrain, Xtest = X[trainIdx], X[testIdx]
ytrain, ytest = y[trainIdx], y[testIdx]
model = someModel()
model.fit(Xtrain, ytrain)
pred = model.predict(Xtest)
from sparkit-learn.
Sadly this is a bug indeed. Sparkit trains sklearn's linear models in parallel, then averages them in a reduce step. There is at least one block, which contains only one of the labels. To check try the following:
train_Z[:, 'y']._rdd.map(lambda x: np.unique(x).size).filter(lambda x: x < 2).count()To resolve You could randomize the train data to avoid blocks with one label, but this is still waiting for a clever solution.
Can't believe that this bug is still not fixed! Sad!
from sparkit-learn.
Related Issues (20)
- Decision function for LinearSVC HOT 2
- Scala support? HOT 1
- Linear models fail with AttributeError: 'int' object has no attribute 'coef_' HOT 1
- [RFC] Scikit interface for the `ml` and `mllib` packages
- DBSCAN Import Error HOT 6
- Integrate skflow
- ImportError: pyspark home needs to be added to PYTHONPATH HOT 1
- Py4JJavaError while fit_transform(X_rdd) HOT 1
- Py4JJavaError while fitting a splearn.rdd.DictRDD?
- [RFC] Plan Next Release HOT 1
- How can i use RandomForestClassifier with sparkit-learn library HOT 7
- For executing SparkRandomForestClassifier how should I create a BlockRDD HOT 5
- ImportError: No module named splearn.rdd , but no errors in import splearn HOT 1
- ImportError: No module named _common HOT 2
- Poor performances HOT 3
- What is the roadmap for this project: is it moribund? HOT 1
- Import error cannot import name "frombuffer_empty" HOT 2
- [Question] ArrayRDD to Pyspark Dataframe? HOT 1
- ImportError: cannot import name _check_numpy_unicode_bug
- Examples missing
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sparkit-learn.