mjuez / approx-smote Goto Github PK
View Code? Open in Web Editor NEWApprox-SMOTE: fast SMOTE for Big Data on Apache Spark
License: Apache License 2.0
Approx-SMOTE: fast SMOTE for Big Data on Apache Spark
License: Apache License 2.0
Hi Mario,
When including the ASMOTE in a pipeline, I'm getting an exception here:
It looks like transform()
expects that the only columns present in the data are the labelCol and the featuresCol. However, I think this constraint is not realistic (and not difficult to remove). As you know, in order to construct a vectorized featuresCol, we usually create a VectorAssembler as a part of a bigger pipeline, where ASMOTE will be included as well, just as another stage.
Unfortunately, this does not work properly as it is actually weird to require a stage that removes all the other columns. Notice we normally include the VectorAssembler after previous steps, and normally no "remove-columns" step is used in pipelines.
Thanks!
Hi again,
It would be easier for a Pyspark wrapper to have all params with identical names to the correponding private member variables of the class. E.g.
Looks like you are deciding on the proper name :-) it would be desirable to have the exact same name in the param name as in the variable name. It happens with maxDistance/maxNeighbors, and bufferSizeSampleSizes/bufferSizeSampleSize. I would appreciate it if you can please update them so my Pyspark wrapper code can properly mirror yours. In the meantime I have a temporary fix.
Thanks!
Congrats and thanks for making this available in Spark!
I am right now finishing a Pyspark wrapper around your code and I've found that the transform() method only selects the features and the label column:
However, the DF returned should contain all the columns of the input DF in order to be consistent with the expected behaviour of all existing Spark transformers. Otherwise, it becomes much more difficult to use in real use cases where you might have a lot of columns in the dataset you need to oversample.
Hello! I just use your sample code,when it runs asmote.transform(ds).It give me a Exception:
java.lang.IllegalArgumentException: requirement failed: Sampling fraction (1000.0) must be on interval [0, 1]
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.util.random.BernoulliSampler.(RandomSampler.scala:150)
at org.apache.spark.rdd.RDD.$anonfun$sample$2(RDD.scala:536)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.sample(RDD.scala:531)
at org.apache.spark.ml.knn.KNN.fit(KNN.scala:408)
at org.apache.spark.ml.instance.ASMOTE.transform(ASMOTE.scala:151)
... 47 elided
Even I did't set any sampling fraction,it gives the error.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.