mjuez / approx-smote Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 4.0 76 KB

Approx-SMOTE: fast SMOTE for Big Data on Apache Spark

License: Apache License 2.0

Scala 96.52% Shell 3.48%

big-data imbalanced-data imbalanced-learning scala smote spark

approx-smote's People

Contributors

Stargazers

Watchers

Forkers

whitezou seanigami xusliebana kaiyun94

approx-smote's Issues

Input should not assume the only columns are the label and features

Hi Mario,

When including the ASMOTE in a pipeline, I'm getting an exception here:

approx-smote/src/main/scala/org/apache/spark/ml/instance/ASMOTE.scala

Line 154 in 4672945

case Row(label: Double, currentF: Vector, neighborsIter: Iterable[_]) =>

It looks like transform() expects that the only columns present in the data are the labelCol and the featuresCol. However, I think this constraint is not realistic (and not difficult to remove). As you know, in order to construct a vectorized featuresCol, we usually create a VectorAssembler as a part of a bigger pipeline, where ASMOTE will be included as well, just as another stage.

Unfortunately, this does not work properly as it is actually weird to require a stage that removes all the other columns. Notice we normally include the VectorAssembler after previous steps, and normally no "remove-columns" step is used in pipelines.

Thanks!

Rename params with incoherent names

Hi again,

It would be easier for a Pyspark wrapper to have all params with identical names to the correponding private member variables of the class. E.g.

approx-smote/src/main/scala/org/apache/spark/ml/instance/ASMOTE.scala

Line 265 in 4672945

 final val maxDistance = new DoubleParam(this, "maxNeighbors", "maximum distance to find neighbors", // todo: maxDistance or maxNeighbors? 

approx-smote/src/main/scala/org/apache/spark/ml/instance/ASMOTE.scala

Line 291 in 4672945

 final val bufferSizeSampleSizes = new IntArrayParam(this, "bufferSizeSampleSize", // todo: should this have an 's' at the end? 

Looks like you are deciding on the proper name :-) it would be desirable to have the exact same name in the param name as in the variable name. It happens with maxDistance/maxNeighbors, and bufferSizeSampleSizes/bufferSizeSampleSize. I would appreciate it if you can please update them so my Pyspark wrapper code can properly mirror yours. In the meantime I have a temporary fix.

Thanks!

Method transform should keep all input columns in the result

Congrats and thanks for making this available in Spark!

I am right now finishing a Pyspark wrapper around your code and I've found that the transform() method only selects the features and the label column:

approx-smote/src/main/scala/org/apache/spark/ml/instance/ASMOTE.scala

Line 170 in 4672945

 ds.select($(labelCol), $(featuresCol)).toDF.union(denormalize(synthSamplesDF, scalerModel)) 

However, the DF returned should contain all the columns of the input DF in order to be consistent with the expected behaviour of all existing Spark transformers. Otherwise, it becomes much more difficult to use in real use cases where you might have a lot of columns in the dataset you need to oversample.

A strange error

Hello! I just use your sample code,when it runs asmote.transform(ds).It give me a Exception:
java.lang.IllegalArgumentException: requirement failed: Sampling fraction (1000.0) must be on interval [0, 1]
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.util.random.BernoulliSampler.(RandomSampler.scala:150)
at org.apache.spark.rdd.RDD.$anonfun$sample$2(RDD.scala:536)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.sample(RDD.scala:531)
at org.apache.spark.ml.knn.KNN.fit(KNN.scala:408)
at org.apache.spark.ml.instance.ASMOTE.transform(ASMOTE.scala:151)
... 47 elided
Even I did't set any sampling fraction,it gives the error.

mjuez / approx-smote Goto Github PK

approx-smote's People

Contributors

Stargazers

Watchers

Forkers

approx-smote's Issues

Input should not assume the only columns are the label and features

Rename params with incoherent names

Method transform should keep all input columns in the result

A strange error

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs