GithubHelp home page GithubHelp logo

mjuez / approx-smote Goto Github PK

View Code? Open in Web Editor NEW
18.0 18.0 4.0 76 KB

Approx-SMOTE: fast SMOTE for Big Data on Apache Spark

License: Apache License 2.0

Scala 96.52% Shell 3.48%
big-data imbalanced-data imbalanced-learning scala smote spark

approx-smote's People

Contributors

mjuez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

approx-smote's Issues

Input should not assume the only columns are the label and features

Hi Mario,

When including the ASMOTE in a pipeline, I'm getting an exception here:

case Row(label: Double, currentF: Vector, neighborsIter: Iterable[_]) =>

It looks like transform() expects that the only columns present in the data are the labelCol and the featuresCol. However, I think this constraint is not realistic (and not difficult to remove). As you know, in order to construct a vectorized featuresCol, we usually create a VectorAssembler as a part of a bigger pipeline, where ASMOTE will be included as well, just as another stage.

Unfortunately, this does not work properly as it is actually weird to require a stage that removes all the other columns. Notice we normally include the VectorAssembler after previous steps, and normally no "remove-columns" step is used in pipelines.

Thanks!

Rename params with incoherent names

Hi again,

It would be easier for a Pyspark wrapper to have all params with identical names to the correponding private member variables of the class. E.g.

final val maxDistance = new DoubleParam(this, "maxNeighbors", "maximum distance to find neighbors", // todo: maxDistance or maxNeighbors?

final val bufferSizeSampleSizes = new IntArrayParam(this, "bufferSizeSampleSize", // todo: should this have an 's' at the end?

Looks like you are deciding on the proper name :-) it would be desirable to have the exact same name in the param name as in the variable name. It happens with maxDistance/maxNeighbors, and bufferSizeSampleSizes/bufferSizeSampleSize. I would appreciate it if you can please update them so my Pyspark wrapper code can properly mirror yours. In the meantime I have a temporary fix.

Thanks!

Method transform should keep all input columns in the result

Congrats and thanks for making this available in Spark!

I am right now finishing a Pyspark wrapper around your code and I've found that the transform() method only selects the features and the label column:

ds.select($(labelCol), $(featuresCol)).toDF.union(denormalize(synthSamplesDF, scalerModel))

However, the DF returned should contain all the columns of the input DF in order to be consistent with the expected behaviour of all existing Spark transformers. Otherwise, it becomes much more difficult to use in real use cases where you might have a lot of columns in the dataset you need to oversample.

A strange error

Hello! I just use your sample code,when it runs asmote.transform(ds).It give me a Exception:
java.lang.IllegalArgumentException: requirement failed: Sampling fraction (1000.0) must be on interval [0, 1]
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.util.random.BernoulliSampler.(RandomSampler.scala:150)
at org.apache.spark.rdd.RDD.$anonfun$sample$2(RDD.scala:536)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.sample(RDD.scala:531)
at org.apache.spark.ml.knn.KNN.fit(KNN.scala:408)
at org.apache.spark.ml.instance.ASMOTE.transform(ASMOTE.scala:151)
... 47 elided
Even I did't set any sampling fraction,it gives the error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.