GithubHelp home page GithubHelp logo

bitnot / featran Goto Github PK

View Code? Open in Web Editor NEW

This project forked from spotify/featran

0.0 1.0 0.0 1.27 MB

A Scala feature transformation library for data science and machine learning

Home Page: https://spotify.github.io/featran

License: Apache License 2.0

Scala 94.74% Python 0.18% Java 4.99% Shell 0.02% HTML 0.07%

featran's Introduction

featran

Build Status codecov.io GitHub license Maven Central

Featran, also known as Featran77 or F77 (get it?), is a Scala library for feature transformation. It aims to simplify the time consuming task of feature engineering in data science and machine learning processes. It supports various collection types for feature extraction and output formats for feature representation.

Introduction

Most feature transformation logic requires two steps, one global aggregation to summarize data followed by one element-wise mapping to transform them. For example:

  • Min-Max Scaler
    • Aggregation: global min & max
    • Mapping: scale each value to [min, max]
  • One-Hot Encoder
    • Aggregation: distinct labels
    • Mapping: convert each label to a binary vector

We can implement this in a naive way using reduce and map.

case class Point(score: Double, label: String)
val data = Seq(Point(1.0, "a"), Point(2.0, "b"), Point(3.0, "c"))

val a = data
  .map(p => (p.score, p.score, Set(p.label)))
  .reduce((x, y) => (math.min(x._1, y._1), math.max(x._2, y._2), x._3 ++ y._3))

val features = data.map { p =>
  (p.score - a._1) / (a._2 - a._1) :: a._3.toList.sorted.map(s => if (s == p.label) 1.0 else 0.0)
}

But this is unmanageable for complex feature sets. The above logic can be easily expressed in Featran.

import com.spotify.featran._
import com.spotify.featran.transformers._

val fs = FeatureSpec.of[Point]
  .required(_.score)(MinMaxScaler("min-max"))
  .required(_.label)(OneHotEncoder("one-hot"))

val fe = fs.extract(data)
val names = fe.featureNames
val features = fe.featureValues[Seq[Double]]

Featran also supports these additional features.

  • Extract from Scala collections, Flink DataSets, Scalding TypedPipes, Scio SCollections and Spark RDDs
  • Output as Scala collections, Breeze dense and sparse vectors, TensorFlow Example Protobuf, XGBoost LabeledPoint and NumPy .npy file
  • Import aggregation from a previous extraction for training, validation and test sets
  • Compose feature specifications and separate outputs

See Examples (source) for detailed examples. See transformers package for a complete list of available feature transformers.

See ScalaDocs for current API documentation.

Presentations

Artifacts

Feature includes the following artifacts:

  • featran-core - core library, support for extraction from Scala collections and output as Scala collections, Breeze dense and sparse vectors
  • featran-java - Java interface, see JavaExample.java
  • featran-flink - support for extraction from Flink DataSet
  • featran-scalding - support for extraction from Scalding TypedPipe
  • featran-scio - support for extraction from Scio SCollection
  • featran-spark - support for extraction from Spark RDD
  • featran-tensorflow - support for output as TensorFlow Example Protobuf
  • featran-xgboost - support for output as XGBoost LabeledPoint
  • featran-numpy - support for output as NumPy .npy file

License

Copyright 2016-2017 Spotify AB.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

featran's People

Contributors

alaiacano avatar andrewsmartin avatar clairemcginty avatar derenrich avatar fallonchen avatar jbx avatar martinbomio avatar nevillelyh avatar ravwojdyla avatar regadas avatar richwhitjr avatar scala-steward avatar slhansen avatar yonromai avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.