GithubHelp home page GithubHelp logo

bulgogi's Introduction

CI

bulgogi

Helper for ML feature preprocessing. Very tasty.

What's in a bulgogi?

Machine learning is cool, but those models are picky beasts. Most statistical models don't handle anything else beyond numbers, thus the development of techniques such as one-hot encoding for categorical variables (e.g. a country variable).

To create these numerical representations, and create other interesting variables aka features, we must pass data through a preprocessing step aka feature engineering. If you work in an environment with several data science teams, it is likely several teams have reimplemented the same feature engineering code in different projects. This is especially true in polyglot teams (e.g. some people prefer R, others Python, yet a growing number Clojure).

bulgogi is a system to simplify this process, via centralization of concerns and coupling with a feature store [1] [2]. The final goal is to increase sharing of code, and ensure point-in-time correctness of training data.

a diagram showing bulgogi getting requests from a model in production, storing the results to a database and training a new model with data from that database without redoing feature engineering

Presently, bulgogi is an idea and this repository is a playground for experimentation. Examples of applications and use-cases will be added as development evolves. The original code started as a gist here.

Although this introduction and examples in /example focus on using bulgogi within the context of production ML models, this library is generic enough for other use-cases where mapping arbitrary functions over data is useful.

I gave a talk at re:Clojure 2021 about it.

Ideas and discussion are welcome!

What bulgogi is not

This is definitly not a replacement for ML pipelines (sci-kit learn pipelines, tidymodels workflows) in all situations. If the cost of higher latency (no benchmarks yet about how much) is higher than the cost of quicker collaboration, then by all means use a pipeline.

Pros and cons

Pros

In environments where multiple teams, or multiple people, need to ship ML models to live production environments, Bulgogi can

  • reduce the time it takes to engineer features -- maybe a teammate has build what you need already
  • decouple model training from deployment, so you ship smaller files, which load faster and need to track fewer things
  • allow ML practitioners to use whatever language they need to create models
  • reduce the amount of data cleaning and wrangling needed before training models

Cons

  • adds latency in comparison to inlined code (benchmarks soon!)
  • naming features needs to be explicit, and potentially long to avoid conflicts between namespaces e.g. if two areas of your company/team call different things by the same
  • all features must be written in Clojure (only a con if nobody knows Clojure in your team/company)

Installation

bulgogi is not in Clojars yet, but you can try it with deps.edn:

{:deps {io.github.jcpsantiago/bulgogi {:git/url "https://github.com/jcpsantiago/bulgogi/"
				       :git/sha "278ce2738f26d4100b3470f133f682ad450662c4"}}

Usage

You can see an example implementation in /example.

The main meat in bulgogi is the preprocessed function. It takes in a request map with keys :input-data (another map) and :features (a vector of strings) e.g.

{:input-data {:current-amount 700
	      :email "[email protected]"
 :features ["n-digits-in-email-name" 
	    "contains-risky-item"]}

and a namespace e.g. 'example.main' to look for functions with the same name as the vals in :features, then pmaps those fns over the :input-data. Finally, it returns a map with the preprocessed data

{:n-digits-in-email-name 2
 :contains-risky-item 1}

Contributing

Issues, PRs, ideas, criticism are all welcome :)

TODO

  • Create a reference implementation, potentially dockerized for fast deployment and testing
  • Benchmark Bulgogi vs inlined code (gold standard) and other libraries
  • Experiment wrapping Pathom 3 to get more generalised dependency resolution
  • Declarative interface for calling external APIs as co-effects
  • How to declaratively add input and output (streams, sinks, etc)

License

Bulgogi is shared under the Eclipse Public License 1.0.

bulgogi's People

Contributors

cyrik avatar jcpsantiago avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

cyrik

bulgogi's Issues

Feature resolving

Currently, features are strings that get looked up as function names in a single namespace.
This makes it easy to work with as long as you have a small feature set, but brittle in regards to mistyping. The feature namespace is also going to get pretty large potentially.

I'd like to brainstorm about solutions a little:

namespaced keywords

  • a little more complicated than the strings
  • user has to know the namespace
  • IDEs can help navigate, check for misspellings
  • trivial to use multiple namespaces

keep strings or non-namespaced keywords, but lookup in multiple namespaces

  • simple to use
  • hard to find the implementation for the user
  • how do we know which namespaces to check?
  • naming conflicts

keep strings or non-namespaced keywords, but force registration of the feature functions

  • simple to use
  • system can be asked about all available features
  • more state in the system
  • feature creators have to know how to register a feature
  • naming conflicts

any thoughts?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.