GithubHelp home page GithubHelp logo

picnicml / doddle-model Goto Github PK

View Code? Open in Web Editor NEW
137.0 137.0 23.0 594 KB

:cake: doddle-model: machine learning in Scala.

Home Page: https://picnicml.github.io

License: Apache License 2.0

Scala 100.00%
breeze data-science doddle-model machine-learning scala

doddle-model's People

Contributors

ashwinbhaskar avatar dandxy89 avatar evanhaldane avatar inejc avatar matejklemen avatar nikdon avatar novoselrok avatar scala-steward avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

doddle-model's Issues

Return all scores from cross validation

Currently, the average score from each fold is computed and returned when calling crossVal.score(...). We are often interested in computing the mean and standard deviation, however, so all scores (or precomputed mean and standard deviation) should be returned instead.

Implement a hyperparameter search functionality

Something similar to scikit-learn's GridSearch and RandomSearch. Could be a class called HyperparameterSearch that would get a function that generates Predictor instances with different hyperparameters as an input. That way a user could control whether he/she wants to search the space on a grid or at random.

Add Documentation for LabelEncoder

It's difficult to understand what LabelEncoder does. I have tried ooking at the method inferLabelEncoder in the CsvLoader.scala, to be specific the line

encoder(name)(featureValue) = encoder(name).size

But it's not yet clear what LabelEncoder is being used for.

Having a comment documentation would really help here.

Behaviour of OneHotEncoder in a Pipeline

Describe the bug
Consider the code below, which creates a pipeline, where one hot encoder is applied first to the categorical feature and then the numerical feature is normalized to have 0 mean and variance of 1 (+ softmax, which is not important here).

To observe that everything was working correctly, I put a println inside the fit method of Pipeline. I noticed, that the above code does not work as intended (see Expected behaviour and Actual behaviour).

To Reproduce

import breeze.linalg.{DenseMatrix, DenseVector}
import io.picnicml.doddlemodel.data.Feature.{FeatureIndex, CategoricalFeature, NumericalFeature}
import io.picnicml.doddlemodel.pipeline.Pipeline.pipe
import io.picnicml.doddlemodel.pipeline.{Pipeline, PipelineTransformers}
import io.picnicml.doddlemodel.preprocessing.{StandardScaler, OneHotEncoder}
import io.picnicml.doddlemodel.linear.SoftmaxClassifier
import io.picnicml.doddlemodel.syntax.PredictorSyntax._

// 3 examples with 1 categorical and 1 numerical feature
val xTr = DenseMatrix(
	List(0.0, 3.7),
	List(5.0, 2.4),
	List(2.0, -0.3)
)
val yTr = DenseVector(0.0, 1.0, 0.0)
val featureIndex = FeatureIndex(List(CategoricalFeature, NumericalFeature))
val transformers: PipelineTransformers = List(
	pipe(OneHotEncoder(featureIndex)),
	pipe(StandardScaler(featureIndex))
)
val pipeline = Pipeline(transformers)(pipe(SoftmaxClassifier()))
val trainedPipeline = pipeline.fit(xTr, yTr)

Expected behavior
The first feature is removed and encoded by 6 new columns. Then, standard scaler is applied on the (now) FIRST (index 0) feature.

Actual behaviour
The first feature is removed and encoded by 6 new columns. Then, standard scaler is applied on the SECOND (index 1) feature.

Versions
Scala: 2.13.0
doddle-model: 0.0.1

Additional context
Prints of partial results (i.e. transformed data):

  • after first transformer is applied:
3.7   1.0  0.0  0.0  0.0  0.0  0.0  
2.4   0.0  0.0  0.0  0.0  0.0  1.0  
-0.3  0.0  0.0  1.0  0.0  0.0  0.0 
  • after second transformer is applied:
3.7   1.1547005383792515   0.0  0.0  0.0  0.0  0.0  
2.4   -0.5773502691896258  0.0  0.0  0.0  0.0  1.0  
-0.3  -0.5773502691896258  0.0  1.0  0.0  0.0  0.0

Implement a KNeighborsClassifier

KNeighborsClassifier seems to be a very popular classification algorithm.
Do you have any plans/timeline for implementing it?

Cheers.

Implement some basic feature transformers

Transforming features into a format that is more suitable for specific algorithms is an integral part of machine learning. See this for existing implementations and see this for scikit-learn implementations.

Write a roadmap for the project

Identify which algorithms would be the most useful to implement first. For a list of existing implementations take a look at the examples repository, for a list of scikit-learn implementations take a look at this. Add a link of the roadmap (wiki) to the readme and to the website.

Don't copy data in cross-validation

Currently, input data (x and y) is copied for every split of cross-validation. Take a look at the code, where toDenseMatrix and toDenseVector are called on every SliceMatrix and SliceVector. This should be fixed in order to be more memory efficient.

Optimize performance of CSVLoader

The current implementation is very slow, I think a better approach would be to implement a custom solution rather than using a third-party library.

Disable codacy coverage upload on forked PRs

All PRs from forked projects are now failing since secrets are not copied to forks (circleci). Uploading of coverage to codacy either needs to be disabled in CI for forked projects or some other solution needs to be implemented.

Add the unfit method to the estimator API

When the best performing model is returned from grid/random search and it is evaluated on the test set, a user might want to retrain it on a whole dataset with the same hyperparameters. Currently, one would have to inspect what hyperparameters were selected and this is problematic in some cases, esp. in a Pipeline where types of the original transformers and the predictor are lost. For that reason, we want to expose the .unfit() method, which would create an unfitted estimator with the same hyperparameters.

Example usage:

val split = splitData(x, y)
val selectedModel = gridSearch(split.xTr, split.yTr)
val score = f1Score(split.yTe, selectedModel.predict(split.xTe))
val finalModel = selectedModel.unfit().fit(x, y)

Make estimator API typesafe

Currently, we throw errors at runtime if .predict is called on an unfitted estimator, conversely, we throw an error if .fit is called on a trained estimator. The idea of this issue is that such mistakes should be caught at compile time rather than runtime.

Implement a better CSV reading functionality

Currently, data can be read from CSV files with this piece of code, but the implementation is very limited; only numerical data (doubles) can be read, non-numerical fields result in Double.NaN which is used to encode missing data in doddle-model. The new implementation should be able to encode numerical and categorical variables with missing values, i.e. numerical features should be encoded as doubles directly and categorical features should first be encoded to numerical representation (take a look at label ecoder) and only then converted to doubles. The function will probably look something like loadCsvDataset(filePath: String, naString: String, headerLine: Boolean = true): DenseMatrix[Double].

A 3rd party library should be used to parse CSV files (take a look at this discussion for some starting points).

NotSerializableException when trying to persist a pipeline

Describe the bug
java.io.NotSerializableException is thrown when trying to persist a logistic regression model within a pipeline.

To Reproduce
Steps to reproduce the behavior:
Run this.

Versions
Scala 2.13, doddle-model 0.0.1-beta4

Stacktrace

Exception in thread "main" java.io.NotSerializableException: io.picnicml.doddlemodel.linear.LogisticRegression$$anon$1
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
	at io.picnicml.doddlemodel.typeclasses.Estimator.save(Estimator.scala:11)
	at io.picnicml.doddlemodel.typeclasses.Estimator.save$(Estimator.scala:9)
	at io.picnicml.doddlemodel.pipeline.Pipeline$$anon$1.save(Pipeline.scala:19)
	at io.picnicml.doddlemodel.syntax.PredictorSyntax$PredictorOps$.save$extension(PredictorSyntax.scala:16)
	at io.picnicml.doddlemodel.spamham.TrainClassifier$.delayedEndpoint$io$picnicml$doddlemodel$spamham$TrainClassifier$1(TrainClassifier.scala:25)
	at io.picnicml.doddlemodel.spamham.TrainClassifier$delayedInit$body.apply(TrainClassifier.scala:18)
	at scala.Function0.apply$mcV$sp(Function0.scala:39)
	at scala.Function0.apply$mcV$sp$(Function0.scala:39)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
	at scala.App.$anonfun$main$1(App.scala:73)
	at scala.App.$anonfun$main$1$adapted(App.scala:73)
	at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553)
	at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:921)
	at scala.App.main(App.scala:73)
	at scala.App.main$(App.scala:71)
	at io.picnicml.doddlemodel.spamham.TrainClassifier$.main(TrainClassifier.scala:18)
	at io.picnicml.doddlemodel.spamham.TrainClassifier.main(TrainClassifier.scala)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.