picnicml / doddle-model Goto Github PK

View Code? Open in Web Editor NEW

137.0 137.0 23.0 594 KB

:cake: doddle-model: machine learning in Scala.

Home Page: https://picnicml.github.io

License: Apache License 2.0

Scala 100.00%

breeze data-science doddle-model machine-learning scala

doddle-model's People

Contributors

Stargazers

Watchers

doddle-model's Issues

Make a new release that is also published for Scala 2.13

Use single precision floating points

We want to make a switch from Double to Float for obvious reasons.

Implement the K-means clustering algorithm

Add k-means[1] clustering to doddle-model.

Select the algorithm implementation (Lloyd's, Hartigan-Wong, ...)
Implement initialisation methods (Random, k-means++)
Include parallelization

[1] https://en.wikipedia.org/wiki/K-means_clustering

Return all scores from cross validation

Currently, the average score from each fold is computed and returned when calling crossVal.score(...). We are often interested in computing the mean and standard deviation, however, so all scores (or precomputed mean and standard deviation) should be returned instead.

Fix numerical stability in SoftmaxClassifier, LogisticRegression

Fix predictProba(...).

Implement a hyperparameter search functionality

Something similar to scikit-learn's GridSearch and RandomSearch. Could be a class called HyperparameterSearch that would get a function that generates Predictor instances with different hyperparameters as an input. That way a user could control whether he/she wants to search the space on a grid or at random.

Add Documentation for LabelEncoder

It's difficult to understand what LabelEncoder does. I have tried ooking at the method inferLabelEncoder in the CsvLoader.scala, to be specific the line

encoder(name)(featureValue) = encoder(name).size

But it's not yet clear what LabelEncoder is being used for.

Having a comment documentation would really help here.

Behaviour of OneHotEncoder in a Pipeline

Describe the bug
Consider the code below, which creates a pipeline, where one hot encoder is applied first to the categorical feature and then the numerical feature is normalized to have 0 mean and variance of 1 (+ softmax, which is not important here).

To observe that everything was working correctly, I put a println inside the fit method of Pipeline. I noticed, that the above code does not work as intended (see Expected behaviour and Actual behaviour).

To Reproduce

import breeze.linalg.{DenseMatrix, DenseVector}
import io.picnicml.doddlemodel.data.Feature.{FeatureIndex, CategoricalFeature, NumericalFeature}
import io.picnicml.doddlemodel.pipeline.Pipeline.pipe
import io.picnicml.doddlemodel.pipeline.{Pipeline, PipelineTransformers}
import io.picnicml.doddlemodel.preprocessing.{StandardScaler, OneHotEncoder}
import io.picnicml.doddlemodel.linear.SoftmaxClassifier
import io.picnicml.doddlemodel.syntax.PredictorSyntax._

// 3 examples with 1 categorical and 1 numerical feature
val xTr = DenseMatrix(
	List(0.0, 3.7),
	List(5.0, 2.4),
	List(2.0, -0.3)
)
val yTr = DenseVector(0.0, 1.0, 0.0)
val featureIndex = FeatureIndex(List(CategoricalFeature, NumericalFeature))
val transformers: PipelineTransformers = List(
	pipe(OneHotEncoder(featureIndex)),
	pipe(StandardScaler(featureIndex))
)
val pipeline = Pipeline(transformers)(pipe(SoftmaxClassifier()))
val trainedPipeline = pipeline.fit(xTr, yTr)

Expected behavior
The first feature is removed and encoded by 6 new columns. Then, standard scaler is applied on the (now) FIRST (index 0) feature.

Actual behaviour
The first feature is removed and encoded by 6 new columns. Then, standard scaler is applied on the SECOND (index 1) feature.

Versions
Scala: 2.13.0
doddle-model: 0.0.1

Additional context
Prints of partial results (i.e. transformed data):

after first transformer is applied:

3.7   1.0  0.0  0.0  0.0  0.0  0.0  
2.4   0.0  0.0  0.0  0.0  0.0  1.0  
-0.3  0.0  0.0  1.0  0.0  0.0  0.0

after second transformer is applied:

3.7   1.1547005383792515   0.0  0.0  0.0  0.0  0.0  
2.4   -0.5773502691896258  0.0  0.0  0.0  0.0  1.0  
-0.3  -0.5773502691896258  0.0  1.0  0.0  0.0  0.0

Create a Github page for Picnic Machine Learning

Create a single page website: picnicml.github.io.

[Linear Models] Display warning if inappropriate data is used as a response variable

Currently, one can feed any kind of y to fit(...), predict(...), predictProba(...) and predictMean(...), e.g. a continuous variable to a LogisticRegression model.

Implement a KNeighborsClassifier

KNeighborsClassifier seems to be a very popular classification algorithm.
Do you have any plans/timeline for implementing it?

Cheers.

Implement an example for serving a pre-trained model in an Android app

See https://github.com/picnicml/doddle-android-example/issues/1.

Pipeline should have predictProba if the final predictior is a classifier

Currently, Pipeline only supports the predict function but should also expose the predictProba if the predictor is a classifier.

Implement one hot encoding preprocessor

One particular manifestation of #7 is to implement a preprocessor, similar to https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.

Make tests deterministic

Integration tests fail sometimes as data is shuffled without a seed.

Allow standard scaler to preprocess a subset of columns

Currently, standard scaler preprocesses all of the columns. Ideally, the user would be able to specify which columns get preprocessed.

Use Scalafmt formatter

Set up https://scalameta.org/scalafmt/. From their website:

Spend more time discussing important issues in code review and less time on code style. Scalafmt formats code so that it looks consistent between people on your team.

Implement feature preprocessing pipeline

Similar to scikit-learn's Pipeline.

Implement some basic feature transformers

Transforming features into a format that is more suitable for specific algorithms is an integral part of machine learning. See this for existing implementations and see this for scikit-learn implementations.

Expose the label encoder constructed during the loading of CSV files

Categorical features are automatically encoded as numerical data during the loading of CSV files. The label encoder constructed during that process should be modified and exposed for reuse during test time (e.g. serving of the model).

Write a roadmap for the project

Identify which algorithms would be the most useful to implement first. For a list of existing implementations take a look at the examples repository, for a list of scikit-learn implementations take a look at this. Add a link of the roadmap (wiki) to the readme and to the website.

Don't copy data in cross-validation

Currently, input data (x and y) is copied for every split of cross-validation. Take a look at the code, where toDenseMatrix and toDenseVector are called on every SliceMatrix and SliceVector. This should be fixed in order to be more memory efficient.

Start implementing feature subset selection

This needs a discussion on what functionality should be implemented first. If anyone has a suggestion or wants to tackle any of this, feel free to propose a roadmap.

Some resources:

Write a function for splitting the dataset into training and test subsets

For an example see this.

Optimize performance of CSVLoader

The current implementation is very slow, I think a better approach would be to implement a custom solution rather than using a third-party library.

Hyperparameter search should run multiple cross validations in parallel

Currently, the processing is sequential.

Each transformer should check that feature types are compatible

If a subset of columns is selected for preprocessing, the transformer should make sure that the types of those columns are compatible with the transformation.

List of transformers to fix:

Disable codacy coverage upload on forked PRs

All PRs from forked projects are now failing since secrets are not copied to forks (circleci). Uploading of coverage to codacy either needs to be disabled in CI for forked projects or some other solution needs to be implemented.

Revise compiler options and use cats syntax for creating options

Use sbt-jmh for benchmarking

Better benchmarking of the library.

Use sbt-jmh to benchmark and visualise the performance.

This was a recommendation from a plokhotnyuk

Add the unfit method to the estimator API

When the best performing model is returned from grid/random search and it is evaluated on the test set, a user might want to retrain it on a whole dataset with the same hyperparameters. Currently, one would have to inspect what hyperparameters were selected and this is problematic in some cases, esp. in a Pipeline where types of the original transformers and the predictor are lost. For that reason, we want to expose the .unfit() method, which would create an unfitted estimator with the same hyperparameters.

Example usage:

val split = splitData(x, y)
val selectedModel = gridSearch(split.xTr, split.yTr)
val score = f1Score(split.yTe, selectedModel.predict(split.xTe))
val finalModel = selectedModel.unfit().fit(x, y)

Make estimator API typesafe

Currently, we throw errors at runtime if .predict is called on an unfitted estimator, conversely, we throw an error if .fit is called on a trained estimator. The idea of this issue is that such mistakes should be caught at compile time rather than runtime.

Allow estimators to receive Dense, Sparse, Slice* matrices as input

Currently, only DenseMatrix is a valid input.

Implement an example for serving a pre-trained model with akka-streams

Use groupedWithin for vectorised predictions. See https://doc.akka.io/docs/akka/current/stream/operators/Source-or-Flow/groupedWithin.html#groupedwithin.

CrossValidation should allow for calculation of multiple metrics

Describe the solution you'd like
CrossValidation should receive and calculate an arbitrary number of metrics specified by the user.

Implement additional (classification, regression and ranking) metrics

Metrics are means of ML algorithms evaluation. See this for existing implementations.

Implement a better CSV reading functionality

Currently, data can be read from CSV files with this piece of code, but the implementation is very limited; only numerical data (doubles) can be read, non-numerical fields result in Double.NaN which is used to encode missing data in doddle-model. The new implementation should be able to encode numerical and categorical variables with missing values, i.e. numerical features should be encoded as doubles directly and categorical features should first be encoded to numerical representation (take a look at label ecoder) and only then converted to doubles. The function will probably look something like loadCsvDataset(filePath: String, naString: String, headerLine: Boolean = true): DenseMatrix[Double].

A 3rd party library should be used to parse CSV files (take a look at this discussion for some starting points).

Make models serializable

We need to provide a mechanism for saving and loading the state of models.

Implement selection of columns for the mean value imputer

Implement column selection, similar to how it's implemented in the standard scaler class with the featureIndex argument. The idea is to allow the user to impute a subset of columns with this preprocessor.

Create and publish Scaladoc documentation

Generate documentation based on Scaladoc and publish it on http://picnicml.github.io.

Require methods declared in supertypes to return the "current" type

See this: https://tpolecat.github.io/2015/04/29/f-bounds.html. E.g.

val model = LinearRegression()
val trainedModel = model.fit(x, y)
// trainedModel's type is Regressor, should be LinearRegression

Implement baseline models

Implement the DummyClassifier and DummyRegressor.

NotSerializableException when trying to persist a pipeline

Describe the bug
java.io.NotSerializableException is thrown when trying to persist a logistic regression model within a pipeline.

To Reproduce
Steps to reproduce the behavior:
Run this.

Versions
Scala 2.13, doddle-model 0.0.1-beta4

Stacktrace

Exception in thread "main" java.io.NotSerializableException: io.picnicml.doddlemodel.linear.LogisticRegression$$anon$1
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
	at io.picnicml.doddlemodel.typeclasses.Estimator.save(Estimator.scala:11)
	at io.picnicml.doddlemodel.typeclasses.Estimator.save$(Estimator.scala:9)
	at io.picnicml.doddlemodel.pipeline.Pipeline$$anon$1.save(Pipeline.scala:19)
	at io.picnicml.doddlemodel.syntax.PredictorSyntax$PredictorOps$.save$extension(PredictorSyntax.scala:16)
	at io.picnicml.doddlemodel.spamham.TrainClassifier$.delayedEndpoint$io$picnicml$doddlemodel$spamham$TrainClassifier$1(TrainClassifier.scala:25)
	at io.picnicml.doddlemodel.spamham.TrainClassifier$delayedInit$body.apply(TrainClassifier.scala:18)
	at scala.Function0.apply$mcV$sp(Function0.scala:39)
	at scala.Function0.apply$mcV$sp$(Function0.scala:39)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
	at scala.App.$anonfun$main$1(App.scala:73)
	at scala.App.$anonfun$main$1$adapted(App.scala:73)
	at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553)
	at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:921)
	at scala.App.main(App.scala:73)
	at scala.App.main$(App.scala:71)
	at io.picnicml.doddlemodel.spamham.TrainClassifier$.main(TrainClassifier.scala:18)
	at io.picnicml.doddlemodel.spamham.TrainClassifier.main(TrainClassifier.scala)

Return a list of feature names when reading data

Currently, the loadCsvDataset function returns a DenseMatrix which stores the loaded data. It should also return a list of feature names constructed from the header line. This would allow the user to translate feature names to column indices in the data matrix.

picnicml / doddle-model Goto Github PK

doddle-model's People

Contributors

Stargazers

Watchers

Forkers

doddle-model's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs