picnicml / doddle-model Goto Github PK
View Code? Open in Web Editor NEW:cake: doddle-model: machine learning in Scala.
Home Page: https://picnicml.github.io
License: Apache License 2.0
:cake: doddle-model: machine learning in Scala.
Home Page: https://picnicml.github.io
License: Apache License 2.0
We want to make a switch from Double
to Float
for obvious reasons.
Add k-means[1] clustering to doddle-model.
Currently, the average score from each fold is computed and returned when calling crossVal.score(...)
. We are often interested in computing the mean and standard deviation, however, so all scores (or precomputed mean and standard deviation) should be returned instead.
Fix predictProba(...)
.
Something similar to scikit-learn's GridSearch
and RandomSearch
. Could be a class called HyperparameterSearch
that would get a function that generates Predictor
instances with different hyperparameters as an input. That way a user could control whether he/she wants to search the space on a grid or at random.
It's difficult to understand what LabelEncoder
does. I have tried ooking at the method inferLabelEncoder
in the CsvLoader.scala
, to be specific the line
encoder(name)(featureValue) = encoder(name).size
But it's not yet clear what LabelEncoder
is being used for.
Having a comment documentation would really help here.
Describe the bug
Consider the code below, which creates a pipeline, where one hot encoder is applied first to the categorical feature and then the numerical feature is normalized to have 0 mean and variance of 1 (+ softmax, which is not important here).
To observe that everything was working correctly, I put a println inside the fit
method of Pipeline
. I noticed, that the above code does not work as intended (see Expected behaviour
and Actual behaviour
).
To Reproduce
import breeze.linalg.{DenseMatrix, DenseVector}
import io.picnicml.doddlemodel.data.Feature.{FeatureIndex, CategoricalFeature, NumericalFeature}
import io.picnicml.doddlemodel.pipeline.Pipeline.pipe
import io.picnicml.doddlemodel.pipeline.{Pipeline, PipelineTransformers}
import io.picnicml.doddlemodel.preprocessing.{StandardScaler, OneHotEncoder}
import io.picnicml.doddlemodel.linear.SoftmaxClassifier
import io.picnicml.doddlemodel.syntax.PredictorSyntax._
// 3 examples with 1 categorical and 1 numerical feature
val xTr = DenseMatrix(
List(0.0, 3.7),
List(5.0, 2.4),
List(2.0, -0.3)
)
val yTr = DenseVector(0.0, 1.0, 0.0)
val featureIndex = FeatureIndex(List(CategoricalFeature, NumericalFeature))
val transformers: PipelineTransformers = List(
pipe(OneHotEncoder(featureIndex)),
pipe(StandardScaler(featureIndex))
)
val pipeline = Pipeline(transformers)(pipe(SoftmaxClassifier()))
val trainedPipeline = pipeline.fit(xTr, yTr)
Expected behavior
The first feature is removed and encoded by 6 new columns. Then, standard scaler is applied on the (now) FIRST (index 0) feature.
Actual behaviour
The first feature is removed and encoded by 6 new columns. Then, standard scaler is applied on the SECOND (index 1) feature.
Versions
Scala: 2.13.0
doddle-model: 0.0.1
Additional context
Prints of partial results (i.e. transformed data):
3.7 1.0 0.0 0.0 0.0 0.0 0.0
2.4 0.0 0.0 0.0 0.0 0.0 1.0
-0.3 0.0 0.0 1.0 0.0 0.0 0.0
3.7 1.1547005383792515 0.0 0.0 0.0 0.0 0.0
2.4 -0.5773502691896258 0.0 0.0 0.0 0.0 1.0
-0.3 -0.5773502691896258 0.0 1.0 0.0 0.0 0.0
Create a single page website: picnicml.github.io
.
Currently, one can feed any kind of y
to fit(...)
, predict(...)
, predictProba(...)
and predictMean(...)
, e.g. a continuous variable to a LogisticRegression
model.
KNeighborsClassifier seems to be a very popular classification algorithm.
Do you have any plans/timeline for implementing it?
Cheers.
Currently, Pipeline
only supports the predict
function but should also expose the predictProba
if the predictor is a classifier.
One particular manifestation of #7 is to implement a preprocessor, similar to https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
Integration tests fail sometimes as data is shuffled without a seed.
Currently, standard scaler preprocesses all of the columns. Ideally, the user would be able to specify which columns get preprocessed.
Set up https://scalameta.org/scalafmt/. From their website:
Spend more time discussing important issues in code review and less time on code style. Scalafmt formats code so that it looks consistent between people on your team.
Similar to scikit-learn's Pipeline
.
Categorical features are automatically encoded as numerical data during the loading of CSV files. The label encoder constructed during that process should be modified and exposed for reuse during test time (e.g. serving of the model).
Identify which algorithms would be the most useful to implement first. For a list of existing implementations take a look at the examples repository, for a list of scikit-learn implementations take a look at this. Add a link of the roadmap (wiki) to the readme and to the website.
Currently, input data (x and y) is copied for every split of cross-validation. Take a look at the code, where toDenseMatrix
and toDenseVector
are called on every SliceMatrix
and SliceVector
. This should be fixed in order to be more memory efficient.
This needs a discussion on what functionality should be implemented first. If anyone has a suggestion or wants to tackle any of this, feel free to propose a roadmap.
Some resources:
For an example see this.
The current implementation is very slow, I think a better approach would be to implement a custom solution rather than using a third-party library.
Currently, the processing is sequential.
If a subset of columns is selected for preprocessing, the transformer should make sure that the types of those columns are compatible with the transformation.
List of transformers to fix:
All PRs from forked projects are now failing since secrets are not copied to forks (circleci). Uploading of coverage to codacy either needs to be disabled in CI for forked projects or some other solution needs to be implemented.
Better benchmarking of the library.
Use sbt-jmh
to benchmark and visualise the performance.
This was a recommendation from a plokhotnyuk
When the best performing model is returned from grid/random search and it is evaluated on the test set, a user might want to retrain it on a whole dataset with the same hyperparameters. Currently, one would have to inspect what hyperparameters were selected and this is problematic in some cases, esp. in a Pipeline
where types of the original transformers and the predictor are lost. For that reason, we want to expose the .unfit()
method, which would create an unfitted estimator with the same hyperparameters.
Example usage:
val split = splitData(x, y)
val selectedModel = gridSearch(split.xTr, split.yTr)
val score = f1Score(split.yTe, selectedModel.predict(split.xTe))
val finalModel = selectedModel.unfit().fit(x, y)
Currently, we throw errors at runtime if .predict
is called on an unfitted estimator, conversely, we throw an error if .fit
is called on a trained estimator. The idea of this issue is that such mistakes should be caught at compile time rather than runtime.
Currently, only DenseMatrix is a valid input.
Use groupedWithin
for vectorised predictions. See https://doc.akka.io/docs/akka/current/stream/operators/Source-or-Flow/groupedWithin.html#groupedwithin.
Describe the solution you'd like
CrossValidation should receive and calculate an arbitrary number of metrics specified by the user.
Metrics are means of ML algorithms evaluation. See this for existing implementations.
Currently, data can be read from CSV files with this piece of code, but the implementation is very limited; only numerical data (doubles) can be read, non-numerical fields result in Double.NaN
which is used to encode missing data in doddle-model. The new implementation should be able to encode numerical and categorical variables with missing values, i.e. numerical features should be encoded as doubles directly and categorical features should first be encoded to numerical representation (take a look at label ecoder) and only then converted to doubles. The function will probably look something like loadCsvDataset(filePath: String, naString: String, headerLine: Boolean = true): DenseMatrix[Double]
.
A 3rd party library should be used to parse CSV files (take a look at this discussion for some starting points).
We need to provide a mechanism for saving and loading the state of models.
Implement column selection, similar to how it's implemented in the standard scaler class with the featureIndex
argument. The idea is to allow the user to impute a subset of columns with this preprocessor.
Generate documentation based on Scaladoc and publish it on http://picnicml.github.io.
See this: https://tpolecat.github.io/2015/04/29/f-bounds.html. E.g.
val model = LinearRegression()
val trainedModel = model.fit(x, y)
// trainedModel's type is Regressor, should be LinearRegression
Implement the DummyClassifier
and DummyRegressor
.
Describe the bug
java.io.NotSerializableException
is thrown when trying to persist a logistic regression model within a pipeline.
To Reproduce
Steps to reproduce the behavior:
Run this.
Versions
Scala 2.13, doddle-model 0.0.1-beta4
Stacktrace
Exception in thread "main" java.io.NotSerializableException: io.picnicml.doddlemodel.linear.LogisticRegression$$anon$1
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at io.picnicml.doddlemodel.typeclasses.Estimator.save(Estimator.scala:11)
at io.picnicml.doddlemodel.typeclasses.Estimator.save$(Estimator.scala:9)
at io.picnicml.doddlemodel.pipeline.Pipeline$$anon$1.save(Pipeline.scala:19)
at io.picnicml.doddlemodel.syntax.PredictorSyntax$PredictorOps$.save$extension(PredictorSyntax.scala:16)
at io.picnicml.doddlemodel.spamham.TrainClassifier$.delayedEndpoint$io$picnicml$doddlemodel$spamham$TrainClassifier$1(TrainClassifier.scala:25)
at io.picnicml.doddlemodel.spamham.TrainClassifier$delayedInit$body.apply(TrainClassifier.scala:18)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1(App.scala:73)
at scala.App.$anonfun$main$1$adapted(App.scala:73)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551)
at scala.collection.AbstractIterable.foreach(Iterable.scala:921)
at scala.App.main(App.scala:73)
at scala.App.main$(App.scala:71)
at io.picnicml.doddlemodel.spamham.TrainClassifier$.main(TrainClassifier.scala:18)
at io.picnicml.doddlemodel.spamham.TrainClassifier.main(TrainClassifier.scala)
Currently, the loadCsvDataset function returns a DenseMatrix which stores the loaded data. It should also return a list of feature names constructed from the header line. This would allow the user to translate feature names to column indices in the data matrix.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.