databricks / automl Goto Github PK

View Code? Open in Web Editor NEW

25.0 11.0 19.0 323 KB

License: Apache License 2.0

Python 100.00%

automl's Introduction

Databricks AutoML Package

automl's People

Contributors

Stargazers

Watchers

Forkers

lu-wang-dl wenfeiy-db yxiong abdullahi-ahmed panchalhp-db shitaoli-db maddiedawson shubhm13 rithwik-db isabella232 vadim ulc0 atworkdante arpitjain799 erdal-pb es94129

automl's Issues

Databricks AutoML Failure | Notebook generation failed with TER: {"error_code":"MAX_NOTEBOOK_SIZE_EXCEEDED","message":"File size imported is (61906255 bytes), exceeded max size (50000000 bytes)"}

We are using Databricks AutoML for a regression problem. The job runs for around 5 minutes and then fails with the error :
ERROR databricks.automl.base_learner: AutoML run with experiment id: 1264215502939848 failed with non-AutoML error Exception('Unable to generate notebook at /mlworkspace/mlflow_experiments/23-01-24-07:55-16. Model_Train_Automl-8af8fe13/23-01-24-07:55-DataExploration-6daa65a552c058ab075213cdd68e2ece using format JUPYTER: {"error_code":"MAX_NOTEBOOK_SIZE_EXCEEDED","message":"File size imported is (61906255 bytes), exceeded max size (50000000 bytes)"}\n')

The dimension of the dataset : (1160, 22)

Screenshot from the run -

New update with mlflow experiments causing _pickle.PicklingError

Since the update to mlflow integration with hyperopt where names are automatically assigned to experiments (such as smiling-worm-674), I began getting the following error consistently when running a previously working mlflow experiment with SparkTrials().

ERROR:hyperopt-spark:trial task 0 failed, exception is
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 405.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 405.0 (TID 1472) (10.143.252.81 executor 0):
org.apache.spark.api.python.PythonException: '_pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range'

However, my experiment is not doing any pickling and my code is not referenced in the full traceback, so I am not exactly sure what the issue is. I can confirm that the experiment works when using hyperopt.Trials() rather than hyperopt.SparkTrials(). Apologies for such a lengthy issue, and sorry if the issue is some simple mistake on my end!

Here is the full traceback:

Full Traceback

Traceback (most recent call last):
 File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 602, in dump
   return Pickler.dump(self, obj)
 File "/databricks/python/lib/python3.9/site-packages/patsy/origin.py", line 117, in __getstate__
   raise NotImplementedError
NotImplementedError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/databricks/spark/python/pyspark/serializers.py", line 527, in dumps
   return cloudpickle.dumps(obj, pickle_protocol)
 File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps
   cp.dump(obj)
 File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 604, in dump
   if "recursion" in e.args[0]:
IndexError: tuple index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/databricks/spark/python/pyspark/worker.py", line 876, in main
   process()
 File "/databricks/spark/python/pyspark/worker.py", line 868, in process
   serializer.dump_stream(out_iter, outfile)
 File "/databricks/spark/python/pyspark/serializers.py", line 329, in dump_stream
   bytes = self.serializer.dumps(vs)
 File "/databricks/spark/python/pyspark/serializers.py", line 537, in dumps
   raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range

   at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:692)
   at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:902)
   at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:884)
   at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:645)
   at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
   at scala.collection.Iterator.foreach(Iterator.scala:943)
   at scala.collection.Iterator.foreach$(Iterator.scala:943)
   at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
   at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
   at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
   at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
   at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1029)
   at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
   at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
   at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
   at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:168)
   at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:136)
   at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
   at org.apache.spark.scheduler.Task.run(Task.scala:96)
   at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:889)
   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1692)
   at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:892)
   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:747)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
   at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3257)
   at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3189)
   at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3180)
   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
   at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3180)
   at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1414)
   at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1414)
   at scala.Option.foreach(Option.scala:407)
   at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1414)
   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3466)
   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3407)
   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3395)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:51)
   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1166)
   at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2702)
   at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1027)
   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
   at org.apache.spark.rdd.RDD.withScope(RDD.scala:411)
   at org.apache.spark.rdd.RDD.collect(RDD.scala:1025)
   at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:282)
   at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
   at sun.reflect.GeneratedMethodAccessor282.invoke(Unknown Source)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
   at py4j.Gateway.invoke(Gateway.java:306)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
   at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
   at java.lang.Thread.run(Thread.java:748)

The following is the code that is being run in the experiments:

mlflow.start_run()

spark_trials = SparkTrials(parallelism=16)

with mlflow.start_run(run_name='test_experiment'):
  best_result = fmin(
    fn=objective, 
    space=space,
    algo=tpe.suggest,
    max_evals=1024,
    trials=spark_trials)

Hyperopt Optimization Function

def objective(args):
    
    # Initialize model pipeline
    pipe = Pipeline(steps=[
        ('selection', args['selection'])
    ])
    
    pipe.set_params(**args['params']) # Model parameters will be set here
    pipe.fit(X, y)
    penalty = pipe['selection'].penalty_
    try:
        residual = np.sum(pipe['selection']._resid) / len(pipe['selection']._resid)
    except AttributeError:
        residual = -10000
    r2 = r2_score(y, pipe.predict(X))
    score = 1 - r2
    mean_square = mean_squared_error(y, pipe.predict(X))
    mlflow.log_metric('avg_residual', residual)
    mlflow.log_metric('mean_squared_error', mean_square)
    mlflow.log_metric('penalty', penalty)
    mlflow.log_metric('r2', r2)

    print(f"Model Name: {args['selection']}: ", score)
          
    # Since we have to minimize the score, we return 1- score.
    return {'loss': score, 'status': STATUS_OK}

Here are the parameters and parameter space:

Params and Parameter Space

params = {
  'selection__fixed': hp.choice('selection.fixed', fixed_arrs),
  'selection__random': hp.choice('selection.random', random_arrs),
  'selection__intercept': hp.choice('selection.intercept', (0, 1)),
  'selection__cov': hp.choice('selection.cov', (0, 1))
  }

space = hp.choice('regressors', [
    {
    'selection':LMEBaseRegressor(group=['panel'],
                                 dependent=dependent,
                                 media=media_cols),
    'params': params
    }
  ]
)

And finally here is the regressor I am using (including because its a custom class built ontop of sklearn):

LMEBaseRegressor Class

class LMEBaseRegressor(BaseEstimator, RegressorMixin):
    """Implementation of an LME Regression for scikit."""

    def __init__(self, random=None, fixed=None,
                 group=['panel'], dependent=None,
                 intercept=0, cov=0, media=None):
        self.random = random
        self.fixed = fixed
        self.group = group
        self.dependent = dependent
        self.intercept = intercept
        self.cov = cov
        self.media = media

    def fit(self, X, y):
        """Fit the model with LME."""
        str_dep = self.dependent[0]
        str_fixed = ' + '.join(self.fixed)
        str_random = ' + '.join(self.random)
        data = pd.concat([X, y], axis=1)
        self.penalty_ = 0
        print(f"{str_dep} ~ {self.intercept} + {str_fixed}")
        print(f"{self.cov} + {str_random}")
        try:
            mixed = smf.mixedlm(f"{str_dep} ~ {self.intercept} + {str_fixed}",
                                data,
                                re_formula=f"~ {self.cov} + {str_random}",
                                groups=data['panel'],
                                use_sqrt=True)\
                .fit(method=['lbfgs'])
            self._model = mixed
            self._resid = mixed.resid
            self.coef_ = mixed.params[0:len(self.fixed)]                    
        
        except(ValueError):
            print("Cannot predict random effects from singular covariance structure.")
            self.penalty_ = 100

        except(np.linalg.LinAlgError):
            print("Linear Algebra Error: recheck base model fit or try using fewer variables.")
            self.penalty_  = 100
        return self

    def predict(self, X):
        """Take the coefficients provided from fit and multiply them by X."""
        if self.penalty_ != 0:
            return np.ones(len(X)) * -100 * self.penalty_
        return self._model.predict(X)

Confidence Interval Define Inside MultiSeriesArimaModel.predict_timeseries

Inside:
class MultiSeriesArimaModel(AbstractArimaModel):

There is:

    def predict_timeseries(
        self,
        horizon: int = None,
        include_history: bool = True,
        df: Optional[pd.DataFrame] = None) -> pd.DataFrame:
        """
        Predict target column for given horizon_timedelta and history data.
        :param horizon: Int number of periods to forecast forward.
        :param include_history: Boolean to include the historical dates in the data
            frame for predictions.
        :param df: A pd.Dataframe containing regressors (exogenous variables), if they were used to train the model.
        :return: A pd.DataFrame with the forecast components.
        """
        horizon = horizon or self._horizon
        ids = self._pickled_models.keys()
        preds_dfs = list(map(lambda id_: self._predict_timeseries_single_id(id_, horizon, include_history, df), ids))
        return pd.concat(preds_dfs).reset_index(drop=True)

Which calls:
self._predict_timeseries_single_id()

Then calls the original class:
ArimaModel()

Which has multiple function calls and eventually:

def _forecast(
    self,
    horizon: int = None,
    X: pd.DataFrame = None) -> pd.DataFrame:
    horizon = horizon or self._horizon
    preds, conf = self.model().predict(
        horizon,
        X=X,
        return_conf_int=True)
    ds_indices = self._get_ds_indices(start_ds=self._end_ds, periods=horizon + 1, frequency=self._frequency)[1:]
    preds_pd = pd.DataFrame({'ds': ds_indices, 'yhat': preds})
    preds_pd[["yhat_lower", "yhat_upper"]] = conf
    return preds_pd

How can I input my own customer Confidence interval? Essentially the conf for the forecast? Why does calling,
MultiSeriesArimaModel.predict_timeseries()
Not have a confidence input? Why can't I input 90%? or 70%?

databricks automl preprocess drops all feature columns - how to prevent?

Hi, i would like to prevent databricks automl preprocess method to drop my features even if they may not contain relevant information.

`
summary = automl.classify(train_df, target_col="label", timeout_minutes=5)

    help(summary)
    model_uri = summary.best_trial.model_path
    model = mlflow.sklearn.load_model(model_uri) #sklearn`

this is the way how i start the automl training. How can i integrate the code for it? I have not found any documentation on that.

Missing hyperopt import

I'm getting an import error "cannot import name 'FMIN_CANCELLED_REASON_EARLY_STOPPING' from 'hyperopt.spark'". This occurred through databricks UI and automl api using runtimes 10.3 ML and 10.4 ML.

Any ideas on how to get past this? It seems to happen on import...from databricks.automl.supervised_learner import SupervisedLearner.

Model Addition Request [Feature]

Two model additions that would be great to have in automl:

catboost - in addition to existing lightgbm and xgboost, could provide some benefits in handling categorical features directly.
Keras - even a constrained handful of architectures, adding deep learning to automl to at least identify when it performs well.

Prophet AutoML never creates model with best parameters from Hyperopt

There is a bug using the ProphetHyperParams enum in the AutoML implementation for forecasting with prophet.
The metrics report the potentially very good best result from Hyperopt, but the stored model itself, the evaluation plots and the forecast results are completely unrelated and probably much worse.

This leads to the following situation which happens every time someone uses Databricks AutoML for forecasting:

Run Hyperopt, find a best set of parameters (best_result), for example:

note: seasonality_mode=1 is multiplicative.
The metrics reported in Databricks AutoML & mlflow are the ones from this best_result. So far so good.
The model that is then trained on the full dataset, evaluated in the Training notebook (plots etc.) and stored as artifact has nothing to do with those selected hyperparameters:

Potential reason / bug:

automl/runtime/databricks/automl_runtime/forecast/prophet/forecast.py

Lines 150 to 153 in 242bf1a

 model = Prophet(changepoint_prior_scale=best_result.get(ProphetHyperParams.CHANGEPOINT_PRIOR_SCALE, 0.05), 

 seasonality_prior_scale=best_result.get(ProphetHyperParams.SEASONALITY_PRIOR_SCALE, 10.0), 

 holidays_prior_scale=best_result.get(ProphetHyperParams.HOLIDAYS_PRIOR_SCALE, 10.0), 

 seasonality_mode=seasonality_mode[best_result.get(ProphetHyperParams.SEASONALITY_MODE, 0)],

Databricks Runtime 10.2 ML
databricks-automl-runtime==0.2.4

Unable to generate notebook using format JUPYTER: {"error_code": "MAX_NOTEBOOK_SIZE_EXCEEDED"]

Hello,

While running an experiment with automl in Databricks RT 11.3ML I get the error:

Unable to generate notebook at [workspace location] using format JUPYTER: {"error_code": "MAX_NOTEBOOK_SIZE_EXCEEDED", "message": "File size imported is 34974148 bytes), exceeded max size (10485760 bytes)"}

The exact same code runs smoothly for datasets with more variables and more training instances but in other Databricks environments. However, in a particular environment, this error always comes up.

The learning task is a regression and I have tried reducing the amount of training instances from 20M (which I know they are automatically sampled during the automl initial steps) to 2K but it still generates a Juyter Notebook of 12MB (apparently bigger than the allowed maximum).

My first guess was that the pandas profiling step causes the error while rendering the output of a "big" dataset but I did manage to manually run the exact same pandas profiling notebook using the same train set dataframe inputed to the automl task.

Any help is appretiated because I'm not sure what else to do as the error comes in a phase of the process which I haven't accessed or modified.

Package compatibility: Numba needs NumPy 1.20 or less

Hello,
When running an AutoML experiment on a cluster with DataBricks runtime 9.1 LTS ML, I get the following error during the setup stage:
Numba needs NumPy 1.20 or less
These are the other libraries installed on the cluster: sqlalchemy, gensim, nltk, xgboost, awswrangler.

Adding githooks for contribution - Can I help?

I noticed that no githooks available to facilitate a contribution to the project. I think it would be helpful to have githooks that run tests and check code formatting whenever someone pushes changes to the repository.

I also noticed that linting is being checked from GitHub actions, but it seems to be showing all violations. I'd be happy to reformat the code with Black and set up a procedure for both sides to help ensure consistency and avoid future issues.

What do you think? Would you be open to adding githooks to the project? If so, I'd be happy to work on it and submit a pull request.

Also, I wanted to ask if there is a Slack channel where I could discuss this feature request with the community.

Thank you

Can we get some docs built & published?

Some components here (like MultiSeriesArimaModel or MultiSeriesProphetModel) end up as customer assets - it would be very nice to have some documentation for these published somewhere.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble

	model = Prophet(changepoint_prior_scale=best_result.get(ProphetHyperParams.CHANGEPOINT_PRIOR_SCALE, 0.05),
	seasonality_prior_scale=best_result.get(ProphetHyperParams.SEASONALITY_PRIOR_SCALE, 10.0),
	holidays_prior_scale=best_result.get(ProphetHyperParams.HOLIDAYS_PRIOR_SCALE, 10.0),
	seasonality_mode=seasonality_mode[best_result.get(ProphetHyperParams.SEASONALITY_MODE, 0)],