databricks / automl Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
We are using Databricks AutoML for a regression problem. The job runs for around 5 minutes and then fails with the error :
ERROR databricks.automl.base_learner: AutoML run with experiment id: 1264215502939848 failed with non-AutoML error Exception('Unable to generate notebook at /mlworkspace/mlflow_experiments/23-01-24-07:55-16. Model_Train_Automl-8af8fe13/23-01-24-07:55-DataExploration-6daa65a552c058ab075213cdd68e2ece using format JUPYTER: {"error_code":"MAX_NOTEBOOK_SIZE_EXCEEDED","message":"File size imported is (61906255 bytes), exceeded max size (50000000 bytes)"}\n')
The dimension of the dataset : (1160, 22)
Since the update to mlflow integration with hyperopt where names are automatically assigned to experiments (such as smiling-worm-674), I began getting the following error consistently when running a previously working mlflow experiment with SparkTrials().
ERROR:hyperopt-spark:trial task 0 failed, exception is
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 405.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 405.0 (TID 1472) (10.143.252.81 executor 0):
org.apache.spark.api.python.PythonException: '_pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range'
However, my experiment is not doing any pickling and my code is not referenced in the full traceback, so I am not exactly sure what the issue is. I can confirm that the experiment works when using hyperopt.Trials() rather than hyperopt.SparkTrials(). Apologies for such a lengthy issue, and sorry if the issue is some simple mistake on my end!
Here is the full traceback:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 602, in dump
return Pickler.dump(self, obj)
File "/databricks/python/lib/python3.9/site-packages/patsy/origin.py", line 117, in __getstate__
raise NotImplementedError
NotImplementedError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 527, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 604, in dump
if "recursion" in e.args[0]:
IndexError: tuple index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 876, in main
process()
File "/databricks/spark/python/pyspark/worker.py", line 868, in process
serializer.dump_stream(out_iter, outfile)
File "/databricks/spark/python/pyspark/serializers.py", line 329, in dump_stream
bytes = self.serializer.dumps(vs)
File "/databricks/spark/python/pyspark/serializers.py", line 537, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:692)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:902)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:884)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:645)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1029)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:168)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:136)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:96)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:889)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1692)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:892)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:747)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3257)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3189)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3180)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3180)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1414)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1414)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1414)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3466)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3407)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3395)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:51)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1166)
at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2702)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1027)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:411)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1025)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:282)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.GeneratedMethodAccessor282.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:748)
The following is the code that is being run in the experiments:
spark_trials = SparkTrials(parallelism=16)
with mlflow.start_run(run_name='test_experiment'):
best_result = fmin(
fn=objective,
space=space,
algo=tpe.suggest,
max_evals=1024,
trials=spark_trials)
def objective(args):
# Initialize model pipeline
pipe = Pipeline(steps=[
('selection', args['selection'])
])
pipe.set_params(**args['params']) # Model parameters will be set here
pipe.fit(X, y)
penalty = pipe['selection'].penalty_
try:
residual = np.sum(pipe['selection']._resid) / len(pipe['selection']._resid)
except AttributeError:
residual = -10000
r2 = r2_score(y, pipe.predict(X))
score = 1 - r2
mean_square = mean_squared_error(y, pipe.predict(X))
mlflow.log_metric('avg_residual', residual)
mlflow.log_metric('mean_squared_error', mean_square)
mlflow.log_metric('penalty', penalty)
mlflow.log_metric('r2', r2)
print(f"Model Name: {args['selection']}: ", score)
# Since we have to minimize the score, we return 1- score.
return {'loss': score, 'status': STATUS_OK}
Here are the parameters and parameter space:
params = {
'selection__fixed': hp.choice('selection.fixed', fixed_arrs),
'selection__random': hp.choice('selection.random', random_arrs),
'selection__intercept': hp.choice('selection.intercept', (0, 1)),
'selection__cov': hp.choice('selection.cov', (0, 1))
}
space = hp.choice('regressors', [
{
'selection':LMEBaseRegressor(group=['panel'],
dependent=dependent,
media=media_cols),
'params': params
}
]
)
And finally here is the regressor I am using (including because its a custom class built ontop of sklearn):
class LMEBaseRegressor(BaseEstimator, RegressorMixin):
"""Implementation of an LME Regression for scikit."""
def __init__(self, random=None, fixed=None,
group=['panel'], dependent=None,
intercept=0, cov=0, media=None):
self.random = random
self.fixed = fixed
self.group = group
self.dependent = dependent
self.intercept = intercept
self.cov = cov
self.media = media
def fit(self, X, y):
"""Fit the model with LME."""
str_dep = self.dependent[0]
str_fixed = ' + '.join(self.fixed)
str_random = ' + '.join(self.random)
data = pd.concat([X, y], axis=1)
self.penalty_ = 0
print(f"{str_dep} ~ {self.intercept} + {str_fixed}")
print(f"{self.cov} + {str_random}")
try:
mixed = smf.mixedlm(f"{str_dep} ~ {self.intercept} + {str_fixed}",
data,
re_formula=f"~ {self.cov} + {str_random}",
groups=data['panel'],
use_sqrt=True)\
.fit(method=['lbfgs'])
self._model = mixed
self._resid = mixed.resid
self.coef_ = mixed.params[0:len(self.fixed)]
except(ValueError):
print("Cannot predict random effects from singular covariance structure.")
self.penalty_ = 100
except(np.linalg.LinAlgError):
print("Linear Algebra Error: recheck base model fit or try using fewer variables.")
self.penalty_ = 100
return self
def predict(self, X):
"""Take the coefficients provided from fit and multiply them by X."""
if self.penalty_ != 0:
return np.ones(len(X)) * -100 * self.penalty_
return self._model.predict(X)
Inside:
class MultiSeriesArimaModel(AbstractArimaModel):
There is:
def predict_timeseries(
self,
horizon: int = None,
include_history: bool = True,
df: Optional[pd.DataFrame] = None) -> pd.DataFrame:
"""
Predict target column for given horizon_timedelta and history data.
:param horizon: Int number of periods to forecast forward.
:param include_history: Boolean to include the historical dates in the data
frame for predictions.
:param df: A pd.Dataframe containing regressors (exogenous variables), if they were used to train the model.
:return: A pd.DataFrame with the forecast components.
"""
horizon = horizon or self._horizon
ids = self._pickled_models.keys()
preds_dfs = list(map(lambda id_: self._predict_timeseries_single_id(id_, horizon, include_history, df), ids))
return pd.concat(preds_dfs).reset_index(drop=True)
Which calls:
self._predict_timeseries_single_id()
Then calls the original class:
ArimaModel()
Which has multiple function calls and eventually:
def _forecast(
self,
horizon: int = None,
X: pd.DataFrame = None) -> pd.DataFrame:
horizon = horizon or self._horizon
preds, conf = self.model().predict(
horizon,
X=X,
return_conf_int=True)
ds_indices = self._get_ds_indices(start_ds=self._end_ds, periods=horizon + 1, frequency=self._frequency)[1:]
preds_pd = pd.DataFrame({'ds': ds_indices, 'yhat': preds})
preds_pd[["yhat_lower", "yhat_upper"]] = conf
return preds_pd
How can I input my own customer Confidence interval? Essentially the conf for the forecast? Why does calling,
MultiSeriesArimaModel.predict_timeseries()
Not have a confidence input? Why can't I input 90%? or 70%?
Hi, i would like to prevent databricks automl preprocess method to drop my features even if they may not contain relevant information.
`
summary = automl.classify(train_df, target_col="label", timeout_minutes=5)
help(summary)
model_uri = summary.best_trial.model_path
model = mlflow.sklearn.load_model(model_uri) #sklearn`
this is the way how i start the automl training. How can i integrate the code for it? I have not found any documentation on that.
I'm getting an import error "cannot import name 'FMIN_CANCELLED_REASON_EARLY_STOPPING' from 'hyperopt.spark'". This occurred through databricks UI and automl api using runtimes 10.3 ML and 10.4 ML.
Any ideas on how to get past this? It seems to happen on import...from databricks.automl.supervised_learner import SupervisedLearner.
Two model additions that would be great to have in automl:
catboost - in addition to existing lightgbm and xgboost, could provide some benefits in handling categorical features directly.
Keras - even a constrained handful of architectures, adding deep learning to automl to at least identify when it performs well.
There is a bug using the ProphetHyperParams enum in the AutoML implementation for forecasting with prophet.
The metrics report the potentially very good best result from Hyperopt, but the stored model itself, the evaluation plots and the forecast results are completely unrelated and probably much worse.
This leads to the following situation which happens every time someone uses Databricks AutoML for forecasting:
Potential reason / bug:
automl/runtime/databricks/automl_runtime/forecast/prophet/forecast.py
Lines 150 to 153 in 242bf1a
Databricks Runtime 10.2 ML
databricks-automl-runtime==0.2.4
Hello,
While running an experiment with automl in Databricks RT 11.3ML I get the error:
Unable to generate notebook at [workspace location] using format JUPYTER: {"error_code": "MAX_NOTEBOOK_SIZE_EXCEEDED", "message": "File size imported is 34974148 bytes), exceeded max size (10485760 bytes)"}
The exact same code runs smoothly for datasets with more variables and more training instances but in other Databricks environments. However, in a particular environment, this error always comes up.
The learning task is a regression and I have tried reducing the amount of training instances from 20M (which I know they are automatically sampled during the automl initial steps) to 2K but it still generates a Juyter Notebook of 12MB (apparently bigger than the allowed maximum).
My first guess was that the pandas profiling step causes the error while rendering the output of a "big" dataset but I did manage to manually run the exact same pandas profiling notebook using the same train set dataframe inputed to the automl task.
Any help is appretiated because I'm not sure what else to do as the error comes in a phase of the process which I haven't accessed or modified.
Hello,
When running an AutoML experiment on a cluster with DataBricks runtime 9.1 LTS ML, I get the following error during the setup stage:
Numba needs NumPy 1.20 or less
These are the other libraries installed on the cluster: sqlalchemy, gensim, nltk, xgboost, awswrangler.
I noticed that no githooks available to facilitate a contribution to the project. I think it would be helpful to have githooks that run tests and check code formatting whenever someone pushes changes to the repository.
I also noticed that linting is being checked from GitHub actions, but it seems to be showing all violations. I'd be happy to reformat the code with Black and set up a procedure for both sides to help ensure consistency and avoid future issues.
What do you think? Would you be open to adding githooks to the project? If so, I'd be happy to work on it and submit a pull request.
Also, I wanted to ask if there is a Slack channel where I could discuss this feature request with the community.
Thank you
Some components here (like MultiSeriesArimaModel
or MultiSeriesProphetModel
) end up as customer assets - it would be very nice to have some documentation for these published somewhere.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.