GithubHelp home page GithubHelp logo

microsoft / flaml Goto Github PK

View Code? Open in Web Editor NEW
3.7K 57.0 486.0 166.09 MB

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.

Home Page: https://microsoft.github.io/FLAML/

License: MIT License

Python 21.69% Jupyter Notebook 77.56% Dockerfile 0.03% JavaScript 0.12% CSS 0.06% MDX 0.54%
automl hyperparam automated-machine-learning machine-learning data-science python jupyter-notebook hyperparameter-optimization random-forest scikit-learn

flaml's Introduction

PyPI version Conda version Build Python Version Downloads

A Fast Library for Automated Machine Learning & Tuning


๐Ÿ”ฅ Heads-up: We have migrated AutoGen into a dedicated github repository. Alongside this move, we have also launched a dedicated Discord server and a website for comprehensive documentation.

๐Ÿ”ฅ The automated multi-agent chat framework in AutoGen is in preview from v2.0.0.

๐Ÿ”ฅ FLAML is highlighted in OpenAI's cookbook.

๐Ÿ”ฅ autogen is released with support for ChatGPT and GPT-4, based on Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference.

๐Ÿ”ฅ FLAML supports Code-First AutoML & Tuning โ€“ Private Preview in Microsoft Fabric Data Science.

What is FLAML

FLAML is a lightweight Python library for efficient automation of machine learning and AI operations. It automates workflow based on large language models, machine learning models, etc. and optimizes their performance.

  • FLAML enables building next-gen GPT-X applications based on multi-agent conversations with minimal effort. It simplifies the orchestration, automation and optimization of a complex GPT-X workflow. It maximizes the performance of GPT-X models and augments their weakness.
  • For common machine learning tasks like classification and regression, it quickly finds quality models for user-provided data with low computational resources. It is easy to customize or extend. Users can find their desired customizability from a smooth range.
  • It supports fast and economical automatic tuning (e.g., inference hyperparameters for foundation models, configurations in MLOps/LMOps workflows, pipelines, mathematical/statistical models, algorithms, computing experiments, software configurations), capable of handling large search space with heterogeneous evaluation cost and complex constraints/guidance/early stopping.

FLAML is powered by a series of research studies from Microsoft Research and collaborators such as Penn State University, Stevens Institute of Technology, University of Washington, and University of Waterloo.

FLAML has a .NET implementation in ML.NET, an open-source, cross-platform machine learning framework for .NET.

Installation

FLAML requires Python version >= 3.8. It can be installed from pip:

pip install flaml

Minimal dependencies are installed without extra options. You can install extra options based on the feature you need. For example, use the following to install the dependencies needed by the autogen package.

pip install "flaml[autogen]"

Find more options in Installation. Each of the notebook examples may require a specific option to be installed.

Quickstart

  • (New) The autogen package enables the next-gen GPT-X applications with a generic multi-agent conversation framework. It offers customizable and conversable agents which integrate LLMs, tools and human. By automating chat among multiple capable agents, one can easily make them collectively perform tasks autonomously or with human feedback, including tasks that require using tools via code. For example,
from flaml import autogen

assistant = autogen.AssistantAgent("assistant")
user_proxy = autogen.UserProxyAgent("user_proxy")
user_proxy.initiate_chat(
    assistant,
    message="Show me the YTD gain of 10 largest technology companies as of today.",
)
# This initiates an automated chat between the two agents to solve the task

Autogen also helps maximize the utility out of the expensive LLMs such as ChatGPT and GPT-4. It offers a drop-in replacement of openai.Completion or openai.ChatCompletion with powerful functionalites like tuning, caching, templating, filtering. For example, you can optimize generations by LLM with your own tuning data, success metrics and budgets.

# perform tuning
config, analysis = autogen.Completion.tune(
    data=tune_data,
    metric="success",
    mode="max",
    eval_func=eval_func,
    inference_budget=0.05,
    optimization_budget=3,
    num_samples=-1,
)
# perform inference for a test instance
response = autogen.Completion.create(context=test_instance, **config)
from flaml import AutoML

automl = AutoML()
automl.fit(X_train, y_train, task="classification")
  • You can restrict the learners and use FLAML as a fast hyperparameter tuning tool for XGBoost, LightGBM, Random Forest etc. or a customized learner.
automl.fit(X_train, y_train, task="classification", estimator_list=["lgbm"])
from flaml import tune
tune.run(evaluation_function, config={โ€ฆ}, low_cost_partial_config={โ€ฆ}, time_budget_s=3600)
  • Zero-shot AutoML allows using the existing training API from lightgbm, xgboost etc. while getting the benefit of AutoML in choosing high-performance hyperparameter configurations per task.
from flaml.default import LGBMRegressor

# Use LGBMRegressor in the same way as you use lightgbm.LGBMRegressor.
estimator = LGBMRegressor()
# The hyperparameters are automatically set according to the training data.
estimator.fit(X_train, y_train)

Documentation

You can find a detailed documentation about FLAML here.

In addition, you can find:

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

If you are new to GitHub here is a detailed help source on getting involved with development on GitHub.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

flaml's People

Contributors

andreaw-ag avatar animaholic avatar anonymous-submission-repo avatar borda avatar coffepowered avatar dependabot[bot] avatar gianpd avatar int-chaos avatar jingdong00 avatar jmrichardson avatar levscaut avatar liususan091219 avatar luisquintanilla avatar markharley avatar michaelmarien avatar michalchromcak avatar prajwalborkar avatar qingyun-wu avatar royninja avatar ruizhuanguw avatar shreyas36 avatar skzhang1 avatar sonichi avatar thinkall avatar vijaya-lakshmi-venkatraman avatar vvijayalakshmi21 avatar wuchihsu avatar yard1 avatar yiranwu0 avatar zvibaratz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flaml's Issues

Guide for contributors

Some helpful documentation for future contributors, may include:

  1. Some general description about the classes and how they work
  2. How to run unit tests.
  3. etc.

Error when ensemble=true

After upgrading to the newest version of FLAML, I am running into the following error when I set ensemble=True:

Traceback (most recent call last):
  File "search.py", line 229, in <module>
    main()
  File "search.py", line 225, in main
    data_sheet = run_data_sheet(data_sheet, target_col, id_col, data_dir, out_dir, eval_metric)
  File "search.py", line 180, in run_data_sheet
    pipe.fit(X_train, y_train, **automl_settings)
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/flaml/automl.py", line 962, in fit
    self._search()
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/flaml/automl.py", line 1232, in _search
    **self._state.fit_kwargs)
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/sklearn/ensemble/_stacking.py", line 441, in fit
    return super().fit(X, self._le.transform(y), sample_weight)
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/sklearn/ensemble/_stacking.py", line 149, in fit
    for est in all_estimators if est != 'drop'
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/parallel.py", line 1054, in __call__
    self.retrieve()
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/parallel.py", line 933, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/opt/python/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/opt/python/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
TypeError: __init__() got an unexpected keyword argument '_estimator_type'

My call to FLAML:

        automl_settings = {
            "time_budget": search_time,
            "task": 'classification',
            "log_file_name": "{}/flaml-{}.log".format(out_dir, runname),
            "n_jobs": 10,
            "estimator_list": ['lgbm', 'xgboost', 'rf', 'extra_tree', 'catboost'],
            "model_history": True,
            "eval_method": "cv",
            "n_splits": 3,
            "metric": eval_metric,
            "log_training_metric": True,
            "verbose": 1,
            "ensemble": True,
        }

        pipe = AutoML()
        pipe.fit(X_train, y_train, **automl_settings)

This issue goes away if I change ensemble to False.

Here are my environment details:

$ pip list
Package            Version
------------------ --------
catboost           0.26
ConfigSpace        0.4.19
cycler             0.10.0
Cython             0.29.23
FLAML              0.5.6
graphviz           0.16
importlib-metadata 4.6.1
joblib             1.0.1
jsonpickle         2.0.0
kiwisolver         1.3.1
lightgbm           3.2.1
matplotlib         3.3.4
numpy              1.19.5
pandas             1.1.5
Pillow             8.3.1
pip                21.1.3
plotly             5.1.0
pyparsing          2.4.7
python-dateutil    2.8.1
pytz               2021.1
scikit-learn       0.24.2
scipy              1.5.4
setuptools         40.6.2
six                1.16.0
tenacity           8.0.0
threadpoolctl      2.1.0
typing-extensions  3.10.0.0
wheel              0.36.2
xgboost            1.4.2
zipp               3.5.0
$ python --version
Python 3.6.8 :: Anaconda custom (64-bit)

Bug in Cross-Validation estimation

Dear all,

I have been trying FLAML for a few days now and I believe I stumbled across a bug in the evaluation of the model when using cross-validation (eval_method="cv").

I believe that there is only the last fold that is taken into account in function evaluate_model_CV (ml.py). The list of validation scores (val_loss_list) is only updated with the current fold's validation score for the last fold or when the budget is not anymore sufficient. In any case, the val_loss_list only contains one item in all cases. Moreover, what is appended to the list is not the validation score of the current fold, but the mean of the validation scores of the first "valid_fold_num" folds.

I would suggest the following to replace lines 220--226 in ml.py:

        val_loss_list.append(val_loss_i)
        if valid_fold_num == n:
            total_val_loss = valid_fold_num = 0
        elif time.time() - start_time >= budget:
            break
    val_loss = np.max(val_loss_list)

One might also consider changing (or make some options) for the last line in the above snippet. Indeed, here the maximum of the validation scores of each fold is taken. Another commonly used way is to take the average of the validation scores of each fold. This could be an option for the user but it is not a bug per se. I am also ok keeping the max of all validation scores as it is now. (note that basically, the current situation is using the mean value of the different folds, as it is taking the total_val_loss divided by the number of folds).

Best

David

Crash with ValueError when ensemble=True

When I set ensemble=True, and my data has categorical features, I get the following error at the end of the FLAML run:

[flaml.automl: 07-08 09:40:44] {1141} INFO -  at 9373.5s,       best extra_tree's error=0.2056, best rf's error=0.1950[flaml.automl: 07-08 09:40:44] {993} INFO - iteration 52, current learner rf[flaml.automl: 07-08 09:41:42] {1141} INFO -  at 9431.7s,       best rf's error=0.1950, best rf's error=0.1950
[flaml.automl: 07-08 09:41:42] {993} INFO - iteration 53, current learner rf
[flaml.automl: 07-08 09:42:11] {1141} INFO -  at 9460.7s,       best rf's error=0.1950, best rf's error=0.1950[flaml.automl: 07-08 09:42:11] {993} INFO - iteration 54, current learner rf[flaml.automl: 07-08 09:50:15] {1141} INFO -  at 9944.4s,       best rf's error=0.1949, best rf's error=0.1949
[flaml.automl: 07-08 09:50:15] {1187} INFO - selected model: RandomForestClassifier(criterion='entropy', max_features=0.7294599478674504,
                       n_estimators=347, n_jobs=10)[flaml.automl: 07-08 09:50:15] {1197} INFO - [('rf', <flaml.model.RandomForestEstimator object at 0x7fca69effaf0>), ('extra_tree', <flaml.model.ExtraTreeEstimator object at 0x7fca8cc1f8e0>), ('lgbm', <flaml.model.LGBMEstimator object at 0x7fc799985190>), ('catboost', <flaml.model.CatBoostEstimator object at 0x7fc
a8cc884f0>), ('xgboost', <flaml.model.XGBoostSklearnEstimator object at 0x7fca8cd0e610>)]
/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/xgboost/sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecat
ed and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier
object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)
/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/xgboost/sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecat
ed and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier
object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)
Traceback (most recent call last):  File "search.py", line 212, in <module>    dump_json(data_sheet_file, data_sheet)
  File "search.py", line 208, in main
    with open(data_sheet_file) as f:  File "search.py", line 163, in run_data_sheet    run['flaml_settings'] = jsonpickle.encode(automl_settings, unpicklable=False, keys=True)
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/flaml/automl.py", line 943, in fit
    self._search()  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/flaml/automl.py", line 1212, in _search    stacker.fit(self._X_train_all, self._y_train_all,
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py", line 441, in fit
    return super().fit(X, self._le.transform(y), sample_weight)  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py", line 196, in fit    _fit_single_estimator(self.final_estimator_, X_meta, y,
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_base.py", line 39, in _fit_single_estimator
    estimator.fit(X, y)
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/flaml/model.py", line 296, in fit
    self._fit(X_train, y_train, **kwargs)
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/flaml/model.py", line 78, in _fit
    model.fit(X_train, y_train, **kwargs)
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 304, in fit
    X, y = self._validate_data(X, y, multi_output=True,
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/base.py", line 433, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 871, in check_X_y
    X = check_array(X, accept_sparse=accept_sparse,
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 673, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
  File "/global/home/hpc3552/.conda/envs/myenv/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: '__OTHER__'

This error does not occur if ensemble=False or if I remove (or encode) the categorical features from my dataset

My guess is that FLAML properly encodes categorical features when training the base estimators (LGBM, RF, etc), but not when training the stacking classifier.

How to enforce monotonicity in FLAML?

Hi Chi:

Thank you for the cool work! Could I enforce monotonicity in the main automl.fit() function? If so, what algorithms can be chosen in the estimator list?

Best,

Questions on the output in the log

Hi Dr. Wang:

Got a few questions from my team on the content in the log of FLAML.

This is part of the log from one of our tests on FLAML (all the numbers on loss are redacted for compliance reasons):

{"record_id": 0, "iter_per_learner": 1, "logged_metric": false, "trial_time": 1756.8860552310944, "total_search_time": 2590.4430527687073, "validation_loss": XXX, "config": {"max_depth": 6, "n_estimators": 100, "min_child_weight": 10, "subsample": 0.67, "colsample_bylevel": 0.9, "gamma": 0, "learning_rate": 0.07435893300587489}, "best_validation_loss": XXX, "best_config": {"max_depth": 6, "n_estimators": 100, "min_child_weight": 10, "subsample": 0.67, "colsample_bylevel": 0.9, "gamma": 0, "learning_rate": 0.07435893300587489}, "learner": "MonotonicXgboostGBTree", "sample_size": 784536}

{"record_id": 1, "iter_per_learner": 5, "logged_metric": false, "trial_time": 1537.3765320777893, "total_search_time": 13424.922722578049, "validation_loss": XXX, "config": {"max_depth": 4, "n_estimators": 110, "min_child_weight": 1, "subsample": 0.5954399576961257, "colsample_bylevel": 1.0, "gamma": 1e-14, "learning_rate": 0.10828032871243709}, "best_validation_loss": XXX "best_config": {"max_depth": 4, "n_estimators": 110, "min_child_weight": 1, "subsample": 0.5954399576961257, "colsample_bylevel": 1.0, "gamma": 1e-14, "learning_rate": 0.10828032871243709}, "learner": "MonotonicXgboostGBTree", "sample_size": 784536}

{"record_id": 2, "iter_per_learner": 13, "logged_metric": false, "trial_time": 340.0606036186218, "total_search_time": 34851.914006233215, "validation_loss": XXX, "config": {"max_depth": 5, "num_leaves": 23, "n_estimators": 157, "min_child_weight": 1, "subsample": 0.5112583180636173, "colsample_bylevel": 0.9863382485941592, "min_split_gain": 1e-14, "learning_rate": 0.05875161500234584}, "best_validation_loss": XXX, "best_config": {"max_depth": 5, "num_leaves": 23, "n_estimators": 157, "min_child_weight": 1, "subsample": 0.5112583180636173, "colsample_bylevel": 0.9863382485941592, "min_split_gain": 1e-14, "learning_rate": 0.05875161500234584}, "learner": "MonotonicLightGBMGBDT", "sample_size": 784536}

{"record_id": 3, "iter_per_learner": 18, "logged_metric": false, "trial_time": 270.2408003807068, "total_search_time": 41368.91024374962, "validation_loss": XXX, "config": {"max_depth": 4, "n_estimators": 338, "min_data_in_leaf": 56, "subsample": 0.6614322871324126, "colsample_bylevel": 0.9458919560564311, "learning_rate": 0.23062756268773424}, "best_validation_loss": XXX, "best_config": {"max_depth": 4, "n_estimators": 338, "min_data_in_leaf": 56, "subsample": 0.6614322871324126, "colsample_bylevel": 0.9458919560564311, "learning_rate": 0.23062756268773424}, "learner": "MonotonicCatboost", "sample_size": 784536}

{"record_id": 4, "iter_per_learner": 22, "logged_metric": false, "trial_time": 366.1694631576538, "total_search_time": 43080.46155285835, "validation_loss": XXX, "config": {"max_depth": 4, "n_estimators": 448, "min_data_in_leaf": 46, "subsample": 0.6950654501710251, "colsample_bylevel": 0.956150967914549, "learning_rate": 0.4527543463119874}, "best_validation_loss": XXX, "best_config": {"max_depth": 4, "n_estimators": 448, "min_data_in_leaf": 46, "subsample": 0.6950654501710251, "colsample_bylevel": 0.956150967914549, "learning_rate": 0.4527543463119874}, "learner": "MonotonicCatboost", "sample_size": 784536}

{"record_id": 5, "iter_per_learner": 23, "logged_metric": false, "trial_time": 343.4558777809143, "total_search_time": 45475.49441862106, "validation_loss": XXX, "config": {"max_depth": 4, "n_estimators": 405, "min_data_in_leaf": 58, "subsample": 0.734450014003538, "colsample_bylevel": 0.9644762947991873, "learning_rate": 0.3151376812002405}, "best_validation_loss": XXX, "best_config": {"max_depth": 4, "n_estimators": 405, "min_data_in_leaf": 58, "subsample": 0.734450014003538, "colsample_bylevel": 0.9644762947991873, "learning_rate": 0.3151376812002405}, "learner": "MonotonicCatboost", "sample_size": 784536}

I am wondering:

  1. What does 'iter_per_learner' mean? My understanding is that the output in the log was generated in batch. For example, for record_id 2, does it include 13 or 8 (13-5 from record_id 1) MonotonicLightGBMGBDT models with different sets of hyperparameters?

  2. What does 'trial_time' mean? How is it different from 'total_search_time"?

  3. What is the difference between 'config' and 'best_config' in each record? They all look the same.

  4. If the process reaches the time budget in the middle of an iteration, will it stop immediately or finish the current iteration first before stopping?

Appreciate your help! As you can see from the log, our dataset is quite large (780000+ records and thousands of predictors). Although the fitting is far from over yet, the current optimal result is already as good as what we got using BayesOpt.

Best,

Output logs in JSON format

Currently logs are produced in Tab-separated row format. Consider JSON log format because:

  1. Easier interpolation with log analysis tools such as Elasticsearch/Logstash
  2. JSON is typed, better for parsing in Python
  3. Easier to combine log output from different future versions due to log schema change -- JSON schema is not position sensitive like the current format.
  4. Model configuration is already dumped in JSON.

import package error

Hi,

I used pip install and
from flaml import AutoML
gave me the error of

FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\windwine8700\anaconda3\lib\site-packages\settings.json'

I am using python 3.83. Is there a way to solve this issue? Thanks.

Best,

Jiaqi

let users specify the final_estimator and passthrough for the ensemble

Is it possible to let users specify the final_estimator and passthrough for the ensemble, please? In practice sometimes the only meta learner can be accepted by the business is GLM. Single boosting models are OK but a boosting model of boosting models is just too complicated for the legal team and regulators. Regarding the passthrough, there is no guarantee that one way will be better than the other so perhaps it is better to let the users decide.

Appreciate your help!

Originally posted by @flippercy in #47 (comment)

Errors when ensemble = True

Hi:

There is an error message when fitting models using customized monotonic learners with ensemble = True:

RuntimeError: Cannot clone object <main.MyMonotonicLightGBMGBDTClassifier object at 0x7f9ef2999310>, as the constructor either does not set or modifies parameter monotone_constraints

I assume it is due to the monotone_constraints added to self.params. Any suggestion on how to fix it?

Usually we won't implement an ensemble of boosting models but would be great if we can figure out a solution!

Thank you.

ACTION REQUIRED: Microsoft needs this private repository to complete compliance info

There are open compliance tasks that need to be reviewed for your FLAML repo.

Action required: 4 compliance tasks

To bring this repository to the standard required for 2021, we require administrators of this and all Microsoft GitHub repositories to complete a small set of tasks within the next 60 days. This is critical work to ensure the compliance and security of your microsoft GitHub organization.

Please take a few minutes to complete the tasks at: https://repos.opensource.microsoft.com/orgs/microsoft/repos/FLAML/compliance

  • The GitHub AE (GitHub inside Microsoft) migration survey has not been completed for this private repository
  • No Service Tree mapping has been set for this repo. If this team does not use Service Tree, they can also opt-out of providing Service Tree data in the Compliance tab.
  • No repository maintainers are set. The Open Source Maintainers are the decision-makers and actionable owners of the repository, irrespective of administrator permission grants on GitHub.
  • Classification of the repository as production/non-production is missing in the Compliance tab.

You can close this work item once you have completed the compliance tasks, or it will automatically close within a day of taking action.

If you no longer need this repository, it might be quickest to delete the repo, too.

GitHub inside Microsoft program information

More information about GitHub inside Microsoft and the new GitHub AE product can be found at https://aka.ms/gim or by contacting [email protected]

FYI: current admins at Microsoft include @ekzhu, @markusweimer, @qingyun-wu, @sonichi

datetime64[ns] dtype in dataframe

When, DataTranformer's fit_transform method is called, If some columns have a datetime format, an error is raised by the sklearn\utils\validation.py method.

I fixed it, turning any datetime columns to datetime.toordinal type

ONNX/ONNXML export

Thanks for this wonderful promising AutoML stack.
May I suggest to had an "export to ONNX/ONNXML method " ?
How would I export the best model pipeline to ONNXML now ? using sklearn-onnx ?

How to pull the number of iterations completed for each learner?

Hi:

Is there a way to pull the number of iterations completed by automl() for each learner, please? I know it can be found in the log if I set log_type to 'all' but can I pull it directly?

Assume all the default learners are used, it would be great if we can get the information for a table as below:

Learner Iterations Completed
Xgboost 100
LightGBM 200
Catboost 150
RF 50

Thank you!

More details on low_cost_partial_config?

fit() is stopping early with the following message:

[flaml.automl: 06-17 12:55:08] {1013} INFO - iteration 41, current learner lrl1
No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'.
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_sag.py:329: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)

I cannot find any details in the documentation as to what exactly low_cost_partial_config is, or what I should be setting it to. Any pointers or guidance would be appreciated.

Verbose argument in model.fit()

Hi,

While training the learner, a console output is generated, which can take up huge space in the notebook if the time_budget is made large. If I wish to suppress the console output while training my learner, how do I do that? In keras, sklearn, etc., setting verbose = 0 suppresses the console output.

Thanks!

Possible error/inconsistency with the paper in FLOW2 step size reduction

In code, step size is reduced with the following:

        if self._num_proposedby_incumbent == self.dir and (
            not self._resource or self._resource == self.max_resource):
                # check stuck condition if using max resource
                if self.step >= self.step_lower_bound:
                    # decrease step size
                    self._oldK = self._K if self._K else self._iter_best_config
                    self._K = self.trial_count_proposed + 1
                    self.step *= np.sqrt(self._oldK / self._K)
                self._num_proposedby_incumbent -= 2

However, the algorithm description in the FLOW2 paper shows that:
image

From this, we can see that k' (_oldK in code) is only changed whenever a new best score is obtained. However, in the current implementation, k' always becomes the previous k instead. This seems counter-intuitive to me, as the step size multiplier will reduce much slower than in the paper implementation, thus making FLOW2 spend more time evaluating a configuration that has most likely already converged.

I believe that the implementation consistent with the paper would be:

        if self._num_proposedby_incumbent == self.dir and (
            not self._resource or self._resource == self.max_resource):
                # check stuck condition if using max resource
                if self.step >= self.step_lower_bound:
                    # decrease step size
                    self._oldK = self._iter_best_config  # change here
                    self._K = self.trial_count_proposed + 1
                    self.step *= np.sqrt(self._oldK / self._K)
                self._num_proposedby_incumbent -= 2

I have ran some trials with this change and it seems to be working as intended, at least for my purposes - converged combinations are eliminated more aggressively.

Am I understanding all of this correctly? Is this an oversight in the code, or has this been changed after the paper was published?

n_estimators of best model is really large (32768)

After running FLAML for several hours, I noticed in the log that the best model was xgboost with n_estimators set to 32768:

{"record_id": 24, "iter_per_learner": 55, "logged_metric": false, "trial_time": 1509.8762745857239, "total_search_time": 41334.96948957443, "validation_loss": 0.25501297475819773, "config": {"n_estimators": 32768.0, "max_leaves": 186.0, "min_child_weight": 0.22536063808245474, "learning_rate": 0.05398963108662436,       "subsample": 0.9173715591862044, "colsample_bylevel": 0.9005345477364418, "colsample_bytree": 0.6104797018735161, "reg_alpha": 0.0009765625, "reg_lambda": 1.    92166667176985, "FLAML_sample_size": 55408}, "best_validation_loss": 0.25501297475819773, "best_config": {"n_estimators": 32768.0, "max_leaves": 186.0,          "min_child_weight": 0.22536063808245474, "learning_rate": 0.05398963108662436, "subsample": 0.9173715591862044, "colsample_bylevel": 0.9005345477364418,         "colsample_bytree": 0.6104797018735161, "reg_alpha": 0.0009765625, "reg_lambda": 1.92166667176985, "FLAML_sample_size": 55408}, "learner": "xgboost",            "sample_size": 55408}
{"curr_best_record_id": 24}

But that seems excessively large to me and will surely result in an overfit model. (Indeed, the model achieves 99.9% F1 score on the training data set, but only about 75% on a held-out test data set.)

I see in the code that 32768 is set as the upper bound for n_estimators:

FLAML/flaml/model.py

Lines 307 to 313 in 0604570

upper = min(32768, int(data_size))
return {
'n_estimators': {
'domain': tune.qloguniform(lower=4, upper=upper, q=1),
'init_value': 4,
'low_cost_init_value': 4,
},

I'm just wondering if this upper bound is intentionally set so high, or if this is an oversight.

fit stops after Iteration 0 when metric is r2

I am using FLAML in Django views:

           X_train, X_test, y_train, y_test = train_test_split(df.copy(), train_size=selectedTrainingPercentage) 
           
           automl = AutoML()

            settings = {
                "time_budget": 60,      # total running time in seconds
                "metric": 'r2',         # primary metrics for regression can be chosen from: ['mae','mse','r2']
                                        # list of ML learners; we tune xgboost in this example
                "task": 'regression',   # task type
            }
            print('fitting')

            automl.fit(X_train=X_train, y_train=y_train, **settings)
            print('fit complete')`

And the fitting stops at iteration 0:

Screenshot from 2021-06-08 21-14-11

However it works completely fine if I change the metric to nae or mse rather than r2.

AtributeError mensage during fit

Hi everyone!!

I've received the atribute error message below when using FLAML with XGBoost (this error occurs with others algorithms too):

[flaml.automl: 07-01 10:45:34] {908} INFO - Evaluation method: cv
[flaml.automl: 07-01 10:45:34] {607} INFO - Using StratifiedKFold
[flaml.automl: 07-01 10:45:34] {929} INFO - Minimizing error metric: 1-roc_auc
[flaml.automl: 07-01 10:45:34] {949} INFO - List of ML learners in AutoML Run: ['xgboost']
[flaml.automl: 07-01 10:45:34] {1013} INFO - iteration 0, current learner xgboost
Traceback (most recent call last):
  File "ft2.py", line 33, in <module>
    automl.fit(X_train=X, y_train=y, **settings)
  File "/scratch/luizhemelo/anaconda3/lib/python3.7/site-packages/flaml/automl.py", line 962, in fit
    self._search()
  File "/scratch/luizhemelo/anaconda3/lib/python3.7/site-packages/flaml/automl.py", line 1081, in _search
    use_ray=False)
  File "/scratch/luizhemelo/anaconda3/lib/python3.7/site-packages/flaml/tune/tune.py", line 270, in run
    search_alg.set_search_properties(metric, mode, config={
AttributeError: 'ConcurrencyLimiter' object has no attribute 'set_search_properties'

Parameters used:

settings = {
    "time_budget": 108000,
    "metric": 'roc_auc',
    "task": 'classification',
    "n_jobs": -1,
    "estimator_list": ['xgboost'],
    "n_splits": 5,
    "log_file_name": 'ft.log',
}

Specifications:
Python 3.7.10
FLAML 0.5.4 (installed via PiP)
XGBoost 1.4.0 (installed via conda)

Any ideas?

Thanks! :D

Question: how does FLAML handle categorical features?

Hi,

I am trying to learn how FLAML handles categorical features - i.e., which encoding methods (e.g., OneHotEncoding, OrdinalEncoding) are used.

I looked through the following code:

class DataTransformer:

but I can't see where the categorical features are actually encoded.

Also, I was wondering if different estimators will use different encodings? E.g., OrdinalEncoder for lgbm and OHE for RandomForest?

sklearn f1_score method has 'binary' as average default

sklearn f1_score makes use of 'binary' as average default parameter. It means, if the problem is a multiclass ones, it must be changed the average parameter to ones of ['micro', 'macro', 'weighted', 'samples']. In the ml module the sklearn_metric_loss_score is called without specifiyng that parameter. Consequently, at the moment, if a multiclass problem and f1 metric are choosen, a problem arise.

One solution, could be to set average='samples' when the task is multiclass:softmax.
However, choosing one of the above list options, depends on the nature of the labels (balanced/unbalanced). It may be interesting to automate the process of choosing the best metric looking to the nature of the labels.

Any idea?
err_flaml

settings = {
"time_budget": TIME_BUDGET,
"metric": 'f1',
"estimator_list": ['lgbm'],
"task": 'classification',
"log_file_name": 'flaml_lgb.log',
}

python 3.9 support?

image

it seems only up to 3.8 is supported, is 3.9? will it be?

also 3.10 which is releasing soon?

struct error during ensemble

When FLAML is building the ensemble at the end of a run, I am seeing the following error message:

[flaml.automl: 07-10 06:16:01] {1153} INFO -  at 9912.2s,       best xgboost's error=0.1893,    best xgboost's error=0.1893
[flaml.automl: 07-10 06:16:01] {1001} INFO - iteration 183, current learner lgbm
[flaml.automl: 07-10 06:17:00] {1153} INFO -  at 9970.5s,       best lgbm's error=0.1928,       best xgboost's error=0.1893
[flaml.automl: 07-10 06:17:00] {1193} INFO - selected model: XGBClassifier(base_score=0.5, booster='gbtree',
              colsample_bylevel=0.6553649023281938, colsample_bynode=1,
              colsample_bytree=0.5733906723952086, gamma=0, gpu_id=-1,
              grow_policy='lossguide', importance_type='gain',
              interaction_constraints='', learning_rate=0.03981439313350194,
              max_delta_step=0, max_depth=0, max_leaves=1130,
              min_child_weight=5.542464309441731, missing=nan,
              monotone_constraints='()', n_estimators=123, n_jobs=10,
              num_parallel_tree=1, objective='multi:softprob', random_state=0,
              reg_alpha=0.0059793400625186045, reg_lambda=7.330769622156848,
              scale_pos_weight=None, subsample=1.0, tree_method='hist',
              use_label_encoder=False, validate_parameters=1, verbosity=0)[flaml.automl: 07-10 06:17:00] {1203} INFO - [('xgboost', <flaml.model.XGBoostSklearnEstimator object at 0x7f3e68c55048>), ('lgbm', <flaml.model.LGBMEstimator object at 0x7f3ea4da8cc0>), ('rf', <flaml.model.RandomForestEstimator object at 0x7f3ef11bc278>), ('extra_tree', <flaml.model.ExtraTreeEstimator object at0x7f3ea4fb9240>), ('catboost', <flaml.model.CatBoostEstimator object at 0x7f3ef24f5c50>)]
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 357, in _sendback_result
    exception=exception))
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/externals/loky/backend/queues.py", line 247, in put
    self._writer.send_bytes(obj)
  File "/opt/python/anaconda3/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/python/anaconda3/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "search.py", line 230, in <module>
    main()
  File "search.py", line 226, in main
    data_sheet = run_data_sheet(data_sheet, target_col, id_col, data_dir, out_dir, eval_metric)
  File "search.py", line 181, in run_data_sheet
    pipe.fit(X_train, y_train, **automl_settings)
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/flaml/automl.py", line 950, in fit
    self._search()
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/flaml/automl.py", line 1222, in _search
    **self._state.fit_kwargs)
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/sklearn/ensemble/_stacking.py", line 441, in fit
    return super().fit(X, self._le.transform(y), sample_weight)
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/sklearn/ensemble/_stacking.py", line 149, in fit
    for est in all_estimators if est != 'drop'
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/parallel.py", line 1054, in __call__
    self.retrieve()
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/parallel.py", line 933, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/global/home/hpc3552/autotext/flaml_env/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/opt/python/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/opt/python/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Interestingly, this error does not occur every time, but only sometimes.

get_output_from_log returns empty objects

Hi:

get_output_from_log returns empty objects for me with no error message:

from flaml.data import get_output_from_log
get_output_from_log(filename = 'test.log', time_budget = 600)

([], [], [], [], [])

However, when I ran your sample codes in the notebook, this function worked well. Moreover, my test log file exists and can be accessed in Jupyter using other methods.

I am using 0.2.5 and my settings are:

settings = {
"time_budget": 3600
'eval_method':'cv',
'max_iter' :100,
'n_splits' :5,
'log_type' :'all',
"metric": 'roc_auc',
"task": 'classification',
"log_file_name": 'test.log',
"log_training_metric": False
}

Any ideas?

Appreciate your help! So far the feedback of FLAML from our users is overwhelmingly positive. Great work!

A couple of errors when building the ensemble

Hi:

Our team has explored the ensemble option in the fit function of automl and got a few errors:

  1. There is an error when using both the GLM (LRL1/LRL2) and MLs for the ensemble. For example:

from flaml.data import load_openml_dataset
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id = 1169, data_dir = './')

settings = {
"time_budget": 40,
"metric": 'roc_auc',
"task": 'classification',
"estimator_list": [
'lrl1'
,'lrl2'
,'lgbm'
, 'xgboost'
],
"log_file_name": 'airlines_experiment.log',
}

automl.fit(X_train = X_train, y_train = y_train, ensemble=True, **settings)

[flaml.automl: 03-18 17:34:40] {1157} INFO - [('xgboost', <flaml.model.XGBoostSklearnEstimator object at 0x7f61f8659ed0>), ('lgbm', <flaml.model.LGBMEstimator object at 0x7f61f8687350>), ('lrl2', <flaml.model.LRL2Classifier object at 0x7f61f8687090>), ('lrl1', <flaml.model.LRL1Classifier object at 0x7f61f8654150>)]

RuntimeError: Cannot clone object <flaml.model.LRL2Classifier object at 0x7f84877a1c10>, as the constructor either does not set or modifies parameter penalty.

This is similar to the error we've discussed before.

  1. The other error is more confusing. We've created a few customized ML learners with monotone constraints and used them for the automl. For example, below are the codes for a monotonic xgboost and a lightGBM both using GBDT as the booster:

class MyMonotonicXGBGBTreeClassifier(BaseEstimator):

def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):

    super().__init__(task, **params)

    self.estimator_class = XGBClassifier

    # convert to int for integer hyperparameters
    self.params = {
        'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
        'booster': params['booster'] if 'booster' in params else 'gbtree',
        'learning_rate': params['learning_rate'],
        'gamma': params['gamma'],
        'max_depth': int(params['max_depth']),
        'min_child_weight': int(params['min_child_weight']),
        'subsample': params['subsample'],
        'colsample_bylevel':params['colsample_bylevel'],
        'n_estimators':int(params['n_estimators']),
        'reg_lambda': params['reg_lambda'],
        'reg_alpha': params['reg_alpha'],
        'random_state': params['random_state'] if 'random_state' in params else randomseed,
        "monotone_constraints": params['monotone_constraints'] if 'monotone_constraints' in params else monotone,
        
   }   

@classmethod
def search_space(cls, data_size, task):
    
    space = {        
    'max_depth': {'domain': tune.uniform(lower=4, upper=15), 'init_value': 8},
    'n_estimators': {'domain': tune.uniform(lower = 50, upper = 800), 'init_value': 200},
    'min_child_weight': {'domain': tune.uniform(lower = 1, upper = 1000), 'init_value': 100},
    'subsample': {'domain': tune.uniform(lower = 0.7, upper = 1), 'init_value': 0.7},
    'colsample_bylevel': {'domain': tune.uniform(lower = 0.6, upper = 1), 'init_value': 0.8},
    'learning_rate': {'domain': tune.loguniform(lower = 0.001, upper = 1), 'init_value': 0.1},
    'gamma': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001},
    'reg_lambda': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 1},
     'reg_alpha': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 0.000000000001},
    }
    return space

class MyMonotonicLightGBMGBDTClassifier(BaseEstimator):

def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):

    super().__init__(task, **params)

    self.estimator_class = LGBMClassifier

    # convert to int for integer hyperparameters
    self.params = {
        'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
        'boosting_type':params['boosting_type'] if 'boosting_type' in params else 'gbdt',
        'learning_rate': params['learning_rate'],
        'min_split_gain': params['min_split_gain'],
        'max_depth': int(params['max_depth']),
        'min_data_in_leaf': int(params['min_data_in_leaf']),
        'min_sum_hessian_in_leaf': params['min_sum_hessian_in_leaf'],
        'subsample': params['subsample'],
        'colsample_bytree':params['colsample_bytree'],
        'n_estimators':int(params['n_estimators']),
        'subsample_freq':int(params['subsample_freq']),
        'reg_lambda': params['reg_lambda'],
        'reg_alpha': params['reg_alpha'],
        'random_state': params['random_state'] if 'random_state' in params else randomseed,
        "monotone_constraints":params['monotone_constraints'] if 'monotone_constraints' in params else monotone,
        
   }   

@classmethod
def search_space(cls, data_size, task):
    
    space = {        
    'max_depth': {'domain': tune.uniform(lower=4, upper=15), 'init_value': 8},
    'subsample_freq': {'domain': tune.uniform(lower=1, upper=10), 'init_value': 5},
    'n_estimators': {'domain': tune.uniform(lower = 50, upper = 800), 'init_value': 200},
    'min_data_in_leaf': {'domain': tune.uniform(lower = 1, upper = 1000), 'init_value': 100},
    'min_sum_hessian_in_leaf': {'domain': tune.loguniform(lower = 0.000001, upper = 0.1), 'init_value': 0.001},
    'subsample': {'domain': tune.uniform(lower = 0.5, upper = 1), 'init_value': 0.67},
    'colsample_bytree': {'domain': tune.uniform(lower = 0.5, upper = 1), 'init_value': 0.9},
    'learning_rate': {'domain': tune.loguniform(lower = 0.001, upper = 1), 'init_value': 0.1},
    'min_split_gain': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001},
    'reg_lambda': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 1},
     'reg_alpha': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 0.000000000001},
    }
    return space

Without the ensemble, both worked well as individual learners. However, when we set ensemble=True, the monotonic xgboost learner still worked well but the process always crashed if the monotonic lightGBM learner was included in the list of estimators. The kernel of Jupyter just went dead without any error message. In the .out file generated at the backend, there is an error message:

[LightGBM] [Fatal] Check failed: static_cast<size_t>(num_total_features_) == io_config.monotone_constraints.size() at /__w/1/s/python-package/compile/src/io/dataset.cpp, line 314

What does it mean? It seems that something is wrong with the monotone_constraints but the size of the constraints matches the number of variables.

This error can be replicated using the airlines data; to make it easier just let monotone=(0, 0, 0, 0, 0, 0, 0).

Appreciate your help.

Does it work with weighted datasets?

Hi Dr. Wang:

Does this algorithm work with weighted datasets? I haven't seen any parameter like 'sample_weight'. Or shall I create customized learners for weighted datasets as you suggested for monotonicity?

Thank you.

Redirect the catboost_info subfolder created by CatboostEstimator

Feedback from sebhrusen (from the automlbenchmark)
CatboostEstimator is creating and filling a catboost_info subfolder in the running directory. We should be able to pass a 'train_dir' param to Catboost to avoid that.

For example at AutoML level, accept a tmpdir and pass it to each algo supporting an equivalent property (or pass a dedicated subfolder, for example tmpdir/catboost for Catboost and so on).

Reference:
openml/automlbenchmark#270

Using pandas validation data gives an error

If I leave out X_val and y_val, automl works fine. But if I specify these values, it crashes with the following error:

----> 7 automl.fit(X_train= xtrain,y_train=ytrain,X_val=xvalid,y_val=yvalid,**automl_settings)

~\anaconda3\lib\site-packages\flaml\automl.py in fit(self, X_train, y_train, dataframe, label, metric, task, n_jobs, log_file_name, estimator_list, time_budget, max_iter, sample, ensemble, eval_method, log_type, model_history, split_ratio, n_splits, log_training_metric, mem_thres, X_val, y_val, sample_weight_val, retrain_full, split_type, learner_selector, hpo_method, **fit_kwargs)
    832         self._state.fit_kwargs = fit_kwargs
    833         self._state.weight_val = sample_weight_val
--> 834         self._validate_data(X_train, y_train, dataframe, label, X_val, y_val)
    835         self._search_states = {}  #key: estimator name; value: SearchState
    836         self._random = np.random.RandomState(RANDOM_SEED)

~\anaconda3\lib\site-packages\flaml\automl.py in _validate_data(self, X_train_all, y_train_all, dataframe, label, X_val, y_val)
    434             "# rows in X_val must match length of y_val.")
    435             if self._transformer:
--> 436                 self._state.X_val = self._transformer.transform(X_val)
    437             else:
    438                 self._state.X_val = X_val

~\anaconda3\lib\site-packages\flaml\data.py in transform(self, X)
    251                 X[cat_columns] = X[cat_columns].astype('category')
    252             if num_columns:
--> 253                 X[num_columns].fillna(np.nan, inplace=True)
    254                 X[num_columns] = self.transformer.transform(X)
    255         return X

~\anaconda3\lib\site-packages\pandas\core\frame.py in fillna(self, value, method, axis, inplace, limit, downcast)
   4315         downcast=None,
   4316     ) -> Optional["DataFrame"]:
-> 4317         return super().fillna(
   4318             value=value,
   4319             method=method,

~\anaconda3\lib\site-packages\pandas\core\generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
   6086         result = self._constructor(new_data)
   6087         if inplace:
-> 6088             return self._update_inplace(result)
   6089         else:
   6090             return result.__finalize__(self, method="fillna")

~\anaconda3\lib\site-packages\pandas\core\generic.py in _update_inplace(self, result, verify_is_copy)
   3962         self._clear_item_cache()
   3963         self._mgr = result._mgr
-> 3964         self._maybe_update_cacher(verify_is_copy=verify_is_copy)
   3965 
   3966     def add_prefix(self: FrameOrSeries, prefix: str) -> FrameOrSeries:

~\anaconda3\lib\site-packages\pandas\core\generic.py in _maybe_update_cacher(self, clear, verify_is_copy)
   3243 
   3244         if verify_is_copy:
-> 3245             self._check_setitem_copy(stacklevel=5, t="referant")
   3246 
   3247         if clear:

~\anaconda3\lib\site-packages\pandas\core\generic.py in _check_setitem_copy(self, stacklevel, t, force)
   3679 
   3680         if value == "raise":
-> 3681             raise com.SettingWithCopyError(t)
   3682         elif value == "warn":
   3683             warnings.warn(t, com.SettingWithCopyWarning, stacklevel=stacklevel)

SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

FLAML: New feature

My team is working on a multiclass classification model for a predicting the workload type(POC, Prod, Dev, Test) of Azure services like SQLDW, Synapse and SQLDB. We replaced the Gridsearch/XGBoost with FLAML XGBoost for better performance. Since it is multiclass classification, we implemented more metrics like normalized confusion matrix, Precision-Recall curve and Roc-curve using OneVsRestClassifier for binarizing the labels for our final model so that we can measure the performance for prediction of each individual workload type, in additional to accuracy, precision and recall of overall model. This seems like a common requirement that other FLAML users might have and it will be valuable to add these features for multiclass classification models.

The link to access the jupyter notebook for multiclass classification is
https://microsoft.sharepoint.com/:u:/t/AzureDataUXBA-DataEngineeringandAnalysis/ETY_DWyvPXBEl2S-R5C6rVUBFa0fvbnE9V7KSzAC3H8uMQ?e=hgxKmb

It has the implementation of above metrics in the last section of the file(5. Metrics).

Option for groupKFold for regression problems

Hi,

I'm trying to tune lightgbm for a regression problem and need to use groupKFold for cross-validation.
By default, automl.fit() takes repeatedkfold as split_type. I looked up at the documentation, but couldn't find details regarding that. Also, how to pass the groups arguments to it.

Thanks in advance.

The ensemble option in the main fit function does not work with customized learners

I fit a model with both the RGF in the sample codes and a few other default learners:

settings = {
"time_budget": 120, # total running time in seconds
"metric": 'roc_auc',
"estimator_list": ['lgbm', 'rf', 'RGF'], # list of ML learners
"task": 'classification', # task type
"sample": True, # whether to subsample training data
"log_file_name": 'airlines_experiment_with_ensemble.log', # cache directory of flaml log files
"log_training_metric": True, # whether to log training metric
}
automl.fit(X_train = X_train, y_train = y_train, ensemble=True, **settings)

I received an error message: TypeError: init() got an unexpected keyword argument '_estimator_type'

I got similar results when using other customized learners with unique hyperparameters.

Moreover, how can I pull the details of the ensemble? I did not see it in the log file.

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.