GithubHelp home page GithubHelp logo

Comments (16)

eddiebergman avatar eddiebergman commented on June 5, 2024 1

Seems correct to me :)

from auto-sklearn.

eddiebergman avatar eddiebergman commented on June 5, 2024 1

I would advise not touching the config files and honestly they're quite outdated given the version of sklearn they ran on.

You can read more in the autosklearn paper but essentially the 25 signifies that it should use this metadata to decide on 25 initial candidates to evaluate, where these are the 25 configurations that give the best "coverage" across the metadatasets, i.e. on average, one of these 25 would have been the best choice for each and every dataset in the metadataset collection. There are some potential issues, notably, does your dataset "look like" one of these metadatasets? If so, then great, you'll have a good estimator in the first 25 evaluations. If not, damn, you'll have to wait 25 evaluations before the BO loop kicks in to start searching. By default, the BO algorithm in autosklearn will just use 25 random samples if there are no initial metalearning configurations, i.e. you set it to 0.

Therefore the choice comes down to, do you think your data is suitably unique such that the metalearning configurations are all going to perform worse than a random set of configurations?
Sometimes the answer is yes, but without proof of such, it's usually no.

Longer story, I'm still in the process of slowly building a revamped AutoSklearn and there we hope to include user provided metadata. Part of this will also be to provide an updated metadataset that solves some issues in the current set of configurations from metalearning.
Feel free to check out AutoML Toolkit (amltk) which it is based on ;)

from auto-sklearn.

eddiebergman avatar eddiebergman commented on June 5, 2024 1

Nope, AutoSklearn doesn't cache between calls. In fact there's almost no caching that happens at all other than dumping models and predictions to disk to use later for predict()

from auto-sklearn.

eddiebergman avatar eddiebergman commented on June 5, 2024 1

Intermediate results of models, while it's running...not easily at all. Intermediate results in terms of post-analysis, yes, although models which are not in the top 50 (default) are pruned to save disk space.

Whether you can improve these models further, yup absolutely. We are revisiting the pipelines in the newer version

from auto-sklearn.

eddiebergman avatar eddiebergman commented on June 5, 2024

We use SMAC as the bayesian optimization library. You can find it here although it's quite convoluted considering it inherits andd overrides some of SMAC's functionality.

from smac.facade.smac_ac_facade import SMAC4AC

from auto-sklearn.

sayannath avatar sayannath commented on June 5, 2024

Hey @eddiebergman! Thanks for the reply. I actually want to log all the model and hyperparameters used by the autosklearn model.

PS: Not about the ensemble models. I want the models which used while getting trained before getting the best model.

from auto-sklearn.

eddiebergman avatar eddiebergman commented on June 5, 2024

Hiyo, unfortunatly the easy ways are not the most informative:

  • You can use leaderboard(detailed=True, ensemble_only=False)
    • This has the downside you won't really see the configurations as a whole
  • You can use show_models() which will give you the actual models that are actually used in the final ensemble.
    • However it's not really the best for visual output as you can't directly see the hyperparameters, you would have to interogate the actual objects returned.
  • You can directly access askl.automl_.runhistory_.items() which is generated by the underlying Bayesian Optimization tool SMAC.

from auto-sklearn.

sayannath avatar sayannath commented on June 5, 2024

I am getting the cost and the configuration like this:

data_for_json = []
for run_key, run_value in run_history.data.items():
    config_id = run_key.config_id
    config = run_history.ids_config[config_id]

    # Convert configuration to a serializable format (dictionary with primitives)
    config_dict = config.get_dictionary()

    # Append configuration and cost to the list
    data_for_json.append({
        "configuration": config_dict,
        "cost": run_value.cost,
        # "run_value": run_value,
        # If you need to convert cost to a score, adjust accordingly
        # Example for accuracy: "score": 1 - run_value.cost
    })

And letting you know that I am using bi-objective function and in that I am returning a combined score.

So is that the correct way to do so. I am also dumbing all the info in a JSON file.

from auto-sklearn.

sayannath avatar sayannath commented on June 5, 2024

While training I said that I am using the bi-objective function in autosklearn. Like this:

def bi_objective_fn(solution, prediction):
    """
    Calculate a combined score of accuracy and fairness.

    :param solution: True labels.
    :param prediction: Predicted labels.
    :return: Combined score.
    """
    protected_attr = "Sex"
    metric_id = 2

    split = generate_train_subset("test_split.txt")
    subset_data_orig_train = data_orig_train.subset(split)

    if os.stat("beta.txt").st_size == 0:
        default = RandomForestClassifier(
            n_estimators=1750,
            criterion="gini",
            max_features=0.5,
            min_samples_split=6,
            min_samples_leaf=6,
            min_weight_fraction_leaf=0.0,
            max_leaf_nodes=None,
            min_impurity_decrease=0.0,
            bootstrap=True,
            max_depth=None,
        )
        degrees = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
        mutation_strategies = {"0": [1, 0], "1": [0, 1]}
        dataset_orig = subset_data_orig_train
        res = create_baseline(
            default,
            dataset_orig,
            privileged_groups,
            unprivileged_groups,
            data_splits=10,
            repetitions=10,
            odds=mutation_strategies,
            options=[0, 1],
            degrees=degrees,
        )
        acc0 = np.array(
            [np.mean([row[0] for row in res["0"][degree]]) for degree in degrees]
        )
        acc1 = np.array(
            [np.mean([row[0] for row in res["1"][degree]]) for degree in degrees]
        )
        fair0 = np.array(
            [
                np.mean([row[metric_id] for row in res["0"][degree]])
                for degree in degrees
            ]
        )
        fair1 = np.array(
            [
                np.mean([row[metric_id] for row in res["1"][degree]])
                for degree in degrees
            ]
        )

        if min(acc0) > min(acc1):
            beta = (max(acc0) - min(acc0)) / (max(acc0) - min(acc0) + max(fair0))
        else:
            beta = (max(acc1) - min(acc1)) / (max(acc1) - min(acc1) + max(fair1))

        f = open("beta.txt", "w")
        f.write(str(beta))
        f.close()
    else:
        f = open("beta.txt", "r")
        beta = float(f.read())
        f.close()
    beta += 0.2
    if beta > 1.0:
        beta = 1.0
    try:
        num_keys = sum(1 for line in open("num_keys.txt"))
        print(num_keys)
        beta -= 0.050 * int(int(num_keys) / 10)
        if int(num_keys) % 10 == 0:
            os.remove(temp_path + "/.auto-sklearn/ensemble_read_losses.pkl")
        f.close()
    except FileNotFoundError:
        pass
    fairness_metrics = [
        1 - np.mean(solution == prediction),
        disparate_impact(subset_data_orig_train, prediction, protected_attr),
        statistical_parity_difference(
            subset_data_orig_train, prediction, protected_attr
        ),
        equal_opportunity_difference(
            subset_data_orig_train, prediction, solution, protected_attr
        ),
        average_odds_difference(
            subset_data_orig_train, prediction, solution, protected_attr
        ),
    ]

    print(
        fairness_metrics[metric_id],
        1 - np.mean(solution == prediction),
        fairness_metrics[metric_id] * beta
        + (1 - np.mean(solution == prediction)) * (1 - beta),
        beta,
    )

    combined_score = fairness_metrics[metric_id] * beta + (
            1 - np.mean(solution == prediction)
    ) * (1 - beta)

    print(
        f"Beta: {beta}, Combined Score: {combined_score}, Fairness Metric: {fairness_metrics}, Accuracy: {np.mean(solution == prediction)}"
    )
    write_file(
        "./titanic_rf_spd_results/titanic_rf_score.txt",
        str(
            f"Combined Score: {combined_score}, Fairness Metric: {fairness_metrics}, Accuracy: {np.mean(solution == prediction)}\n"
        ),
        mode="a",
    )
    return combined_score


# Create a custom metric object (bi-objective function)
accuracy_scorer = autosklearn.metrics.make_scorer(
    name="accu",
    score_func=bi_objective_fn,
    optimum=1,
    greater_is_better=False,
    needs_proba=False,
    needs_threshold=False,
)

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=60 * 60,
    memory_limit=10000000,
    include_estimators=["CustomRandomForest"],
    ensemble_size=1,
    initial_configurations_via_metalearning=25,
    include_preprocessors=[
        "kernel_pca",
        "select_percentile_classification",
        "select_rates_classification",
    ],
    tmp_folder=temp_path,
    delete_tmp_folder_after_terminate=False,
    metric=accuracy_scorer,
)

So I am unable to get what actually the run_value.cost signifies.

As in most of the cost is 0.0. Can you help me with this?

from auto-sklearn.

eddiebergman avatar eddiebergman commented on June 5, 2024

I can't really tell you why it's 0.0 all the time but one thing that might help to know about is the worst_possible_result of make_scorer which it seems to be returning.

You're sure that your metric is able to return a result? It seems like it's just constantly setting the worst_possible_result

from auto-sklearn.

sayannath avatar sayannath commented on June 5, 2024

Thanks for the help! Can you help me out with initial_configurations_via_metalearning. What actually does 25 signify?

And also what does this file autosklearn/metalearning/optimizers/metalearn_optimizer/metalearn_optimizerDefault.cfg
actually do? Is it something we need to change to improve performance of the model?

from auto-sklearn.

sayannath avatar sayannath commented on June 5, 2024

Thanks for the detailed info. Is there any type of caching happens when we run the same model on the same dataset for couple of times?

from auto-sklearn.

sayannath avatar sayannath commented on June 5, 2024

@eddiebergman

Can I get intermediate results, of the models which are ensembling, and apply any technique to make the models better by keeping mutation based OR out-of-automl meta-learning based idea in mind?

from auto-sklearn.

sayannath avatar sayannath commented on June 5, 2024

Hey! Can you tell me what the cost means when we get the run history?

from auto-sklearn.

eddiebergman avatar eddiebergman commented on June 5, 2024

It's just the metric value converted in some manner such that it's something to be minimized which is what SMAC needs. For bounded metrics, this also means it's min-max normalized betwwen (0, 1) where 0 means optimal and 1 means worst. For unbounded metrics, this really just means sign flipping the value.

from auto-sklearn.

sayannath avatar sayannath commented on June 5, 2024

Hey @eddiebergman

Can we add custom meta-feature in autokslearn metalearning?

from auto-sklearn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.