GithubHelp home page GithubHelp logo

deephyper / nasbigdata Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 2.0 1.23 MB

Neural architecture search for big data problems

License: BSD 2-Clause "Simplified" License

Python 27.61% Jupyter Notebook 71.42% Shell 0.97%

nasbigdata's Introduction

AgEBO-Tabular

DOI

The code is available at NASBigData Github repo.

Aging Evolution with Bayesian Optimization (AgEBO) is a nested-distributed algorithm to generate better neural architectures. AgEBO advantages are:

  • the parallel evaluation of neural networks on computing ressources (e.g., cores, gpu, nodes).
  • the parallel training of each evaluated neural networks by using data-parallelism (Horovod).
  • the jointly optimization of hyperparameters and neural architectures which enables the automatic adaptation of data-parallelism setting to avoid a loss of accuracy.

This repo contains the experimental materials linked to the implementation of AgEBO algorithm in DeepHyper's repo. The version of DeepHyper used is: e8e07e2db54dceed83b626104b66a07509a95a8c

Environment information

The experiments were executed on the ThetaGPU supercomputer.

  • OS Login Node: Ubuntu 18.04.5 LTS (GNU/Linux 4.15.0-112-generic x86_64)
  • OS Compute Node: NVIDIA DGX Server Version 4.99.9 (GNU/Linux 5.3.0-62-generic x86_64)
  • Python: Miniconda Python 3.8

For more information about the environment refer to the infos-sc21.txt which was generated with the provided SC Author-Kit

Installation

Install Miniconda: conda.io. Then create a Python environment:

conda create -n dh-env python=3.8

Then install Deephyper. To have the detailed installation process of DeepHyper follow the instructions given at: deephyper.readthedocs.io. We propose the following commands:

conda activate dh-env
conda install gxx_linux-64 gcc_linux-64 -y
git clone https://github.com/deephyper/deephyper.git
cd deephyper/
git checkout e8e07e2db54dceed83b626104b66a07509a95a8c
pip install -e.
pip install ray[default]

Finally, install the NASBigData package::

cd ..
git clone https://github.com/deephyper/NASBigData.git
cd NASBigData/
pip install -e.

Download and Generate datasets from ECP-Candle

Have the following dependencies installed:

pip install numba
pip install astropy
pip install patsy
pip install statsmodels

For the Combo dataset run:

cd NASBigData/nas_big_data/combo/
sh download_data.sh

For the Attn dataset run:

cd NASBigData/nas_big_data/attn/
sh download_data.sh

How it works

The AgEBO algorithm (Aging Evolution with Bayesian Optimisation) was directly added to the DeepHyper project and can be found here.

To submit and run an experiment on the ThetaGPU system the following command is used:

deephyper ray-submit nas agebo -w combo_2gpu_8_agebo_sync -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_agebo.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 2 --num-gpus-per-task 2 -as ../SetUpEnv.sh --n-jobs 16

where

  • -w denotes the name of the experiment.
  • -n denotes the number of nodes requested.
  • -t denotes the allocation time (minutes) requested.
  • -A denotes the project's name at the ALCF.
  • -q denotes the queue's name.
  • --problem is the Python package import to the Problem definition (which define the hyperparameter and neural architecture search space, the loss to optimise, etc.).
  • --run is the Python package import to the run function (which evaluate each configuration sampled by the search).
  • --max-evals denotes the maximum number of evaluations to performe (often affected to an high value so that the search uses the whole allocation time).
  • --num-cpus-per-task the number of cores used by each evaluation.
  • --num-gpus-per-task the number of GPUs used by each evaluation.
  • --as the absolute PATH to the activation script SetUpEnv.sh (used to initialise the good environment on compute nodes when the allocation is starting).
  • --n-jobs the number of processes that the surrogate model of the Bayesian optimiser can use.

The deephyper ray-submit ... command will create a directory with -w name and automatically generate a submission script for Cobalt (the scheduler at the ALCF). Such a submission script will be composed of the following.

The initialisation of the environment:

#!/bin/bash -x
#COBALT -A datascience
#COBALT -n 8
#COBALT -q full-node
#COBALT -t 180

mkdir infos && cd infos

ACTIVATE_PYTHON_ENV="/lus/grand/projects/datascience/regele/thetagpu/agebo/SetUpEnv.sh"
echo "Script to activate Python env: $ACTIVATE_PYTHON_ENV"
source $ACTIVATE_PYTHON_ENV

The initialisation of the Ray cluster:

# USER CONFIGURATION
CPUS_PER_NODE=8
GPUS_PER_NODE=8


# Script to launch Ray cluster
# Getting the node names
mapfile -t nodes_array -d '\n' < $COBALT_NODEFILE

head_node=${nodes_array[0]}
head_node_ip=$(dig $head_node a +short | awk 'FNR==2')

# if we detect a space character in the head node IP, we'll
# convert it to an ipv4 address. This step is optional.
if [[ "$head_node_ip" == *" "* ]]; then
IFS=' ' read -ra ADDR <<<"$head_node_ip"
if [[ ${#ADDR[0]} -gt 16 ]]; then
  head_node_ip=${ADDR[1]}
else
  head_node_ip=${ADDR[0]}
fi
echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
fi

# Starting the Ray Head Node
port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"

echo "Starting HEAD at $head_node"
ssh -tt $head_node_ip "source $ACTIVATE_PYTHON_ENV; \
    ray start --head --node-ip-address=$head_node_ip --port=$port \
    --num-cpus $CPUS_PER_NODE --num-gpus $GPUS_PER_NODE --block" &

# optional, though may be useful in certain versions of Ray < 1.0.
sleep 10

# number of nodes other than the head node
worker_num=$((${#nodes_array[*]} - 1))
echo "$worker_num workers"

for ((i = 1; i <= worker_num; i++)); do
    node_i=${nodes_array[$i]}
    node_i_ip=$(dig $node_i a +short | awk 'FNR==1')
    echo "Starting WORKER $i at $node_i with ip=$node_i_ip"
    ssh -tt $node_i_ip "source $ACTIVATE_PYTHON_ENV; \
        ray start --address $ip_head \
        --num-cpus $CPUS_PER_NODE --num-gpus $GPUS_PER_NODE" --block &
    sleep 5
done

The DeepHyper command to start the search:

deephyper nas agebo --evaluator ray --ray-address auto \
    --problem nas_big_data.combo.problem_agebo.Problem \
    --run deephyper.nas.run.tf_distributed.run \
    --max-evals 10000 \
    --num-cpus-per-task 2 \
    --num-gpus-per-task 2 \
    --n-jobs=16

Commands to reproduce

All the commands can be found in the NASBigData repo.

The experiments are name as {dataset}_{x}gpu_{y}_{z}_{other} where

  • dataset is the name of the corresponding dataset (e.g., combo or attn).
  • x is the number of GPUs used for each trained neural network (e.g., 1, 2, 4, 8).
  • y is the number of nodes used for the allocation (e.g., 1, 2, 4, 8, 16).
  • z is the name of the algorithm (e.g., age, agebo).
  • other are other keywords used to differentiate some experiments (e.g., kappa value)>

We give the full set of commands used to run our experiments.

Combo dataset

  • combo_1gpu_8_age
deephyper ray-submit nas regevo -w combo_1gpu_8_age -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_ae.Problem --run deephyper.nas.run.alpha.run --max-evals 10000 --num-cpus-per-task 1 --num-gpus-per-task 1 -as ../SetUpEnv.sh
  • combo_2gpu_8_age
deephyper ray-submit nas regevo -w combo_2gpu_8_age -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_ae.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 2 --num-gpus-per-task 2 -as ../SetUpEnv.sh
  • combo_8gpu_8_age
deephyper ray-submit nas regevo -w combo_8gpu_8_age -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_ae.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 8 --num-gpus-per-task 8 -as ../SetUpEnv.sh
  • combo_8gpu_8_agebo
deephyper ray-submit nas agebo -w combo_8gpu_8_agebo -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_agebo.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 8 --num-gpus-per-task 8 -as ../SetUpEnv.sh --n-jobs 16
  • combo_2gpu_8_agebo
deephyper ray-submit nas agebo -w combo_2gpu_8_agebo -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_agebo.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 2 --num-gpus-per-task 2 -as ../SetUpEnv.sh --n-jobs 16
  • combo_1gpu_2_age
deephyper ray-submit nas regevo -w combo_1gpu_2_age -n 2 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_ae.Problem --run deephyper.nas.run.alpha.run --max-evals 10000 --num-cpus-per-task 1 --num-gpus-per-task 1 -as ../SetUpEnv.sh
  • combo_2gpu_4_age
deephyper ray-submit nas regevo -w combo_2gpu_4_age -n 4 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_ae.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 2 --num-gpus-per-task 2 -as ../SetUpEnv.sh
  • combo_4gpu_8_age
deephyper ray-submit nas regevo -w combo_4gpu_8_age -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_ae.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 4 --num-gpus-per-task 4 -as ../SetUpEnv.sh
  • combo_8gpu_16_age
deephyper ray-submit nas regevo -w combo_8gpu_16_age -n 16 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_ae.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 8 --num-gpus-per-task 8 -as ../SetUpEnv.sh
  • combo_1gpu_2_agebo
deephyper ray-submit nas agebo -w combo_1gpu_2_agebo -n 2 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_agebo.Problem --run deephyper.nas.run.alpha.run --max-evals 10000 --num-cpus-per-task 1 --num-gpus-per-task 1 -as ../SetUpEnv.sh --n-jobs 16
  • combo_2gpu_4_agebo
deephyper ray-submit nas agebo -w combo_2gpu_4_agebo -n 4 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_agebo.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 2 --num-gpus-per-task 2 -as ../SetUpEnv.sh --n-jobs 16
  • combo_4gpu_8_agebo
deephyper ray-submit nas agebo -w combo_4gpu_8_agebo -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_agebo.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 4 --num-gpus-per-task 4 -as ../SetUpEnv.sh --n-jobs 16
  • combo_8gpu_16_agebo
deephyper ray-submit nas agebo -w combo_8gpu_16_agebo -n 16 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_agebo.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 8 --num-gpus-per-task 8 -as ../SetUpEnv.sh --n-jobs 16
  • combo_4gpu_8_agebo_1_96
deephyper ray-submit nas agebo -w combo_4gpu_8_agebo_1_96 -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_agebo.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 4 --num-gpus-per-task 4 -as ../SetUpEnv.sh --n-jobs 16 --kappa 1.96
  • combo_4gpu_8_agebo_19_6
deephyper ray-submit nas agebo -w combo_4gpu_8_agebo_19_6 -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_agebo.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 4 --num-gpus-per-task 4 -as ../SetUpEnv.sh --n-jobs 16 --kappa 19.6
  • combo_1gpu_8_agebo
deephyper ray-submit nas agebo -w combo_1gpu_8_agebo -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_agebo.Problem --run deephyper.nas.run.alpha.run --max-evals 10000 --num-cpus-per-task 1 --num-gpus-per-task 1 -as ../SetUpEnv.sh --n-jobs 16
  • combo_4gpu_8_ambsmixed
deephyper ray-submit nas ambsmixed -w combo_4gpu_8_ambsmixed -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_agebo.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 4 --num-gpus-per-task 4 -as ../SetUpEnv.sh --n-jobs 16
  • combo_4gpu_8_regevomixed
deephyper ray-submit nas regevomixed -w combo_4gpu_8_regevomixed -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_agebo.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 4 --num-gpus-per-task 4 -as ../SetUpEnv.sh
  • combo_2gpu_1_age
deephyper ray-submit nas regevo -w combo_2gpu_1_age -n 1 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_ae.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 2 --num-gpus-per-task 2 -as ../SetUpEnv.sh
  • combo_2gpu_2_age
deephyper ray-submit nas regevo -w combo_2gpu_2_age -n 2 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_ae.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 2 --num-gpus-per-task 2 -as ../SetUpEnv.sh
  • combo_2gpu_16_age
deephyper ray-submit nas regevo -w combo_2gpu_16_age -n 16 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_ae.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 2 --num-gpus-per-task 2 -as ../SetUpEnv.sh
  • combo_2gpu_1_agebo
deephyper ray-submit nas agebo -w combo_2gpu_1_agebo -n 1 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_agebo.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 2 --num-gpus-per-task 2 -as ../SetUpEnv.sh --n-jobs 16
  • combo_2gpu_2_agebo
deephyper ray-submit nas agebo -w combo_2gpu_2_agebo -n 2 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_agebo.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 2 --num-gpus-per-task 2 -as ../SetUpEnv.sh --n-jobs 16
  • combo_2gpu_16_agebo
deephyper ray-submit nas agebo -w combo_2gpu_16_agebo -n 16 -t 180 -A datascience -q full-node --problem nas_big_data.combo.problem_agebo.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 2 --num-gpus-per-task 2 -as ../SetUpEnv.sh --n-jobs 16

Attn dataset

  • attn_1gpu_8_age
deephyper ray-submit nas regevo -w attn_1gpu_8_age -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.attn.problem_ae.Problem --run deephyper.nas.run.alpha.run --max-evals 10000 --num-cpus-per-task 1 --num-gpus-per-task 1 -as ../SetUpEnv.sh
  • attn_1gpu_8_agebo
deephyper ray-submit nas agebo -w attn_1gpu_8_agebo -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.attn.problem_agebo.Problem --run deephyper.nas.run.alpha.run --max-evals 10000 --num-cpus-per-task 1 --num-gpus-per-task 1 -as ../SetUpEnv.sh --n-jobs 16
  • attn_2gpu_8_agebo
deephyper ray-submit nas agebo -w attn_2gpu_8_agebo -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.attn.problem_agebo.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 2 --num-gpus-per-task 2 -as ../SetUpEnv.sh --n-jobs 16
  • attn_4gpu_8_agebo
deephyper ray-submit nas agebo -w attn_4gpu_8_agebo -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.attn.problem_agebo.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 4 --num-gpus-per-task 4 -as ../SetUpEnv.sh --n-jobs 16
  • attn_8gpu_8_agebo
deephyper ray-submit nas agebo -w attn_8gpu_8_agebo -n 8 -t 180 -A datascience -q full-node --problem nas_big_data.attn.problem_agebo.Problem --run deephyper.nas.run.tf_distributed.run --max-evals 10000 --num-cpus-per-task 8 --num-gpus-per-task 8 -as ../SetUpEnv.sh --n-jobs 16

nasbigdata's People

Contributors

deathn0t avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

nasbigdata's Issues

Question about airlines dataset

Hi,

I am trying to test the airlines application in your repository. However, I got an error in load_data.py.

load_data.py gets the dataset from deephyper.benchmark.datasets.airlines, this part works okay. Some columns in the dataset are strings, for example, the airlines/airport names.

Ater loading the dataset from deephyper.benchmark.datasets.airlines, the error appeared in prepro_input.fit_transform(X_train), which reported ValueError: could not convert string to float: 'OO'. (The detailed error is listed at the bottom. )

Do you have any suggestions about it? Or where can I get the correct dataset of it?

Thank you so much...

!!! USING TEST DATA !!!
Uncaught exception <class 'ValueError'>: could not convert string to float: 'OO'Traceback (most recent call last):
  File "load_data.py", line 91, in <module>
    load_data(use_test=True)
  File "load_data.py", line 48, in load_data
    return load_data_cache(use_test=use_test)
  File "/lus/theta-fs0/projects/VeloC/hyliu/work_deephyper/deephyper/deephyper/benchmark/datasets/util.py", line 30, in wrapper
    (X_train, y_train), (X_valid, y_valid) = data_loader(*args, **kwargs)
  File "load_data.py", line 37, in load_data_cache
    X_train = prepro_input.fit_transform(X_train)
  File "/home/hyliu/work/softwares/conda/envs/testdh/lib/python3.7/site-packages/sklearn/pipeline.py", line 378, in fit_transform
    Xt = self._fit(X, y, **fit_params_steps)
  File "/home/hyliu/work/softwares/conda/envs/testdh/lib/python3.7/site-packages/sklearn/pipeline.py", line 307, in _fit
    **fit_params_steps[name])
  File "/home/hyliu/work/softwares/conda/envs/testdh/lib/python3.7/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "/home/hyliu/work/softwares/conda/envs/testdh/lib/python3.7/site-packages/sklearn/pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/hyliu/work/softwares/conda/envs/testdh/lib/python3.7/site-packages/sklearn/base.py", line 699, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/home/hyliu/work/softwares/conda/envs/testdh/lib/python3.7/site-packages/sklearn/preprocessing/_data.py", line 363, in fit
    return self.partial_fit(X, y)
  File "/home/hyliu/work/softwares/conda/envs/testdh/lib/python3.7/site-packages/sklearn/preprocessing/_data.py", line 398, in partial_fit
    force_all_finite="allow-nan")
  File "/home/hyliu/work/softwares/conda/envs/testdh/lib/python3.7/site-packages/sklearn/base.py", line 421, in _validate_data
    X = check_array(X, **check_params)
  File "/home/hyliu/work/softwares/conda/envs/testdh/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/home/hyliu/work/softwares/conda/envs/testdh/lib/python3.7/site-packages/sklearn/utils/validation.py", line 616, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
  File "/home/hyliu/work/softwares/conda/envs/testdh/lib/python3.7/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'OO'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.