GithubHelp home page GithubHelp logo

google-research / nasbench Goto Github PK

View Code? Open in Web Editor NEW
664.0 18.0 129.0 357 KB

NASBench: A Neural Architecture Search Dataset and Benchmark

License: Apache License 2.0

Python 64.03% Jupyter Notebook 35.97%

nasbench's Introduction

NASBench: A Neural Architecture Search Dataset and Benchmark

This repository contains the code used for generating and interacting with the NASBench dataset. The dataset contains 423,624 unique neural networks exhaustively generated and evaluated from a fixed graph-based search space.

Each network is trained and evaluated multiple times on CIFAR-10 at various training budgets and we present the metrics in a queriable API. The current release contains over 5 million trained and evaluated models.

Our paper can be found at:

NAS-Bench-101: Towards Reproducible Neural Architecture Search

If you use this dataset, please cite:

@InProceedings{pmlr-v97-ying19a,
    title =     {{NAS}-Bench-101: Towards Reproducible Neural Architecture Search},
    author =    {Ying, Chris and Klein, Aaron and Christiansen, Eric and Real, Esteban and Murphy, Kevin and Hutter, Frank},
    booktitle = {Proceedings of the 36th International Conference on Machine Learning},
    pages =     {7105--7114},
    year =      {2019},
    editor =    {Chaudhuri, Kamalika and Salakhutdinov, Ruslan},
    volume =    {97},
    series =    {Proceedings of Machine Learning Research},
    address =   {Long Beach, California, USA},
    month =     {09--15 Jun},
    publisher = {PMLR},
    url =       {http://proceedings.mlr.press/v97/ying19a.html},

Dataset overview

NASBench is a tabular dataset which maps convolutional neural network architectures to their trained and evaluated performance on CIFAR-10. Specifically, all networks share the same network "skeleton", which can be seen in Figure (a) below. What changes between different models is the "module", which is a collection of neural network operations linked in an arbitrary graph-like structure.

Modules are represented by directed acyclic graphs with up to 9 vertices and 7 edges. The valid operations at each vertex are "3x3 convolution", "1x1 convolution", and "3x3 max-pooling". Figure (b) below shows an Inception-like cell within the dataset. Figure (c) shows a high-level overview of how the interior filter counts of each module are computed.

There are exactly 423,624 computationally unique modules within this search space and each one has been trained for 4, 12, 36, and 108 epochs three times each (423K * 3 * 4 = ~5M total trained models). We report the following metrics:

  • training accuracy
  • validation accuracy
  • testing accuracy
  • number of parameters
  • training time

The scatterplot below shows a comparison of number of parameters, training time, and mean validation accuracy of models trained for 108 epochs in the dataset.

See our paper for more detailed information about the design of this search space, further implementation details, and more in-depth analysis.

Colab

You can directly use this dataset from Google Colaboratory without needing to install anything on your local machine. Click "Open in Colab" below:

Open In Colab

Setup

  1. Clone this repo.
git clone https://github.com/google-research/nasbench
cd nasbench
  1. (optional) Create a virtualenv for this library.
virtualenv venv
source venv/bin/activate
  1. Install the project along with dependencies.
pip install -e .

Note: the only required dependency is TensorFlow. The above instructions will install the CPU version of TensorFlow to the virtualenv. For other install options, see https://www.tensorflow.org/install/.

Download the dataset

The full dataset (which includes all 5M data points at all 4 epoch lengths):

https://storage.googleapis.com/nasbench/nasbench_full.tfrecord

Size: ~1.95 GB, SHA256: 3d64db8180fb1b0207212f9032205064312b6907a3bbc81eabea10db2f5c7e9c


Subset of the dataset with only models trained at 108 epochs:

https://storage.googleapis.com/nasbench/nasbench_only108.tfrecord

Size: ~499 MB, SHA256: 4c39c3936e36a85269881d659e44e61a245babcb72cb374eacacf75d0e5f4fd1

Using the dataset

Example usage (see example.py for a full runnable example):

# Load the data from file (this will take some time)
nasbench = api.NASBench('/path/to/nasbench.tfrecord')

# Create an Inception-like module (5x5 convolution replaced with two 3x3
# convolutions).
model_spec = api.ModelSpec(
    # Adjacency matrix of the module
    matrix=[[0, 1, 1, 1, 0, 1, 0],    # input layer
            [0, 0, 0, 0, 0, 0, 1],    # 1x1 conv
            [0, 0, 0, 0, 0, 0, 1],    # 3x3 conv
            [0, 0, 0, 0, 1, 0, 0],    # 5x5 conv (replaced by two 3x3's)
            [0, 0, 0, 0, 0, 0, 1],    # 5x5 conv (replaced by two 3x3's)
            [0, 0, 0, 0, 0, 0, 1],    # 3x3 max-pool
            [0, 0, 0, 0, 0, 0, 0]],   # output layer
    # Operations at the vertices of the module, matches order of matrix
    ops=[INPUT, CONV1X1, CONV3X3, CONV3X3, CONV3X3, MAXPOOL3X3, OUTPUT])

# Query this model from dataset, returns a dictionary containing the metrics
# associated with this model.
data = nasbench.query(model_spec)

See nasbench/api.py for more information, including the constraints on valid module matrices and operations.

Note: it is not required to use nasbench/api.py to work with this dataset, you can see how to parse the dataset files from the initializer inside nasbench/api.py and then interact the data however you'd like.

How the dataset was generated

The dataset generation code is provided for reference, but the dataset has already been fully generated.

The list of unique computation graphs evaluated in this dataset was generated via nasbench/scripts/generate_graphs.py. Each of these graphs was evaluated multiple times via nasbench/scripts/run_evaluation.py.

How to run the unit tests

Unit tests are included for some of the algorithmically complex parts of the code. The tests can be run directly via Python. Example:

python nasbench/tests/model_builder_test.py

Disclaimer

This is not an official Google product.

nasbench's People

Contributors

chrisying avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nasbench's Issues

Lower than expected final test accuracies for some models?

Hi,

Thank you for your great resource.
I have been starting to use your benchmark but had some queries about the final test accuracies that are returned for some models.

I'm using the nasbench_only108.tfrecord and checked the sha256sum is correct.

A few models I find reported to have final test accuracies of ~10% are as follows (I give the hash returned by nasbench.hash_iterator() )

01bcceabc42489b3af4b4496e333a86e
003d4f8e9c5a066d7b248230d8a4fcb5

However when I train them for a few epochs with a constant learning rate just to see if I expect the test accuracy to be completely random I normally get > 40% validation accuracy, so the fact that the test accuracy is random doesn't seem right to me. (Obviously the test accuracy isn't the validation accuracy and I'm not using the same training procedure but I wouldn't expect such different results between the 2).

I get my test accuracies from the nasbench api as follows

for unique_hash in nasbench.hash_iterator():
        matrix = nasbench.fixed_statistics[unique_hash]['module_adjacency']
        operations = nasbench.fixed_statistics[unique_hash]['module_operations']
        spec = ModelSpec(matrix, operations)
        data = nasbench.query(spec)
        acc = 100.*data['test_accuracy']

Is this correct?

If they are incorrect, is there some systematic cause which would let me know which models I can trust the test accuracies for and which ones I can't?

Apologies if I've misunderstood something.

Thanks again

RMSProp epsilon=1.0, why?

Thank you so much for this codebase. It helps a lot to make NAS more reproducible.

I have a question regarding RMSProp. I do not see RMSProp often in computer vision, but I guess it is fine, there are not the greatest difference between optimizers. However, I see that you used epsilon=1.0 which I find odd since this is the constants that usually prevent division by zero errors and you set it at a very high value. That high value introduces a systematic bias in the variance estimate. Do you have any references for other public results using this in conjunction with that high learning rate or is there any reason in particular why epsilon=1.0?

source code for calculating the graph edit distance

Hi, I'm currently working on some research problems related to the results in section "3.3. Locality" of your original paper. However, I cannot find the source codes for calculating the graph edit distance between architectures. It would be very helpful if you can provide the source codes that generate "Figure 6" in the original paper. Many thanks!

Have you compare the execution time of various algorithms?

Figure 7 in the origin paper compares the performance of various algorithms. But the x-axis is the estimated training time which can be queried from the nasbench-101 dataset.

I am really interested in how much execution time does each algorithm take. The execution time here means how long each algorithm takes to obtain the Figure 7, not the estimated training time.

Have you compare that part?

How much mean test accuracy can the best architecure achieve?

As the origin paper says,

The best architecture in our dataset (Figure 1) achieved a mean test accuracy of 94:32%.

However, based on my calculation, the highest mean test accuracy on nasbench_only108.tfrecord dataset is 0.9448784788449606.

How much mean test accuracy can the best architecure achieve?

Not compatible with tensorflow 2.0

The code works well with tensorflow 1.14 but throws the following error when using tf 2.0:

File ".../nasbench/lib/training_time.py", line 130, in
class _TimingRunHook(tf.train.SessionRunHook):
AttributeError: module 'tensorflow_core._api.v2.train' has no attribute 'SessionRunHook'

y-axis in Fig 7(left)

The left plot in Fig 7 in the paper shows test regret -- can you explain how that's computed exactly?

I know it's log10(y - y_best) -- but what is y_best exactly? Is that the best validation/test accuracy for a single model run / averaged across the 3 model runs?

I think the four possibilities would be:

test acc           mean across 3 runs           0.943175752957662
test acc           maximum across 3 runs        0.9466145634651184
validation acc     mean across 3 runs           0.9505542318026224
validation acc     maximum across 3 runs        0.9518229365348816

Thanks!

is_valid is true, query function report errors

m = np.array([[0. ,1. ,0. ,1. ,0. ,1. ,1.],
[0. ,0. ,0. ,0. ,1. ,1. ,1.],
[0. ,0. ,0. ,1. ,0. ,1. ,1.],
[0. ,0. ,0. ,0. ,0. ,1. ,1.],
[0. ,0. ,0. ,0. ,0. ,0. ,0.],
[0. ,0. ,0. ,0. ,0. ,0. ,1.],
[0. ,0. ,0. ,0. ,0. ,0. ,0.]])
ops=['input', 'maxpool3x3', 'conv1x1-bn-relu' , 'conv3x3-bn-relu', 'conv3x3-bn-relu', 'conv3x3-bn-relu','output']
cell = api.ModelSpec(matrix=m, ops=ops)
print(nasbench.is_valid(cell))

m is a matrix I searched by autoshrink, is_valid is true

And then,
data = nasbench.query(cell)
for k, v in data.items():
print('%s: %s' % (k, str(v)))

It will report

KeyError Traceback (most recent call last)
in ()
----> 1 data = nasbench.query(cell)
2 for k, v in data.items():
3 print('%s: %s' % (k, str(v)))

~\Downloads\ECE590 NAS\nasbench\api.py in query(self, model_spec, epochs, stop_halfway)
235 % self.valid_epochs)
236
--> 237 fixed_stat, computed_stat = self.get_metrics_from_spec(model_spec)
238 sampled_index = random.randint(0, self.config['num_repeats'] - 1)
239 computed_stat = computed_stat[epochs][sampled_index]

~\Downloads\ECE590 NAS\nasbench\api.py in get_metrics_from_spec(self, model_spec)
364 self._check_spec(model_spec)
365 module_hash = self._hash_spec(model_spec)
--> 366 return self.get_metrics_from_hash(module_hash)
367
368 def _check_spec(self, model_spec):

~\Downloads\ECE590 NAS\nasbench\api.py in get_metrics_from_hash(self, module_hash)
346 fixed stats and computed stats of the model spec provided.
347 """
--> 348 fixed_stat = copy.deepcopy(self.fixed_statistics[module_hash])
349 computed_stat = copy.deepcopy(self.computed_statistics[module_hash])
350 return fixed_stat, computed_stat

KeyError: '926a133285cb34c1a38eefe60e7586d4'

How to solve this problem

Generating a dataset (tfrecord files)

Hi,
Thanks for open-sourcing the code behind the benchmark.
I am trying to reproduce some results and it seems that the script 'run_evaluation.py' does generate some checkpoints & a results.txt files for each relevant model of the graph, but not that "nasbench.tfrecord" file. Could you please point out to the appropriate script? So far it appears to be missing...

Thanks & Regards
K. Rene Traore

Data Query is non-deterministic

Hi,

I noticed that querying a specific architecture doesn't always result in getting the same parameters back.
Below is an example cell, where the query result shows the same operations and adjacency matrix, but different values for the floating point parameters.

Below are the results from querying the same architecture 3 times, getting different results each time.

{'module_adjacency': array([[0, 1, 1, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0]], dtype=int8), 'module_operations': ['input', 'conv1x1-bn-relu', 'conv3x3-bn-relu', 'maxpool3x3', 'conv3x3-bn-relu', 'conv3x3-bn-relu', 'output'], 'trainable_parameters': 32426634, 'training_time': 4321.9140625, 'train_accuracy': 1.0, 'validation_accuracy': 0.9431089758872986, 'test_accuracy': 0.9406049847602844}
{'module_adjacency': array([[0, 1, 1, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0]], dtype=int8), 'module_operations': ['input', 'conv1x1-bn-relu', 'conv3x3-bn-relu', 'maxpool3x3', 'conv3x3-bn-relu', 'conv3x3-bn-relu', 'output'], 'trainable_parameters': 32426634, 'training_time': 4326.7412109375, 'train_accuracy': 1.0, 'validation_accuracy': 0.9487179517745972, 'test_accuracy': 0.944411039352417}
{'module_adjacency': array([[0, 1, 1, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0]], dtype=int8), 'module_operations': ['input', 'conv1x1-bn-relu', 'conv3x3-bn-relu', 'maxpool3x3', 'conv3x3-bn-relu', 'conv3x3-bn-relu', 'output'], 'trainable_parameters': 32426634, 'training_time': 4309.798828125, 'train_accuracy': 1.0, 'validation_accuracy': 0.9432091116905212, 'test_accuracy': 0.9445112347602844}

There is a difference in validation accuracy of 0.5%, and a difference in test accuracy of 0.4%.
There is also a difference in training time of about 17 (I'm assuming this is in seconds?).

Is there any word on where these inaccuracies come from, and what level of accuracy can be expected?

A brief round of testing revealed the following numbers wrt accuracy (numbers were taken for the cell shown in the above examples) (Code was taken from some of my unit tests):

query_result: Dict = self.nb101.query(cell)

self.assertAlmostEqual(1.0, query_result["train_accuracy"], delta=0.005)
self.assertAlmostEqual(0.9431, query_result["validation_accuracy"], delta=0.006)
self.assertAlmostEqual(0.9406, query_result["test_accuracy"], delta=0.006)
self.assertAlmostEqual(4321.91, query_result["training_time"], delta=20)
self.assertEqual(32426634, query_result["trainable_parameters"])

Handling isomorphic architectures in benchmarked algorithms ?

I was wondering if the benchmarked algorithms in the paper (Figure 7) handle isomorphic architectures. To be more specific, when an algorithm creates a new architecture, do you check against all evaluated architectures for isomorphism or not?

The random search and the regularized evolution methods implemented in your NASBench.ipynb seem to only check if an architecture is valid. I did a quick experiment that collects the MD5 hash for each evaluated architecture. The results show that, on average over 11 repeats, more than 98% of the architectures created by regularized evolution are isomorphic.

hdf5 file format

Can you provide hdf5 file format with fixed_stats and computed_stats precomputed in it, so that the api class construction becomes faster, right now it takes almost 5 minutes to construct the api's class, and the culprit is looping through the tfrecorder in python which seems very in-efficient.

what is "total_time" in the tfrecord file?

Hi~
I find that in each row in the tf_record file, there's a field named "total_time", which is slightly larger than training time of the full epochs. I am interested in what "total_time" represents, and I wonder whether "total_time" and "training_time" can be used to obtain the inference latency of each model?
Thanks~

how to create model instance using matrix?

Hi, I am not very familiar with tensorflow and I am wondering can I create the model instance using simple API like this:

nasbench/example.py

Lines 45 to 53 in b942470

matrix=[[0, 1, 1, 1, 0, 1, 0], # input layer
[0, 0, 0, 0, 0, 0, 1], # 1x1 conv
[0, 0, 0, 0, 0, 0, 1], # 3x3 conv
[0, 0, 0, 0, 1, 0, 0], # 5x5 conv (replaced by two 3x3's)
[0, 0, 0, 0, 0, 0, 1], # 5x5 conv (replaced by two 3x3's)
[0, 0, 0, 0, 0, 0, 1], # 3x3 max-pool
[0, 0, 0, 0, 0, 0, 0]], # output layer
# Operations at the vertices of the module, matches order of matrix
ops=[INPUT, CONV1X1, CONV3X3, CONV3X3, CONV3X3, MAXPOOL3X3, OUTPUT])

Query neighbors of an architecture

Is there code somewhere here that takes an architecture and returns all of it's neighbors? Eg, the architectures that can be generated by adding/removing one edge or changing the op at one node?

Something like

neighbors = get_neighbors(arch)
assert all([edit_distance(arch, n) == 1 for n in neighbors])

Thanks!

Invalid model spec passes NASBench.is_valid check

Seems like networks that only have one edge from input to output can pass through the validation check. For example the following network connectivity matrix:
[[0., 1., 1., 1., 1., 1., 1.],
[0., 0., 1., 1., 1., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.]],

How to extract all the architectures and metrics from the records?

I could not find such an example.

The only thing I found was how to query some random architectures and their metrics, using "nasbench.query(spec)"

How could I extract all the architectures and related metrics without having to query for a specific model configuration?

Reproduce Figure 8 in the NASBench paper

Hi, I am trying to duplicate the curves in Figure 8. I have some questions:

  1. what is the mutation rate p? How could I map it to the mutation probability of edges and node operations? I check the code here, only one edge or one node is mutated at each time, is it mapped to the case that p=1? How could I get p=0.5 and p=2?
  2. I assume the tournament size t is the number of samples from population, which are compared to get the best model as a parent. Is it correct?
  3. Were the curves obtained by searching validation accuracy or test accuracy? I tried by searching validation accuracy and plotting test accuracy, I observed small up-and-down fluctuations along the test curves even I averaged over 100 runs. This is normal and expected, because a higher validation accuracy can map to a lower test accuracy. How did you avoid it in Figure 8?
    My results of median of 100 runs:
    image
    image

My results of mean of 100 runs:
image
image
(The results don't match the results in Figure 8)

  1. Is the x-axis median of 100 runs or means of them or something others? Is the x-axis in a linear scale?

Thank you so much.

tensorflow 2.3.1 incompatibility

Trying out the example I get the following error.

In [7]: from nasbench import api
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-5ba3f4b880ca> in <module>
----> 1 from nasbench import api

~/dump/nasbench/nasbench/api.py in <module>
     96 
     97 from nasbench.lib import config
---> 98 from nasbench.lib import evaluate
     99 from nasbench.lib import model_metrics_pb2
    100 from nasbench.lib import model_spec as _model_spec

~/dump/nasbench/nasbench/lib/evaluate.py in <module>
     22 
     23 from nasbench.lib import cifar
---> 24 from nasbench.lib import model_builder
     25 from nasbench.lib import training_time
     26 import numpy as np

~/dump/nasbench/nasbench/lib/model_builder.py in <module>
     29 
     30 from nasbench.lib import base_ops
---> 31 from nasbench.lib import training_time
     32 import numpy as np
     33 import tensorflow as tf

~/dump/nasbench/nasbench/lib/training_time.py in <module>
    128 
    129 
--> 130 class _TimingRunHook(tf.train.SessionRunHook):
    131   """Hook to stop the training after a certain amount of time."""
    132 

AttributeError: module 'tensorflow._api.v2.train' has no attribute 'SessionRunHook'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.