GithubHelp home page GithubHelp logo

distillpub / post--bayesian-optimization Goto Github PK

View Code? Open in Web Editor NEW
71.0 71.0 16.0 621.81 MB

Exploring Bayesian Optimization

Home Page: https://distill.pub/2020/bayesian-optimization/

HTML 11.13% TeX 2.63% CSS 0.61% Jupyter Notebook 84.74% JavaScript 0.89%

post--bayesian-optimization's Introduction

Post -- Exploring Bayesian Optimization

Breaking Bayesian Optimization into small, sizable chunks.

To view the rendered version of the post, visit: https://distill.pub/2020/bayesian-optimization/

Authors

Apoorv Agnihotri and Nipun Batra (both IIT Gandhinagar)

Offline viewing

Open public/index.html in your browser.

NB - the citations may not appear correctly in the offline render

post--bayesian-optimization's People

Contributors

apoorvagnihotri avatar chrisyeh96 avatar ludwigschubert avatar nipunbatra avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

post--bayesian-optimization's Issues

Revision -- Feedback 1

  • Add reference to matern kernel
  • For prior model figure, also add the uncertainty in the legend
  • make the alpha higher for the uncertainty in all the plots.
  • the first plot showing only the GT has thin lines, others have thicker lines. I would suggest having line width something in between the two.
  • Make all the plots uniform
  • Verify the header levels in HTML
  • Change the trivial acquisition function's name in plots
  • Slow down all the animations
  • Show the CDF in a darker (lesser alpha) so that we can clearly see what is happening.
  • Put hyper parameters in new line in the plots (after iteration number). Put them with mathematical symbols (latex in matplotlib). Also, no need to mention HPar,ams word in the plots.
  • Do the same in html also. Like, do not write eps instead write \epsilon. I have made these corrections before EI. Please do it for everything that follows.
  • For Expected improvement, first write the formula in terms of expectation in improvement See slide labelled: An expected utility criterion in Nando De Freitas’ lecture: https://www.cs.ubc.ca/~nando/540-2013/lectures/l7.pdf
  • In the plots replace EI with \alpha_{EI} etc. for all the approaches
  • Write the formulae for GP-UCB and mention in text that it is similar to the first acquisition function that we had introduced. The text for GP-UCB seems broken (v to 3 …)
  • Don’t bring about Bernoulli discussion for Thomson sampling. Instead, add missing intuition: why choosing random samples from posterior and maximizing them will likely do a tradeoff between exploration and exploitation.
  • For hyperparams v/s params - just mention equation of Ridge regression and mention the params (\theta) and the hyperparams (\delta^2, learning rate,..)
  • The fonts on SVM and below plots are very small. Also, try with some other cmap options also. Also, the maximum point isn’t clearly visible for many of the plots.

  • I think we may want to ensure that the convention for bayesian optimisation is the same across the method. We could stick to the convention you used in the EI section just now. I had written slightly differently for PI. It may be best to just use Nando de Freitas' convention as used in the paper linked in #2
  • https://arxiv.org/pdf/1012.2599.pdf is a nice paper to refer. I think we can just rename our first function as: UCB instead of ACQ1. You can also view Nando's lecture at 1:06 mins to understand why UCB would work.
  • From paper mentioned in #2, we can update some text about GP-UCB also.
  • Don't write Plotting Posterior etc. as the title. Posterior should be enough.
  • For the PI visualisation wrt epsilon, I don't think the effect of epsilon is visible. We should be able to somehow see the larger variance points being chosen. Maybe plot the PI on a log-scale for the same. Also, make the epsilon = 0 as 0.0. This will ensure the plot doesn't jump.
  • For NNs, you might want to store all the results. So that if we have to just tweak the visualistions, we do not have to re-run the entire code.
  • The EI plots have PI as the ylabel currently.
  • For the random plot in the comparison section for 1d, run the random acquisition with multiple random initialisations and show the mean and the variance.

Need inputs
Add thompson sampling to SVM and others & NN for diff methods

Moons dataset looks weird

Nice article! The image of the moons dataset at the end of the article looks a bit strange since the points are heavily overlapping and not in a moon shape.

Screen Shot 2020-05-06 at 1 35 34 PM

Is it the correct image?

Some `<d-math>` tags overflow on mobile

Hi Apoorv and Nipun! Congrats on the publication! ^_^
Before putting it on production, I fixed some things I perceived as styling issues. (Take a look at commits between b3de979 and f25b837 — let me know if you disagree with any of the changes; none of this is dogmatic.)

One issue I only fixed quickly but that may benefit from your attention is that some of the formulas don't break into multiple lines, and so are too large on mobile. I worked around that in cc60b27 by making those scrollable, but that's not ideal as someone may miss that they can scroll a formula. If you have the time, I'd appreciate if you wen through your article on a smartphone screen and considered breaking wide formulas into multiple lines.

Again, congrats on your publication!
—Ludwig, for the Distill editorial team.

Revision -- Feedback 3

  • learning rate description says — all real numbers. That doesn’t make sense. Isn’t this on a log scale also?
  • for hyperparameter for Ridge, mention the equation showing W, c, X, and y, and also add that a GD based approach would also have a learning rate/step size hyperparameter
  • Comparison of different acquisition functions on gold mining figure — legend should be EI-PI and GP-UCB. Also, for all plots showing random stdev in comparison, make the random alpha lower (more transparent)
  • Last plot (YLabel has achieved misspelled). Just remove the title intact for this plot.
  • to put the neural network computation into perspective: add some numbers. How much time each evaluation (each hyperparameter combination) took on your laptop and what was the final accuracy achieved and what were the final hyper parameters learnt. Also, then mention that you could save a lot of time given the heavy computational requirements if you needed grid search instead.
  • modify the readme with the correct title and the author information, Also link the IPYNB that was used for the experiments.
  • for the initial submission, remove the smileys.
  • Run the entire text through grammarly once.
  • See if we can add additional citations.
  • In further reading mention Parallel BO.
  • Link some talks from GP summer schools on Bayesian Optimization also.
  • Keep on reading the post slowly multiple times to see if there is any improvement we can make.
  • citations don't appear correctly - like #17, #8 and #3 citation.

Updates to Hero Plot

  • could we choose a different starting point (first sampled point), which ensures that we can see different next 'x' chosen as we increase the \epsilon
  • instead of having a slider outside the plot for epsilon, could we have something like: https://youtu.be/_8V5o2UHG0E?t=46627
  • Let's render maths also and see
  • we don't need xlabel = x in the first subplot (since the X-axis is common)

Issues

  • Next and previous are not very clearly visible in the hero slides. use https://www.jssor.com/demos/image-slider.slider

  • In the slider images have f(x) in y-axis and remove the gold content references. They make sense for the examples but not for the summary. Also no need to then mention Ground Truth for ..Also, the x-axis should be "x" and not "X"

  • In next slide mention, this could be trivial if f(x) was cheap to evaluate, but, often it is expensive to evaluate, like, amount of gold in a particular location, or accuracy for a set of hyper-parameters of a machine learning model.

  • In slide you mention f(x*) in fewest evaluations, instead write: Objective: find the maxima f(x*) in few evaluations as sampling is expensive.

  • Remove the word constraints in the next slide. And just have the 2 enumerated points (second one without the mention of ground truth)

  • In next slide and other slides, remove GT references from legend also

  • Use a surroage function ... remove the comma after prior

  • Next slide, use let us instead of let's and write GP in full

  • Next slide - write functional observation (an observation from f(x))

  • The Big Question - where to sample next to quickly find the maxima

  • No comma after One,

  • The next chosen point to observe is the one that maximises the probability of improvement over the current maximum (write this for the Choose point that maxim..)

  • In meain article, when you introduce Gaussian Processes, write as Gaussian Processes (GPs)

  • In active learning procedure, "automate" replace with "simulate"

  • Old text
    Given the fact that we are only interested in knowing the location where the maximum occurs. It might be a good idea to evaluate at locations where our surrogate model's prediction mean is the highest, i.e. to exploit. But unfortunately, our mean is not always accurate, so we need to correct our mean which can be done by reducing variance or exploration. BO looks at both exploitation and exploration, whereas in the case of Active Learning Problem, we only cared about exploration.

New text
Given the fact that we are only interested in knowing the location where the maximum occurs, it might be a good idea to evaluate at locations where our surrogate model's prediction mean is the highest, i.e. to exploit. But unfortunately, our model mean is not always accurate (since we have limited observations), so we need to correct our model, which can be done by reducing variance or exploration. BO looks at both exploitation and exploration, whereas in the case of active learning, we only cared about exploration.

  • Acquisition Functions
    Text should be:

We just discussed that our original optimisation problem (equation) is hard given the expensive nature of evaluating f. The key idea of BO is to transform this original difficult optimisation into a sequence of easier inexpensive optimisations called an acquisition function (alpha(x)). Each of these sequence of easier inexpensive optimisations involves finding the next point to sample. Thus, we can interpret the acquisition function as commensurate
with how desirable evaluating f at x is expected to be for the maximisation problem [CITE: https://www.cse.wustl.edu/~garnett/cse515t/spring_2015/files/lecture_notes/12.pdf]

While we have just now discussed that our goal is to transform the original optimisation into a sequence of easier optimisation, where is the "Bayesian" in this optimisation, and how is the acquisition function related? Let us re-wind and go back to our surrogate model and build the link between all the things we have discussed thus far, by noting the steps of BO [CITE: https://www.youtube.com/watch?list=PLZ_xn3EIbxZHoq8A3-2F4_rLyy61vkEpU&v=EnXxO3BAgYk]:

  1. Choose a surrogate model and its prior over space of objectives f
  2. Given the set of observations (function sampling), use Bayes rule to obtain the posterior
  3. Use an acquisition function (alpha(x)), which is a function of the posterior to decide where to sample next (x_t = argmax()..)
  4. Add new sampled data to the set of observations and Goto Step #2 till convergence or budget elapses

We now have three core ideas associated with acquisition functions: i) they are a function of the surrogate posterior; ii) they combine exploration and exploitation; and iii) they are inexpensive to evaluate. Let us now look into a few examples of commonly used acquisition functions to understand the concept better.

  • Remove the following text
    Let us understand this concept in two cases:

We have two points of similar means (of function values (gold in our case)). We now want to choose one of these to obtain the labels or values. We will choose the one with higher variance. This basically says that given the same exploitability, we choose the one with higher exploration value.

We have two points having the same variance. We would now choose the point with the higher mean. This basically says that given the same explorability, we will choose the one with higher exploitation value.

  • Remove the text "hero plot" and instead write below plot

  • In Intuition behind E -> change spread out sigma to symbol sigma

  • Everywhere except the title - change Active Learning to active learning

  • SVM example remove the GIFs for random and GP-UCB. Also, mention the optimal <C, gamma> found via grid search and via EI and PI.

  • dimentions --> dimensions

TODO Master issue

  • Write about Hyperparameter v/s parameter: https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)
  • Write about Thompson sampling
  • Motivate grid search expensive computation cost
  • Formally introduce the properties/problem of Bayesian optimization (aka Frazier video) and relate each point to the gold mining problem. This should happen after we have done the gold mining problem and more generally decide to introduce the problem.
  • Explain role of epsilon in PI
  • epsilon in EI..
  • Write about galore of packages and services including sigopt, hyperopt, spearmint, scikit-optimize etc.
  • Look into examples introduced in sigopt/hyperopt/spearmint/skoptimize to seek inspiration
  • BO v/s GD: https://stats.stackexchange.com/questions/161923/bayesian-optimization-or-gradient-descent This also has some nice things which can be problematic for GPs (like being cubic in input dimension)
  • Use prettier fonts and experiment with five thirty eight style for figures
  • Refer and read this excellent article: https://thuijskens.github.io/2016/12/29/bayesian-optimisation/ and the citations in it.
  • Introduce the problems (kernel choosing, other hyperparameters) in Bayesian optimization. -- will refer slide# 30 in Peter Fraizer Bayesian Optimization tutorial, link
  • Why the above problem is much easier to deal with?
  • Different Epsilon and the result on PI
  • BO when there are categorical variables also: scikit-optimize/scikit-optimize#580 and hyperopt has an implementation, so does sigopt..
  • Neural network hyper-parameter search for simple CNN for MNIST: i) learning rate (0.001, 0.01, ...) - this would require us to probably use a log-scale to make the search linear; ii) batch size: 1, 2, 4, .. use a log scale base 2; iii) number of hidden units: maybe again log scale
  • Continuing from above - we can also make use of gradient information when available. This should be only a discussion, but should be good for us to know (after submission!)
  • Explain intuition of GP-UCB
  • Like SVM, create an optimisation on Random Forests with the two params being # estimators and max depth
  • [ ] Graph showing time saved.

Revision

  • Remove BO slides, they don't add value. e0eee6a
  • Correct hyperlinks in Acknoledgement e88978c
  • Acknoledgement -> Acknoledgements e88978c
  • Content gets clipped in collapsible f28df30
  • Writing Improvements e11591f 723b2dc
    • Include suggestions from Chris. 4b7055a
    • Abstract + Mining Gold! 1024eb5 91e2e1a
    • Active Learning 5539934
    • Bayesian Optimization + Formalizing Bayesian Optimization 6e40a36
    • Acquisition Functions 4b6d52c
    • Probability of Improvement (PI) 3df731b
    • Expected Improvement (EI) 010cd00
    • PI vs. EI 87ac6d5
    • Thompson Sampling 5abe99b
    • Random + Summary of Acquisition Functions
    • Other Acq. Functions 0be836f
    • Comparison + Why is it beneficial to optimize the acquisition function?
    • Hyperparameter Tuning cb5b009
    • Example 1, 2, 3 629fa36
    • Conclusion and Summary + Embrace Bayesian Optimization
    • Acknowledgements
    • Further Reading
  • make grids fainter 4cb44be
  • In thompson acq. image, 6c8fd46
    • Title -> "samples from surrogate posterior"
    • Use black for Training Point.
    • Translucent bounds
  • Interactive Plot 13f00c9
    • In Interactive plot, we do not need ground truth.
    • New Interactive plot. 77a9c0c 1e6e1f4
  • Summary BO Plot
  • Sliders don't span the whole image. 61eb725
  • In examples plots 3edb4f0 7ba7bf8
    • lighter alpha across every plot 7ba7bf8
    • Iteration number should be common for the two subplots. (common title subplots fig.text used)
    • replace pink dots with hollow circles with thick black borders.
    • Have Text label with variable name.
      • for alpha
      • for C
      • for gamma

Duplicate reference

References 4 and 8 are both for the same paper: "Taking the Human Out of the Loop: A Review of Bayesian Optimization". Suggest removing one.

Revision -- Feedback 2

  • Add citation for Active learning. Burr Settles has a good review. There is another highly cited survey paper. Cite both. The citation should be added when we mention Problem 1.
  • When you introduce Gaussian Processes, the citations seem to be broken.
  • Footnote on Matern kernel comes after the end of the line. It should come before “.”
  • When you talk about ground truth for the first time, denote it with a function “f(x)”
  • In the figure showing the prior - label the Predicted as Predicted (mu), GT as GT (f). Do the same for all the plots. This will ensure that everyone is clear about f, mu, sigma, GT and predicted.
  • f(x^+) refers to the maximum functional value, i.e., max(train_y) — you have already introduced some convention, no need to use train_y, just use the convention introduced earlier.
  • Correct the figure showing PI and CDF: make initial epsilon 0.0 instead of 0;
    • the font size are not uniform compared to earlier plots, these seem to be smaller in this plot,
    • try to put the PI on the log-scale so that we can hopefully show that increasing epsilon leads to a different point being chosen by PI.
  • Consistency for hyper-parmater v/s hyperparameter. Either is okay — just use the same one everywhere.
  • Reduce usage of code Python block. For example, for “how likely” in EI, either make it strong or italic, but no need to put it in code block. Correct others too.
  • Convention : EI has a different meaning for x* as compared to x. Maybe in PI also correct it and make it x_{t++1}
  • Mockus citation does not show properly from the main text.
  • In the equation after talking about Mockus, you seem to have omitted _{t+1}. Let’s ensure the correctness and consistency across the entire post.
  • mean(x) - f(x^+) : Use the introduced symbols for mean(x)? And similarly for the next one also write sigma
  • In the PI v.s. EI graph — write epsilon_{PI} = 0.01 and similarly for epsilon_{EI}.
  • In the GP-UCB formulation, mention what is small t.
  • In GP-UCB mention how it looks very similar to UCB introduced earlier, but differs in the fact that the second term has a time-dependent/iteration-dependent coefficient.
  • Remove the discussion about Bernoulli
  • For EI-PI mention all the hyperparams for both the PI and the EI part of it. Also, mention that such combinations can help overcome the limitations of individual methods.
  • In the comparison plot, x-axis should be number of drilled sites
  • In the comparison plot, mention that the random experiment was done with multiple random seeds and we show the mean and the variance. Mention that the variance is very high.
  • Mention that for all the other methods, we took the “best” hyper parameters for each approach.
  • One small example that we can think of can be of linear regression, we don't really have hyperparameters, but the parameters are the — replace this discussion with Ridge regression (write the objective, mention \theta being a param, mention \lambda >0 and learning rate to be hyperparams.
  • Be consistent with let’s or let us. Just prefer the latter as it more formal.
  • For the sklearn moon dataset, add legend for Class A and Class B.
  • In SVM section, mention briefly what C and gamma do.
  • The font size in the SVM plot is not consistent with the remaining figures and is very small.
  • In SVM plot, the title should be accuracy and accuracies.
  • In SVM plot, the accuracy should be on a scale of 40 to 100 and not on 0.4 to 1. This is just for avoiding any confusion.
  • In SVM plot, for both the subplots, take the legend outside the main plot. Else, it is very hard to see what is happening.
  • In SVM plot, the global maxima is hard to see. Make it much bigger.
  • They with different CMAPS — https://matplotlib.org/users/colormaps.html Viridis might be okay.
  • Comparison between methods should be renamed to comparison of different acquisition functions on SVM classification task.
    • font size should be made consistent with all plots
    • make the lines a little thicker,
    • GP_UCB in legend should be GP-UCB
    • the X-label should be # queries made
  • Random Forest — isn’t our task classification, you seem to mention using RF regressor. As before with SVM one, take the legend out. Add a bit of explanation for depth v/s number of trees.
  • The plot looks a little odd. Maybe it is due to the legend intermingling.
  • On RFs, the comparison plot for random doesn’t seem to make sense. We can never decrease the max. Accuracy reached till now.
  • For example on CNN, mention what Hyperparameter space we are looking at: like batch size from 2^3 to .., similarly for others and then mention how we can easily do this in modern package and here you show how it translates to simple Python code in scikit-optimize. As before mention how changing the hyperparams might change the prediction.
  • The CNN example seems to end very abruptly. Mention the key learning if any. The plot also needs to be interpreted.
  • In conclusions, mention that optimizing or tuning hyperparams is an important facet of modern machine learning algorithms and BO is an efficient way of tuning the same. I would suggest that the last line should be — we hope you had a good time reading the article and hope you are ready to exploit the power of BO. In case you wish to explore more, please read the Further reading section below.
  • Wherever possible don’t mention - I have. .. instead write we
  • I would suggest we write “Further reading” and break it down into various aspects - like a) making use of gradient information when available; b) very quickly mention or link the difference between BO and GD; c) mention the caution about kernel; d) BO applications other than the ones discussed in this paper. This should also include the sensor placement paper we saw a few days back.
  • some practical tips! Scaled Accuracy

Review #2

The following peer review was solicited as part of the Distill review process.

The reviewer chose to keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.


Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
What type of contributions does this article make?: The article provides an exposition of bayesian optimization methods. It motivates the use of Bayesian Optimization (BO), and gives several examples of applying BO for different objectives.

Advancing the Dialogue Score
How significant are these contributions? 2/5

Comments

I think the main contribution of the current article is in the simulations, which illustrate BO in practice. However, I believe the article does not do a great job of explaining the setup and foundations of BO, and of unifying the various examples under a common framework. In this sense, I don't believe its exposition is a significant contribution.

For example, I think the following short note (which the authors cite) does an excellent job of briefly introducing the BO formalism, and presenting different instanciations of BO (for different objective functions) under the same underlying framework: https://www.cse.wustl.edu/~garnett/cse515t/spring_2015/files/lecture_notes/12.pdf

Outstanding Communication Score
Article Structure 3/5
Writing Style 3/5
Diagram & Interface Style 4/5
Impact of diagrams / interfaces / tools for thought? 3/5
Readability 2/5

Comments

  • The article is fairly long for the core content, and it's easy to get lost in the details of the various examples, and lose track of the main points.
  • It does not use jargon, but the writing is a bit verbose and could be condensed.
  • I would omit the interactive figure -- there are too many moving parts, and it confuses more than clarifies. The non-interactive simulations are good, although there are perhaps too many of them -- just a few simulations would convey the point (that different objective functions result in different optimization procedures).
  • The diagram format is standard (as simulations of GP), although there is value in doing and showing these simulations in the context of baysean optimization.
  • Assuming knowledge of Gaussian Processes, this topic (BO with GP prior) is not very difficult. In particular, it can be described simply as:
    -- Assume a Gaussian Process prior on the ground-truth function F.
    -- Formalize your objective (eg. sampling a point 'x' with maximum expected value of F(x), or maximizing the probability that F(x) > F(x_j) for all previously-sampled points x_j)
    -- Use the existing samples {(x, F(x))} to compute the posterior of F given the samples (under the GP prior), and maximize your objective function under the posterior. This yields a choice of new point to sample.
    -- (Different "acquisition functions" simply correspond to different objectives in step (2)).
  • The current article is fairly long for conveying the above point, and it includes many details which can be distracting (eg, equations for the exact form of the maximization in (3), which does not add much conceptually).
  • Concretely, I suggest cutting a lot of the discussion about details of various acquisition functions, and just presenting a few examples to convey the point that different objectives (Step 2) yield different optimization procedures.
Scientific Correctness & Integrity Score
Are claims in the article well supported? 3/5
Does the article critically evaluate its limitations? How easily would a lay person understand them? 4/5
How easy would it be to replicate (or falsify) the results? 5/5
Does the article cite relevant work? 5/5
Does the article exhibit strong intellectual honesty and scientific hygiene? 5/5

Comments

Minor points:

  • For comparing vs. a random strategy, I would just compare against a truly random strategy, instead of using a "random acquisition function" within the BO framework as a proxy for this.
  • In the "comparison" plots, the random strategy appears to do about as well the Bayesian Optimization -- which means this is not a setting that convinces me that BO is powerful.
  • In the "comparison" plot, different acquisition functions correspond to different objectives. However, we are evaluating them all under the same objective, which is somewhat unfair. In particular, if the objective is well-specified and the ground-truth is actually drawn from a GP prior, then BO should exactly maximize the expected objective value (ie, it should be the optimal thing to do, if the assumptions hold).

Major points:

With respect to the scientific content, my main issue is that there is no clear distinction made between:

  • Bayesian optimization as a formal framework, with provable optimality guarantees.
  • Bayesian optimization as it's used in practice (e.g. even if the true ground-truth is not drawn from a gaussian process, we can still apply BO methods and hope to get something reasonable, though not provably so).

These two viewpoints are conflated throughout the article. For example, in the section "Formalizing Bayesian Optimization", the points described are actually heuristics about setting (B), not formalisms in the sense of (A).

This confusion also makes it difficult to see how different acquisition functions relate to each other, and what our actual objective is in choosing between different acquisition functions.

Review #1

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to Austin Huang for taking the time to review this article.


General Comments

Missing Tools for Reasoning

Acquisition functions are introduced from a definitional standpoint and their behavior is illustrated for a relatively artificial example. Sometimes the methods are shown to work, sometimes they don't. How does one think about implementation alternatives when working on a new problem? The article provides few conceptual tools for the reader to apply these methods successfully.

There's also serious issues with model misspecification underneath the surface of these implementations (see for example, Thompson Sampling discussion). However, the article doesn't even raise the topic - the discussion starts from a fixed model specification and anecdotally shows methods either working or not under a narrow example.

Relatedly, there's a section entitled ""Why is it easier to optimize the acquisition function?"" This framing may be misleading since ""easiness"" isn't the goal. The real question seems to be ""Why is it beneficial to optimize the acquisition function?"" or perhaps ""is it even beneficial to optimize with respect to an acquisition function""?

Does the Hero Plot Illustrate a Cental Aspect of the Discussion?

An interactive visualization communicates a response function to the variables that can be affected by input. In the hero plot, this corresponds to the response of the activation function as a function of the epsilon hyperparameter in a PI acquisition function for fixed data and ground truth. It also shows the CDF for two slices of X (1.0 and 5.0) which are intermediate computations used by the activation function.

Is that particular relationship sufficiently central to the article to be front and center? There are other relationships that seem more central to the topic that could have been highlighted (how choice of acquisition functions compare, how the activation function changes with data). The plot is nice to interact with for thinking about exploration/exploitation in PI, but it doesn't seem to be an obvious choice as the hero plot.

Minor visual issue - the vertical labels look buggy, with 0.00e+0 cutting through the axis line.

Grey backgrounds don't fit Distill's Template

The patch of grey rectangle background for each figure doesn't fit the aesthetic of the distill template. The convention in other articles seems to be white-on-white with no boundary or occasionally a horizontal ribbon that runs the width of the page for visualizations with lots of margin content.

Animations are Overused

Note in other distill articles, animations are used sparingly, and usually just at the top figure or concluding figure.

Looping animations were overused and ultimately not a good way to illustrate a dependency relationship compared to a visual with a control.

Even if the content in those figures is kept as is with a slider http://worrydream.com/LadderOfAbstraction/, this would be an improvement by not being distracting and allowing the reader to examine relationships between iterations more carefully.

Introduction to EI is Confusing

Perhaps the framing using the unknown ground truth was the original motivation but here it just makes the reasoning convoluted without adding much insight. Don't see any reason not to just jump to the definition as described by the name - expected improvement (i.e. the 2nd equation).

Thompson Sampling

""It has a low overhead of setting up."" - not sure why this is specifically pointed out in the case of TS, is overhead any lower to set up than the other acquisition functions?

The statement that ""This will ensure an exploratory behaviour."" is contradicted by the animation demonstration that follows. From that demo's figures, it would actually seem nearly impossible to reach the global minimma without refining the underlying GP model - there's not enough noise in the function distribution to adequately explore. However the example is simply left without further comment.

Hyperparameter Tuning - Axis Labels

Using the horizontal label ""# of Hyper-Parameters Tested"" is a confusing label description since it doesn't really refer to the # of hyper-parameters tested, but rather the # of values that have been evaluated.

Hyperparameter Tuning - Changing colormap scale makes it impossible to track the function evolution

The colormaps should probably not rescale with each iteration - it makes it very difficult to track the evolution of the acquisition function between frames.

As mentioned above, replacing all or most animations with a slider control would also improve the legibility of the figure.

Legend tweaks

  • The legend positioning for the top ""hero"" plot looks buggy. ""GT"", ""GP"" and ""\epsilon"" are glued to the point without any spacing. The alignment looks very off
  • Not sure why ""GT"" is abbreviated when longer captions like ""Acquisition function"" are not.
  • ""Train points"" -> ""Training points""
  • Given the legends are already really busy ""(Tie randomly broken)"" would be better as a linked footnote."

"# Minor Writing Improvements

  • ""Older problem - Earlier in the active learning problem ... "" can remove the preface and start with ""In the active learning problem ...""
  • ""We can write a general form of an acquisition function ..."" this sentence could be more weight and made more explicit about stating that mu(x) models exploitation and sigma(x) represents the value of exploration. It's implied by the phrasing, but could be clearer.
  • Don't nest parenthesis in parenthesis ""(of function values (gold in our case))""
  • ""We can obtain a closed form solution as below"" - expression in terms of CDF is not usually considered ""closed form"" https://en.wikipedia.org/wiki/Closed-form_expression would just avoid using the phrase
  • "" h_{t+1} is our GP posterior of the ground truth"" - guessing intends to refer to the ""posterior mean"" since it needs to be a function
  • ""first vanilla acquisition function"" - reference UCB directly instead of referring to it as ""first vanilla acquisition function""
  • (try to find the global maxima that might be near this “best” location)"" - this parenthetical remark is confusing and doesn't add to the statement.
  • ""easily"" is used a lot throughout the article and in almost all cases the sentence improves by the omission of this unnecessary subjective qualifier. ""equation can be easily converted..."", ""One can easily change ..."", ""We can easily apply the BO for more dimensions"", ""... can easily be incorporated into BO."" (2 times in the same sentence in the last example)

Concluding Comments

Bayesian optimization and active learning aren't particularly popular to write about currently. I also suspect there's quite a bit of interest in the topic, particularly in industry and applied machine learning contexts.

Given that, this article does contribute to a notable gap in the research distillation space. However, I think more work needs to be put into this manuscript to raise the quality of communication to be comparable to other distill articles."


Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
What type of contributions does this article make?: Explanation of existing results

Advancing the Dialogue Score
How significant are these contributions? 3/5
Outstanding Communication Score
Article Structure 2/5
Writing Style 3/5
Diagram & Interface Style 3/5
Impact of diagrams / interfaces / tools for thought? 3/5
Readability 3/5
Scientific Correctness & Integrity Score
Are claims in the article well supported? 3/5
Does the article critically evaluate its limitations? How easily would a lay person understand them? 3/5
How easy would it be to replicate (or falsify) the results? 4/5
Does the article cite relevant work? 4/5
Does the article exhibit strong intellectual honesty and scientific hygiene? 3/5

Review #3

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to Jasper Snoek for taking the time to review this article.


Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
What type of contributions does this article make?: This article gives an intuitive explanation of the basics of Bayesian optimization and it is an enjoyable and interesting read. It does a nice job of visually demonstrating the impact and the behavior of a variety of choices made within Bayesian optimization. It does a great job visualizing the various acquisition functions in Bayesian optimization with the help of nice interactive plots and animations.

Advancing the Dialogue Score
How significant are these contributions? 4/5

Comments

I think this is a really clean and concise introduction to Bayesian optimization and some of the nuances of the underlying strategy followed. Bayesian optimization is certainly dear to me and I appreciate having someone take the time to produce nice visualizations so I feel inclined to accept. There is a lot that could be added to this post, but I suppose in the spirit of the journal (i.e. short and crisp) this might be just right? In any case there are some underlying modeling issues that I would like corrected before this is accepted.

Outstanding Communication Score
Article Structure 4/5
Writing Style 5/5
Diagram & Interface Style 4/5
Impact of diagrams / interfaces / tools for thought? 3/5
Readability 4/5

Comments

  • I think the article is structured well and flows nicely.
  • Writing style is very easy to follow and accessible. A couple of minor typos.
  • The diagrams are intuitive and I love having interactive plots of acquisition functions in Bayesian optimization.
  • The diagrams are really helpful, elegant and really drive the story. However, they aren’t terribly novel since this is exactly how we’ve been visualizing Bayesian optimization in papers and texts for years. The authors do turn them into animations, however, which is a neat upgrade from the static plots previously used.
  • Very readable. The authors side-step some of the more technical aspects of Bayesian modeling (e.g. how does a Gaussian process work and how do you fit it). However, for the purposes of this article that might be better.
Scientific Correctness & Integrity Score
Are claims in the article well supported? 4/5
Does the article critically evaluate its limitations? How easily would a lay person understand them? 4/5
How easy would it be to replicate (or falsify) the results? 5/5
Does the article cite relevant work? 4/5
Does the article exhibit strong intellectual honesty and scientific hygiene? 4/5

Comments

  • I think there’s a lot of specifics, challenges and follow up work that isn’t referenced here. However, the authors do a good job of citing relevant work.
  • This is just an introductory exposition that would be really easy to replicate empirically.
  • Yes. There is a lot of relevant work missing, e.g. citations for Thompson sampling (http://proceedings.mlr.press/v84/kandasamy18a/kandasamy18a.pdf) but most of the relevant citations are there.
  • There is what I see as a critical modeling mistake in the diagrams and resulting analysis. The authors don’t adjust for the mean of the data, which results in weird and pathological behavior. I would like the authors to correct this and then resubmit (and then I think it should be ok). Detailed comments are below.

Detailed comments

Intro:
Note the word “hyperparameter” can actually be contentious. The traditional definition of a hyperparameter is as a higher level model parameter influencing the parameters of the model. Under that definition things like learning rate, optimization parameters, etc. don’t really apply. So I always say Bayesian optimization is used to tune hyperparameters, optimization parameters and other model parameters.

Mining Gold:
There’s a neat historical precedent to this that’s worth mentioning. The first use of Gaussian processes was actually to model ore density in South Africa. Applying Gaussian processes was initially called “Kriging” after Danie Krige (https://en.wikipedia.org/wiki/Danie_G._Krige) who used GPs to model the spatial density of ore deposits (and figure out where to drill).

“Active Learning”:
“We cannot estimate the gold estimate” sounds awkward. Maybe rephrase.

“Gaussian processes”:
“Gaussian processes regression” -> Gaussian process regression
“(Smoothess) Such” -> “(Smoothness). Such”

“Prior model”:
The Matern kernel is a specific choice of prior that is worth spending some time rationalizing. It’s probably a good idea to introduce the concept of a kernel and how the choice of kernel corresponds to a prior over functions. Then describe how the Matern lets you determine the smoothness of the prior (i.e. a Matern 5/2 means twice differentiable). How does that correspond to your assumptions about gold smoothness?

“UCB”:
“This is because while the variance or uncertainty is high for such points, the posterior mean is low.”
The posterior mean is low because the prior mean is set to 0. You could subtract the mean from the observed data to make it 0 mean (then in your case the mean would go to ~5 instead of 0 and the acquisition function would be higher further from the data). It would be better to make the mean a hyperparameter of the GP and optimize it or integrate it out.

“Probability of Improvement”:
“given same exploitability” -> “given the same...”

The “hero plot” is neat. Though I’m not sure I understand why it’s called a hero plot.

“(can be identified by the grey translucent area” is missing closing parens.

“Expected Improvement”:
These plots are neat and convey the intuition really nicely. However, the EI values are tiny and the behavior does not follow what I have seen for EI. I suspect the optimization routine is under exploring because of the zero-mean issue I brought up above. Specifically, for a stationary kernel, the GP posterior will return to the mean when moving away from the data. In this case, it’s returning to 0, which is silly since 0 is not the mean of the observations you have seen. I suspect the routine will be much better behaved if you subtract out the mean of the data, fit the model, and then add the mean back in.

“Gaussian Process Upper Confidence Bound (GP-UCB)”:
I like the discussion of regret. However, I don’t think Srinivas et al. introduced GP-UCB and other acquisition functions also minimize regret. Instead they derived some elegant bounds on regret under the GP-UCB acquisition function. I think I would rephrase this to say something like: “Srinivas et al. developed a schedule for \beta that they theoretically demonstrated minimizes cumulative regret. The schedule is T_t … “

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.