Revision -- Feedback 1

I think we may want to ensure that the convention for bayesian optimisation is the same across the method. We could stick to the convention you used in the EI section just now. I had written slightly differently for PI. It may be best to just use Nando de Freitas' convention as used in the paper linked in #2
https://arxiv.org/pdf/1012.2599.pdf is a nice paper to refer. I think we can just rename our first function as: UCB instead of ACQ1. You can also view Nando's lecture at 1:06 mins to understand why UCB would work.
From paper mentioned in #2, we can update some text about GP-UCB also.
Don't write Plotting Posterior etc. as the title. Posterior should be enough.
For the PI visualisation wrt epsilon, I don't think the effect of epsilon is visible. We should be able to somehow see the larger variance points being chosen. Maybe plot the PI on a log-scale for the same. Also, make the epsilon = 0 as 0.0. This will ensure the plot doesn't jump.
For NNs, you might want to store all the results. So that if we have to just tweak the visualistions, we do not have to re-run the entire code.
The EI plots have PI as the ylabel currently.
For the random plot in the comparison section for 1d, run the random acquisition with multiple random initialisations and show the mean and the variance.

Need inputs
~~Add thompson sampling to SVM and others & NN for diff methods~~

Some `<d-math>` tags overflow on mobile

Hi Apoorv and Nipun! Congrats on the publication! ^_^
Before putting it on production, I fixed some things I perceived as styling issues. (Take a look at commits between b3de979 and f25b837 — let me know if you disagree with any of the changes; none of this is dogmatic.)

One issue I only fixed quickly but that may benefit from your attention is that some of the formulas don't break into multiple lines, and so are too large on mobile. I worked around that in cc60b27 by making those scrollable, but that's not ideal as someone may miss that they can scroll a formula. If you have the time, I'd appreciate if you wen through your article on a smartphone screen and considered breaking wide formulas into multiple lines.

Again, congrats on your publication!
—Ludwig, for the Distill editorial team.

Probability of Improvement

I don't think the parentheses for the last step makes sense to me

Shouldn't it be like this?

References to add

please keep adding here. This will ensure it is easy to write them all up in the article.

Review #3

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to Jasper Snoek for taking the time to review this article.

Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
What type of contributions does this article make?: This article gives an intuitive explanation of the basics of Bayesian optimization and it is an enjoyable and interesting read. It does a nice job of visually demonstrating the impact and the behavior of a variety of choices made within Bayesian optimization. It does a great job visualizing the various acquisition functions in Bayesian optimization with the help of nice interactive plots and animations.

Advancing the Dialogue	Score
How significant are these contributions?	4/5

Comments

I think this is a really clean and concise introduction to Bayesian optimization and some of the nuances of the underlying strategy followed. Bayesian optimization is certainly dear to me and I appreciate having someone take the time to produce nice visualizations so I feel inclined to accept. There is a lot that could be added to this post, but I suppose in the spirit of the journal (i.e. short and crisp) this might be just right? In any case there are some underlying modeling issues that I would like corrected before this is accepted.

Outstanding Communication	Score
Article Structure	4/5
Writing Style	5/5
Diagram & Interface Style	4/5
Impact of diagrams / interfaces / tools for thought?	3/5
Readability	4/5

Comments

I think the article is structured well and flows nicely.
Writing style is very easy to follow and accessible. A couple of minor typos.
The diagrams are intuitive and I love having interactive plots of acquisition functions in Bayesian optimization.
The diagrams are really helpful, elegant and really drive the story. However, they aren’t terribly novel since this is exactly how we’ve been visualizing Bayesian optimization in papers and texts for years. The authors do turn them into animations, however, which is a neat upgrade from the static plots previously used.
Very readable. The authors side-step some of the more technical aspects of Bayesian modeling (e.g. how does a Gaussian process work and how do you fit it). However, for the purposes of this article that might be better.

Scientific Correctness & Integrity	Score
Are claims in the article well supported?	4/5
Does the article critically evaluate its limitations? How easily would a lay person understand them?	4/5
How easy would it be to replicate (or falsify) the results?	5/5
Does the article cite relevant work?	4/5
Does the article exhibit strong intellectual honesty and scientific hygiene?	4/5

Comments

I think there’s a lot of specifics, challenges and follow up work that isn’t referenced here. However, the authors do a good job of citing relevant work.
This is just an introductory exposition that would be really easy to replicate empirically.
Yes. There is a lot of relevant work missing, e.g. citations for Thompson sampling (http://proceedings.mlr.press/v84/kandasamy18a/kandasamy18a.pdf) but most of the relevant citations are there.
There is what I see as a critical modeling mistake in the diagrams and resulting analysis. The authors don’t adjust for the mean of the data, which results in weird and pathological behavior. I would like the authors to correct this and then resubmit (and then I think it should be ok). Detailed comments are below.

Detailed comments

Intro:
Note the word “hyperparameter” can actually be contentious. The traditional definition of a hyperparameter is as a higher level model parameter influencing the parameters of the model. Under that definition things like learning rate, optimization parameters, etc. don’t really apply. So I always say Bayesian optimization is used to tune hyperparameters, optimization parameters and other model parameters.

Mining Gold:
There’s a neat historical precedent to this that’s worth mentioning. The first use of Gaussian processes was actually to model ore density in South Africa. Applying Gaussian processes was initially called “Kriging” after Danie Krige (https://en.wikipedia.org/wiki/Danie_G._Krige) who used GPs to model the spatial density of ore deposits (and figure out where to drill).

“Active Learning”:
“We cannot estimate the gold estimate” sounds awkward. Maybe rephrase.

“Gaussian processes”:
“Gaussian processes regression” -> Gaussian process regression
“(Smoothess) Such” -> “(Smoothness). Such”

“Prior model”:
The Matern kernel is a specific choice of prior that is worth spending some time rationalizing. It’s probably a good idea to introduce the concept of a kernel and how the choice of kernel corresponds to a prior over functions. Then describe how the Matern lets you determine the smoothness of the prior (i.e. a Matern 5/2 means twice differentiable). How does that correspond to your assumptions about gold smoothness?

“UCB”:
“This is because while the variance or uncertainty is high for such points, the posterior mean is low.”
The posterior mean is low because the prior mean is set to 0. You could subtract the mean from the observed data to make it 0 mean (then in your case the mean would go to ~5 instead of 0 and the acquisition function would be higher further from the data). It would be better to make the mean a hyperparameter of the GP and optimize it or integrate it out.

“Probability of Improvement”:
“given same exploitability” -> “given the same...”

The “hero plot” is neat. Though I’m not sure I understand why it’s called a hero plot.

“(can be identified by the grey translucent area” is missing closing parens.

“Expected Improvement”:
These plots are neat and convey the intuition really nicely. However, the EI values are tiny and the behavior does not follow what I have seen for EI. I suspect the optimization routine is under exploring because of the zero-mean issue I brought up above. Specifically, for a stationary kernel, the GP posterior will return to the mean when moving away from the data. In this case, it’s returning to 0, which is silly since 0 is not the mean of the observations you have seen. I suspect the routine will be much better behaved if you subtract out the mean of the data, fit the model, and then add the mean back in.

“Gaussian Process Upper Confidence Bound (GP-UCB)”:
I like the discussion of regret. However, I don’t think Srinivas et al. introduced GP-UCB and other acquisition functions also minimize regret. Instead they derived some elegant bounds on regret under the GP-UCB acquisition function. I think I would rephrase this to say something like: “Srinivas et al. developed a schedule for \beta that they theoretically demonstrated minimizes cumulative regret. The schedule is T_t … “

Review #2

The following peer review was solicited as part of the Distill review process.

The reviewer chose to keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.

Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
What type of contributions does this article make?: The article provides an exposition of bayesian optimization methods. It motivates the use of Bayesian Optimization (BO), and gives several examples of applying BO for different objectives.

Advancing the Dialogue	Score
How significant are these contributions?	2/5

Comments

I think the main contribution of the current article is in the simulations, which illustrate BO in practice. However, I believe the article does not do a great job of explaining the setup and foundations of BO, and of unifying the various examples under a common framework. In this sense, I don't believe its exposition is a significant contribution.

For example, I think the following short note (which the authors cite) does an excellent job of briefly introducing the BO formalism, and presenting different instanciations of BO (for different objective functions) under the same underlying framework: https://www.cse.wustl.edu/~garnett/cse515t/spring_2015/files/lecture_notes/12.pdf

Outstanding Communication	Score
Article Structure	3/5
Writing Style	3/5
Diagram & Interface Style	4/5
Impact of diagrams / interfaces / tools for thought?	3/5
Readability	2/5

Comments

The article is fairly long for the core content, and it's easy to get lost in the details of the various examples, and lose track of the main points.
It does not use jargon, but the writing is a bit verbose and could be condensed.
I would omit the interactive figure -- there are too many moving parts, and it confuses more than clarifies. The non-interactive simulations are good, although there are perhaps too many of them -- just a few simulations would convey the point (that different objective functions result in different optimization procedures).
The diagram format is standard (as simulations of GP), although there is value in doing and showing these simulations in the context of baysean optimization.
Assuming knowledge of Gaussian Processes, this topic (BO with GP prior) is not very difficult. In particular, it can be described simply as:
-- Assume a Gaussian Process prior on the ground-truth function F.
-- Formalize your objective (eg. sampling a point 'x' with maximum expected value of F(x), or maximizing the probability that F(x) > F(x_j) for all previously-sampled points x_j)
-- Use the existing samples {(x, F(x))} to compute the posterior of F given the samples (under the GP prior), and maximize your objective function under the posterior. This yields a choice of new point to sample.
-- (Different "acquisition functions" simply correspond to different objectives in step (2)).
The current article is fairly long for conveying the above point, and it includes many details which can be distracting (eg, equations for the exact form of the maximization in (3), which does not add much conceptually).
Concretely, I suggest cutting a lot of the discussion about details of various acquisition functions, and just presenting a few examples to convey the point that different objectives (Step 2) yield different optimization procedures.

Scientific Correctness & Integrity	Score
Are claims in the article well supported?	3/5
Does the article critically evaluate its limitations? How easily would a lay person understand them?	4/5
How easy would it be to replicate (or falsify) the results?	5/5
Does the article cite relevant work?	5/5
Does the article exhibit strong intellectual honesty and scientific hygiene?	5/5

Comments

Minor points:

For comparing vs. a random strategy, I would just compare against a truly random strategy, instead of using a "random acquisition function" within the BO framework as a proxy for this.
In the "comparison" plots, the random strategy appears to do about as well the Bayesian Optimization -- which means this is not a setting that convinces me that BO is powerful.
In the "comparison" plot, different acquisition functions correspond to different objectives. However, we are evaluating them all under the same objective, which is somewhat unfair. In particular, if the objective is well-specified and the ground-truth is actually drawn from a GP prior, then BO should exactly maximize the expected objective value (ie, it should be the optimal thing to do, if the assumptions hold).

Major points:

With respect to the scientific content, my main issue is that there is no clear distinction made between:

Bayesian optimization as a formal framework, with provable optimality guarantees.
Bayesian optimization as it's used in practice (e.g. even if the true ground-truth is not drawn from a gaussian process, we can still apply BO methods and hope to get something reasonable, though not provably so).

These two viewpoints are conflated throughout the article. For example, in the section "Formalizing Bayesian Optimization", the points described are actually heuristics about setting (B), not formalisms in the sense of (A).

This confusion also makes it difficult to see how different acquisition functions relate to each other, and what our actual objective is in choosing between different acquisition functions.

Updates to Hero Plot

could we choose a different starting point (first sampled point), which ensures that we can see different next 'x' chosen as we increase the \epsilon
instead of having a slider outside the plot for epsilon, could we have something like: https://youtu.be/_8V5o2UHG0E?t=46627
Let's render maths also and see
we don't need xlabel = x in the first subplot (since the X-axis is common)

Moons dataset looks weird

Nice article! The image of the moons dataset at the end of the article looks a bit strange since the points are heavily overlapping and not in a moon shape.

Is it the correct image?

Duplicate reference

References 4 and 8 are both for the same paper: "Taking the Human Out of the Loop: A Review of Bayesian Optimization". Suggest removing one.

Revision -- Feedback 3

Collapsibles don't reveal all the content.

Collapsible don't reveal all the content on Ubuntu Firefox however they behave as expected on Chromium. See the image below.

Thanks Gwern, for pointing this out on Reddit.

Issues

New text
Given the fact that we are only interested in knowing the location where the maximum occurs, it might be a good idea to evaluate at locations where our surrogate model's prediction mean is the highest, i.e. to exploit. But unfortunately, our model mean is not always accurate (since we have limited observations), so we need to correct our model, which can be done by reducing variance or exploration. BO looks at both exploitation and exploration, whereas in the case of active learning, we only cared about exploration.

Acquisition Functions
Text should be:

We just discussed that our original optimisation problem (equation) is hard given the expensive nature of evaluating f. The key idea of BO is to transform this original difficult optimisation into a sequence of easier inexpensive optimisations called an acquisition function (alpha(x)). Each of these sequence of easier inexpensive optimisations involves finding the next point to sample. Thus, we can interpret the acquisition function as commensurate
with how desirable evaluating f at x is expected to be for the maximisation problem [CITE: https://www.cse.wustl.edu/~garnett/cse515t/spring_2015/files/lecture_notes/12.pdf]

While we have just now discussed that our goal is to transform the original optimisation into a sequence of easier optimisation, where is the "Bayesian" in this optimisation, and how is the acquisition function related? Let us re-wind and go back to our surrogate model and build the link between all the things we have discussed thus far, by noting the steps of BO [CITE: https://www.youtube.com/watch?list=PLZ_xn3EIbxZHoq8A3-2F4_rLyy61vkEpU&v=EnXxO3BAgYk]:

Choose a surrogate model and its prior over space of objectives f
Given the set of observations (function sampling), use Bayes rule to obtain the posterior
Use an acquisition function (alpha(x)), which is a function of the posterior to decide where to sample next (x_t = argmax()..)
Add new sampled data to the set of observations and Goto Step #2 till convergence or budget elapses

We now have three core ideas associated with acquisition functions: i) they are a function of the surrogate posterior; ii) they combine exploration and exploitation; and iii) they are inexpensive to evaluate. Let us now look into a few examples of commonly used acquisition functions to understand the concept better.

Remove the following text
Let us understand this concept in two cases:

We have two points of similar means (of function values (gold in our case)). We now want to choose one of these to obtain the labels or values. We will choose the one with higher variance. This basically says that given the same exploitability, we choose the one with higher exploration value.

We have two points having the same variance. We would now choose the point with the higher mean. This basically says that given the same explorability, we will choose the one with higher exploitation value.

Remove the text "hero plot" and instead write below plot
In Intuition behind E -> change spread out sigma to symbol sigma
Everywhere except the title - change Active Learning to active learning
SVM example remove the GIFs for random and GP-UCB. Also, mention the optimal <C, gamma> found via grid search and via EI and PI.
dimentions --> dimensions

Review #1

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to Austin Huang for taking the time to review this article.

General Comments

Missing Tools for Reasoning

Acquisition functions are introduced from a definitional standpoint and their behavior is illustrated for a relatively artificial example. Sometimes the methods are shown to work, sometimes they don't. How does one think about implementation alternatives when working on a new problem? The article provides few conceptual tools for the reader to apply these methods successfully.

There's also serious issues with model misspecification underneath the surface of these implementations (see for example, Thompson Sampling discussion). However, the article doesn't even raise the topic - the discussion starts from a fixed model specification and anecdotally shows methods either working or not under a narrow example.

Relatedly, there's a section entitled ""Why is it easier to optimize the acquisition function?"" This framing may be misleading since ""easiness"" isn't the goal. The real question seems to be ""Why is it beneficial to optimize the acquisition function?"" or perhaps ""is it even beneficial to optimize with respect to an acquisition function""?

Does the Hero Plot Illustrate a Cental Aspect of the Discussion?

An interactive visualization communicates a response function to the variables that can be affected by input. In the hero plot, this corresponds to the response of the activation function as a function of the epsilon hyperparameter in a PI acquisition function for fixed data and ground truth. It also shows the CDF for two slices of X (1.0 and 5.0) which are intermediate computations used by the activation function.

Is that particular relationship sufficiently central to the article to be front and center? There are other relationships that seem more central to the topic that could have been highlighted (how choice of acquisition functions compare, how the activation function changes with data). The plot is nice to interact with for thinking about exploration/exploitation in PI, but it doesn't seem to be an obvious choice as the hero plot.

Minor visual issue - the vertical labels look buggy, with 0.00e+0 cutting through the axis line.

Grey backgrounds don't fit Distill's Template

The patch of grey rectangle background for each figure doesn't fit the aesthetic of the distill template. The convention in other articles seems to be white-on-white with no boundary or occasionally a horizontal ribbon that runs the width of the page for visualizations with lots of margin content.

Animations are Overused

Note in other distill articles, animations are used sparingly, and usually just at the top figure or concluding figure.

Looping animations were overused and ultimately not a good way to illustrate a dependency relationship compared to a visual with a control.

Even if the content in those figures is kept as is with a slider http://worrydream.com/LadderOfAbstraction/, this would be an improvement by not being distracting and allowing the reader to examine relationships between iterations more carefully.

Introduction to EI is Confusing

Perhaps the framing using the unknown ground truth was the original motivation but here it just makes the reasoning convoluted without adding much insight. Don't see any reason not to just jump to the definition as described by the name - expected improvement (i.e. the 2nd equation).

Thompson Sampling

""It has a low overhead of setting up."" - not sure why this is specifically pointed out in the case of TS, is overhead any lower to set up than the other acquisition functions?

The statement that ""This will ensure an exploratory behaviour."" is contradicted by the animation demonstration that follows. From that demo's figures, it would actually seem nearly impossible to reach the global minimma without refining the underlying GP model - there's not enough noise in the function distribution to adequately explore. However the example is simply left without further comment.

Hyperparameter Tuning - Axis Labels

Using the horizontal label ""# of Hyper-Parameters Tested"" is a confusing label description since it doesn't really refer to the # of hyper-parameters tested, but rather the # of values that have been evaluated.

Hyperparameter Tuning - Changing colormap scale makes it impossible to track the function evolution

The colormaps should probably not rescale with each iteration - it makes it very difficult to track the evolution of the acquisition function between frames.

As mentioned above, replacing all or most animations with a slider control would also improve the legibility of the figure.

Legend tweaks

The legend positioning for the top ""hero"" plot looks buggy. ""GT"", ""GP"" and ""\epsilon"" are glued to the point without any spacing. The alignment looks very off
Not sure why ""GT"" is abbreviated when longer captions like ""Acquisition function"" are not.
""Train points"" -> ""Training points""
Given the legends are already really busy ""(Tie randomly broken)"" would be better as a linked footnote."

"# Minor Writing Improvements

""Older problem - Earlier in the active learning problem ... "" can remove the preface and start with ""In the active learning problem ...""
""We can write a general form of an acquisition function ..."" this sentence could be more weight and made more explicit about stating that mu(x) models exploitation and sigma(x) represents the value of exploration. It's implied by the phrasing, but could be clearer.
Don't nest parenthesis in parenthesis ""(of function values (gold in our case))""
""We can obtain a closed form solution as below"" - expression in terms of CDF is not usually considered ""closed form"" https://en.wikipedia.org/wiki/Closed-form_expression would just avoid using the phrase
"" h_{t+1} is our GP posterior of the ground truth"" - guessing intends to refer to the ""posterior mean"" since it needs to be a function
""first vanilla acquisition function"" - reference UCB directly instead of referring to it as ""first vanilla acquisition function""
(try to find the global maxima that might be near this “best” location)"" - this parenthetical remark is confusing and doesn't add to the statement.
""easily"" is used a lot throughout the article and in almost all cases the sentence improves by the omission of this unnecessary subjective qualifier. ""equation can be easily converted..."", ""One can easily change ..."", ""We can easily apply the BO for more dimensions"", ""... can easily be incorporated into BO."" (2 times in the same sentence in the last example)

Concluding Comments

Bayesian optimization and active learning aren't particularly popular to write about currently. I also suspect there's quite a bit of interest in the topic, particularly in industry and applied machine learning contexts.

Given that, this article does contribute to a notable gap in the research distillation space. However, I think more work needs to be put into this manuscript to raise the quality of communication to be comparable to other distill articles."

Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
What type of contributions does this article make?: Explanation of existing results

Advancing the Dialogue	Score
How significant are these contributions?	3/5

Outstanding Communication	Score
Article Structure	2/5
Writing Style	3/5
Diagram & Interface Style	3/5
Impact of diagrams / interfaces / tools for thought?	3/5
Readability	3/5

Scientific Correctness & Integrity	Score
Are claims in the article well supported?	3/5
Does the article critically evaluate its limitations? How easily would a lay person understand them?	3/5
How easy would it be to replicate (or falsify) the results?	4/5
Does the article cite relevant work?	4/5
Does the article exhibit strong intellectual honesty and scientific hygiene?	3/5

Website needs to be updated to new index.html

The brackets in this formula are correct:

post--bayesian-optimization/public/index.html

Line 401 in 822b3fa

x_{t+1} = argmax(\alpha_{PI}(x)) = argmax(P(f(x) \geq (f(x^+) +\epsilon)))

But when displayed on the website, they are wrong: https://distill.pub/2020/bayesian-optimization/

distillpub / post--bayesian-optimization Goto Github PK

post--bayesian-optimization's Introduction

Post -- Exploring Bayesian Optimization

Breaking Bayesian Optimization into small, sizable chunks.

Authors

Offline viewing

post--bayesian-optimization's People

Contributors

Stargazers

Watchers

Forkers

post--bayesian-optimization's Issues

Comments

Comments

Comments

Detailed comments

Comments

Comments

Comments

Minor points:

Major points:

General Comments

Missing Tools for Reasoning

Does the Hero Plot Illustrate a Cental Aspect of the Discussion?

Grey backgrounds don't fit Distill's Template

Animations are Overused

Introduction to EI is Confusing

Thompson Sampling

Hyperparameter Tuning - Axis Labels

Hyperparameter Tuning - Changing colormap scale makes it impossible to track the function evolution

Legend tweaks

Concluding Comments

Recommend Projects

Recommend Topics

Recommend Org

Jobs