\gdef{\logpinv}{\log p^{-1}} \gdef{\klqp}{\mathrm{KL}[q,p]} \gdef{\paccz}{p(\mathrm{accept \ }z)} \gdef{\pacczn}{p(\mathrm{accept }z^n)} \gdef{\logfracn}{\log \frac{q^n(z^n)}{p^n(z^n)}} \gdef{\qtil}{\tilde{q}} \gdef{\minv}{\frac{1}{M}} \gdef{\Pr}{\operatorname{Pr}}

I’ll describe a fun little problem in information theory, and a solution–a compression algorithm–based on rejection sampling. This problem is motivated by the coding interpretation of the variational bound, which seems to be a valuable source of intuition. The compression algorithm I describe is distinct from the well-known idea of bits-back coding and gives a more direct interpretation of the variational bound objective. It’s terribly computationally inefficient, but I think it’s interesting as a proof of principle.

**The Problem**. Alice and Bob initially agree on a “prior” distribution p(z), and they have a shared random number generator (RNG). Later, Alice is given a different distribution q(z). *How long of a message does Alice need to send to Bob so that by combining the message with the RNG, he can produce a sample z \sim q(z)?*

More precisely, Alice and Bob agree on a deterministic function f(\omega, m) of the RNG state \omega and the message m. Alice computes the message m as a function of q, and the distribution of f(\omega, m) must equal q(z).

Let’s analyze the problem for a couple of simple cases:

- If q(z)=p(z), then the message length is zero: Alice can just tell Bob to take his first sample from p.
- If q(z)=I[z=z_0], i.e. an indicator on one value z_0, then the best Alice can do is to send z_0 with message length \logpinv(z_0).

Note that this problem is subtlely different from the problem where Alice samples z \sim q (using a non-shared RNG) and then must send it to Bob. Sending arbitrary z requires expected code-length E_{q}[\logpinv(z)]. If Alice samples z\sim q and sends it to Bob using the code from p(z), it would require more bits than necessary. In particular, it would take S[p] bits in the case that q(z)=p(z) rather than zero.

You might guess that the general answer is \klqp, which gives the correct answer in examples (1) and (2) above. That guess is correct! I’ll prove it below (after adding some pesky details). But first, I’ll explain the motivation for this communication problem.

Many concepts of probability have a corresponding interpretation in terms of codes and compression. A key idea, needed for models with latent variables (like the variational autoencoder (VAE)) is variational upper bound (VUB). It’s usually called the variational *lower* bound, but we’ll flip the sign so it’ll correspond to code-length.

The VUB is the objective used for fitting probabilistic models with latent variables. Given a model p(x,z)=p(z)p(x|z), we typically want to maximize \log p(x), but there’s an intractable sum over z. The VUB introduces a sample distribution q(z), giving an upper bound on the log-loss. \logpinv(x) \le \underbrace{\klqp}_{(*)} + \underbrace{E_{z \sim q}[\logpinv(x|z)]}_{(**)} Equality occurs at q(z)=p(z|x), and training (e.g., for the VAE) involves jointly minimizing the RHS with respect to p and q

The LHS of the inequality reads “the number of bits Alice needs to send to Bob to transmit x, given that they previously agreed on distribution p(x)”. Can the RHS be interpreted as a concrete compression scheme for x, involving a code z that partially encodes x?

We’d like to say something like this:

- (*) is the number of bits Alice must send Bob, so Bob gets a sample z \sim q.
- (**) is the expected cost for Bob to fully reconstruct x, given the z he received.

The second point, interpreting E_{z \sim q}[\logpinv(x|z)] as the code-length of x given z, is clearly true. The first point is non-obvious, and it’s precisely the problem stated above.

The variational upper bound is indeed the code-length under a well-known compression scheme, called *bits-back coding*. However, bits-back coding doesn’t quite match the simple two-part-code interpretation given above. In bits-back coding, Alice samples z using some auxiliary data as the entropy source, then sends the whole sample to Bob at an expected cost of E_q[\logpinv(z)]. Then, she sends x at a cost of \logpinv(x | z). Finally, using x to infer the distribution q, Bob recovers the E_q[\log q^{-1}(z)] bits of auxiliary data, giving a net code-length of E_q[\logpinv(z)]+E_q[\logpinv(x | z)]-E_q[\log q^{-1}(z)]=\klqp + E_q[\logpinv(x | z)].

Bits-back is a pretty slick idea, but I’ve always wondered if the interpretation as a two-part code can be directly implemented. In particular, can the code-word z be communicated using cost \klqp, without using x at all?

Let’s return to the problem, where Alice needs to send Bob a message that lets him sample z \sim q. One natural idea is to use rejection sampling. Rejection sampling allows you to sample from q(z) by (stochastically) filtering samples from a different distribution p(z). Alice uses her RNG to generate a sequence of IID samples from p(z), but applies the rejection criterion so that the first accepted sample is a sample from q(z). Then she sends to Bob the index n of the first accepted sample. Bob runs the same process on his end and takes the nth sample, which is the same as Alice’s nth sample due to the shared RNG.

Now let’s look at the expected code-length of this protocol. In rejection sampling, we sample z \sim p(z) and accept with probability \paccz = \minv\frac{q(z)}{p(z)}, where M=\max_z \frac{q(z)}{p(z)}.

The probability of accepting a sample is E_z[\paccz]=\minv. Given that an event has probability \epsilon, the expected number of samples until it occurs is 1/\epsilon. Hence, the expected number of trials of the rejection sampling process is just M. The code-length of this integer is \log M=\log \max_z \frac{q(z)}{p(z)} = \max_z \log \frac{q(z)}{p(z)}.

Hence, this rejection sampling procedure attains a code-length \max_z \log \frac{q(z)}{p(z)}. But we’ve claimed above that the optimal code-length is \klqp=E_{z\sim q}[\log \frac{q(z)}{p(z)}]. So rejection sampling is suboptimal in general, replacing the expectation E_q by a max \max_q.

The rejection sampling approach almost works, but it gives suboptimal code-length. Like many ideas in coding theory, we can fix the problem by grouping together a bunch of messages and use the law of large numbers. We’ll group together n samples and do rejection sampling with \log M \approx n\klqp.

Let’s modify the communication problem to have Alice send Bob n samples at a time. In the modified problem, Alice and Bob agree on (p_1, p_2, \dots, p_n); Alice needs to send Bob a sample (z_1, z_2, \dots, z_n) from (q_1, q_2, \dots, q_n). To simplify the argument, let all of the distributions be the equal; p_1 = p_2 = \dots = p_n; q_1 = q_2 = \dots = q_n. The argument can easily be modified for the case where these distributions are not equal.

As for notation, let z^n denote an n-tuple of samples, and let p^n(z^n) and q^n(z^n) denote the joint distributions over n-tuples of samples.

We’ll define the communication protocol as follows. Alice repeatedly samples z^n \sim p^n, accepting with probability \min\left(1, \minv\frac{q^n(z^n)}{p^n(z^n)}\right), where \log M=n(\klqp+\epsilon), and \epsilon is a small number that \rightarrow 0 as n \rightarrow \infty. Then she sends Bob an integer k–the number of trials until acceptance, and he takes the kth sample from p^n.

To show that this protocol works, we will prove the following two statements:

- The expected message length per sample z_i approaches \klqp as n \rightarrow \infty.
- The total variation divergence between each decoded z_i and q(z) approaches zero as n \rightarrow \infty.

The proof will be based on the idea of typical sets introduced by Shannon. We’ll also explain how to slightly modify the protocol to send exactly q instead of an approximation (at the cost of some extra bits).

Consider the log ratio \log \frac{q^n(z^n)}{p^n(z^n)} = \sum_{i=1}^n \log \frac{q(z_i)}{p(z_i)} For z sampled from q, the expectation of each of these terms is E_q[\log \frac{q(z)}{p(z)}]=\klqp. Informally speaking, this sum will probably be around n\klqp \pm O(\sqrt{n}). Let’s state this more formally.

Let \epsilon>0 be a small number. As n \rightarrow \infty, the sample average of \log \frac{q(z)}{p(z)} approaches its mean value, \klqp, so we get \Pr \left(\frac{1}{n}\logfracn \le \klqp+\epsilon\right) \ge 1-\epsilon for z^n \sim q^n. Let S denote the set of z^n satisfying \frac{1}{n}\logfracn \le \klqp - \epsilon. S satisfies \Pr(z\sim q \in S) \ge 1-\epsilon.

Alice does rejection sampling by sampling z^n \sim p^n and then accepting with probability \Pr(\text{accept \ } z^n) = \min(1, \minv\frac{q^n(z^n)}{p^n(z^n)}), where \log M=n(\klqp+\epsilon). \Pr(\text{sample } z^n \sim p^n \text{ and accept }) = \begin{cases} \frac{q^n(z^n)}{M} \qquad z \in S \\ <\frac{q^n(z^n)}{M} \qquad z \notin S \\ \end{cases} Now let’s compute the probability of acceptance: \begin{aligned} \Pr(\text{accept})&=\sum_{z^n}\Pr(\text{sample } z^n \sim p^n \text{ and accept })\\ &=\sum_{z^n \in S} \frac{q^n(z^n)}{M} + \sum_{z \notin S} \text{[positive value]}\\ &\ge \sum_{z^n \in S} \frac{q^n(z^n)}{M}\\ &\ge (1 - \epsilon) / M \end{aligned} The message length is \log(\frac{1}{\Pr(\text{accept})})=\log M + O(\epsilon) = n\klqp + O(\epsilon), proving the first part of the proposition.

For the second part of the proposition, let’s define \qtil^N to be the decoded distribution over z^n when following the rejection sampling protocol. \qtil^n(z^n) \propto p^n(z^n)P(\text{accept } z^n) Define q_S to be the distribution of samples from q^n, conditioned on membership in S. For z^n \in S, q^n are proportional, as follows: q^n(z^n) = q_S(z^n) P_{S|q} \quad\text{where}\quad P_{S|q}=\Pr(z^n \sim q^n \in S)\\ \qtil^n(z^n) = q_S(z^n) P_{S|\qtil} \quad\text{where}\quad P_{S|\qtil} =\Pr(z^n \sim \qtil^n \in S) Furthermore, 1 \ge P_{S|\qtil} \ge P_{S|q} \ge 1- \epsilon. A routine calculation shows that the total variation divergence is O(\epsilon). This proves the second part of the proposition.

Finally, it’s a bit unsatisfying that Alice doesn’t send exactly q^n, she sends an approximation \qtil^n. We can easily fix this issue and have Alice send exactly q^n at the cost of some extra bits. Here’s a sketch. With probability P_{S|q}, we perform the protocol above. With probability 1-P_{S|q}, Alice directly sends z^n, sampled the compliment of S, at a cost of -\log p^n(z^n). Overall, the extra cost is O(\epsilon).

This procedure is computationally intractable, since it requires Alice to generate a sequence of samples from an exponentially large set of tuples (z_1, z_2, \dots, z_n). This contrasts with bits-back coding, which can be implemented efficiently. In fact, a recent paper showed how to implement bits-back coding with VAEs, cleverly using ANS (a relative of arithmetic coding).

It’s possible that there’s a procedure like arithmetic coding that solves our problem, giving an efficient algorithm in the case that z lives in a small discrete set. If z is high-dimensional, then it seems unlikely that we can solve the transmission problem efficiently without additional assumptions–all we can do is enumerate samples from p and index into them.

Finally, I wouldn’t be surprised if this problem is well-known–it seems like a natural way of formalizing the idea of lossy data transmission. If so, please send me a pointer.

*Thanks to Nik Tezak and Beth Barnes for helpful feedback*.

\gdef\ratio{\tfrac{p(x)}{q(x)}} \gdef\iratio{\tfrac{q(x)}{p(x)}} \gdef\half{\tfrac{1}{2}} \gdef{\klqp}{\mathrm{KL}[q,p]} \gdef{\klpq}{\mathrm{KL}[p,q]}

This post is about Monte-Carlo approximations of KL divergence. KL[q, p] = \sum_x q(x) \log \iratio = E_{ x \sim q}[\log \iratio ] It explains a trick I’ve used in various code, where I approximate \klqp as a sample average of \half (\log p(x) - \log q(x))^2, for samples x from q, rather the more standard \log \frac{q(x)}{p(x)}. This post will explain why this expression is a good (though biased) estimator of KL, and how to make it unbiased while preserving its low variance.

Our options for computing KL depend on what kind of access we have to p and q. Here, we’ll be assuming that we can compute the probabilities (or probability densities) p(x) and q(x) for any x, but we can’t calculate the sum over x analytically. Why wouldn’t we be able to calculate it analytically?

- Computing it exactly requires too much computation or memory.
- There’s no closed form expression.
- We can simplify code by just storing the log-prob, not the whole distribution. This is a reasonable choice if KL is just being used as a diagnostic, as is often the case in reinforcement learning.

The most common strategy for estimating sums or integrals is to use a Monte-Carlo estimate. Given samples x_1, x_2, \dots \sim q, how can we construct a good estimate?

A good estimator is unbiased (it has the right mean) and has low variance. We know that one unbiased estimator (under samples from q) is \log \iratio. However, it has high-variance, as it’s negative for half of the samples, whereas KL is always positive. Let’s call this naive estimator k_1 = \log \iratio = - \log r, where we’ve defined the ratio r=\ratio that’ll appear frequently in the subsequent calculations.

An alternative estimator, which has lower variance but is biased, is \frac{1}{2}(\log \ratio)^2 = \half (\log r)^2. Let’s call this estimator k_2. Intuitively, k_2 seems to be better because each sample tells you how far apart p and q are, and it’s always positive. Empirically, k_2 does indeed have much lower variance than k_1, and also has remarkably low bias. (We’ll show this in an experiment below.)

There’s a good reason why estimator k_2 has low bias: its expectation is an f-divergence. An f-divergence is defined as D_f(p,q) = E_{x \sim q}[f(\ratio)] for a convex function f. KL divergence and various other well-known probability distances are f-divergences. Now here’s the key non-obvious fact: all f-divergences with differentiable f look like KL divergence up to second order when q is close to p. Namely, for a parametrized distribution p_{\theta},

D_f(p_0, p_{\theta}) = \tfrac{f''(1)}{2} \theta^T F \theta + O(\theta^3)

where F is the Fisher information matrix for p_{\theta} evaluated at p_{\theta}=p_0.

E_q[k_2]=E_q[\frac{1}{2}(\log r)^2] is the f-divergence where f(x)=\half (\log x)^2, whereas \klqp corresponds to f(x)= - \log x. It’s easy to check that both have f''(1)=1, so both look like the same quadratic distance function for p\approx q.

Is it possible to write down a KL divergence estimator that is unbiased but also low variance? The general way to lower variance is with a control variate. I.e., take k_1 and add something that has expectation zero but is negatively correlated with k_1. The only interesting quantity that’s guaranteed to have zero expectation is \ratio - 1 = r-1. So for any \lambda, the expression -\log r + \lambda (r - 1) is an unbiased estimator of \klqp. We can do a calculation to minimize the variance of this estimator and solve for \lambda. But unfortunately we get an expression that depends on p and q and is hard to calculate analytically.

However, we can choose a good \lambda using a simpler strategy. Note that since log is concave, \log(x) \le x - 1. Therefore, if we let \lambda=1, the expression above is guaranteed to be positive. It measures the vertical distance between \log(x) and its tangent. This leaves us with the estimator k_3 = (r - 1) - \log r.

The idea of measuring distance by looking at the difference between a convex function and its tangent plane appears in many places. It’s called a Bregman divergence and has many beautiful properties.

We can generalize the above idea to get a good, always-positive estimator for any f-divergence, most notably the other KL divergence \klpq (note that p and q are switched here). Since f is by convex, and and E_q[r]=1, the following is an estimator of the f-divergence: f(r) - f'(1)(r-1). This is always positive because it’s the distance between f and its tangent at r=1, and convex functions lie above their tangent lines. Now \klpq corresponds to f(x)=x \log x, which has f'(1)=1, leaving us with the estimator r \log r - (r - 1).

In summary, we have the following estimators (for samples x \sim q, and r = \ratio):

- \klpq: r \log r - (r - 1)
- \klqp: (r - 1) - \log r

Now let’s compare the bias and variance of the three estimators for \klqp. Suppose q=N(0,1), p=N(0.1,1). Here, the true KL is 0.005.

bias/true | stdev/true | |

k1 | 0 | 20 |

k2 | 0.002 | 1.42 |

k3 | 0 | 1.42 |

Note that the bias of k2 is incredibly low here: it’s 0.2%.

Now let’s try for a larger true KL divergence. p=N(1,1) gives us a true KL divergence of 0.5.

bias/true | stdev/true | |

k1 | 0 | 2 |

k2 | 0.25 | 1.73 |

k3 | 0 | 1.7 |

Here, the bias of k2 is much larger. k3 has even lower standard deviation than k2 while being unbiased, so it appears to be a strictly better estimator.

Here’s the code I used to get these results:

```
import torch.distributions as dis
p = dis.Normal(loc=0, scale=1)
q = dis.Normal(loc=0.1, scale=1)
x = q.sample(sample_shape=(10_000_000,))
truekl = dis.kl_divergence(p, q)
print("true", truekl)
logr = p.log_prob(x) - q.log_prob(x)
k1 = -logr
k2 = logr ** 2 / 2
k3 = (logr.exp() - 1) - logr
for k in (k1, k2, k3):
print((k.mean() - truekl) / truekl, k.std() / truekl)
```

*Thanks to Jacob Hilton and Nisan Stiennon for helpful feedback.*

*I originally wrote this guide in back in December 2017 for the OpenAI Fellows program*

In this essay, I provide some advice to up-and-coming researchers in machine learning (ML), based on my experience doing research and advising others. The advice covers how to choose problems and organize your time. I also recommend the following prior essays on similar topics:

My essay will cover similar ground, but it’s more tuned to the peculiar features of ML.

The keys to success are working on the right problems, making continual progress on them, and achieving continual personal growth. This essay is comprised of three sections, each covering one of these topics.

**Exercise**. Before continuing, it’s useful to spend a few minutes about which findings and achievements in ML have been most interesting and informative to you. Think about what makes each one stand out—whether it's a groundbreaking result that changed your perspective on some problem; or an algorithmic idea that's reusable; or a deep insight about some recurring questions. You should aspire to produce results, algorithms, and insights of this caliber.

Your ability to choose the right problems to work on is even more important than your raw technical skill. This taste in problems is something you’ll develop over time by watching which ideas prosper and which ones are forgotten. You’ll see which ones serve as building blocks for new ideas and results, and which ones are ignored because they are too complicated or too fragile, or because the incremental improvement is too small.

You might be wondering if there’s a way to speed up the process of developing a good taste for what problems to work on. In fact, there are several good ways.

- Read a lot of papers, and assess them critically. If possible, discuss them with others who have a deeper knowledge of the subject.
- Work in a research group with other people working on similar topics. That way you can absorb their experiences as well as your own.
- Seek advice from experienced researchers on what to work on. There’s no shame in working on ideas suggested by other people. Ideas are cheap, and there are lots of them in the air. Your skill comes in when you decide which one to work on, and how well you execute on it.
- Spend time reflecting on what research is useful and fruitful. Think about questions like
- When is theory useful?
- When are empirical results transferable?
- What causes some ideas to get wide uptake, whereas others are forgotten?
- What are the trends in your field? Which lines of work will make the other ones obsolete?

Items 1-3 relate to optimizing your environment and getting input from other researchers, whereas item 4 is something you do alone. As empirical evidence for the importance of 1-3, consider how the biggests bursts of impactful work tend to be tightly clustered in a small number of research groups and institutions. That’s not because these people are dramatically smarter than everyone else, it’s because they have a higher density of expertise and perspective, which puts them a little ahead of the rest of the community, and thus they dominate in generating new results. If you’re not fortunate enough to be in an environment with high density of relevant expertise, don’t despair. You’ll just have to work extra-hard to get ahead of the pack, and it’s extra-important to specialize and develop your own unique perspective.

Roughly speaking, there are two different ways that you might go about deciding what to work on next.

Idea-driven. Follow some sector of the literature. As you read a paper showing how to do X, you have an idea of how to do X even better. Then you embark on a project to test your idea.

Goal-driven. Develop a vision of some new AI capabilities you’d like to achieve, and solve problems that bring you closer to that goal. (Below, I give a couple case studies from my own research, including the goal of using reinforcement learning for 3D humanoid locomotion.) In your experimentation, you test a variety of existing methods from the literature, and then you develop your own methods that improve on them.

Of course, these two approaches are not mutually exclusive. Any given subfield ML is concerned with some goals (e.g., object detection). Any “idea-driven” project will represent progress towards the subfield’s goals, and thus in a sense, it’s an instance of goal-driven research. But here, I’ll take goal-driven research to mean that your goal is more specific than your whole subfield’s goal, and it’s more like *make X work for the first time* than *make X work better*.

I personally recommend goal-driven research for most people, and I’ve mostly followed this strategy myself.

One major downside of idea-driven research is that there’s a high risk of getting scooped or duplicating the work of others. Researchers around the world are reading the same literature, which leads them to similar ideas. To make breakthroughs with idea-driven research, you need to develop an exceptionally deep understanding of your subject, and a perspective that diverges from the rest of the community—some can do it, but it’s difficult.

On the other hand, with goal-driven research, your goal will give you a perspective that’s differentiated from the rest of the community. It will lead you to ask questions that no one else is asking, enabling you to make larger leaps of progress. Goal driven research can also be much more motivating. You can wake up every morning and imagine achieving your goal—what the result would look like and how you would feel. That makes it easier to stick to a long-running research program with ups and downs. Goals also make it possible for a team of researchers to work together and attack different aspects of the problem, whereas idea-driven research is most effectively carried out by “teams” of 1-2 people.

For the first half of my PhD, my goal was to enable robots to manipulate deformable objects—including surgical robots tying knots, and household robots folding clothes. While this goal was determined by my advisor, Pieter Abbeel, as the main goal for his lab, I developed my own opinion on how to achieve this goal—my approach was based on learning from human demonstrations, and I was going to start with the problem of getting the PR2 to tie knots in rope. Various unexpected subproblems arose, one of which was trajectory optimization, and my work on that subproblem ended up being the most influential product of the knot-tying project.

For the second half of my PhD, I became interested in reinforcement learning. While there are many problem domains in which reinforcement learning can be applied, I decided to focus on robotic locomotion, since the goal was concrete and the end result was exciting to me. Specifically, my goal was to get a 3D robot to learn how to run from scratch using reinforcement learning. After some initial exploration, I decided to focus on policy gradient methods, since they seemed most amenable to understanding and mathematical analysis, and I could leverage my strength in optimization. During this period, I developed TRPO and GAE and eventually achieved the original goal of 3D humanoid locomotion.

While I was working on locomotion and starting to get my first results with policy gradient methods, the DeepMind team presented the results using DQN on Atari. After this result, many people jumped on the bandwagon and tried to develop better versions of Q-learning and apply them to the Atari domain. However, I had already explored Q-learning and concluded that it wasn’t a good approach for the locomotion tasks I was working on, so I continued working on policy gradient methods, which led to TRPO, GAE, and later PPO—now my best known pieces of work. This example illustrates how choosing a different problem from the rest of the community can lead you to explore different ideas.

One pitfall of goal-driven research is taking your goal too literally. If you have a specific capability in mind, there’s probably some way to achieve it in an uninteresting way that doesn’t advance the field of machine learning. You should constrain your search to solutions that seem general and can be applied to other problems.

For example, while working on robotic locomotion, I avoided incorporating domain information into the solution—the goal was to achieve locomotion in simulation, *in a way that was general and could be applied to other problems*. I did a bit of feature engineering and reward shaping in order to see the first signs of life, but I was careful to keep my changes simple and not let them affect the algorithm I was developing. Now that I am using videogames as a testbed, I make sure that my algorithmic ideas are not specific to this setting—that they equally well could be applied to robotics.

Sometimes, people who are both exceptionally smart and hard-working fail to do great research. In my view, the main reason for this failure is that they work on unimportant problems. When you embark on a research project, you should ask yourself: how large is the potential upside? Will this be a 10% improvement or a 10X improvement? I often see researchers take on projects that seem sensible but could only possibly yield a small improvement to some metric.

Incremental work (those 10% improvements) are most useful in the context of a larger goal that you are trying to achieve. For example, the seminal paper on ImageNet classification using convolutional neural networks (Krizhevsky, Sutskever, & Hinton, 2012) does not contain any radically new algorithmic components, rather, it stacks up a large number of small improvements to achieve an unprecedented result that was surprising to almost everyone at the time (though we take it for granted now). During your day-to-day work, you’ll make incremental improvements in performance and in understanding. But these small steps should be moving you towards a larger goal that represents a non-incremental advance.

If you are working on incremental ideas, be aware that their usefulness depends on their complexity. A method that slightly improves on the baseline better be very simple, otherwise no one will bother using it—not even you. If it gives a 10% improvement, it better be 2 lines of code, whereas if it's a 50% improvement, it can add 10 lines of code, etc. (I’m just giving these numbers for illustration, the actual numbers will obviously depend on the domain.)

Go back and look at the list of machine learning achievements you admire the most. Does your long-term research plan have the potential to reach the level of those achievements? If you can’t see a path to something that you’d be proud of, then you should revise your plan so it does have that potential.

To develop new algorithms and insights in machine learning, you need to concentrate your efforts on a problem for a long period of time. This section is about developing effective habits for this long-term problem solving process, enabling you to continually build towards great results.

I strongly advise you to keep a notebook, where you record your daily ideas and experiments. I have done this through 5 years of grad school and 2 years at OpenAI, and I feel that it has been tremendously helpful.

I create an entry for each day. In this entry, I write down what I’m doing, ideas I have, and experimental results (pasting in plots and tables). Every 1 or 2 weeks, I do a review, where I read all of my daily entries and I condense the information into a summary. Usually my review contains sections for *experimental findings*, *insights* (which might come from me, my colleagues, or things I read), *code progress* (what did I implement), and *next steps / future work*. After I do my week in review, I often look at the previous week to see if I followed up on everything I thought of that week. Also, while doing this review, I sometimes transfer information into other sources of notes. (For example, I keep a list of backburner ideas and projects, separate from my notebook.)

What’s the value in keeping this notebook and doing the regular reviews?

First, the notebook is a good place to write down ideas as soon as you have them, so you can revisit them later. Often, when I revisit my journal entries during the week in review, I’ll fill in a missing piece in a puzzle, which didn’t occur to me at the time.

Second, the notebook helps you keep your experimental results in a unified place, so you can easily find the results later. It’s easy to forget about your conclusions, e.g., which hyperparameters made a difference, and you’ll want to revisit your old notebook entries.

Third, the notebook lets you monitor your use of time. You might wonder “where did last week go?”, and the notebook will help you answer that question. You might be disappointed with your throughput and realize you need to work on your time management. You also might look back at several months and realize that you’ve been jumping around between ideas too much—that you have a few half-finished projects but you didn’t follow any of these threads long enough to yield a notable result.

To solve a challenging problem, you need to spend a sufficient amount of time on it. But in empirical machine learning research, it’s hard to know if you’ve tried an idea hard enough. Sometimes the idea has the potential to work, but if you get one detail wrong, you’ll see no signs of life. But other ideas are simply doomed to fail no matter how hard you work on them.

In my experience, switching problems too frequently (and giving up on promising ideas) is a more common failure mode than not switching enough. Often, while you’re engaged in the long slog towards getting your current idea to work, another promising idea will come along, and you’ll want to jump to that idea. If your idea is quick to try and the potential upside is large, then go ahead and do it. But more commonly, your initial results on the new idea will be disappointing, and it’ll take a more sustained effort to yield significant results.

As a rule of thumb, when you look back at which projects you’ve been working on over a period of months, you should find that there have been lots of small dead ends, but the majority of your time has been directed towards projects that yielded a deliverable such as a paper or a blog post. If you look back at your time and see that a substantial fraction was spent on half-finished projects—which were not definite failures, but which you abandoned in favor of some newer idea—then you should make a stronger effort towards consistency and follow-through in the future.

One strategy, which I haven’t tried personally but makes a lot of sense upon reflection, is to devote some fixed time budget to trying out new ideas that diverge from your main line of work. Say, spend one day per week on something totally different from your main project. This would constitute a kind of epsilon-greedy exploration, and it would also help to broaden your knowledge.

No matter how you allocate your time during your research journey, you are bound to learn a lot. Each project will present new challenges, and you can pick up the background material and skills as you go along. However, you can significantly improve your chances to do great work in the long term by regularly setting aside time for your personal development. Specifically, you should allocate some fraction of your time towards improving your general knowledge of ML as opposed to working on your current project. If you don’t allocate this time, then your knowledge is likely to plateau after you learn the basics that you need for your day-to-day work. It’s easy to settle into a comfort zone of methods you understand well—you may need to expend active effort to expand this zone.

The main ways to build your knowledge of ML are to read textbooks, theses and papers; and to reimplement algorithms from these sources. Early on in your career, I recommend splitting your time about evenly between textbooks and papers. You should choose a small set of relevant textbooks and theses to gradually work through, and you should also reimplement the models and algorithms from your favorite papers.

Most students of machine learning don’t spend time reading textbooks after they finish their school courses. I think this is a mistake, since textbooks are a much more dense way to absorb knowledge than papers. Each conference paper typically contains one main new idea, along with a background section that’s too concise to learn anything from. There’s a lot of overhead, since you typically need to spend more time understanding the notation and terminology than the idea itself. On the other hand, good textbooks collect decades of ideas and present them in the proper order with the same notation. Besides reading the introductory machine learning textbooks, read other books in your areas of interest. A couple of my favorites were *Numerical Optimization* by Nocedal & Wright, and *Elements of Information Theory* by Cover & Thomas.

Besides textbooks, I recommend reading PhD theses of researchers whose work interests you. PhD theses in ML usually are ordered as follows: (1) introductory and background material, (2) several papers that were previously published at conferences (it’s said that you just have to “staple together” your papers to write your thesis), and (3) a conclusion and outlook. You’re likely to benefit most from parts (1) and (3), since they contain a unifying view of the past and future of the field, written by an expert. Recent theses are often the best place to find a literature review of an active field, but older theses also often contain valuable gems of insight.

Textbooks and theses are good for building up your foundational knowledge, but you’ll also need to read a lot of papers to bring your knowledge up to the frontier. When you are just starting your research career, I recommend spending a lot of time reimplementing ideas from papers, and comparing your results to the published ones. First of all, this gives you a much deeper understanding of the topic than you’d get by passively reading. Second, you’ll gain experience running experiments, and you’ll get much quicker feedback by reimplementing existing work (where the desired level of performance is known) than by doing original research. Once you can easily reproduce the state-of-the-art, you’ll be ready to go beyond it.

Besides reading seminal papers and reimplementing them, you should also keep track of the less exceptional papers being published in your field. Reading and skimming the incoming papers with a critical eye helps you notice the trends in your field (perhaps you notice that a lot of papers are using some new technique and getting good results—maybe you should investigate it). It also helps you build up your taste by observing the dependency graph of ideas—which ideas become widely used and open the door to other ideas.

Go forth and do great research!