Tighter risk certificates for neural networks
Abstract
This paper presents empirical studies regarding training probabilistic neural networks using training objectives derived from PACBayes bounds. In the context of probabilistic neural networks, the output of training is a probability distribution over network weights. We present two training objectives, used here for the first time in connection with training neural networks. These two training objectives are derived from tight PACBayes bounds, one of which is new. We also reimplement a previously used training objective based on a classical PACBayes bound, to compare the properties of the predictors learned using the different training objectives. We compute risk certificates that are valid on any unseen examples for the learnt predictors. We further experiment with different types of priors on the weights (both datafree and datadependent priors) and neural network architectures. Our experiments on MNIST and CIFAR10 show that our training methods produce competitive test set errors and nonvacuous risk bounds with much tighter values than previous results in the literature, showing promise not only to guide the learning algorithm through bounding the risk but also for model selection. These observations suggest that the methods studied here might be good candidates for selfbounding learning.
120001484/0010/00xxx00a M. PérezOrtiz, O. Rivasplata, J. ShaweTaylor, C. Szepesvári \ShortHeadingsTighter risk certificates for neural networks PérezOrtiz, Rivasplata, ShaweTaylor, Szepesvári \firstpageno1
Kevin Murphy and Bernhard Schölkopf
Deep learning, neural work training, weight randomization, generalization, pathwise reparametrized gradients, PACBayes with Backprop, datadependent priors.
1 Introduction
In a probabilistic neural network, the result of the training process is a distribution over network weights, rather than simply fixed weights. The datadependence of the learned distribution over weights implies that this distribution might capture key properties of the learning task and, consequently, its predictions are well aligned with the task. Several prediction schemes can be devised based on a probability distribution over weights. For instance, one may use a randomized predictor, where each prediction is done by randomly sampling the weights from the datadependent distribution obtained as the result of the training process. Another possible scheme consists of predicting with the mean of the learned distribution. Yet another prediction scheme is based on integrating the predictions of all possible parameter settings, weighted according to the learned distribution.
In this paper we study probabilistic neural networks from a PACBayesian approach. We name ‘PACBayes with Backprop’ (PBB) the family of (probabilistic) neural network training methods derived from PACBayes bounds. The work reported here is the result of our empirical studies undertaken to investigate three PBB training objectives. For reference, they are the functions , and , shown respectively in Eq. (8), Eq. (9) and Eq. (10) below. These objectives are based on PACBayes bounds with similar names, which are relaxations of the PACBayes relative entropy bound (Langford and Seeger, 2001), also known as the PACBayeskl bound in the literature. The classic PACBayes bound, from which is derived, is that of McAllester (1999), but we use the improved dependence on the number of training patterns clarified by Maurer (2004). The PACBayeslambda bound is that of Thiemann et al. (2017). The PACBayesquadratic bound, from which is derived, was first introduced by Rivasplata et al. (2019b). We are also interested in certifying the quality of the randomized classifiers generated by these training methods, by providing a certificate on the classification error that is valid on unseen examples.
Our line of research owes credit to previous works that have trained a probabilistic neural network by minimizing a PACBayes bound, or used a PACBayes bound to give risk certificates for trained neural networks. Langford and Caruana (2001) developed a method to train a probabilistic neural network by randomizing the weights with Gaussian noise (adjusted in a datadependent way via a sensitivity analysis), and computed an upper bound on the error using the PACBayeskl bound.^{1}^{1}1Inversion of the PACBayeskl bound (we explain this in Section 6) gives a certificate (upper bound) on the risk of the randomized predictor, in terms of its empirical error and other quantities. The empirical error term is evaluated indirectly by Monte Carlo sampling, and a bound on the tail of the Monte Carlo evaluation (Langford and Caruana, 2001, Theorem 2.5) is combined with the PACBayeskl bound to give a numerical risk certificate that holds with high probability over data and Monte Carlo samples. They also pointed that PACBayes bounds might be fruitful for computing nonvacuous generalization bounds for neural nets. Dziugaite and Roy (2017) used a training objective (essentially equivalent to ) based on a relaxation of the PACBayeskl bound, they optimized this objective using stochastic gradient descent (SGD), and computed a confidence bound on the error of the randomized classifier following the same approach that Langford and Caruana (2001) used to compute their error bound. Dziugaite and Roy (2018) developed a twostage method to train a probabilistic neural net, which in the first stage trains a prior mean by empirical risk minimization via SGLD,^{2}^{2}2Stochastic gradient Langevin dynamics, Welling and Teh (2011). and in the second stage reuses the same training data used for the prior in order to train a posterior Gaussian distribution over weights by minimizing a relaxation of the classic PACBayes bound which accounts for the data reuse.
In this paper we report experiments on MINIST and CIFAR10 with the three training objectives mentioned above. We used by default the randomized predictor scheme (also called the ‘stochastic predictor’ in the PACBayes literature), justified by the fact that PACBayes bounds give highconfidence guarantees on the average loss of the randomized predictor. Since training is based on a surrogate loss function, optimizing a PBB objective gives a highconfidence guarantee on the randomized predictor’s risk under the surrogate loss. Accordingly, to obtain guarantees that are valid for the classification (zeroone) loss, we separately evaluate the test set error and compute a confidence bound on the risk based on this loss (following the same procedure that was used by Langford and Caruana (2001)). For the sake of comparison we also report the test set evaluations of the other two predictor schemes described above, namely, the mean and the ensemble predictors.
Our work took inspiration from Blundell et al. (2015), whose results showed that randomized weights achieve competitive test set errors; and from Dziugaite and Roy (2017, 2018), whose results gave randomized neural network classifiers with reasonable test set errors and, more importantly, nonvacuous risk bound values. Our experiments show that PBB training objectives can (a) achieve competitive test set errors (e.g. comparable to Blundell et al. (2015) and empirical risk minimisation), while also (b) deliver risk certificates with reasonably tight values. Our results show as well a significant improvement over those of Dziugaite and Roy (2017, 2018): we further close the gap between the risk certificate (bound value) and the risk estimate (test set error rate). As we argue below, this improvement comes from the tightness of the PACBayes bounds we used, which is established analytically and corroborated by our experiments on MNIST and CIFAR10 with deep fully connected networks and convolutional neural networks.
Regarding the tightness of the training objectives, Dziugaite and Roy’s training objective (which in our notation takes essentially the form of shown in Eq. (10) below) has the disadvantage of being suboptimal in the regime of small losses. This is because to obtain it they relaxed the PACBayeskl bound via an inequality that is loose in this regime. The looseness was the price paid for having a computable objective. Note that small losses is precisely the regime of interest in neural network training (although the true loss being small is dataset and architecture dependent). By contrast, our proposed training objectives ( and in Eq. (8) and Eq. (9) below) are based on relaxing the PACBayeskl bound by an inequality that is tighter in this same regime of small losses, which is one of the reasons explaining our tighter risk certificates in MNIST (not for CIFAR10, which could be explained by the large empirical loss obtained at the end of the optimisation). Interestingly, our own reimplementation of also gave improved results compared to the results of Dziugaite and Roy, which suggests that besides the training objectives we used, also the training strategies we used are responsible for the improvements.
We would like to highlight the elegant simplicity of the methods presented here: Our results are achieved i) with priors learnt through empirical risk minimisation on a subset of the dataset (which does not overlap with the data used for computing the risk certificate for the probabilistic neural network, thus in line with classical PACBayes priors) and ii) via classical SGD optimization. In contrast, Dziugaite and Roy (2018) trained a special type of datadependent PACBayes prior on the whole dataset using SGLD optimization. They justified this procedure arguing that the limit distribution of SGLD satisfies the differential privacy property (but a finitetime guarantee was missing), and relaxed the PACBayes bound with a correction term based on the concept of maxinformation^{3}^{3}3Dwork et al. (2015a, b) proposed this concept in the context of adaptive data analysis. to account for using the same data to train the prior mean and to evaluate the PACBayes bound. Furthermore, our methods do not involve tampering with the training objective, as opposed to Blundell et al. (2015), who used a “KL killing trick” by inserting a tunable parameter as a factor of the KullbackLeibler (KL) divergence in their objective. Our work highlights the point that it is worthwhile studying simple methods, not just to understand their scope or for the sake of having a more controlled experimental setup, but also to more accurately assess the real value added by the ‘extras’ of the more complex methods.
A clear advantage of PACBayes with Backprop (PBB) methods is being an instance of learning with guarantees: When training probabilistic neural nets by PBB methods the output is not just a predictor but simultaneously a risk certificate that guarantees the quality of predictions on unseen examples. Once again, the datadependent solution found by the optimization process comes with a highconfidence guarantee that certifies its risk under the surrogate training loss, and to obtain a highconfidence guarantee under the classification (zeroone) loss we evaluate post training a confidence bound on the classification error. A more ambitious goal would be to develop selfbounding learning algorithms (Freund, 1998) which use all the available data to achieve both goals simultaneously, thus obviating the use of datasplitting protocols. Naturally, learning with guarantees per se will not impress until the values of the reported risk certificates match or closely follow the classification error rates calculated on a test set, so that the risk certificate is informative of the performance on unseen examples.
Our contributions:

[leftmargin=*,topsep=2pt,itemsep=2pt]

We rigorously study and illustrate ‘PACBayes with Backprop’ (PBB), a generic strategy to derive neural network training methods from PACBayes bounds.

We propose –and experiment with– two new PBB training objectives: one derived from our novel PACBayesquadratic bound, and one derived from the PACBayeslambda bound of Thiemann et al. (2017).

We also reimplement the training objective used by Dziugaite and Roy for the sake of comparing our training objectives and training strategy, both with respect to test set accuracy and risk certificates obtained.

We connect PACBayes with Backprop (PBB) methods to the BayesbyBackprop (BBB) method of Blundell et al. (2015), which achieved competitive test set accuracy. Unlike BBB, our training methods not only achieve competitive test set errors, but require less heuristics and also provide a risk certificate.

We demonstrate via experimental results that PBB methods might be able to achieve selfcertified learning with nontrivial certificates: obtaining competitive test set errors and computing nonvacuous bounds with much tighter values than previous works.
Broader context.
Deep learning is a vibrant research area. The success of deep neural network models in several tasks has motivated many works that study their optimization and generalization properties (some of the collective knowledge is condensed in a few recent sources such as Montavon et al. (2012); Goodfellow et al. (2016); Aggarwal (2018)). Some works focus on experimenting with methods to train neural networks, others aim at generating knowledge and understanding about these fascinating learning systems. In this paper we intend to contribute both ways. We focus on supervised classification problems through probabilistic neural networks, and we experiment with training objectives that are principled and consist of interpretable quantities. Furthermore, our work puts an emphasis on certifying the quality of predictions beyond a specific data set.
Note that known neural network training methods range from those that have been developed based mainly on heuristics to those derived from sound principles. Bayesian learning, for instance, offers principled approaches for learning datadependent distributions over network weights (see e.g. Buntine and Weigend (1991), Neal (1992), MacKay (1992), Barber and Bishop (1998)), hence probabilistic neural nets arise naturally in this approach. Bayesian neural networks continue to be developed, with notable recent contributions e.g. by HernándezLobato and Adams (2015); Martens and Grosse (2015); Blundell et al. (2015); Gal and Ghahramani (2016); Louizos and Welling (2016); Ritter et al. (2018), among others. Our work is complementary of Bayesian learning in the sense that our methods also offer principled training objectives for learning probabilistic neural networks. However, there are differences between the PACBayesian and Bayesian approaches that are important to keep in mind (we discuss the differences in Section 3). It is worth mentioning also that some works have pointed out the resemblance between the Bayesian evidence lower bound (ELBO) and PACBayes bounds, and made connections between the PACBayes approach and variational Bayesian inference (Achille and Soatto, 2018; Thakur et al., 2019).
As we pointed out before, we are not the first to train a probabilistic neural network by minimizing a PACBayes bound, or to use a PACBayes bound to give risk certificates for trained neural networks. We already mentioned Langford and Caruana (2001) and Dziugaite and Roy (2017, 2018), whose works have directly influenced ours. Next, we comment on some other works that connect PACBayes with neural networks. London (2017) approached the generalization of neural networks by a stabilitybased PACBayes analysis, and proposed an adaptive sampling algorithm for SGD that optimizes its distribution over training instances using multiplicative weight updates. Neyshabur et al. (2017, 2018) examined the connection between some specifically defined complexity measures and generalization, the part related to our work is that they specialized a form of the classic PACBayes bound and used Gaussian noise on network weights to give generalization bounds for probabilistic neural networks based on the norms of the weights. Zhou et al. (2019) compressed trained networks by pruning weights to a given target sparsity, and gave generalization guarantees on the compressed networks, which were based on randomizing predictors according to their ‘description length’ and a specialization of Catoni (2007)’s PACBayes bound.
We would like to point out that the present work builds on Rivasplata et al. (2019b).^{4}^{4}4Rivasplata et al. (2019b) is now “arXiv’ed” permanently. In the meantime, more works have appeared that connect neural networks with PACBayes bounds in various settings: Letarte et al. (2019), Viallard et al. (2019), Lan et al. (2020), Dziugaite et al. (2020), Biggs and Guedj (2020). We do not elaborate on these works as they deal with significantly different settings than ours (e.g. they study binary activated networks or ensembles or focus exclusively on the role of the prior).
Paper layout.
The rest of the paper is organized as follows. In Section 2 we briefly recall some notions of supervised learning, mainly to set the notation used later. In Section 3 we outline the PACBayes framework and discuss some PACBayes bounds, while in Section 5 we present the training objectives derived from them. Section 4 discusses the connection between our work and Blundell et al. (2015). The technical Section 6 describes the binary KL inversion strategy and the ways we use it. In Section 7 we present our experimental results. We conclude and discuss future research directions in Section 8.
2 Generalization through risk upper bounds
In the context of supervised learning, an algorithm that trains a neural network receives a finite list of training examples and produces a datadependent weight vector , which is used to predict the label of unseen examples. The ultimate goal is for the algorithm to find a weight vector that generalizes well, meaning that the decisions arrived at by using the learned should give rise to a small loss on unseen examples^{5}^{5}5In statistical learning theory the meaning of generalization of a learning method has a precise definition (see e.g. ShalevShwartz and BenDavid (2014)). We use the word in a slightly broader sense here. . Turning this into precise statements requires a formal description of the learning setting, briefly discussed next. The reader familiar with learning theory can skip the next couple of paragraphs and come back if they need clarifications regarding notation.
The training algorithm receives a size random sample . Each example is randomly drawn from a space according to an underlying (but unknown) probability distribution^{6}^{6}6 denotes the set of all probability measures over . . The example space usually takes the form in supervised learning, where and , each example being a pair consisting of an input and its corresponding label . A space encompasses all possible weight vectors, and it is understood that each possible weight vector maps to a predictor function that will assign a label to each new input . While statistical inference is largely concerned with elucidating properties of the unknown datagenerating distribution, the main focus of machine learning is on the quality of predictors, measured by the expected loss on unseen examples, also called the risk:
(1) 
Here is a fixed loss function. With these components, regression is defined as the problem when and the loss function is the squared loss, namely , while binary classification is the problem where (or ) and the loss is set to be the zeroone loss: .
The goal of learning is to find a weight vector with small risk . Since the datagenerating distribution is unknown, is an unobservable objective. Replacing the expected loss with the average loss on the data gives rise to an observable objective called the empirical risk functional:
(2) 
In practice, the minimization of is often done with some version of gradient descent. Since the zeroone loss gives rise to a piecewise constant loss function, in classification it is common to replace it with a smooth(er) loss, such as the crossentropy loss, while changing the range of to .
Under certain conditions, minimizing the empirical risk leads to a weight that is guaranteed to have a small risk. Examples of such conditions are when the set of functions representable has a small capacity relative to the sample size or the map that produces the weights given the data is stable, or the same map is implicitly constructed by an algorithm such as stochastic gradient descent. However, often minimizing the empirical risk can lead to a situation when the risk of the learned weight is larger than desired – a case of overfitting. To prevent overfitting, various methods are commonly used. These include complexity regularization, early stopping, injecting noise in various places into the learning process, etc (e.g. Srivastava et al. (2014), Wan et al. (2013), Caruana et al. (2001), Hinton and van Camp (1993)).
An alternative to these is to minimize a surrogate objective which is guaranteed to give an upper bound on the risk. As long as the upper bound is tight and the optimization gives rise to a small value for the surrogate objective, the user can be sure that the risk will also be small: In this sense, overfitting is automatically prevented, while we also automatically get a selfbounded method in the sense of Freund (1998) (see also Langford and Blum (2003)). In this paper we follow this last approach, with two specific training objectives derived from corresponding PACBayes bounds, which we introduce in the next section. The approach to learning datadependent distributions over hypotheses by minimizing a PACBayes bound is mentioned already in McAllester (1999), credit for this approach in various contexts is due also to Germain et al. (2009), Seldin and Tishby (2010), Keshet et al. (2011), Noy and Crammer (2014), Keshet et al. (2017), among others. Subsequent use of this approach for training neural nets was done by Dziugaite and Roy (2017, 2018).
As will be demonstrated below, our experiments based on our two training objectives and (Eq. (8) and Eq. (9) below) lead to (a) test set performance comparable to that of Blundell et al. (2015), while (b) computing nonvacuous bounds with tighter values than those obtained by (Eq. (10) below) which is essentially equivalent to the training objective used by Dziugaite and Roy.
3 PACBayes bounds
Probabilistic neural networks are realized as probability distributions over the weight space. While the outcome of a classical (nonprobabilistic) neural network training method is a datadependent weight vector, the outcome of training a probabilistic neural network is a datadependent distribution^{7}^{7}7Formally, a datadependent distribution over is a stochastic kernel from to . This formalization of datadependent distributions over predictors is covered lucidly by Rivasplata et al. (2019a). over weights, say . Then, given a fresh input , the network predicts its label by drawing a weight vector at random according to and applies the predictor to . Each new prediction requires a fresh draw. One way, which we adopt in this paper, to measure the performance of the resulting randomizing predictor, is to use the expected loss over the random draws of weights. Accordingly, the average empirical loss becomes and the average population loss becomes . In general, we denote by the integral whenever is a probability distribution over and an integrable function.
To introduce the promised PACBayes bounds we need to recall some further definitions. Given two probability distributions , the KullbackLeibler (KL) divergence of from , also known as relative entropy of given , is defined as follows:
when , the RadonNikodym derivative of with respect to , is defined; otherwise . For we will write
which is called the binary KL divergence, and is the divergence of the Bernoulli distribution with parameter from the Bernoulli distribution with parameter .
The PACBayeskl theorem (Langford and Seeger (2001), Seeger (2002), Maurer (2004)) concludes that as long as the loss takes values in , for any , with probability at least over random samples , for any distribution over it holds that
(3) 
where is a datafree distribution over , which means that is fixed without any dependence on the data on which the bound is evaluated. One can lowerbound the binary KL divergence, e.g., using the relaxed version of Pinsker’s inequality , and then solve the resulting inequality for (see e.g. Tolstikhin and Seldin (2013)). Alternatively, one may use the refined version of Pinsker’s inequality valid for (see e.g. Boucheron et al. (2013, Lemma 8.4)), which is tighter when (see Section 5.1), and thus get
() 
The difference to the result one gets from the relaxed version of Pinsker’s inequality is the appearance of under the square root. This, in particular, tells us that the inequality is tighter when the population loss, , is smaller (specifically when ). But it is exactly because of the appearance of on the righthand side that this bound is not immediately useful for optimization purposes. However, one can view the above inequality as a quadratic inequality on . Solving this inequality for leads to the following empirical PACBayes bound, which to the best of our knowledge is new:
For any , for any , for any datafree distribution , for any loss function with range , for any , with probability over size i.i.d. samples , simultaneously for all distributions over we have
(4) 
Alternatively, using ( ‣ 3) combined with the inequality , after some derivations one obtains the PACBayes bound of Thiemann et al. (2017): valid for all
For any , for any , for any datafree distribution , for any loss function with range , for any , with probability over size i.i.d. samples , simultaneously for all distributions over and we have
(5) 
For convenience, we quote the classical PACBayes bound of McAllester (1999):
For any , for any , for any datafree distribution , for any loss function with range , for any , with probability over size i.i.d. samples , simultaneously for all distributions over we have
(6) 
The original proof of McAllester (1999) gave a slightly looser bound. The form presented in Section 3 is with the sharp dependence on clarified by Maurer (2004).
Notice that the conclusion of these theorems is an upper bound on that holds simultaneously for all distributions over weights, with high probability (over samples). In particular, the bounds allow to choose a distribution in a datadependent manner, which is why they are usually called ‘posterior’ distributions in the PACBayesian literature. These theorems could be rewritten directly in terms of the datadependent distributions represented as stochastic kernels (as done by Rivasplata et al. (2019a)). Below in Section 5 we discuss training objectives derived from these bounds. Notice that there are many other PACBayes bounds available in the literature (the usual ones are by McAllester (1999), Langford and Seeger (2001), Catoni (2007); but see also McAllester (2003), Keshet et al. (2011), the mini tutorial van Erven (2014), and the primer Guedj (2019)). Each such bound readily leads to a training objective by replicating the procedure described below.
4 The Bayes by Backprop (BBB) objective
The ‘Bayes by backprop’ (BBB) method of Blundell et al. (2015) is inspired by a variational Bayes argument (Fox and Roberts, 2012), where the idea is to learn a distribution over the weights that approximate the Bayesian posterior distribution. Choosing a dimensional Gaussian , parametrized by , the optimum parameters are those that minimize , i.e. the KL divergence from and the Bayesian posterior . By a simple calculation, and using the Bayes rule, one can extract:
where stands here for the Bayesian prior distribution. Thus, minimizing is equivalent to minimizing the righthand side, which presents a sum of a datadependent term (the expected negative loglikelihood) and a priordependent term (), which makes this optimization problem analogous to that of minimizing a PACBayes bound, since the latter also balances a fittodata (empirical loss) term and a fittoprior (KL) term. However, in the PACBayes framework the ‘prior’ is a reference distribution and the ‘posterior’ does not need to be derived from the prior by an update factor, but is rather unrestricted. This is a crucial difference with Bayesian learning, and one that makes the PACBayes framework a lot more flexible in the choice of distributions, even compared to generalized Bayesian approaches (Bissiri et al., 2016).
As we mentioned before, the training objective proposed by Blundell et al. (2015) is inspired by the variational Bayesian argument outlined above, but the training objective they proposed and experimented with, in our notation, is as follows:
(7) 
The scaling factor, , is introduced in a heuristic manner to make the method more flexible, while the variational Bayes argument gives (7) with . When is treated as a tuning parameter, the method can be interpreted as searching in “KL balls” centered at of various radii. Thus, the KL term then plays the role of penalizing the complexity of the model space searched. Blundell et al. (2015) propose to optimize this objective (for a fixed ) using stochastic gradient descent (SGD), which randomizes over both minibatches and the weights, and uses the pathwise gradient estimate (Price, 1958). The resulting gradientcalculation procedure can be seen to be only at most twice as expensive as standard backpropagation – hence the name of their method. The hyperparameter is chosen using a validation set, which is also often used to select the best performing model among those that were produced during the course of running SGD (as opposed to using the model obtained when the optimization procedure finishes).
The results in Blundell et al. (2015) have shown that probabilistic neural networks enable an intuitive and principled implementation of classification reject options (i.e. using uncertainty to allow the model say “I don’t know” when classifying a new example) and model pruning. The use of a prior during training has empirically shown similar results to other implicit regularisation schemes, such as dropout. Finally, their weight uncertainty was also used to drive the explorationexploitation tradeoff in reinforcement learning.
5 Towards practical PACBayes with Backprop (PBB) methods
The essential idea of “PACBayes with Backprop” (PBB) is to train a probabilistic neural network by minimizing an upper bound on the risk, specifically, a PACBayes bound. Here we present two training objectives, derived from Eq. (4) and Eq. (5) respectively, in the context of classification problems when the loss is the zeroone loss or a surrogate loss. These objectives are used here for the first time to train probabilistic neural networks. We also discuss the training objective derived from Eq. (6) for comparison purposes.
To optimize the weights of neural networks the standard idea is to use a form of stochastic gradient descent, which requires the ability to efficiently calculate gradients of the objective to be optimized. When the loss is the zeroone loss, , the training loss viewed as a function of the weights, is piecewise constant, which makes simple gradientbased methods fail (since the gradient, whenever it exists, is zero). As such, it is customary to replace the zeroone loss with a smoother “surrogate loss” that plays well with gradientbased optimization. In particular, the standard loss used on multiclass classification problems is the crossentropy loss, defined by where , and is the softmax function defined by . This choice can be justified on the grounds that gives an upper bound on the probability of mistake when the label is chosen at random from the distribution produced by applying softmax on (e.g., the output of the last linear layer of a neural network).^{8}^{8}8Indeed, owning to the inequality , which is valid for any , given any and , if then . We thus also propose to replace the zeroone loss with the crossentropy loss in either Section 3 or Section 3, leading to the objectives
(8) 
and
(9) 
where denotes the empirical error rate under the ‘bounded’ version of crossentropy loss, namely the loss described below, and denotes the function implemented by the neural network that uses weights .
For comparison, the training objective derived from Section 3 takes the following form:
(10) 
The next issue to address is that the crossentropy loss is unbounded, while the previous theorems required a bounded loss. This is fixed by enforcing an upper bound on the crossentropy loss by lowerbounding the network probabilities by a value (Dziugaite and Roy, 2018). This is achieved by replacing in the definition of by . This adjustment gives a ‘bounded crossentropy’ loss function with range between 0 and . Finally, rescaling by gives a loss function with range [0,1] ready to be used in the PACBayes bounds and training objectives discussed here. The latter () is used as the surrogate loss for training in all our experiments with , , and .
Optimization of (8) entails minimizing over only, while optimization of (9) is done by alternating minimization with respect to and , similar to the procedure that was used by Thiemann et al. (2017) in their experiments with SVMs. By choosing appropriately, either case we use the pathwise gradient estimator (Price, 1958; Jankowiak and Obermeyer, 2018) as done by Blundell et al. (2015). In particular, assuming that with is such that with ) has the same distribution as where is drawn at random from a fixed distribution and is a smooth map, an unbiased estimate of the gradient of the lossmap at some can be obtained by drawing and calculating , thereby reducing the efficient computation of the gradient to the application of the backpropagation algorithm on the map at .^{9}^{9}9Indeed (e.g. Ruiz et al. (2016)),
Following Blundell et al. (2015), the reparametrization we use is with appropriate distribution (Gauss or Laplace) for each coordinate of . The process of optimization is implemented using the transformation and the gradient updates are with respect to and , as can be seen in Algorithm 1. Note that Algorithm 1 shows the procedure for optimising with Gaussian noise. The procedure with Laplace noise is similar. The procedure for is similar. The procedure for would be very similar except that has the additional parameter .
5.1 Pinsker inequality: relaxed versus refined
We explain now the differences between the relaxed and refined versions of the Pinsker inequality, used for defining the above presented PACBayes inspired training objectives and crucial to understand their differences. Recall the definition of binary KL divergence: for any we write for the KL divergence of the Bernoulli distribution with parameter from the Bernoulli distribution with parameter :
The relaxed Pinsker inequality:
(11) 
The refined Pinsker inequality:
(12) 
One can compare these two inequalities, to find regime of in which one is better than the other. The result of the comparison is that Eq. (11) (used in ) is tighter whenever , and Eq. (12) (used in ) is tighter whenever . They match if .
5.2 The choice of the prior distribution
We experiment both with priors centered at randomly initialised weights and priors learnt by empirical risk minimisation on a subset of the dataset which is independent of the subset used to compute the risk certificate. Note that all training data are used by the learning algorithm ( examples used to build the prior, to learn the posterior and to evaluate the risk certificate). Since the posterior is initialised to the prior, the learnt prior translates to the posterior being initialised to a large region centered at the empirical risk minimiser.
5.2.1 KL formulas: Laplace versus Gauss
We describe here the two distributions considered for the network weights: Laplace and Gaussian. The Laplace density with mean parameter and with variance is:
The KL divergence for two Laplace distributions is
(13) 
For comparison, recall that the Gaussian density with mean parameter and variance has the following form:
The KL divergence for two Gaussian distributions is
(14) 
The formulas (13) and (14) above are for the KL divergence between onedimensional Laplace or Gaussian distributions. It is straightforward to extend them to, say, dimensional product distributions, corresponding to random vectors with independent components, as in this case the KL is equal to the sum of the KL divergences of the components.
6 Computing the risk certificates
After optimising the network weights through the previously presented training objectives we need to evaluate the risk certificate using the PACBayeskl theorem, following the procedure of Langford and Caruana (2001). To do so, we describe first how to invert the binary KL from Eq. (3). For and , let :
be the “inverse” of the binary entropy with respect to the second argument. This is easily seen to be welldefined. Furthermore, holds precisely when .
is needed for computing an upper bound on based on the PACBayeskl bound: For any , with probability at least over size random samples we have:
6.1 Estimating the empirical loss via Monte Carlo sampling
is also needed to estimate the empirical term by random weight sampling: If are i.i.d. and is the empirical distribution, then for any , with probability at least we have (see Langford and Caruana (2001, Theorem 2.5)), hence by the inversion formula:
This expression can be applied to upperbound or by setting the underlying loss function to be the 01 (classification) loss or the crossentropy loss, respectively.
The latter expression also can be combined with any of the PACBayes bounds in Section 3 to upperbound the loss under the datadependent posterior by a computable expression, we illustrate this with the classical PACBayes bound: For any , with probability at least over size random samples and size weight samples we have:
In our experiments we evaluate the risk certificates (risk upper bounds) for the crossentropy loss () and the 01 loss (), respectively, computed using the PACBayeskl bound and Monte Carlo weight sampling.
7 Experimental results
We performed a series of experiments on MNIST and CIFAR10 to thoroughly investigate the properties of the training objectives presented before. Specifically, we empirically evaluate the two proposed training objectives and of Eq. (8) and Eq. (9), and compare these to of Eq. (10) and of Eq. (7). When possible, we also compare to empirical risk minimisation with dropout (). In all experiments, training objectives are compared under the same conditions, i.e. weight initialisation, prior, optimiser (vanilla SGD with momentum) and network architecture. The code for our experiments is publicly available^{10}^{10}10Code available at https://github.com/mperezortiz/PBB in PyTorch.
7.1 Prior distribution over weights
We studied Gaussian and Laplace distributions over the model weights (details in appendix). The posterior distribution is the same kind as the prior in each case.
We also tested in our experiments both datafree random priors (with randomness in the initialization of the weights) and datadependent priors. In both cases, the center parameters of the prior were initialised randomly from a truncated centered Gaussian distribution with standard deviation set to , where is the dimension of the inputs to a particular layer, truncating at standard deviations. The main difference between our datafree and datadependent priors is that, after initialisation, the center parameters of datadependent priors are optimised through ERM on a subset of the training data (50% if not indicated otherwise). This means the posterior center will be initialised at the empirical risk minimiser. Similar approaches for building datadependent priors have been considered before in the PACBayesian literature (Lever et al., 2013; ParradoHernández et al., 2012; Dziugaite and Roy, 2018). The data used for building the prior is used as well for learning the posterior, but not to evaluate the final risk bound, so as to avoid needing differentially private arguments to justify learning the prior (Dziugaite and Roy, 2018). As opposed to datadependent priors, in the case of datafree priors we simply use the initial random weights as center parameters. After choosing the center parameter of the prior, the scale parameters is set to the constant scale hyperparameter. The posterior is always initialised at the prior (both center and scale parameters). We find in our experiments that the prior can be overfitted easily. To avoid this, we use dropout during the learning process (exclusive to learning the prior, not the posterior).
7.2 Experimental setup
All risk certificates were computed using the the PACBayeskl theorem with and and Monte Carlo model samples. The same confidence was used in all the PBB training objectives (, , ).
7.2.1 Hyperparameter selection
For all experiments we performed a grid search over all hyperparameters and selected the run with the best risk upper bound on 01 error. We elaborate more on the use of PACBayes bounds for model selection in the next subsection. We did a grid sweep over the prior distribution scale hyperparameter (i.e. standard deviation) with values in . We observed that higher variance values lead to instability during training and lower variance does not explore the weight space. For the SGD with momentum optimizer we performed a grid sweep over learning rate in and momentum in . We found that learning rates higher than caused divergence in training and learning rates lower than converged slowly. We also found that the best optimiser hyperparameters for building the datadependent prior differ from those selected for optimising the posterior. Because of this, we also performed a grid sweep over the learning rate and momentum used for learning the datadependent prior (testing the same values as before). The dropout rate used for learning the prior was selected from . All training objectives derived from PACBayes bounds used the ‘bounded crossentropy’ function as surrogate loss during training, for which we enforced boundedness by restricting the minimum probability (see Section 5). We observed that the value performed well. Values higher than distorts the input to loss function and leads to higher training loss. The lambda value in was initialised to 1.0 and optimized using alternate minimization using SGD with momentum, using the same learning rate and momentum than for the posterior optimisation. Notice that requires an additional sweep over a KL tradeoff coefficient, which was done with values in , see Blundell et al. (2015).
For ERM, we used the same range for optimising the learning rate, momentum and dropout rate. However, given that in this case we do not have a risk certificate we need to set aside some data for validation and hyperparameter tuning. We set 4% of the data as validation in MNIST (2400 examples) and 5% in the case of CIFAR10 (2500 examples). This is the first example of how PACBayes bounds could be a first step in the direction of selfbounding learning, showing in this case that, as opposed to ERM, PACBayes inspired training objectives do not need a validation set.
7.2.2 Predictors and metrics reported
For all methods, we compare three different prediction strategies using the final model weights: i) stochastic predictor, randomly sampling fresh model weights for each test example; ii) deterministic predictor, using exclusively the posterior mean; iii) ensemble predictor, as done by Blundell et al. (2015), in which majority voting is used with the predictions of a number of model weight samples, in our case 100. We report the test cross entropy loss (xe) and 01 error of these predictors. We also report a series of metrics at the end of training (train empirical risk using crossentropy and 01 error and KL divergence between posterior and prior) and two risk certificates ( for crossentropy loss and for 01 loss).
7.2.3 Architectures
For MNIST, we tested both a fully connected neural network (FCN) with 4 layers (including input and output) and 600 units per layer, as well as a convolutional neural network (CNN) with 4 layers (two convolutional, two fully connected). For the latter, we learn a distribution over the convolutional kernels. We trained our models using the standard MNIST dataset split of 60000 training examples and 10000 test examples. For CIFAR10, we tested three convolutional architectures (one with a total of 9 layers with learnable parameters and the other two with 13 and 15 layers) with standard CIFAR10 data splits. ReLU activations were used in each hidden layer for both datasets. Both for learning the posterior and the prior, we ran the training for 100 epochs (however we observed that methods converged around 70). We used a training batch size of for all the experiments.
7.3 Hyperparameter tuning through PACBayes bounds
We show now that PACBayes bounds can be used not only as training objectives to guide the optimisation algorithm but also for model selection. Specifically, Figure 1 compares the PACBayeskl bound for crossentropy and 01 losses (xaxis) to the test 01 error for the stochastic predictor (yaxis) for more than 600 runs from the hyperparameter grid search performed for with a CNN architecture and a datadependent Gaussian prior on MNIST. We do a grid search over 6 hyperparameters: prior scale, dropout rate, and the learning rate and momentum both for learning the prior and the posterior. To appreciate a larger range of performance values (thus avoiding only showing the risk and performance for relatively accurate classifiers) we use here a reduced training set for these experiments (i.e. 10% of training data from MNIST). The test set is maintained. The results show a clear relationship between the risk certificate and test set 01 error, especially for the risk upper bound of the 01 error, as expected. While the plots also show heterokedasticity (there is a noticeable increase of variability towards the right side of the xaxis) the important thing is that for small error values the corresponding values of the risk certificate are reasonable stable. It is worth keeping in mind, however, that bounds generally get weaker with higher error values.
Motivated by the results in the plots, where it is shown that the bound could potentially be used for model selection, we use the risk certificate of the 01 loss (evaluated as explained in Section 6) for hyperparameter tuning in all our subsequent experiments. Note that the advantage in this case is that we do not need a validation set.
7.4 Comparison of different training objectives and priors
We first present a comparison of the four considered training objectives on MNIST using Gaussian weight distributions. Table 1 shows the results for the two architectures previously described for MNIST (FCN and CNN) and both datafree and datadependent priors (referred to as Rand.Init. and Learnt, respectively). We also include the results obtained by standard ERM using the crossentropy loss, for which part of the table can not be completed (e.g. risk certificates). The last column of the table shows the test 01 error of the prior predictor. We also report the test performance for the stochastic predictor (column named Stch. pred.), the posterior mean deterministic predictor (column named Det. pred.) and the ensemble predictor (column named Ens pred.).
An important note is that we used the risk certificates for model selection for all training objectives, including (with the only exception of ). The KL tradeoff coefficient included in (Blundell et al., 2015) relaxes the importance given to the prior in the optimisation, but obviously not in the computation of the risk certificate, which in practice means that larger KL killing coefficients will lead to worse risk certificates. Because of this, in all cases, the model selection strategy chose the lowest value for the KL killing coefficient (0.1) for , meaning there are cases in which obtained better test performance than the ones we report in this table, but much looser risk certificates. We present more experiments on this in the next subsection where we experiment with the KL killing trick.
Setup  Risk cert. & Train metrics  Stch. pred.  Det. pred.  Ens. pred.  Prior  
Arch.  Prior  Obj.  KL/n  xe  01 err.  xe  01 err.  xe  01 err.  01 err.  
FCN  Rand.Init. (Gaussian)  .2033  .3155  .1383  .0277  .0951  .0268  .0921  .0137  .0558  .0007  .0572  .8792  
.2326  .3275  .1856  .0218  .0742  .0211  .0732  .0077  .0429  .0004  .0448  .8792  
.1749  .3304  .0810  .0433  .1531  .0407  .1411  .0204  .0851  .0009  .0868  .8792  
.5163  .5516  .6857  .0066  .0235  .0088  .0293  .0038  .0172  .0003  .0178  .8792  
Learnt (Gaussian)  .0146  .0279  .0010  .0092  .0204  .0084  .0202  .0032  .0186  .0002  .0189  .0202  
.0201  .0354  .0054  .0073  .0178  .0082  .0196  .0071  .0185  .0001  .0185  .0202  
.0141  .0284  .0001  .0115  .0247  .0101  .0230  .0089  .0189  .0002  .0191  .0202  
.0788  .0968  .0704  .0025  .0090  .0063  .0179  .0066  .0153  .0001  .0153  .0202  
        .0004  .0007      .0101  .0152        
CNN  Rand.Init. (Gaussian)  .1453  .2165  .1039  .0157  .0535  .0143  .0513  .0062  .0257  .0003  .0261  .9478  
.1583  .2202  .1256  .0126  .0430  .0109  .0397  .0056  .0207  .0003  .0211  .9478  
.1260  .2277  .0622  .0273  .0932  .0253  .0869  .0111  .0425  .0006  .0421  .9478  
.3400  .3645  .3948  .0034  .0120  .0039  .0154  .0016  .0088  .0001  .0092  .9478  
Learnt (Gaussian)  .0078  .0155  .0001  .0058  .0127  .0045  .0104  .0003  .0105  .0001  .0104  .0104  
.0095  .0186  .0010  .0051  .0123  .0044  .0106  .0047  .0098  .0000  .0100  .0104  
.0083  .0166  .0000  .0064  .0139  .0049  .0123  .0048  .0103  .0001  .0103  .0104  
.0447  .0538  .0398  .0012  .0042  .0040  .0104  .0043  .0082  .0002  .0082  .0104  
        .0003  .0004      .0081  .0092       
The findings from the experiments in Table 1 are as follows: i) achieves consistently better risk certificates for 01 error in all experiments, providing as well better test performance than . ii) is the best PACBayes inspired objective in terms of test performance, although the risk certificates are generally less tight. iii) In most cases, the stochastic predictor does not worsen the performance of the prior predictor, improving it very significantly for random datafree priors (i.e. Rand.Init). iv) The mean of the weight distribution is also improved, as shown by comparing the results of the deterministic predictor (Det. pred.) to the prior predictor. The ensemble predictor also generally improves on the prior. v) The improvements brought by datadependent priors (labelled as “Learnt” in the table) are consistent across the two architectures, showing better test performance and risk certificates (although the use of datafree priors still produced nonvacuous risk certificates). vi) The application of PBB is successful not only for learning fully connected layers but also for learning convolutional ones. The improvements in performance and risk certificates that the use of a CNN brings are also noteworthy. vii) The proposed PACBayes inspired learning strategies show competitive performance when compared to stateoftheart and (specially when using datadependent priors) while providing tight risk certificates.
We now compare our results to those reported before in the literature for MNIST. Note that in this case there are differences regarding optimiser, prior chosen and weight initialisation. The neural network architecture used is the same (FCN). Dziugaite and Roy (2018) implemented a version of and the bound of Lever et al. (2013) for comparison. We compare the results reported by them with the results of training with our two training objectives and , and with (optimized as per our and ). The hyperparameter in both Dziugaite and Roy (2018) and Lever et al. (2013) controls the temperature of a Gibbs distribution. In the table we display only the two values of their parameter which achieve best test error and risk certificate. These results are presented in Table 2. We note that Dziugaite and Roy (2018)’s best values correspond to test accuracy of 94% or 93% while in those cases their risk certificates (0.650 or 0.350, respectively), although nonvacuous, were far from tight. On the other hand the tightest value of their risk bound (0.21) only gives an 88% accuracy. In contrast, our PBB methods achieve close to 98% test accuracy (or 0.0202 test error). At the same time, as noted above, our risk certificate (0.0279) is much tighter than theirs (0.210), meaning that our training scheme (not only training objectives but also prior) are a significant improvement with respect to previously computed nonvacuous bounds. Even more accurate predictors and tighter bounds are achieved by the CNN architecture in Table 1.
Method  Stch. Pred. 01 Err  Risk certificate 

D&R 2018 ()  0.120  0.2100 
D&R 2018 ()  0.060  0.6500 
Lever et al. 2013 ()  0.120  0.2600 
Lever et al. 2013 ()  0.060  1.0000 
pb_quad  0.0202  0.0279 
pb_lambda  0.0196  0.0354 
pb_classic  0.0230  0.284 
7.5 KL killing trick
As many works have pointed out before (and we have observed in our experiments), the problem with all the four presented training objectives is that the KL term tends to dominate and most of the work in training is targeted at reducing it, which effectively means often the posterior cannot move far from the prior. To address this issue, distributiondependent (Lever et al. (2013)) or datadependent (Dziugaite and Roy (2018)) priors have been used in the literature, which we also do in this work. Another approach to solve this is to add a coefficient that controls the influence of the KL in the training objective (Blundell et al., 2015). This means that in the case of we could see marginal decrease in the KL divergence during the course of training (specially given small KL killing coefficients) and the solution it returns is expected to be similar to that returned simply using ERM with crossentropy. However, this also has its effects on the risk certificate. To show these effects, we run all four training objectives with a KL penalty of 0.0001 during training and report the results in Table 3. For simplicity, only a CNN architecture is considered in this experiment. What we can see comparing these results to the ones reported in Table 1 is that while the 01 error for the stochastic classifier decreases, the KL term increases and so does the final risk certificate. Practitioners may want to consider this tradeoff between performance and tight risk certificates.
Setup  Risk cert. & Train metrics  Stch. pred.  Det. pred.  Ens. pred.  Prior  

Arch. & Prior  Obj.  KL/n  xe  01 err.  xe  01 err.  xe  01 err.  01 err.  
CNN
Rand.Init (KL killing) 
.2292  .2824  .2174  .0097  .0330  .0084  .0305  .0042  .0193  .0002  .0201  .9478  
.2840  .3241  .3004  .0066  .0225  .0058  .0222  .0039  .0144  .0002  .0148  .9478  
.2297  .2846  .2167  .0101  .0344  .0096  .0343  .0047  .0208  .0002  .0216  .9478  
.4815  .4974  .6402  .0024  .0082  .0035  .0107  .0024  .0082  .0000  .0079  .9478  
CNN
Learnt (KL killing) 
.0191  .0296  .0104  .0030  .0087  .0033  .0101  .0000  .0095  .0000  .0096  .0104  
.0245  .0354  .0162  .0025  .0076  .0031  .0092  .0040  .0092  .0000  .0095  .0104  
.0187  .0296  .0100  .0031  .0089  .0037  .0106  .0043  .0095  .0001  .0095  .0104  
.0470  .0557  .0421  .0012  .0041  .0034  .0096  .0025  .0085  .0001  .0083  .0104 
7.6 Laplace weight distributions
We experimented with both Laplace and Gaussian priors. The results are presented in Table 4. Comparing these to the results with Gaussian weight distributions from Table 1, we did not observe significant and consistent differences in terms of risk certificates and test set error between the two priors. The distribution to use could be problemdependent, but we found that both Gaussian and Laplace distributions achieve good risk certificates and test set performance.
Setup  Risk cert. & Train metrics  Stch. pred.  Det. pred.  Ens. pred.  Prior  

Arch.  prior  Obj.  KL/n  xe  01 err.  xe  01 err.  xe  01 err.  01 err.  
CNN  Rand.Init. (Laplace)  .1548  .2425  .1024  .0207  .0709  .0190  .0677  .0113  .0429  .0004  .0436  .9478  
.1844  .2540  .1489  .0147  .0496  .0131  .0461  .0096  .0310  .0003  .0312  .9478  
.1334  .2489  .0610  .0322  .1101  .0296  .1014  .0208  .0719  .0007  .0695  .9478  
.4280  .4487  .5385  .0031  .0107  .0038  .0139  .0006  .0096  .0001  .0090  .9478  
Learnt (Laplace)  .0085  .0167  .0004  .0056  .0126  .0043  .0098  .0011  .0103  .0001  .0103  .0104  
.0119  .0216  .0025  .0049  .0118  .0041  .0106  .0052  .0103  .0003  .0100  .0104  
.0076  .0155  .0000  .0060  .0131  .0046  .0107  .0015  .0105  .0001  .0106  .0104  
.0737  .0866  .0673  .0019  .0062  .0031  .0092  .0013  .0093  .0001  .0091  .0104 
7.7 CIFAR10 with larger architectures
We evaluate now our training objectives on CIFAR10 using deep CNN architectures. Note that this is a much larger scale experiment than the ones presented before (15 layers with learnable parameters vs 4). As far as we know, we are the first to evaluate PACBayes inspired training objectives in such deep architectures. The results are presented in Table 5 for three architectures (with 9, 13 and 15 layers, with around 6M, 10M and 13M parameters, respectively). Note, however, that the number of parameters is doubled for our probabilistic neural networks. We also experiment with using different amount of data for learning the prior: 50% and 70%, leaving respectively 25.000 and 15.000 examples to evaluate the bound. The conclusions are as follows: i) In this case, the improvements brought by learning the posterior through PBB with respect to the prior are much better and generally consistent across all experiments (e.g. 2 points in test 01 error for when using 50% of the data for learning the prior). ii) Risk certificates are also nonvacuous and tight (although less than for MNIST). iii) We validate again that shows better test performance but less tight risk certificates. iv) In this case, seems to provide slightly tighter bounds than , but worse test performance. The tighter bounds can be explained by our findings with the Pinsker inequality, which makes tighter when true loss is more than 0.25. v) Obtained results with 15 layers are competitive, achieving similar performance than those reported in the stateoftheart for VGG16 (deep network proposed for CIFAR10 with 16 layers). vi) The results indicate that 50% of the training data is not enough in this dataset to build a competitive prior and this influences the test performance and the risk certificates. The results with 70% of the data are, however, very close to those achieved by ERM across all three architectures. vii) Similarly than with the rest of the experiments, a major difference can be seen when comparing the risk certificate achieved by with the risk certificate achieved by PACBayes inspired training objectives. viii) Finally, it is noteworthy how the KL gets smaller as we move to deeper architectures (specially from 9 to 13 layers), which is an interesting observation, as there are many more parameters used in the computation of the KL. This indicates that the posterior in deeper architectures stays much closer to the prior. We believe this may be because in a higherdimensional weight space, the weight updates have a smaller euclidean norms, hence the smaller KL.
Setup  Risk cert. & Train metrics  Stch. pred.  Det. pred.  Ens. pred.  Prior  
Arch.  Prior  Obj.  KL/n  xe  01 err.  xe  01 err.  xe  01 err.  01 err.  
CNN
(9 layers) 
Learnt (50% data)  .1296  .3034  .0089  .0868  .2428  .0903  .2452  .0726  .2439  .0024  .2413  .2518  
.1742  .3730  .0611  .0571  .2108  .0689  .2307  .0609  .2225  .0018  .2133  .2518  
.1173  .2901  .0035  .0903  .2511  .0931  .2537  .0952  .2437  .0025  .2332  .2518  
.8096  .8633  1.5107  .0239  .0926  .0715  .2198  .0735  .2160  .0017  .2130  .2518  
Learnt (70% data)  .1017  .2502  .0026  .0796  .2179  .0816  .2137  .0928  .2137  .0023  .2100  .2169  
.1414  .3128  .0307  .0630  .2022  .0708  .2081  .0767  .2061  .0021  .2049  .2169  
.0957  .2377  .0004  .0851  .2223  .0862  .2161  .0827  .2167  .0021  .2135  .2169  
.6142  .6965  .8397  .0212  .0822  .0708  .1979  .0562  .1992  .0019  .1944  .2169  
        .0355  .0552      .1400  .1946        
CNN
(13 layers) 
Learnt (50% data)  .0821  .2256  .0042  .0577  .1874  .0585  .1809  .0519  .1788  .0011  .1783  .1914  
.1163  .2737  .0272  .0491  .1741  .0516  .1740  .0466  .1726  .0015  .1690  .1914  
.0757  .2127  .0009  .0635  .1936  .0622  .1880  .0592  .1810  .0017  .1816  .1914  
.6787  .7566  .9999  .0250  .0924  .0505  .1676  .0422  .1646  .0011  .1614  .1914  
Learnt (70% data)  .0659  .1832  .0015  .0519  .1608  .0517  .1568  .0421  .1553  .0010  .1546  .1587  
.0896  .2177  .0145  .0449  .1499  .0479  .1541  .0604  .1522  .0011  .1507  .1587  
.0619  .1758  .0002  .0548  .1644  .0541  .1588  .0605  .1578  .0013  .1557  .1587  
.4961  .5858  .5826  .0213  .0772  .0487  .1508  .0532  .1495  .0016  .1461  .1587  
        .0576  .0810      .0930  .1566        
CNN
(15 layers) 
Learnt (50% data)  .0867  .2174  .0053  .0587  .1753  .0584  .1668  .0538  .1662  .0014  .1653  .1688  
.1217  .2707  .0304  .0494  .1661  .0506  .1618  .0417  .1639  .0015  .1622  .1688  
.0782  .1954  .0007  .0667  .1783  .0652  .1686  .0594  .1692  .0013  .1674  .1688  
.6069  .7066  .7908  .0287  .1073  .0468  .1553  .0412  .1530  .0012  .1517  .1688  
Learnt (70% data)  .0756  .1806  .0028  .0559  .1513  .0559  .1463  .0391  .1469  .0016  .1449  .1490  
.0922  .2121  .0133  .0486  .1477  .0500  .1437  .0507  .1449  .0012  .1438  .1490  
.0703  .1667  .0003  .0622  .1548  .0615  .1475  .0551  .1480  .0010  .1476  .1490  
.4481  .5572  .4795  .0259  .0947  .0455  .1413  .0395  .1405  .0008  .1409  .1490  
        .0208  .0339      .0957  .1413       
8 Conclusion and future work
In this paper we explored ‘PACBayes with Backprop’ (PBB) methods to train probabilistic neural networks with different weight distributions, priors and network architectures. The takehome message is that the training methods presented in this paper are derived from sound theoretical foundations and provide a simple strategy that comes with a performance guarantee at a relatively low extra cost in performance. This is an improvement over methods derived heuristically rather than from theoretically justified arguments, and over methods that do not include a risk certificate valid on unseen examples. Additionally, we empirically demonstrate the usefulness of datadependent priors for achieving competitive test performance and, importantly, for computing risk certificates with tight values.
The results of our experiments on MNIST and CIFAR10 have showed that these PBB objectives give predictors with competitive test set performance and with nonvacuous risk certificates that significantly improve previous results and can be used not only for guiding the learning algorithm and certifying the risk but also for model selection. This shows that PBB is a promising direction towards achieving selfbounding learning, since the values of the risk certificates output by the training methods are tight, i.e. close to the values of the test set error estimates. We also evaluated our training objectives on large convolutional neural networks (up to 15 layers and around 13M parameters). Our results showed risk certificates with values not as close to the test set error estimate as in the MNIST experiments but still nonvacuous and relatively tight (18% of risk certificate for a stochastic predictor that achieves 14.6% of test 01 error).
In future work we plan to test different covariance structures for the weight distribution and validate a more extensive list of choices for the weight distributions across a larger list of datasets. We finally plan to experiment how to approach the wellknown dominance of the KL term in the optimisation of these objectives. Datadependent priors seem like a promising avenue to do so. We will also explore deeper architectures.
References
 Achille and Soatto (2018) Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018.
 Aggarwal (2018) Charu C. Aggarwal. Neural networks and deep learning. Springer, 2018.
 Barber and Bishop (1998) David Barber and Christopher M Bishop. Ensemble learning for multilayer networks. In Advances in neural information processing systems, pages 395–401, 1998.
 Biggs and Guedj (2020) Felix Biggs and Benjamin Guedj. Differentiable PACBayes Objectives with Partially Aggregated Neural Networks. arXiv:2006.12228, 2020.
 Bissiri et al. (2016) Pier Giovanni Bissiri, Chris C Holmes, and Stephen G Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):1103–1130, 2016.
 Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In 32nd International Conference on Machine Learning, pages 1613–1622, 2015.
 Boucheron et al. (2013) Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
 Buntine and Weigend (1991) Wray L Buntine and Andreas S Weigend. Bayesian backpropagation. Complex systems, 5(6):603–643, 1991.
 Caruana et al. (2001) Rich Caruana, Steve Lawrence, and C Lee Giles. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In Advances in Neural Information Processing Systems, pages 402–408, 2001.
 Catoni (2007) Olivier Catoni. PACBayesian Supervised Classification: The Thermodynamics of Statistical Learning. arXiv:0712.0248, 2007.
 Dwork et al. (2015a) Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, and Aaron Roth. Generalization in adaptive data analysis and holdout reuse. In Advances in Neural Information Processing Systems, pages 2350–2358, 2015a.
 Dwork et al. (2015b) Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the fortyseventh annual ACM symposium on Theory of Computing, pages 117–126. ACM, 2015b.
 Dziugaite and Roy (2017) Gintare Karolina Dziugaite and Daniel M. Roy. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data. In UAI, 2017.
 Dziugaite and Roy (2018) Gintare Karolina Dziugaite and Daniel M Roy. Datadependent PACBayes priors via differential privacy. In Advances in Neural Information Processing Systems, pages 8430–8441, 2018.
 Dziugaite et al. (2020) Gintare Karolina Dziugaite, Kyle Hsu, Waseem Gharbieh, and Daniel M Roy. On the role of data in PACBayes bounds. arXiv:2006.10929, 2020.
 Fox and Roberts (2012) Charles W Fox and Stephen J Roberts. A tutorial on variational Bayesian inference. Artificial Intelligence Review, 38(2):85–95, 2012.
 Freund (1998) Yoav Freund. Self bounding learning algorithms. In Proceedings of the eleventh annual conference on Computational Learning Theory, pages 247–258. ACM, 1998.
 Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
 Germain et al. (2009) Pascal Germain, Alexandre Lacasse, François Laviolette, and Mario Marchand. PACBayesian learning of linear classifiers. In Proc. of the 26th International Conference on Machine Learning, pages 353–360. ACM, 2009.
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
 Guedj (2019) Benjamin Guedj. A Primer on PACBayesian Learning. arXiv:1901.05353, 2019.
 HernándezLobato and Adams (2015) José Miguel HernándezLobato and Ryan Adams. Probabilistic backpropagation for scalable learning of bayesian neural networks. In International Conference on Machine Learning, pages 1861–1869, 2015.
 Hinton and van Camp (1993) Geoffrey E Hinton and Drew van Camp. Keeping neural networks simple. In International Conference on Artificial Neural Networks, pages 11–18. Springer, 1993.
 Jankowiak and Obermeyer (2018) Martin Jankowiak and Fritz Obermeyer. Pathwise Derivatives Beyond the Reparameterization Trick. arXiv:1806.01851, 2018.
 Keshet et al. (2011) Joseph Keshet, David McAllester, and Tamir Hazan. PACBayesian approach for minimization of phoneme error rate. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2224–2227. IEEE, 2011.
 Keshet et al. (2017) Joseph Keshet, Subhransu Maji, Tamir Hazan, and Tommi Jaakkola. Perturbation Models and PACBayesian Generalization Bounds. In Perturbations, Optimization, and Statistics, pages 289–309. MIT Press, 2017. URL http://u.cs.biu.ac.il/~jkeshet/papers/KeshetMaHaJa16.pdf.
 Lan et al. (2020) Xinjie Lan, Xin Guo, and Kenneth E Barner. PACBayesian Generalization Bounds for MultiLayer Perceptrons. arXiv:2006.08888, 2020.
 Langford and Blum (2003) John Langford and Avrim Blum. Microchoice bounds and self bounding learning algorithms. Machine Learning, 51(2):165–179, 2003.
 Langford and Caruana (2001) John Langford and Rich Caruana. (Not) bounding the true error. In Advances in Neural Information Processing Systems, pages 809–816, 2001.
 Langford and Seeger (2001) John Langford and Matthias Seeger. Bounds for averaging classifiers. Technical Report CMUCS01102, Carnegie Mellon University, 2001.