Summary: | Bayesian machine learning (ML) models have long been advocated as an important tool for safe artificial intelligence. Yet, little is known about their vulnerability against adversarial attacks. Such attacks aim to cause undesired model behaviour (e.g. misclassification) by crafting small perturbations to regular inputs which appear to be insignificant to humans (e.g. slight blurring of image data). This fairly recent phenomenon has undermined the suitability of many ML models for deployment in safety critical applications. In this thesis, we investigate how robust Bayesian ML models are against adversarial attacks, focussing on Gaussian process (GP) and Bayesian neural network (BNN) classification models. In particular, for GP classification models, we derive guarantees for the robustness against adversarial attacks that facilitate the evaluation of their suitability for a given safety critical application. Furthermore we investigate whether better posterior approximations benefit the adversarial robustness of BNNs, comparing the adversarial robustness resulting from high quality posterior approximations to that resulting from deterministic approximations in a range of experiments. We find that well approximated BNNs tend to be empirically more susceptible to adversarial attacks than deterministic neural network models for most popular priors. This calls for caution when deploying such BNNs in safety critical applications. Lastly, we show that using GPs in a Bayesian optimisation framework, it is possible to craft successful adversarial perturbations in black-box scenarios (where the attacked model is a priori unknown and can only be studied by querying it) with fewer model queries than previously needed. This facilitates a better evaluation of the adversarial robustness of a model in black-box scenarios.
|