Fast and efficient black-box testing for AI cybersecurity


Deep Neural Networks (DNNs) are vulnerable to adversarial examples, which are malicious inputs designed to fool the model’s prediction. This poses a threat for autonomous driving, security systems, etc. For instance, one can put a sticker on a STOP sign to make the computer vision system of an autonomous car classify it as a TURN RIGHT sign. In our paper Sign Bits Are All You Need for Black-Box Attacks, published in ICLR 2020, we propose a new black-box attack that can be used for fast and efficient black-box testing of machine learning models before being deployed into production.

AI Cybersecurity: White-Box versus Black-Box Attacks

The literature of the adversarial attacks primarily divides into white-box and black-box settings. The white-box setting, while not the focus of this work, follows from the works of Biggio et al. (2013) and Goodfellow et al. (2015) who introduced the Fast Gradient Sign Method (FGSM), including several methods to produce adversarial examples for various learning tasks and threat perturbation constraints (Carlini and Wagner, 2017; Moosavi-Dezfooli et al., 2016; Hayes and Danezis, 2017; Al-Dujaili et al., 2018; Kurakin et al., 2017; Shamir et al., 2019). Chen et al. (2017) introduced a principled approach by using gradient based optimization. They employ finite differences, a zeroth-order optimization means, to estimate the gradient and then use it to design a gradient-based attack. While this approach successfully generates adversarial examples, it is expensive in how many times the model is queried. Ilyas et al. (2018) substitute traditional finite differences methods with Natural Evolutionary Strategies (NES) to obtain an estimate of the gradient.

Tu et al. (2018) provide an adaptive random gradient estimation algorithm that balances query counts and distortion, and introduces a trained auto-encoder to achieve attack acceleration. Our work contrasts with the general approach of these works in two ways:

  1. We focus on estimating the sign of the gradient and investigate whether this estimation suffices to efficiently generate adversarial examples.
  2. The above methods employ random sampling in constructing queries to the model while our construction is adaptive.

The Gradient Estimation Problem

The gradient estimation problem is  central to the process of generating adversarial examples. The majority of approach rely on estimating both the magnitude and sign of the gradient. This paper suggests a weaker version of the gradient estimation problem (estimating only its sign) and proposes an algorithm that adaptively constructs queries to recover the sign. The proposed approach outperforms many state-of-the-art black-box attack methods in terms of query complexity.

We exploit two concepts:

  1. Directional derivatives of a function (machine learning model’s accuracy in this context) can be estimated using finite differences using only two function evaluations (model queries in this context).
  2. The directional derivative is separable in the gradient’s coordinates (Property 1 in the paper) allowing us to use a simple sign flip/revert procedure to estimate a number of sign bits simultaneously. The joint estimation of several sign bits simultaneously results in less number of queries to the model (which makes the approach highly query-efficient). This sign bit gradient vector is then used in place of the true gradient in an FGSM-like procedure.

We started with the gradient estimation (both magnitude and sign) problem and asked if we can solve a sub-problem (hoping it will be easier): sign gradient estimation. This however transformed the problem from continuous optimization to discrete optimization perspective and resulting in an exponential search complexity 2^{n}. This challenge was overcome by employing the separability property of directional derivative (Property 1 in the paper).

Experiments were conducted on the three standard datasets in machine learning: MNIST, CIFAR10, and IMAGENET. Both \ell_\infty and \ell_2 perturbation threats were considered on naturally-trained as well as adversarially-trained models.


  • Adversarial examples: malicious inputs designed to fool a machine learning model’s prediction.
  • Black-box attack: the process of generating adversarial attacks without access to the model’s weights nor architecture.
  • Perturbation constraint: the amount of change allowed on an input to craft an adversarial example.

Please cite our work using the BibTeX below.

title={Sign Bits Are All You Need for Black-Box Attacks},
author={Abdullah Al-Dujaili and Una-May O'Reilly},
booktitle={International Conference on Learning Representations},
Close Modal