Notes on attacks against machine learning models - Welcome to my blog

Membership inference¶

Goal: Output 1 if sample belongs to the training data 0 otherwise.

Attempts:

- Train “shadow model” that is hopefully similar to the victim model.
- Train classifier on the confidence scores of shadow model to predict if input is a member
- Input random noise into model (are non-members) check distribution of confidence scores
- Use a suitably high value as the threshold to filter out non-members
In white box setting gradients can be used as input to classifier.
If no confidence scores are available, randomly change input until class changes. In general since a member should be far from the decision boundary, it should take longer. walks along the decision boundary
Multi exit networks are less vulnerable to standard attacks, but timing/exit information correlates highly with membership status

Data Reconstruction¶

Goal: Reconstruct (expensive) training data set

Attempts:

- Start with noise, using gradient information optimize input space by maximizing $1-f_k(x)$ . Hence the goal is to find an input $x$ that best represents the label $k$ .
- Not effective for complicated networks
In a black-box setting, train model that decodes the embeddings to likely inputs (inverse function). This inverse model needs to be trained from data that is similar. This can be done in a unsupervised way.

Model Stealing¶

Goal: Steal trained model. Can be used to make e.g. membership inference and adversarial attacks easier

In spirit this is the same as model distillation with black-box access to the teacher model.

Attempts:

If confidence scores are available.
- Query random noise, directly train on $(z, \mathop{\mathrm{arg\,max}}_i V_i(z)), z \sim \mathcal{N}(0, I)$
- Alternatively train additional generator $G(z)$ that produces input that are similar to the true data, makes it easier to learn and generalize.
- Since only black-box access, estimate gradients of victim model using finite differences.
- KL-Divergence has vanishing gradients, => Use L1 loss on logits
- To obtain logits from softmax scores of victim model uses $\log(V_i) - \frac{1}{k}\sum_{j=1}^k V_j$
- GAN like training, where discriminator forces generator to generate images that match the expected inputs, i.e. acts as a prior to e.g. produce images.
With labels only
- Train model with cross entropy on labels
- Generator is penalized to generate diverse samples from all classes, also those that are sparsely represented.

The generator makes it easier to learn the stolen model since the inputs more closely represent true dataset.

Hyperparameter stealing from plots¶

Train many models with hyperparamters configurations, train CNN on plots.
Shapes of t-SNE are able to distinguish models, density and perplextiy don’t really have a influence on accuracy
Training model on different dataset, compared to victim model is only a little better than guessing. 5% of actual data improves this by a lot
Density transfers over data, same for perplexity
Thresholding embeddings or adding noise to t-SNE coordinates is an effective defense. On loss plots smoothing works okayish
With this as data augmentation in the training process, defenses are no longer effective

Adversarial Attacks¶

White-box attakcs¶

FGSM¶

x = x+ \epsilon \text{sign}(\nabla_x L(f(x), y))

(1)

Takes a single step of size $\epsilon$ to stay in the bounding box.

PGD¶

x_t = \Pi(x + \alpha \text{sign}(\nabla_x L(f(x), y)))

(2)

Takes multiple steps of size $\alpha$ and projects back into the bounding box. Strongest 1. order attack.

DeepFool¶

\min || \delta ||_2 \quad \text{s.t.} f(x + \delta) \neq y

(3)

Directly optimizes that perturbation is small. Can also be used to quantify robustness of networks. If $\delta$ is large, the network is robust.

Harder to solve this optimization problem

while \text{ sign}(f(x_t)) = \text{sign}(f(x_0)) \: do \\ \delta_i = \frac{f(x_i)}{||\nabla f(x_t)||_2^2} \cdot \nabla f(x_t) \\ x_{t+1} = x_t + \delta_t

(4)

We assume that the output class is determined by the sign of the classifier, i.e. on which side of the decision boundary the sample is. This obviously only works for a binary classifier.

In the multi-class setting, the closest hyperplane to a possibly non-linear decision boundary for each of the k-classifiers is fitted, and we perturb the input such that it represents the projection of $x$ to the closest hyperplane (decision boundary) of another class.

Carlini-Wagner¶

\min ||\delta||_\infty \quad \text{ s.t. } [ \max_{y \neq t} f_y(x + \delta) - f_t(x + \delta)]^+

(5)

where $f_j$ are the logits of class $j$ . Hence this chooses a perturbation that maxmimizes the difference in the logits to the target class $t$ .

Black-box attacks¶

Transfer-based¶

Momentum iterative method¶

Train on a model that one has access to, then transfer to black-box target model. Using momentum when optimizing the perturbed input / adversarial input helps with transferability, since the linear assumption of FGSM is not realistic and iterative FGSM methods may get stuck in local minima.

g_{t+1} = \mu g_{t+1} + \frac{\nabla f(x_t)}{||\nabla f(x_t) ||_1} \\ x_{t+1} = x_t + \alpha \cdot \text{sign}(g_{t+1})

(6)

alternatively nesterovs momentum can be used

x' = x_t + \alpha \mu \cdot g_t \\ g_{t+1} = \mu g_t + \frac{\nabla f(x')}{||\nabla f(x') ||_1} \\ x_{t+1} = x + \alpha \cdot \text{sign}(g_{t+1})

(7)

Diverse Input and Translation Invariant Attack¶

DIM: Randomly resize and pad and use a sample from this transformation during the iterative process.
TIM: Average gradient over random translations of the image

x_{t+1} = x_t + \alpha \cdot \text{sign} \left( \mathbb{E}_T [\nabla f(Tx)] \right)

(8)

Specific architecture transfer¶

On e.g. ResNets using the gradients of skip-connections transfers better

Query-based¶

Consider first the hard label case: Assume we are given an adversarial image $x$ from the target class $y(x) = t$ . Clearly this is a weak adversarial sample, since it was sampled from another class and thus is visually very different. To overcome this issue iterative rejection sampling can be used.

In each step randomly perturb the image by e.g. adding noise. If after the perturbation the image is still classified as $t$ , keep the perturbation otherwise sample a new transformation. The size of the perturbation can be seen as the step-size in this iterative process.

The algorithm corresponds to walking along the decision boundary of the class until a certain distance threshold is reached, where distance is measured as the norm of the perturbation over all perturbations.

To make the algorithm efficient, after each step one additionally projects to a hypersphere around the image that is to be attacked, in order to force that the image gets visually closer.

If confidence scores are available, ZOO uses numerical gradients on each coordinate to perform iterative methods. They pick the coordinate randomly in each step. Other approaches such as NATTACK, learn a parameterized distribution using reinforcement learning that is likely to produce adversarial images.

Mitigate unsafe image generation¶

Options:

Remove unsafe images from training data
Block unsafe keywords in prompt
Better safety classifiers to detect and reject unsafe images

Image generation for hateful memes¶

In img-img models, hateful images can be identified by leveraging e.g. clip embeddings and measuring cosine similarity. Identify the “influencer” of an hateful image variant.

\mathop{\mathrm{arg\,max}}_i \cos(e_v - e_o, e_i)

(9)

where by semantic embeddings we assume that $e_v = e_o + e_i$ , i.e. the variant of the original hateful image is generated by the influencing image $i$ .

In text-img models, the same can be done using the clip-text embedding $\mathop{\mathrm{arg\,max}}_i \cos(e_o + t_i, e_i)$

Deepfakes¶

Fakes produced by GANs have more pronounced periodic patterns in Fourier domain.

Hybrid text-to-image classification¶

In text-to-image models, using the prompt in addition to the image as input to a classifier works very well. This can be attributed by the fact, that the generated image mainly adheres to the prompt, while a caption to a real image only contains part of the full details.

Quantitatively this can be measured by the cosine similarity in the clip embedding space, where the fake image is closer to the prompt than the real image.

Describing the environment in the prompt helps to overcome this classification method.

Prompt-based classification¶

Uses vision-language model BLIP to ask in a “zero”-shot fashion if the input image is fake. Not really zero-shot, a soft-prompt tuning approach is used, i.e. a special token is added to the vocabulary and only the embedding of this token is fine-tuned for the classification task. The approach uses two such tokens, one for the Q-Former in BLIP that embeds the question and one token for the LLM that answers the question.

DIT¶

Adds noise to the image and then uses a text-image model, or more specifically the denoising/diffusion process of such a model. For artificial images the output after denoising is almost indistinguishable from the input, while real images look significantly different. Conditioning the denoising process by a prompt changes the output for the real image to follow the prompt, while the artificial image basically ignores the prompt.

Backdoor attacks¶

BadNets¶

Misclassify poisoned inputs. Changing the architecture of the victim is in general not possible, on the other hand one can try and poison the dataset to include triggers that lead to misclassification. Requires no modification of the adversarial inputs during inference time.

Poisoned samples could e.g. have the wrong labels. The goal of the attacker is to minimize the classification error on these poisoned samples, such that the attack works after training.

TrojanNN¶

For a trigger TrojanNN tries to identify a neuron that fires for the specific trigger and then optimizes the trigger data to maximally activate this neuron. A internal neuron works better than directly optimizing a output neuron, since this keeps accuracy on clean samples. Typically the neuron with the heighest sum of absolute weights from the previous layer is chosen, i.e. the most connected neuron.

Invisible Backdoor Attack¶

f^* = \arg \min L(D_{train} \cup D_{train}^p)

(10)

where $D_{train}^p$ is the poisoned training dataset. To include the triggers this work uses steganography to hide it in the least significant bits of e.g. the colors of the image.

Alternatively they propose regularizing the trigger, in order to keep it small. This approach in general is easier, since the trigger is optimized to excite specific neurons and the regularization is simply an additional term in the cost function.

\arg \min ||A(\alpha)[I] - c A(\alpha_0)[I] || _2 + \lambda ||\alpha ||_p

(11)

where $A(\cdot)[I]$ is the activation of neurons specified by the index set $I$ and $\alpha_0$ is the initial trigger and $c$ is the scaling of the initial activation that should be reached by the trigger.

Backdooring distilled datasets¶

Hidding triggers in distilled datasets is easier, since the distilled inputs basically look like noise. The goal is to only influence the distillation process. To this end one can simply add poisoned samples with wrong labels before the distillation and hope they are enough to influence the model after distillation. However this has weak performance, since the triggers are not adapted to the distillation process.

To this end the trigger is updated in each distillation epoch, again by maximizing neuron activation. Triggers from earlier distillation epochs prove also effective in later epochs, even easier to hide trigger, since multiple exist.

BadEncoder¶

Poisoning of self-supervised encoder models. The idea is to force the model to map poisoned samples close to samples of the target class. However since the target class may change for different downstream tasks, the adversary needs to have in mind, what representative samples for different downstream tasks look like. For each attacked downstream task, a new trigger needs to be optimized.

The optimization of the model looks as follows:

L_0 = - \frac{\sum_{i=1}^t \sum_{j=1}^{r_i} \sum_{x \in D^p} sim(f'(x_{ij} + e_{i}), f'(x_{ij}))}{|D^p| \cdot \sum_{i=1}^t r_i}\\ L_1 = - \frac{\sum_{i=1}^t \sum_{j=1}^{r_i} sim(f'(x_{ij}), f(x_{ij}))}{\sum_{i=1}^t r_i} \\ L_2 = - \frac{1}{|D|} \sum_{x \in D} sim(f'(x), f(x))

(12)

The first loss quantifies that samples with the trigger should be close to representative samples of that task. The second loss measures how similar clean representative images of the poisoned encoder $f'$ are to the clean encoder $f$ . The third loss encourages that for any clean images the embeddings are close, not just for the representatives such as in $L_1$ .

BadNL¶

Poisoning language is hard since the space is discrete. Multiple options are proposed on where to place the trigger.

BadChar: Use invisible ASCII/UTF-8 control chars as triggers. Otherwise only alter words with a small edit distance
BadWord: Use a MLM model to get the embedding of the most likely word. Then interpolate between the embedding of the desired trigger and the most likely word and choose one of the top-k closest real words instead. To keep the sentence grammatically correct, only take neighbors with the same part-of-speech tag into account. Alternatively one could use e.g. the least frequent synonym, where synonym is determined by e.g. cosine similarity.
BadSentence: Replace sub-sentence directly with trigger sentence or change tense to e.g. “<verb> -> Will have been <verb>” or change voice e.g. from active to passive.

As an alternative using style transfer has also been explored. However the paper makes a architecture change to the model in order to also predict the style of the sentence as the attack is weak otherwise.

In order to attack pre-trained models, the trigger is optimized to be similar to the target class as is done during the BadEncoder attack.

L_1 = \sum_{x: y(x)=i} \sum_{x': y(x')=j} dist(f(x), f(x')) \\ L_2 = - \sum_{x: y(x)=t} \sum_{x' = x + \delta} dist(f(x), f(x'))

(13)

The first loss maximizes distance between classes to achieve high downstream accurcay, the second term minimizes distance between poisoned samples and the target class.

Defenses¶

Various defenses have been proposed.

Many attacks maximize the activiation of a specific neuron, analyzing neuron behaviour can identify poisoned models.
Training on clean data or distilling the model remove backdoors
Input transformations can be learned that remove potential triggers
Add random perturbations to the input, if output is not random, then input most likely contains a trigger since model is optimized to recognize the trigger.
Analyze eigenvectors of covariance matrix of the dataset. Poisoned samples correlate highly with the top eigenvector. Poisoned samples also can be seen as outliers in the embedding space.

SSLGuard¶

Protect a self-supervised model from model stealing and reprogramming by watermarking. The crucial insight is that random vectors are most likely to be orthogonal to each other. Thus a model decoding model is optimized to map embeddings in a verificationd dataset close to a random secret key. If a public model produces embeddings that are by this decoder mapped close to the secret key, it is likely that this public model is stolen.

Adversarial Reprogramming¶

Use victim model to compute your own function. Required encoding your inputs into the victim space and victim output to the task output.

This is done by finding a perturbation that encodes the new task. This perturbation can then be added to all inputs which need to be solved for the task.

Vision¶

\delta = \tanh(W \odot M)\\ x' = x + \delta

(14)

The mask $M$ is used to constrain the reprogramming trigger in the input. $W$ are the parameters of the trigger that can then be optimized.

W = \mathop{\arg\min_W} - \log P(h_g(y) | x') + \lambda ||W||_p^2

(15)

Here $h_g$ maps from the victim models output to the desired task output.

To better hide the trigger, it can be e.g. shuffled and hidden in another image

\delta = \alpha \tanh (\text{shuffle}(x)+ (W \odot \text{shuffle}(M)))\\ x' = \bar{x} + \delta

(16)

where $\bar{x}$ is any sample from the victim models dataset.

NLP¶

Discrete input space makes the problem harder. A idea is to leverage a context dependent vocabulary remapping. In other words for specific context window lengths, a mapping from the task vocabulary to the victim vocabulary is defined. The context window is then driven over the input sentence and summed up to define the mapping from the task to the victim vocabulary.

This yields a distribution over tokens. The goal is to optimize this distribution to optimally map the task to the victim input. Since argmax is not differentiable, gumbel-softmax is used to approximate which token should be taken. During training the temperature of the softmax is decreased such that the distribution converges to a one-hot encoding for which token should be chosen to optimally encode the task.

Model hijacking¶

By dataset poisoning¶

A camouflager model creates visually similar images by combining the reprogramming trigger with a image from the victim dataset.

min ||x_c - x_o||_1 + || F(x_c) - F(x_{adv}) ||_1 - ||F(x_c) - F(x_o)||_1

(17)

The first term encourages camouflaged images that are close to real images in the dataset. The second term makes sure that the embeddings of the adversarial input and the camouflaged input are similar. The third task makes sure that the embeddings of the camouflaged image and the original are sufficiently different in order to not compromise downstream performance.

NLP¶

Here the goal is to camouflage the output, since the input is most likely similar already.

Generate pseudo sentences. These will be used to poison the victim model for the adversarial input from the hijacking dataset. These can for example be generated by a public model that performs the same task.
Hijacking tokens are generated for each task label. Then the victim should after adversarial reprogramming/hijacking output the corresponding token.
Randomly insert/mask/replace word in each pseudo sentence with mask. Take hijacking token, that BERT proposes, and repeat multiple times.

As poisoned sample take the sentence that maximizes the semantic similarity and has the highest hijacked token count. After training on a dataset that contains these poisoned samples, the adversarial task can be performed by counting the weighted frequency of the hijacking tokens in the output.

Each token is weighted by how often it appears in the hijacking dataset, i.e. the pseudo sentences.

Posts

Regularization & Ill-Conditioned Problems

Posts

Divergence, Regularization and Variational Approaches