Enhancing deep learning security through explainability and robustness

Date
2024
DOI
Authors
Kiourti, Panagiota
Version
OA Version
Citation
Abstract
The growing interest in deploying deep learning models in critical applications has raised concerns about their vulnerabilities, particularly to backdoor or Trojan attacks. These attacks aim to train a network to respond maliciously to specially crafted trigger patterns in the inputs while exhibiting state-of-the-art performance. This thesis addresses the identification of such attacks in deep reinforcement learning, proposes a mitigation strategy for their detection in classification neural networks in production using feature attribution methods, and introduces a new framework for evaluating the robustness of attribution methods. Firstly, TrojDRL is introduced as a tool for exploring and evaluating backdoor attacks on deep reinforcement learning agents. TrojDRL exploits the sequential nature of deep reinforcement learning (DRL) and considers various threat model gradations. It introduces untargeted attacks on state-of-the-art actor-critic policy networks that can circumvent existing defenses built on the assumption that backdoors are targeted. TrojDRL shows that the attacks require only as little as 0.025% poisoning of the training data. Compared with existing works of backdoor attacks on classification models, this tool is a pioneering effort toward understanding the vulnerability of DRL agents. Secondly, this thesis presents MISA, a new online detection approach for Trojan triggers present in neural networks at inference time after the deployment of the model. MISA utilizes feature attribution methods to explain the decision of a neural network. It defines misattributions to capture the anomalous manifestation of a Trojan activation in the feature attribution space by first computing the input's attribution on different features and then statistically analyzing these attributions to ascertain the presence of a Trojan trigger. Across a set of benchmarks, MISA can effectively detect Trojan triggers for a wide variety of trigger patterns, achieving 96% AUC for detecting Trojan-triggered images without any assumptions on the trigger pattern. Lastly, the robustness of feature attribution methods for deep neural networks is critically examined. This thesis challenges the current notion of attributional robustness that largely ignores the difference in the model's outputs and introduces a new evaluation framework. This involves defining similar inputs in a different way than existing methods do and introducing a novel method based on generative adversarial networks to generate these inputs, leading to a different definition of attributional robustness. The new robustness metric is comprehensively evaluated against existing metrics and state-of-the-art attribution methods. The findings highlight the need for a more objective metric that reveals the weaknesses of an attribution method rather than that of the neural network, thus providing a more accurate evaluation of the robustness of attribution methods.
Description
License