Home Robust Machine Learning - Detection
Post
Cancel

Robust Machine Learning - Detection

Relevant papers: [J3, C4].

The following article examines the problem of Byzantine behavior in distributed learning setups from the lens of detection. The problem has been introduced in another article discussing our previously proposed method ByzShield. In this work, our objective is to initially attempt to detect which nodes behave adversarially or erroneously before resorting to gradient aggregation. We consider attack models ranging from strong ones: $q$ omniscient adversaries with full knowledge of the defense protocol that can change from iteration to iteration to weak ones: $q$ randomly chosen adversaries with limited collusion abilities which only change every few iterations at a time. Our algorithms rely on redundant task assignments coupled with the detection of adversarial behavior. Specifically, we consider clusters of $K$ workers, where each gradient task is assigned to $r$ of them.

We consider two different task assignment schemes, named Aspis and Aspis+. Aspis is a subset-based scheme and enjoys the most benefits in adversarial settings: omniscient, colluding adversaries, up to $q$ adversaries that can change at each iteration. However, Aspis requires large batch sizes (in the mini-batch SGD). It is well-recognized that large batch sizes often cause performance degradation in training. Accordingly, for this class of attacks, we present a different algorithm called Aspis+ that can work with much smaller batch sizes. For Aspis+, we use combinatorial designs to assign the gradient tasks to workers. Aspis+ is a good fit for clusters that suffer from non-adversarial failures that can lead to inaccurate gradients.

For Aspis, we devised a strong attack against the method and proved that it is optimal for distorting as many files as possible. Even in this adverse scenario, our method enjoys a reduction in the fraction of corrupted gradients of more than 90% compared with the prior state-of-the-art DETOX. A weaker variation of this attack is where the adversaries do not collude and act randomly. In this case, we demonstrate that the Aspis protocol allows for detecting all the adversaries. In both scenarios, we provide theoretical guarantees on the fraction of corrupted gradients; Figure 1 shows a simulation of the distortion fraction and compares it with the baseline method (that assigns one gradient to each worker) as well as DETOX, under optimal attacks.

Figure 1 Figure 1: Distortion fraction of optimal attacks for $(K,r)=(50,3)$ and comparison.

The detection mechanism of Aspis is based on clique-finding in appropriate graphs. In our setup, the edges of the utilized graphs encode the agreements among the workers. A clique in an undirected graph is defined as a subset of vertices with an edge between any pair of them. A maximal clique is one that cannot be enlarged by adding additional vertices to it. A maximum clique is one such that there is no clique with more vertices in the given graph. In our papers, we prove that the honest workers form a clique of size $K-q$. The clique containing the honest workers may not be maximal. However, it will have a size of at least $K-q$. Any worker with degree less than $K-q-1$ will not belong to a maximum clique, and we show that it can be eliminated immediately as a “detected” adversary. Let us consider a cluster of $K=7$ workers. An example of successful detection of the adversarial nodes $U_1$, $U_2$, and $U_3$ where the maximum clique is unique and of size $K-q = 4$ is in Figure 2. However, if there are multiple maximum cliques in the graph, the ambiguity as to which one is the honest set makes an accurate detection impossible; in this case, we resort to robust aggregation, i.e., the method of gradient filtering discussed in the previous article. An example of this case is shown in Figure 3.

Figure 2 Figure 2: Unique max-clique in the detection graph, detection succeeds.

Figure 3 Figure 3: Two max-cliques in the detection graph, detection fails.

We note that clique-finding is well-known to be an NP-complete problem (cf. [Karp, 1972]). Nevertheless, there are fast, practical algorithms with excellent performance on graphs even up to hundreds of nodes (we used that of [Etsuji et al., 2006]) that is implemented in the Python NetworkX package. Our extensive experimental evidence suggests that clique-finding is not a computation bottleneck for the size and structure of the graphs that Aspis uses. We have experimented with clique-finding on a graph of $K=100$ workers and $r=5$ for different values of $q$; in all cases, enumerating all maximal cliques took no more than $15$ milliseconds.

For weaker attacks considered for Aspis+, i.e., when the adversaries change only every few iterations, our results indicate that our scheme detects all adversaries within approximately $5$ iterations. The principal intuition of the Aspis+ detection approach is to iteratively keep refining a graph in which the edges encode the agreements of workers until the degree of some nodes falls below a value; we prove that the workers whose degree falls below $K-q-1$ are Byzantine.

We present top-1 classification accuracy experiments on the CIFAR-10 data set for various gradient distortion attacks coupled with choice/behavior patterns of the adversarial nodes. Under the most sophisticated distortion methods such as ALIE, the performance gap between Aspis/Aspis+ and other state-of-the-art methods is substantial, e.g., for Aspis, it is $43\%$ in the strong scenario (cf. Figure 4), and for Aspis+, $19\%$ in the weak scenario (cf. Figure 5).

Figure 4 Figure 4: ALIE distortion under optimal attack scenarios, coordinate-wise median-based defenses, CIFAR-10, K=15.

Figure 5 Figure 5: ALIE distortion under weak attack scenarios (random Byzantines), coordinate-wise median-based defenses, CIFAR-10, K=15.

This post is licensed under CC BY 4.0 by the author.

Trending Tags