お知らせ

Switch Fault Tolerance in a Mirrored K-Ary N-Tree(情報科学部コンピュータ科学科 李 亜民 教授)

  • 2020年10月23日
お知らせ

→Google翻訳はこちら

2019年度に受賞・表彰を受けた教員の研究や受賞内容を紹介します。

李亜民教授は、「Best Paper Award」(CITS 2019)を受賞しました。

  • 論文「Switch Fault Tolerance in a Mirrored K-Ary N-Tree」

The fat-tree is one of the most commonly used topologies of the interconnection networks in current large-scale parallel computers. A k-ary n-tree is a fat-tree with multiple roots where k is the arity or the number of links of a switch that connects to the previous or next level; that is, the switch radix is 2k, and n is the number of levels. The compute nodes are connected only to the leaf switches. The fat-tree is a folded version of a Clos network. The Clos network is a multistage interconnection network (MIN) that uses small-scale crossbar as the building blocks. It was originally designed for non-blocking telecommunication circuit switching.

Fat-tree and Clos network provide high path diversity but meanwhile require a higher number of switches with a non-negligible wiring complexity and have a greater packet latency than other low-cost MINs. The increased switch cost and packet latency both stem from the need to route packets first to an arbitrary middle level or root switch and then to their ultimate destination.

A Mirrored K-Ary N-Tree (MiKANT) network is a variant of fat-tree aimed at reducing hardware cost and packet latency. MiKANT doubles the number of compute nodes of the fat-tree by adding a few switches and making all the switches have a same radix. Fig. 1 shows a MiKANT(k,n) with k=3 and n=3. Compared to the classical fat-tree and Clos network, MiKANT not only reduces the numbers of switches and links so that it can be implemented at lower hardware cost, but also makes the network average distance shorter for achieving higher communication performance.

Fig. 1 A Mirrored 3-ary 3-tree

As the scale of MiKANT becomes large, the probability of switch failure increases. A switch faulty means that the switch and all links connected to it cannot be used for passing the packet. Fault tolerance is one of the important issues of the interconnection networks for large-scale parallel computers. The purpose of fault tolerant routing is to find a routing path from source node to destination node with high probability in a system with switch or link faulty.

The basic idea of the switch fault tolerant routing is that in case using the port determined by the shortest path routing algorithm cannot forward the packet due to the switch faulty, we use other port(s) to try it.

(1) One-port algorithm: In the upward phase (from the source node to a nearest common ancestor switch of both the source and destination nodes), if the switch selected by shortest path algorithm is faulty, the next port to will be used to send the packet. If the switch connected to the next port is also faulty, we terminate the routing.

(2) All-port algorithm: All-port algorithm tries to use k ports to send packet. If all the neighbor switches linked by these ports are faulty, we terminate the routing.

(3) Go-back algorithm: In the upward phase, One-port or All-port algorithm changes port when a neighbor switch is faulty. Because in the downward phase (from the nearest common ancestor switch to the destination node) the path is deterministic, we cannot change the port. If a switch in the path is faulty in this phase, it is possible to go back to the upward phase to change the port.

The three algorithms described above generate routing paths that do not contain circles; we say that these algorithms are deadlock-free.

We have evaluated the performance of the shortest path, One-port, All-port, and Go-back algorithms through simulations. The network is a MiKANT(4,5) with 2,048 switches and 2,048 compute nodes. Some switches are marked as faulty randomly. For each routing, we randomly select two distinct compute nodes as the source and destination, respectively.

Fig. 2 shows the successful routing ratios of the algorithms with the switch faulty rate ranging from 0% to 100%. We can see that when the faulty rate reaches 70%, or 1,434 faulty switches, the successful routing ratio is near zero. Obviously, if the switch connected to the source node or destination node is faulty, for any algorithm it is impossible to perform the routing successfully. The curve labeled with "S/D switch faulty" shows the ratio of such a case.

Fig. 2 Successful routing ratio on switch faulty

Fig. 3 shows the average path length when the routing is successful. The reason why the average path length becomes short when the switch faulty rate is high is that in such high faulty rate, only the routing can be successful between two nodes that have a short distance between each other.

Fig. 3 Average path length on switch faulty

Fat-tree is widely used in recent designs of supercomputers. As a variant of fat-tree, the MiKANT network can achieve high performance at low implementation cost. The proposed three simple fault tolerant routing algorithms are deadlock-free and can find a path between source and destination nodes at high probability. The future research work may include the link fault tolerant routing in MiKANT and implementation of MiKANT on a chip.

法政大学情報科学部コンピュータ科学科

李 亜民 教授(LI Yamin)

LI Yamin Born in China in 1958. Graduated from the Department of Computer Science and Engineering, Tsinghua University, and got a Ph.D degree from the Graduate School of Computer Science and Technology, Tsinghua University. After working as an assistant and associate professor at Tsinghua University, and associate professor at the University of Aizu, he became a professor at the Department of Computer Science, the Faculty of Computer and Information Sciences, Hosei University in April 2000.

  • 所属・役職は、記事掲載時点の情報です。