Calable, fault-tolerant membership for group communication on HPC systems

by 1983- Varma, Jyothish S.

Abstract (Summary)
VARMA, JYOTHISH S. Scalable, Fault-Tolerant Membership for Group Communication on HPC Systems. (Under the direction of Associate Professor Dr. Frank Mueller). Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM’s Blue Gene/L. A shorter mean-timeto-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small reconfiguration overhead within the fault-tolerant layer. This thesis presents a scalable approach to reconfigure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response time in the order of hundreds of microseconds and single-digit milliseconds for reconfiguration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems. Scalable, Fault-Tolerant Membership for Group Communication on HPC Systems by Jyothish S. Varma A thesis submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Master of Science in Computer Science Raleigh 2006 Approved By: Dr. Tao Xie Dr. Vincent Freeh Dr. Frank Mueller Chair of Advisory Committee ii Biography Jyothish Varma was born on the 3rd of January 1983, in Kerala, India. He received his Bachelor of Technology in Computer Science from Model Engineering College, Cochin, India, in 2004. He opted to continue with his higher studies and joined North Carolina State University in Fall 2004. With the defense of this thesis, he is receiving the degree Master of Science in Computer Science from NCSU, in May 2006. iii
Bibliographical Information:


School:North Carolina State University

School Location:USA - North Carolina

Source Type:Master's Thesis

Keywords:north carolina state university


Date of Publication:

© 2009 All Rights Reserved.