Calable, fault-tolerant membership for group communication on HPC systems
Abstract (Summary)
VARMA, JYOTHISH S. Scalable, Fault-Tolerant Membership for Group Communication
on HPC Systems. (Under the direction of Associate Professor Dr. Frank Mueller).
Reliability is increasingly becoming a challenge for high-performance computing
(HPC) systems with thousands of nodes, such as IBM’s Blue Gene/L. A shorter mean-timeto-failure
can be addressed by adding fault tolerance to reconfigure working nodes to ensure
that communication and computation can progress. However, existing approaches fall short
in providing scalability and small reconfiguration overhead within the fault-tolerant layer.
This thesis presents a scalable approach to reconfigure the communication infrastructure
after node failures. We propose a decentralized (peer-to-peer) protocol that
maintains a consistent view of active nodes in the presence of faults. Our protocol shows
response time in the order of hundreds of microseconds and single-digit milliseconds for
reconfiguration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol
can be adapted to match the network topology to further increase performance. We
also verify experimental results against a performance model, which demonstrates the scalability
of the approach. Hence, the membership service is suitable for deployment in the
communication layer of MPI runtime systems.
Scalable, Fault-Tolerant Membership for Group Communication on HPC
Systems
by
Jyothish S. Varma
A thesis submitted to the Graduate Faculty of
North Carolina State University
in partial fulfillment of the
requirements for the Degree of
Master of Science in
Computer Science
Raleigh
2006
Approved By:
Dr. Tao Xie Dr. Vincent Freeh
Dr. Frank Mueller
Chair of Advisory Committee
ii
Biography
Jyothish Varma was born on the 3rd of January 1983, in Kerala, India. He received
his Bachelor of Technology in Computer Science from Model Engineering College, Cochin,
India, in 2004. He opted to continue with his higher studies and joined North Carolina
State University in Fall 2004. With the defense of this thesis, he is receiving the degree
Master of Science in Computer Science from NCSU, in May 2006.
iii
Bibliographical Information:
Advisor:
School:North Carolina State University
School Location:USA - North Carolina
Source Type:Master's Thesis
Keywords:north carolina state university
ISBN:
Date of Publication: