As technological progress brings us ever closer to achieving exaFLOP executions in High-Performance Computing (HPC) systems, new challenges emerge for rendering executions sustainable: how to power, cool and control resources on HPC systems so as to keep them running at exascale performance levels for periods long enough to complete demanding computations. It is a certainty that future HPC systems will achieve exascale performance through massive parallelism employing millions of processor cores running billions of threads. At these scales, failures and errors will be frequent, with many instances occurring daily. This fact places resilience squarely as another major roadblock to sustainability. In this talk, I will argue that large computer systems, including exascale HPC systems, will ultimately be operated based on predictive computational models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing “nuts-and-bolts” operations. The breakthrough I am aiming for is the automatic control of future exascale systems based on predictive data-driven models. Prediction will enable anticipating anomalous states, such as those due to failures, as well as forecasting regular behavior for keeping power consumption at bay. The predictive models will include not only the computer system as such, but also its geographical, socio-political and physical environment including power and cooling infrastructures.
Ozalp Babaoglu is Professor at the Department of Computer Science and Engineering, University of Bologna. He received a Ph.D. in 1981 from the University of California at Berkeley. Babaoglu’s virtual memory extensions to AT&T Unix as a graduate student at UC Berkeley became the basis for a long line of “BSD Unix” distributions. He is the recipient of 1982 Sakrison Memorial Award, 1989 UNIX International Recognition Award and 1993 USENIX Association Lifetime Achievement Awardfor his contributions to the Unix system community and to Open Industry Standards. Before moving to Bologna in 1988, Babaoglu was an Associate Professor in the Department of Computer Science at Cornell University where he conducted research on distributed systems and fault-tolerance. Since moving to Italy, he has been active in numerous European research projects in distributed computing and complex systems including BROADCAST, CABERNET, ADAPT and DELIS. In 2001 he co-founded the Bertinoro international center for informatics (BiCi). Since its inception, this “Italian Dagstuhl” has organized more than 150 prestigious scientific meetings/schools and has had thousands of young researchers from all over the world pass through its doors. In 2002 Babaoglu was made a Fellow of the ACM for his “contributions to fault-tolerant distributed computing, BSD Unix, and for leadership in the European distributed systems community”. From 2002 to 2005 he was the coordinator of the European Union Framework Five project BISON that resulted in seminal work on biology-inspired techniques applied to dynamic networks and on gossip-based distributed algorithms. In 2007, he co-founded the IEEE International Conference on Self-Adaptive and Self-Organizing Systems (SASO) conference series and has been a member of its Steering Committee since inception and has served as co-general chair for the 2007 and 2013 editions. Since 2013, he has been on the Selection Committee for the ACM Heidelberg Laureate Forum, which brings together young researchers in Computer Science and Mathematics with Abel, Fields and Turing Laureates. He currently serves on the editorial board of ACM Transactions on Autonomous and Adaptive Systems. Previously, he served for two decades on the editorial boards of ACM Transactions on Computer Systems and Springer-Verlag Distributed Computing.