Isaac Scientific Publishing

Frontiers in Signal Processing

Adaptive and Power-Aware Resilience for Extreme-scale Computing

Download PDF (1192.8 KB) PP. 24 - 40 Pub. Date: July 10, 2017

DOI: 10.22606/fsp.2017.11004

Author(s)

  • Xiaolong Cui*
    Department of Computer Science, University of Pittsburgh, Pittsburgh, United States
  • Taieb Znati
    Department of Computer Science, University of Pittsburgh, Pittsburgh, United States
  • Rami Melhem
    Department of Computer Science, University of Pittsburgh, Pittsburgh, United States

Abstract

With the concerted efforts from researchers in hardware, software, algorithm, and data management, HPC is moving towards extreme-scale, featuring a computing capability of quintillion (1018) FLOPS. As we approach the new era of computing, however, several daunting scalability challenges remain to be conquered. Delivering extreme-scale performance will require a computing platform that supports billion-way parallelism, necessitating a dramatic increase in the number of computing, storage, and networking components. At such a large scale, failure would become a norm rather than an exception, driving the system to significantly lower efficiency with unprecedented amount of power consumption. To tackle this challenge, we propose an adaptive and power-aware algorithm, referred to as Lazy Shadowing, as an efficient and scalable approach to achieve high-levels of resilience, through forward progress, in extreme-scale, failure-prone computing environments. Lazy Shadowing associates with each process a “shadow” (process) that executes at a reduced rate, and opportunistically rolls forward each shadow to catch up with the its leading process during failure recovery. Compared to existing fault tolerance methods, our approach can achieve 20% energy saving with potential reduction in solution time at scale.

Keywords

Lazy Shadowing, extreme-scale computing, forward progress, reliability

References

[1] S. Ahern and et. al., “Scientific discovery at the exascale, a report from the doe ascr 2011 workshop on exascale data management, analysis, and visualization,” 2011.

[2] O. Sarood and et. al., “Maximizing throughput of overprovisioned hpc data centers under a strict power budget,” ser. SC ’14, Piscataway, NJ, USA, 2014, pp. 807–818. [Online]. Available: http://dx.doi.org/10.1109/SC.2014.71

[3] O. Villa and et. al., “Scaling the power wall: A path to exascale,” ser. SC ' 14 Piscataway, NJ, USA: IEEE Press, 2014, pp. 830–841. [Online]. Available: http://dx.doi.org/10.1109/SC.2014.73

[4] E. Elnozahy and et. al., “A survey of rollback-recovery protocols in message-passing systems,” ACM Comput. Surv., vol. 34, no. 3, pp. 375–408, 2002.

[5] K. Chandy and C. Ramamoorthy, “Rollback and recovery strategies for computer programs,” Computers, IEEE Transactions on, vol. C-21, no. 6, pp. 546–556, June 1972.

[6] E. Elnozahy and J. Plank, “Checkpointing for peta-scale systems: a look into the future of practical rollbackrecovery,” DSC, vol. 1, no. 2, pp. 97 – 108, april-june 2004.

[7] R. Riesen, K. Ferreira, J. R. Stearley, R. Oldfield, J. H. L. III, K. T. Pedretti, and R. Brightwell, “Redundant computing for exascale systems,” December 2010.

[8] P. Hargrove and J. Duell, “Berkeley lab checkpoint/restart (blcr) for linux clusters,” in Journal of Physics: Conference Series, vol. 46, no. 1, 2006, p. 494.

[9] J. Plank and M. Thomason, “The average availability of parallel checkpointing systems and its importance in selecting runtime parameters,” in Fault-Tolerant Computing, 1999, pp. 250 –257.

[10] B. Randell, “System structure for software fault tolerance,” in Proceedings of the international conference on Reliable software. New York, NY, USA: ACM, 1975, pp. 437–449. [Online]. Available: http://doi.acm.org/10.1145/800027.808467

[11] G. Zheng, L. Shi, and L. Kale, “FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI,” in Cluster Computing, 2004 IEEE International Conference on, Sept 2004, pp. 93–103.

[12] A. Guermouche and et. al., “Uncoordinated checkpointing without domino effect for send-deterministic mpi applications,” in IPDPS, May 2011, pp. 989–1000.

[13] H. chang Nam, J. Kim, S. Lee, and S. Lee, “Probabilistic checkpointing,” in In Proceedings of Intl. Symposium on Fault-Tolerant Computing, 1997, pp. 153–160.

[14] S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira, “Adaptive incremental checkpointing for massively parallel systems,” in ICS, St. Malo, France, 2004.

[15] J. Plank and K. Li, “Faster checkpointing with n+1 parity,” in Fault-Tolerant Computing, June 1994, pp. 288–297.

[16] E. Elnozahy and W. Zwaenepoel, “Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output commit,” TC, vol. 41, pp. 526–531, 1992.

[17] K. Li, J. F. Naughton, and J. S. Plank, “Low-latency, concurrent checkpointing for parallel programs,” IEEE Trans. Parallel Distrib. Syst., vol. 5, no. 8, pp. 874–879, Aug. 1994. [Online]. Available: http://dx.doi.org/10.1109/71.298215

[18] A. Moody, G. Bronevetsky, K. Mohror, and B. Supinski, “Design, modeling, and evaluation of a scalable multilevel checkpointing system,” in SC, 2010, pp. 1–11. [Online]. Available: http://dx.doi.org/10.1109/SC.2010.18

[19] D. Hakkarinen and Z. Chen, “Multilevel diskless checkpointing,” Computers, IEEE Transactions on, vol. 62, no. 4, pp. 772–783, April 2013.

[20] F. Chen, D. A. Koufaty, and X. Zhang, “Hystor: Making the best use of solid state drives in high performance storage systems,” ser. ICS, New York, USA, 2011, pp. 22–32. [Online]. Available: http://doi.acm.org/10.1145/1995896.1995902

[21] D. Fiala and et. al., “Detection and correction of silent data corruption for large-scale highperformance computing,” ser. SC, Los Alamitos, CA, USA, 2012, pp. 78:1–78:12. [Online]. Available: http://dl.acm.org/citation.cfm?id=2388996.2389102

[22] A. Lefray, T. Ropars, and A. Schiper, “Replication for send-deterministic MPI HPC applications,” ser. FTXS ' 13. New York, NY, USA: ACM, 2013, pp. 33–40. [Online]. Available: http://doi.acm.org/10.1145/2465813.2465819

[23] F. Cappello, “Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities,” IJHPCA, vol. 23, no. 3, pp. 212–226, 2009.

[24] X. Ni, E. Meneses, N. Jain, and L. V. Kalé, “Acr: Automatic checkpoint/restart for soft and hard error protection,” ser. SC. New York, NY, USA: ACM, 2013, pp. 7:1–7:12. [Online]. Available: http://doi.acm.org/10.1145/2503210.2503266

[25] J. Stearley and et. al., “Does partial replication pay off?” in DSN-W, June 2012, pp. 1–6.

[26] J. Elliott and et. al., “Combining partial redundancy and checkpointing for HPC,” ser. ICDCS. Washington, DC, USA: IEEE Computer Society, 2012, pp. 615–626. [Online]. Available: http: //dx.doi.org/10.1109/

[27] C. Engelmann and S. B?hm, “Redundant execution of hpc applications with mr-mpi,” in PDCN, 2011, pp. 15–17.

[28] H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, “Combining Process Replication and Checkpointing for Resilience on Exascale Systems,” INRIA, Rapport de recherche RR-7951, May 2012. [Online]. Available: http://hal.inria.fr/hal-00697180

[29] L. Alvisi and K. Marzullp, “Message logging: Pessimistic, optimistic, causal, and optimal,” IEEE Trans. Softw.Eng., vol. 42, no. 2, pp. 149–159, 1998.

[30] J. Daly, “A higher order estimate of the optimum checkpoint interval for restart dumps,” Future Gener. Comput. Syst., vol. 22, no. 3, pp. 303–312, Feb. 2006. [Online]. Available: http://dx.doi.org/10.1016/j.future.2004.11.016

[31] S. Albers, A. Antoniadis, and G. Greiner, “On multi-processor speed scaling with migration: Extended abstract,” ser. SPAA ' 11. New York, NY, USA: ACM, 2011, pp. 279–288. [Online]. Available: http://doi.acm.org/10.1145/1989493.1989539

[32] P. Kling and P. Pietrzyk, “Profitable scheduling on multiple speed-scalable processors,” ser. SPAA ' 13. New York, NY, USA: ACM, 2013, pp. 251–260. [Online]. Available: http://doi.acm.org/10.1145/2486159.2486183