Correcting soft errors online in fast fourier transform

Date
2017
Language
English
Embargo Lift Date
Committee Members
Degree
Degree Year
Department
Grantor
Journal Title
Journal ISSN
Volume Title
Found At
ACM
Abstract

While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the existing ABFT schemes detect soft errors online before the computation finishes. This paper presents an online ABFT scheme for FFT so that soft errors can be detected online and the corrupted computation can be terminated in a much more timely manner. We also extend our scheme to tolerate both arithmetic errors and memory errors, develop strategies to reduce its fault tolerance overhead and improve its numerical stability and fault coverage, and finally incorporate it into the widely used FFTW library - one of the today's fastest FFT software implementations. Experimental results demonstrate that: (1) the proposed online ABFT scheme introduces much lower overhead than the existing offline ABFT schemes; (2) it detects errors in a much more timely manner; and (3) it also has higher numerical stability and better fault coverage.

Description
item.page.description.tableofcontents
item.page.relation.haspart
Cite As
Liang, X., Chen, J., Tao, D., Li, S., Wu, P., Li, H., … Chen, Z. (2017). Correcting Soft Errors Online in Fast Fourier Transform. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 30:1–30:12). New York, NY, USA: ACM. https://doi.org/10.1145/3126908.3126915
ISSN
Publisher
Series/Report
Sponsorship
Major
Extent
Identifier
Relation
Journal
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Rights
Publisher Policy
Source
Author
Alternative Title
Type
Conference proceedings
Number
Volume
Conference Dates
Conference Host
Conference Location
Conference Name
Conference Panel
Conference Secretariat Location
Version
Author's manuscript
Full Text Available at
This item is under embargo {{howLong}}