ARCHIVES
VOL. 10, ISSUE 2 (2025)
Exploring FP8 floating-point format for computational efficiency in deep learning inference and training
Authors
Himanshu Sharma, Kamre Shriharsh, S Rishwanth Rao, Dr. Awwab Mohammad
Abstract
FP8 (8-bit floating-point) is an emerging
numerical format that promises a balance between computational efficiency and
precision in deep learning. Traditionally, formats like FP32 and FP16 have been
used for training due to their accuracy, while INT8 has been leveraged for
inference to save resources. FP8 introduces a new tradeoff: it offers the
efficiency of INT8 with better flexibility, and although it has lower
precision, it still supports floating-point operations. This paper investigates
FP8’s capabilities for inference and training, the architecture of its
configurations (E4M3 and E5M2), and how they affect neural network performance.
We compare it with FP16 and INT8, demonstrating the practical benefits and
challenges of FP8 implementation.
Download
Pages:39-43
How to cite this article:
Himanshu Sharma, Kamre Shriharsh, S Rishwanth Rao, Dr. Awwab Mohammad "Exploring FP8 floating-point format for computational efficiency in deep learning inference and training". International Journal of Advanced Education and Research, Vol 10, Issue 2, 2025, Pages 39-43
Download Author Certificate
Please enter the email address corresponding to this article submission to download your certificate.
