Logo
International Journal of
Advanced Education and Research

Search

ARCHIVES
VOL. 10, ISSUE 2 (2025)
Exploring FP8 floating-point format for computational efficiency in deep learning inference and training
Authors
Himanshu Sharma, Kamre Shriharsh, S Rishwanth Rao, Dr. Awwab Mohammad
Abstract
FP8 (8-bit floating-point) is an emerging numerical format that promises a balance between computational efficiency and precision in deep learning. Traditionally, formats like FP32 and FP16 have been used for training due to their accuracy, while INT8 has been leveraged for inference to save resources. FP8 introduces a new tradeoff: it offers the efficiency of INT8 with better flexibility, and although it has lower precision, it still supports floating-point operations. This paper investigates FP8’s capabilities for inference and training, the architecture of its configurations (E4M3 and E5M2), and how they affect neural network performance. We compare it with FP16 and INT8, demonstrating the practical benefits and challenges of FP8 implementation.
Download
Pages:39-43
How to cite this article:
Himanshu Sharma, Kamre Shriharsh, S Rishwanth Rao, Dr. Awwab Mohammad "Exploring FP8 floating-point format for computational efficiency in deep learning inference and training". International Journal of Advanced Education and Research, Vol 10, Issue 2, 2025, Pages 39-43
Download Author Certificate

Please enter the email address corresponding to this article submission to download your certificate.