When Silicon Fails Silently: Characterizing Hardware-Induced Corruption in LLM Training

0 views
Download
  • Share
+0
Create Account or Sign In to post comments
#REPP #electronics packaging #photonics packaging #SDC #Silent data corruption #LLM #AI

(19:06 + Q&A) Jeffrey Ma, Harvard University — As the scale of training large language models (LLMs) increases, one emergent failure is silent data corruption (SDC), where hardware produces incorrect computations without explicit failure signals. In this work, we summarize the first investigation of the impact of real-world SDCs on LLM training presented in our ACL work [1]. In our investigation, we compare model training between healthy production nodes and unhealthy nodes exhibiting SDCs. With the help from a cloud computing platform, we access the unhealthy nodes that were swept out from production by automated fleet management. Using deterministic execution via XLA compiler and our proposed synchronization mechanisms, we isolate and analyze the impact of SDC errors on these nodes at three levels: at each submodule computation, at a single optimizer step, and at a training period. Our results reveal that the impact of SDCs on computation varies on different unhealthy nodes. Although in most cases the perturbations from SDCs on submodule computation and gradients are relatively small, SDCs can lead models to converge to different optima with different weights and even cause spikes in the training loss.
Bio: Jeffrey Ma is a third-year CS PhD student at Harvard advised by Professor Vijay Janapa Reddi. His work bridges machine learning, large-scale distributed and software systems, and reliability, with a focus on both large scale training and code-centric foundation models. Previously, in collaboration with AWS AI Labs, he studied silent data corruption in large-scale LLM training, with recent results accepted to ACL 2025 and IEEE IOLTS 2025.  

Edited videos and slides from most of the REPP talks are available at https://attend.ieee.org/repp/?page_id=2110 
 Join our Dlist to hear about future REPPs: https://attend.ieee.org/repp/?page_id=361 
Organized by the IEEE Silicon Valley chapter of the Electronics Packaging Society: https://ieee.org/scveps 

(19:06 + Q&A) Jeffrey Ma, Harvard University — As the scale of training large language models (LLMs) increases, one emergent failure is silent data corruption (SDC), where hardware produces incorrect computations without explicit failure signals. In this work, we summarize the first investigation of the impact of real-world SDCs on LLM training… (more)

Speakers in this video

Advertisment

Advertisment