This page contains press release content distributed by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

TuxCare To Present QCon London 2026 Session on Operating Open Source at Scale

TuxCare To Present QCon London 2026 Session on Operating Open Source at Scale

PALO ALTO, CA, UNITED STATES, March 12, 2026 /EINPresswire.com/ — TuxCare, a global innovator in securing open source,

March 12, 2026

Purdue University’s School of Health Sciences Invests in Alpha-E Fusion Device to Revolutionize Student Research

Purdue University’s School of Health Sciences Invests in Alpha-E Fusion Device to Revolutionize Student Research

WEST LAFAYETTE, IN, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Alpha Ring, the global leader in micro-fusion

March 12, 2026

Influential Women Profiles: Ani Gamez: Tax Director at H&CO in Miami, Florida

Influential Women Profiles: Ani Gamez: Tax Director at H&CO in Miami, Florida

MIAMI, FL, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Helping High-Net-Worth Clients Navigate Complex

March 12, 2026

Macxvideo AI V3.13 Adds Dedicated Audio-Only Recording and Improves Screen Capture Stability on macOS

Macxvideo AI V3.13 Adds Dedicated Audio-Only Recording and Improves Screen Capture Stability on macOS

Digiarty Software has released Macxvideo AI V3.13, adding an audio recorder and updating its screen recording module to

March 12, 2026

Best selling home cook releases her second book

Best selling home cook releases her second book

A passionate home cook whose recipes have led to millions of views online, is planning to release her second book. I

March 12, 2026

Influential Women Recognize Sharon M. Jacobs for 45 Years of Dedication to Student Success and Inclusive Education

Influential Women Recognize Sharon M. Jacobs for 45 Years of Dedication to Student Success and Inclusive Education

BARRE, VT, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Veteran Vermont Educator and Mentor Continues Supporting

March 12, 2026

Chinese Neurosurgical Journal Study Develops a New Protocol for Determining Location of Paraclinoid Aneurysms

Chinese Neurosurgical Journal Study Develops a New Protocol for Determining Location of Paraclinoid Aneurysms

Researchers develop a modified high-resolution magnetic resonance imaging technique for determining the location of

March 12, 2026

Accelovant Unveils MPX‑EDGE: High‑Performance Edge AI/MIMO Controller for Next‑Gen Semiconductor Tools

Accelovant Unveils MPX‑EDGE: High‑Performance Edge AI/MIMO Controller for Next‑Gen Semiconductor Tools

Visit us at SEMI “Smarter Sensors, AI at the Edge in Semiconductor Manufacturing” and learn how it runs your

March 12, 2026

Standardizing the Economics of AI Discovery: Partnerize and Profound Establish Infrastructure for Zero-Click Commerce

Standardizing the Economics of AI Discovery: Partnerize and Profound Establish Infrastructure for Zero-Click Commerce

Collaboration connects AI discovery to actual revenue through intelligence, influence measurement, and verified payment

March 12, 2026

Premier Auto Protect Explains Rising Auto Extended Car Warranty Demand Amid Repair Cost Inflation

Premier Auto Protect Explains Rising Auto Extended Car Warranty Demand Amid Repair Cost Inflation

Premier Auto Protect explains how rising vehicle repair costs are leading more drivers to consider auto extended car

March 12, 2026

HAProxy Ranked #3 Best Web Hosting Software Product in G2’s 2026 Best Software Awards

HAProxy Ranked #3 Best Web Hosting Software Product in G2’s 2026 Best Software Awards

User-driven recognition highlights HAProxy’s leadership in Load Balancing, WAF, and DDoS Protection for scaling modern

March 12, 2026

a2b Fulfillment Achieves Milestone Safety Rating, Underscoring Operational Excellence

a2b Fulfillment Achieves Milestone Safety Rating, Underscoring Operational Excellence

GREENSBORO, GA, UNITED STATES, March 12, 2026 /EINPresswire.com/ — a2b Fulfillment, a leading provider of order

March 12, 2026

HMP Global’s Radiopharmaceutical Education Institute (RPEI) Aims to Advance Oncology Innovation

HMP Global’s Radiopharmaceutical Education Institute (RPEI) Aims to Advance Oncology Innovation

New, multidisciplinary platform delivers expert-driven education to support the evolving integration of

March 12, 2026

The Boxery Reports Dimensional Weight Is Reshaping E-Commerce Packaging—And Shipping Mailers Are the Fastest Fix

The Boxery Reports Dimensional Weight Is Reshaping E-Commerce Packaging—And Shipping Mailers Are the Fastest Fix

The Boxery explains how weight affects shipping costs and why many e-commerce businesses are switching to shipping

March 12, 2026

NDay, an NVIDIA Inception Member, Launches Self-Service GARAK AI LLM Red Teaming, Expanding Continuous Exploitability

NDay, an NVIDIA Inception Member, Launches Self-Service GARAK AI LLM Red Teaming, Expanding Continuous Exploitability

NDay, an NVIDIA Inception Member, Launches Self-Service GARAK AI Red Teaming, Expanding Its Continuous Exploitability

March 12, 2026

WhatsDash Rebrands as StatNexa, Launching a Unified Marketing Analytics Platform for Agencies

WhatsDash Rebrands as StatNexa, Launching a Unified Marketing Analytics Platform for Agencies

WhatsDash officially rebranded to StatNexa, introducing enhanced marketing analytics, advanced reporting dashboards,

March 12, 2026

Sober in Cyber and JackiesInSecurity Host Rockin’ Mocktails, An Alcohol-Free Networking Event for RSAC Attendees

Sober in Cyber and JackiesInSecurity Host Rockin’ Mocktails, An Alcohol-Free Networking Event for RSAC Attendees

Connect over mocktails, music, and creative activities at this inclusive alternative to traditional conference happy

March 12, 2026

Return Technologies Announces Renaud de Viel Castel as Co-Founder and Chief Executive Officer

Return Technologies Announces Renaud de Viel Castel as Co-Founder and Chief Executive Officer

Return Technologies named Renaud de Viel Castel Co-Founder & CEO. The company focuses on transparent, traceable

March 12, 2026

Moyae Launches Digital Retina Drawing Tool

Moyae Launches Digital Retina Drawing Tool

Moyae Launches Digital Retina Drawing Module, Replacing Paper Diagrams for Retina Specialists AUSTIN, TX, UNITED

March 12, 2026

AI Call Handling Technology Gains Adoption Among Home Service Businesses

AI Call Handling Technology Gains Adoption Among Home Service Businesses

Ringzy platform uses conversational AI voice agents to help contractors manage inbound calls SHELBY TOWNSHIP, MI,

March 12, 2026

GoML & Plumbata Launch AI Platform to Structure & Interpret Complex Union Agreements for Engineering & Construction

GoML & Plumbata Launch AI Platform to Structure & Interpret Complex Union Agreements for Engineering & Construction

In the construction industry, union agreements often exceed 600 pages. Plumbata turns these agreements into structured,

March 12, 2026

Florida State University Integrates RPM Platform to Train Next Generation of Researchers and Healthcare Professionals

Florida State University Integrates RPM Platform to Train Next Generation of Researchers and Healthcare Professionals

NEWARK, NJ, UNITED STATES, March 12, 2026 /EINPresswire.com/ — RPM Healthcare has been selected by Florida State

March 12, 2026

ANY.RUN Announces Integration with Tines to Accelerate SOC Response with Intelligent Workflows

ANY.RUN Announces Integration with Tines to Accelerate SOC Response with Intelligent Workflows

DUBAI, DUBAI, UNITED ARAB EMIRATES, March 12, 2026 /EINPresswire.com/ — ANY.RUN has launched a new integration with

March 12, 2026

TorchStone Global and Ontic announce strategic partnership

TorchStone Global and Ontic announce strategic partnership

Alliance designates TorchStone as Ontic’s preferred partner combining elite protective intelligence with

March 12, 2026

Sanpeggio’s Expands in Alabama with Grand Opening of its 7th Location in Hoover

Sanpeggio’s Expands in Alabama with Grand Opening of its 7th Location in Hoover

Sanpeggio’s opens its 7th Alabama location in Hoover, bringing handcrafted pizza and a welcoming neighborhood gathering

March 12, 2026

Overture Entertainment, Inc. Gives Way to Robtone, LLC

Overture Entertainment, Inc. Gives Way to Robtone, LLC

Multi-faceted entertainment company ends its 30-year run to form a streamline, more efficient entity. OEI was becoming

March 12, 2026

UAGC Launches Virtual Simulation Experiences to Prepare Future Early Childhood Educators

UAGC Launches Virtual Simulation Experiences to Prepare Future Early Childhood Educators

This project was designed to bridge the gap between theory and practice in online early childhood education.”—

March 12, 2026

Sphera Wins Five-Year Sole-Source Hazardous Materials Contract from the Defense Logistics Agency

Sphera Wins Five-Year Sole-Source Hazardous Materials Contract from the Defense Logistics Agency

Award reinforces Sphera’s leadership in chemical lifecycle management across the Department of War and NASA This award

March 12, 2026

SINQUA WALLS LEADS SXSW PANEL WITH ACCLAIMED PRODUCERS ON HOW ORIGINAL FILMS GET GREENLIT

SINQUA WALLS LEADS SXSW PANEL WITH ACCLAIMED PRODUCERS ON HOW ORIGINAL FILMS GET GREENLIT

Award-winning filmmakers and industry leaders convene at SXSW 2026 for a behind-the-scenes look at how original films

March 12, 2026

MethodSense Releases 2026 Regulatory Outlook for MedTech Industry

MethodSense Releases 2026 Regulatory Outlook for MedTech Industry

What regulatory shifts in AI, cybersecurity, digital submissions, and capital strategy mean for your success in 2026…

March 12, 2026

Magic Smiles for Kids Opens New Pediatric Dental Office in Bay Shore, New York

Magic Smiles for Kids Opens New Pediatric Dental Office in Bay Shore, New York

Magic Smiles for Kids provides children's dentistry focused on preventative care, early education & positive

March 12, 2026

Fresh Off His NAACP Image Awards Moment, Mali Music Heads to the DMV for One Night Only

Fresh Off His NAACP Image Awards Moment, Mali Music Heads to the DMV for One Night Only

The Grammy Award–winning artist returns to the East Coast for a special performance at The Birchmere in Alexandria,

March 12, 2026

QABA Expands Outreach in Thailand

QABA Expands Outreach in Thailand

Credentialing board visits local centers and hosts professional gathering to strengthen international ABA community The

March 12, 2026

Psynth Achieves HIPAA, PIPEDA, and GDPR Compliance: Independently Verified Across All Three Frameworks

Psynth Achieves HIPAA, PIPEDA, and GDPR Compliance: Independently Verified Across All Three Frameworks

AIS confirms no material gaps across all three privacy frameworks, making Psynth the only report writing platform for

March 12, 2026

Bainbridge Consulting Recognized in 2026 Vault Rankings and Featured in Forbes List of Top Consulting Firms (2016-2025)

Bainbridge Consulting Recognized in 2026 Vault Rankings and Featured in Forbes List of Top Consulting Firms (2016-2025)

Continued recognition reflects the firm's commitment to research-driven advisory and client outcomes These rankings

March 12, 2026

CadenceSEO Expands to Tennessee With Advanced SEO and Digital Marketing Services

CadenceSEO Expands to Tennessee With Advanced SEO and Digital Marketing Services

CadenceSEO brings tailored digital strategies, from Technical SEO Consulting to AI-driven visibility, to businesses

March 12, 2026

CIS Report Warns: AI Tools Can Aid Criminals in Planning Physical Attacks

CIS Report Warns: AI Tools Can Aid Criminals in Planning Physical Attacks

Our findings show GenAI is lowering the barrier of entry further than ever for people looking to plan real-world harm.

March 12, 2026

NYC interfaith Iftar: another powerful evening of unity

NYC interfaith Iftar: another powerful evening of unity

NEW YORK CITY, NY, UNITED STATES, March 12, 2026 /EINPresswire.com/ — The American Muslim & Multifaith Women’s

March 12, 2026

Technology B2B Sales Leader to Drive Profitable Growth for Chief Outsiders Clients

Technology B2B Sales Leader to Drive Profitable Growth for Chief Outsiders Clients

An expert in complex solution selling, Jim Wallace will deliver sustained revenue, margin, and customer satisfaction

March 12, 2026

Christopher Calabro Named to the LPL Ambassador Council

Christopher Calabro Named to the LPL Ambassador Council

ELMSFORD, NY, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Christopher Calabro from CPC Wealth Management, based

March 12, 2026