DescriptionNearly one third of all data centers had an incident in 2018 up from 25 percent the year before, according to a recent survey by the Uptime Institute. The average cost of downtime was $260,000 per hour. Yet 80 percent of the incidents could be prevented through robust detection systems. Today, data center incident detection is typically conducted through threshold-based techniques leading to either too many false alarms or late detection of incidents.
We utilize machine learning and AI to detect incidents early, while reducing false alarm rates. We estimate metric statistical distributions across time and metric subsets, and compare the metric behaviour against statistical baselines. We employ correlation engines coupled with domain expertise to identify metric causality relationships. We further use neural networks to identify data center metric behaviour. Our systems are trained and customized for HPC systems with thousands of nodes and tens of thousands of processors.