Role
Senior Product Designer
Year
2022-2024
Company
Internal Platform
Focus
Impact
Reduction in incident detection time
Achieved through relational visualization and unified dashboard.
Faster response time
Improved operational resilience through actionable dashboards.
The Challenge
Traditional observability tools monitor metrics—CPU load, latency, error rates. But a global fulfillment network doesn't fail in isolation. Thousands of sites, hundreds of technical teams, and thousands of interdependent services form a dependency graph with millions of connections. When something breaks, the question isn't "which metric spiked?" but "which dependency failed?" Before this platform, incident resolvers had no unified view of these relationships. They manually navigated 7 disparate data sources — alarm consoles, graph databases, monitoring dashboards, network analyzers, system logs, and deployment trackers — while 6 additional metadata systems fed site context in the background. The real challenge wasn't monitoring; it was understanding how services depended on each other at scale.
"Currently if a Power or Dual WAN outage occurs at a site, we can't tell the difference if it's ISP or Power related."
The Solution
Working closely with the engineering team, we designed a single-pane dashboard that turns a complex dependency graph into something an incident resolver can read at a glance. The key design decision was structuring the interface around dependencies rather than individual metrics — so resolvers could trace a failure path instead of checking tools one by one. The system highlights the probable root cause automatically, and we built in an explainability framework for ML-driven trust scores planned for future releases.
Relational Visualization
A visual language that maps how services depend on each other — making invisible relationships visible at a glance.
Probable Cause Engine
Automated root cause detection that traces the failure path and highlights where the chain breaks.
Unified Information Architecture
Aggregated alerts with consistent navigation patterns reducing cognitive load across thousands of services.
Reflection
Visualizing dependencies rather than isolated metrics fundamentally changed how operators reasoned about incidents — they stopped chasing individual alarms and started tracing causal chains. Running UX discovery in parallel with backend development meant we could validate design decisions against real data within the same sprint. The biggest accelerator was establishing a design system where every backend concept had a direct UI counterpart — status cards, metric badges, dependency connectors. Engineers could ship new monitoring views by referencing the design system alone, without waiting for custom specs.