Enterprise SaaS · Observability platform

Metric OS

An intelligent observability platform that helps enterprise teams understand system performance, investigate anomalies, and act before issues become business risks.

Lead Product Designer16 weeksWeb applicationAI-assisted observability

Resolution velocity+38%▲ 12.4% this quarter

ServicesRiskIdentityOps

Project overview

One operating layer for understanding what changed, why it matters, and what to do next.

Metric OS unifies telemetry from applications, infrastructure, services, and critical user journeys. Instead of asking teams to interpret disconnected monitors, it prioritizes meaningful anomalies, explains likely causes, and connects every incident to an accountable owner.

My role

Product Design Lead

Led product strategy, research synthesis, information architecture, interaction design, prototyping, validation, and design-system direction.

Team

Cross-functional core

1 product lead, 2 designers, 5 engineers, 1 data scientist, and enterprise subject-matter partners.

Outcome focus

From reporting to action

Reduce time-to-detection, increase telemetry trust, and standardize investigation workflows across teams.

The challenge

Engineering teams had more monitoring data than answers.

11+

tools used to investigate a critical service issue

43%

of incident time spent reconciling conflicting telemetry

3.2 days

average delay between detecting and assigning an incident

The so-what

Metric OS turns alert storms into one explainable incident, reducing MTTR before technical failures become revenue failures.

Modern cloud teams rarely lack telemetry. They lack a coherent way to determine which signals matter, how failures are connected, and what action is safe to take. Metric OS correlates logs, traces, metrics, and service dependencies into an opinionated incident workspace.

The platform compresses hundreds of noisy alerts into a single operational narrative, exposes the blast radius, and keeps investigation and remediation in one governed flow.

42%target reduction in MTTR

500→1alerts grouped into one incident

4→1tools consolidated per investigation

502TIMEOUTCPUTLSRETRYQUEUELATENCY5XXMEMRESETSLOPOOL

→

CORRELATED INCIDENTINC-4821Checkout API failure94% root-cause confidence

Primary user & pain point

Meet Frankin, the SRE responsible when thousands of services become one urgent problem.

FRFrankinSenior Site Reliability Engineer · Global commerce platform

“I do not need another alert. I need to know what changed, who is affected, and the safest action to take.”

LOGSupstream reset...TLS handshake...

TRACES

METRICS18.7%error rate

SLACKIs checkout down?@frankin investigating

Environment2,400 microservices · 18 regions · 9M daily sessionsSuccess metricRestore service before the incident breaches revenue SLOs

01 · Fragmented context

Four browser tabs, no shared narrative

Frankin jumps between logs, traces, infrastructure metrics, and Slack while manually rebuilding the timeline of failure.

02 · Unstructured evidence

Text logs hide the causal chain

Production errors arrive as dense, unorganized payloads with weak connections to services, deployments, and user impact.

03 · Alert fatigue

One failure creates a storm

Five hundred downstream alerts compete for attention even though they originate from the same upstream regression.

Critical workflow · From anomaly to resolution

One continuous path from signal detection to governed rollback.

Proactive detection

500 loose alerts become one incident container.

Metric OS correlates temporal proximity, trace ancestry, deployment changes, and service dependencies. Instead of paging Frankin 500 times, it creates a single incident with a confidence score, suspected origin, and affected SLOs.

500 alerts → 1 incident

Investigation

The blast radius becomes spatial and explainable.

The topology canvas brings the failing API gateway forward, dims healthy infrastructure, and maps the propagation path across checkout, payments, and identity. Selecting a node pivots the log grid to the exact service and trace context.

Root cause confidence · 94%

Automated action

A governed rollback executes inside the incident.

Frankin reviews the recommended rollback, validates affected dependencies, and executes it without leaving the workspace. Guardrails show permissions, expected impact, rollback progress, and post-action health verification.

v2.18.4✓v2.18.3

Rollback verified · 4m 12s

Main incident response screen · 60/40 split-pane

Investigate the system and the evidence without losing context.

A single-screen response workspace combines spatial service topology with high-density debugging data and governed remediation.

⌘ K Search services, traces, incidents, or run a command

● LiveFR

SEV-1 · ACTIVE

Checkout API error-rate spike

Started 09:41 UTC · Revenue checkout flow degraded · 18.4K users affected

Spatial topology canvasCheckout production · us-east-1

Blast radius · 4 services

✓identity-service99.99% · 42ms

✓catalog-service99.98% · 51ms

502 Bad Gateway!api-gateway18.7% errors · 1.8s

Degraded!checkout-service12.4% errors · 1.2s

Timeout!payment-orchestrator8.6% errors · 2.1s

✓order-service99.96% · 68ms

100%

⌕ service:api-gateway level:error⌘ ↵ Run

Error rate18.7%▲ 16.2%

Avg latency1.82s▲ 1.4s

Saturation91%▲ 24%

TimestampService nameStatusLog message payloadTrace ID

09:44:21.847api-gatewayERRORupstream connect error: TLS handshake timeout; reset reason=connection_terminationtr_8f2c91a

09:44:21.812checkout-service502POST /checkout failed: upstream request timeout after 1800ms; retry_budget_exhausted=truetr_8f2c91a

09:44:21.790payment-orchestratorWARNcircuit_breaker state=OPEN dependency=tokenization-v2 failure_rate=0.42tr_4bd109e

09:44:20.644api-gatewayERRORupstream reset before response headers; transport_failure_reason=TLS_errortr_73ac011

09:44:19.318order-serviceINFOrequest completed status=200 duration_ms=68 region=us-east-1tr_a013bd2

09:44:18.902checkout-serviceWARNconnection pool saturation=0.91 active=182 idle=4 pending=63tr_1cc81fa

AI root-cause hypothesis · 94% confidenceDeployment v2.18.4 introduced an incompatible TLS cipher configuration in api-gateway.

First error occurred 47 seconds after deployment. Rollback is expected to restore checkout traffic within 4–6 minutes.

Observability cockpit

Every component connects performance to context, confidence, and action.

The extended observability workspace supports system-health scanning and detailed investigation without forcing teams into separate monitoring tools.

Service availability

99.98% +18.4%

Last 12 months⌄

ActualPlanForecast

Telemetry composition

Signal mix

Healthy

62%Traces

Traces 62.1MLogs 28.4MMetrics 9.5M

Service intelligence

Latency by service

6 cohorts

W1W2W3W4W5W6

API gateway

Payments

Identity

Data pipeline

Predictive outlook

Capacity forecast

91% confidence

73%Expected utilization

61%
Conservative73%
Expected86%
Upside

Signal catalog

Governed signals

128 certified

API availabilitySRE Platform · 4m ago118.4%Certified

P95 API latencyObservability · 8m ago64.2%Watch

Database saturationInfrastructure · 2m ago87.0%At risk

Error budget remainingSRE Platform · 6m ago$8.4MCertified

Incident actions

Resolution tracker

Q2 priorities

API latency regression72%Owner · SRE Platform · On track

Database saturation program46%Owner · Infrastructure · At risk

Error-budget recovery88%Owner · Platform · On track

Design system

A semantic system built for dense, trustworthy enterprise experiences.

Semantic color

Color communicates status and confidence, never meaning alone.

Data typography118.4%

Metric title / 16

Supporting context / 14

Trust states

● Certified◐ Monitoring! At risk

Accessibility

WCAG 2.1 AA contrast
Keyboard-first workflows
Non-color status indicators
Reduced-motion support

UX defended · Design rationale

Enterprise density, depth, and continuity are functional decisions.

Every visual choice is tied to the cognitive demands of high-stakes incident response.

01 · Density over decorative whitespace

Rapid scanning requires evidence to stay visible.

During an incident, an SRE compares timestamps, service names, statuses, payload patterns, and trace IDs across multiple events. A compact table preserves more causal evidence in the viewport, reduces scrolling, and makes repeated patterns visible. Density is controlled through strict column alignment, monospace payloads, restrained borders, and semantic badges.

UX principle · Maximize comparable evidence per glance.

02 · Glass 2.0 as functional depth

The Z-axis communicates urgency without creating color fatigue.

Healthy infrastructure recedes into lower-contrast background layers. Anomalous services rise forward on crisp glass surfaces with elevation, glow, and high-contrast error badges. This reserves saturated crimson and amber for genuine risk while depth and focus communicate hierarchy across the rest of the topology.

UX principle · Use depth before adding more color.

03 · One continuous split-pane workflow

Topology and evidence remain causally connected.

The 40% topology canvas preserves system relationships while the 60% debugging pane exposes the supporting logs, KPIs, and traces. Selecting a service pivots evidence without opening another tab. Keeping detection, investigation, and rollback togethis eliminates context reconstruction and reduces working-memory load.

UX principle · Preserve context across every investigative pivot.

Impact & outcomes

Creating a faster, more accountable observability ecosystem.

Prototype validation with enterprise users showed that combining signal prioritization, transparent AI explanations, and governed telemetry reduced time spent interpreting dashboards and increased confidence in incident response.

38%faster time from signal to assigned action

52%less time spent reconciling metric definitions

4.7/5reported confidence in the observability workflow

AAaccessibility target across critical workflows

Next project

Explore more product design work.

Back to selected work ↗