Enterprise SaaS · Observability platform

Metric OS

An intelligent observability platform that helps enterprise teams understand system performance, investigate anomalies, and act before issues become business risks.

Lead Product Designer16 weeksWeb applicationAI-assisted observability
Resolution velocity+38%▲ 12.4% this quarter
ServicesRiskIdentityOps
Project overview

One operating layer for understanding what changed, why it matters, and what to do next.

Metric OS unifies telemetry from applications, infrastructure, services, and critical user journeys. Instead of asking teams to interpret disconnected monitors, it prioritizes meaningful anomalies, explains likely causes, and connects every incident to an accountable owner.

My role

Product Design Lead

Led product strategy, research synthesis, information architecture, interaction design, prototyping, validation, and design-system direction.

Team

Cross-functional core

1 product lead, 2 designers, 5 engineers, 1 data scientist, and enterprise subject-matter partners.

Outcome focus

From reporting to action

Reduce time-to-detection, increase telemetry trust, and standardize investigation workflows across teams.

The challenge

Engineering teams had more monitoring data than answers.

11+

tools used to investigate a critical service issue

43%

of incident time spent reconciling conflicting telemetry

3.2 days

average delay between detecting and assigning an incident

The so-what

Metric OS turns alert storms into one explainable incident, reducing MTTR before technical failures become revenue failures.

Modern cloud teams rarely lack telemetry. They lack a coherent way to determine which signals matter, how failures are connected, and what action is safe to take. Metric OS correlates logs, traces, metrics, and service dependencies into an opinionated incident workspace.

The platform compresses hundreds of noisy alerts into a single operational narrative, exposes the blast radius, and keeps investigation and remediation in one governed flow.

42%target reduction in MTTR
500→1alerts grouped into one incident
4→1tools consolidated per investigation
502TIMEOUTCPUTLSRETRYQUEUELATENCY5XXMEMRESETSLOPOOL
CORRELATED INCIDENTINC-4821Checkout API failure94% root-cause confidence
Primary user & pain point

Meet Frankin, the SRE responsible when thousands of services become one urgent problem.

FRFrankinSenior Site Reliability Engineer · Global commerce platform
“I do not need another alert. I need to know what changed, who is affected, and the safest action to take.”
LOGSupstream reset...TLS handshake...
TRACES
METRICS18.7%error rate
SLACKIs checkout down?@frankin investigating
Environment2,400 microservices · 18 regions · 9M daily sessionsSuccess metricRestore service before the incident breaches revenue SLOs
01 · Fragmented context

Four browser tabs, no shared narrative

Frankin jumps between logs, traces, infrastructure metrics, and Slack while manually rebuilding the timeline of failure.

02 · Unstructured evidence

Text logs hide the causal chain

Production errors arrive as dense, unorganized payloads with weak connections to services, deployments, and user impact.

03 · Alert fatigue

One failure creates a storm

Five hundred downstream alerts compete for attention even though they originate from the same upstream regression.

Critical workflow · From anomaly to resolution

One continuous path from signal detection to governed rollback.

01
Proactive detection

500 loose alerts become one incident container.

Metric OS correlates temporal proximity, trace ancestry, deployment changes, and service dependencies. Instead of paging Frankin 500 times, it creates a single incident with a confidence score, suspected origin, and affected SLOs.

1
500 alerts → 1 incident
02
Investigation

The blast radius becomes spatial and explainable.

The topology canvas brings the failing API gateway forward, dims healthy infrastructure, and maps the propagation path across checkout, payments, and identity. Selecting a node pivots the log grid to the exact service and trace context.

!
Root cause confidence · 94%
03
Automated action

A governed rollback executes inside the incident.

Frankin reviews the recommended rollback, validates affected dependencies, and executes it without leaving the workspace. Guardrails show permissions, expected impact, rollback progress, and post-action health verification.

v2.18.4v2.18.3
Rollback verified · 4m 12s
Main incident response screen · 60/40 split-pane

Investigate the system and the evidence without losing context.

A single-screen response workspace combines spatial service topology with high-density debugging data and governed remediation.

Metric OS/Incidents/INC-4821
● LiveFR
SEV-1 · ACTIVE

Checkout API error-rate spike

Started 09:41 UTC · Revenue checkout flow degraded · 18.4K users affected

Spatial topology canvasCheckout production · us-east-1
Blast radius · 4 services
identity-service99.99% · 42ms
catalog-service99.98% · 51ms
502 Bad Gateway!api-gateway18.7% errors · 1.8s
Degraded!checkout-service12.4% errors · 1.2s
Timeout!payment-orchestrator8.6% errors · 2.1s
order-service99.96% · 68ms
100%
⌕  service:api-gateway level:error⌘ ↵ Run
Error rate18.7%▲ 16.2%
Avg latency1.82s▲ 1.4s
Saturation91%▲ 24%
Correlated logs1,284 events · 94% confidence
TimestampService nameStatusLog message payloadTrace ID
api-gatewayERRORupstream connect error: TLS handshake timeout; reset reason=connection_terminationtr_8f2c91a
checkout-service502POST /checkout failed: upstream request timeout after 1800ms; retry_budget_exhausted=truetr_8f2c91a
payment-orchestratorWARNcircuit_breaker state=OPEN dependency=tokenization-v2 failure_rate=0.42tr_4bd109e
api-gatewayERRORupstream reset before response headers; transport_failure_reason=TLS_errortr_73ac011
order-serviceINFOrequest completed status=200 duration_ms=68 region=us-east-1tr_a013bd2
checkout-serviceWARNconnection pool saturation=0.91 active=182 idle=4 pending=63tr_1cc81fa
AI root-cause hypothesis · 94% confidenceDeployment v2.18.4 introduced an incompatible TLS cipher configuration in api-gateway.

First error occurred 47 seconds after deployment. Rollback is expected to restore checkout traffic within 4–6 minutes.

Observability cockpit

Every component connects performance to context, confidence, and action.

The extended observability workspace supports system-health scanning and detailed investigation without forcing teams into separate monitoring tools.

Service availability

99.98% +18.4%

Last 12 months⌄
May 202699.98%▲ $2.3M vs plan
ActualPlanForecast
Telemetry composition

Signal mix

Healthy
62%Traces
Traces 62.1MLogs 28.4MMetrics 9.5M
Service intelligence

Latency by service

6 cohorts
W1W2W3W4W5W6
API gateway
Payments
Identity
Data pipeline
Predictive outlook

Capacity forecast

91% confidence
73%Expected utilization
61%
Conservative
73%
Expected
86%
Upside
Signal catalog

Governed signals

128 certified
API availabilitySRE Platform · 4m ago118.4%Certified
P95 API latencyObservability · 8m ago64.2%Watch
Database saturationInfrastructure · 2m ago87.0%At risk
Error budget remainingSRE Platform · 6m ago$8.4MCertified
Incident actions

Resolution tracker

Q2 priorities
API latency regression72%Owner · SRE Platform · On track
Database saturation program46%Owner · Infrastructure · At risk
Error-budget recovery88%Owner · Platform · On track
Design system

A semantic system built for dense, trustworthy enterprise experiences.

Semantic color

Color communicates status and confidence, never meaning alone.

Data typography118.4%

Metric title / 16

Supporting context / 14

Trust states
● Certified◐ Monitoring! At risk
Accessibility
  • WCAG 2.1 AA contrast
  • Keyboard-first workflows
  • Non-color status indicators
  • Reduced-motion support
UX defended · Design rationale

Enterprise density, depth, and continuity are functional decisions.

Every visual choice is tied to the cognitive demands of high-stakes incident response.

01 · Density over decorative whitespace

Rapid scanning requires evidence to stay visible.

During an incident, an SRE compares timestamps, service names, statuses, payload patterns, and trace IDs across multiple events. A compact table preserves more causal evidence in the viewport, reduces scrolling, and makes repeated patterns visible. Density is controlled through strict column alignment, monospace payloads, restrained borders, and semantic badges.

UX principle · Maximize comparable evidence per glance.
02 · Glass 2.0 as functional depth

The Z-axis communicates urgency without creating color fatigue.

Healthy infrastructure recedes into lower-contrast background layers. Anomalous services rise forward on crisp glass surfaces with elevation, glow, and high-contrast error badges. This reserves saturated crimson and amber for genuine risk while depth and focus communicate hierarchy across the rest of the topology.

UX principle · Use depth before adding more color.
03 · One continuous split-pane workflow

Topology and evidence remain causally connected.

The 40% topology canvas preserves system relationships while the 60% debugging pane exposes the supporting logs, KPIs, and traces. Selecting a service pivots evidence without opening another tab. Keeping detection, investigation, and rollback togethis eliminates context reconstruction and reduces working-memory load.

UX principle · Preserve context across every investigative pivot.
Impact & outcomes

Creating a faster, more accountable observability ecosystem.

Prototype validation with enterprise users showed that combining signal prioritization, transparent AI explanations, and governed telemetry reduced time spent interpreting dashboards and increased confidence in incident response.

38%faster time from signal to assigned action
52%less time spent reconciling metric definitions
4.7/5reported confidence in the observability workflow
AAaccessibility target across critical workflows
Next project

Explore more product design work.

Back to selected work ↗