- Home
- Remote Jobs
- Information Technology Operations Supervisor
Already filled
Don't miss the next one. Get matching roles delivered to your inbox.
Information Technology Operations Supervisor
Job summary
Work model
Alert Management & Observability Standards Lead
Duration: 6 months
Location: REMOTE
Role Summary
The Alert Management & Observability Standards Lead is responsible for rationalizing and governing all system alerts to ensure they align with department priorities, operational coverage models, and service reliability goals. This role defines alerting standards, reviews and approves alerts before they are routed to the 24x7 Eyes-on-Glass Operations team, and establishes a scalable approach to cataloging alert response instructions (runbooks/playbooks) so responders can take consistent, high-quality actions.
This position operates at the intersection of the IT Operations Command Center (OCC), engineering/application teams, platform/monitoring tool owners, and service owners, ensuring alerts are actionable, prioritized, and paired with clear response guidance.
Key Responsibilities
- Alert Rationalization & Prioritization (Core)
- Standards, Policies, and Guardrails
- Routing Decisions to 24x7 Eyes-on-Glass
- Runbook / Response Instruction Cataloging (Knowledge System)
- Reporting & Operational Outcomes
- Cross-Functional Enablement
Required Qualifications
- 5+ years in IT Operations, SRE, Observability, Monitoring Engineering, or Incident Management
- Demonstrated success reducing noise and improving actionability across enterprise alerting ecosystems
- Experience with common monitoring/observability tools (e.g., Splunk, AppDynamics, Dynatrace, Datadog, Prometheus/Grafana, Azure Monitor, CloudWatch, ServiceNow Event Mgmt or similar)
Strong understanding of:
- Incident response workflows and operational coverage models (24x7 vs. business hours)
- CMDB/service ownership concepts and dependency mapping
- Standard operating procedures/runbooks and knowledge management
- Excellent stakeholder management and ability to drive standards across teams
Preferred Qualifications
- Experience designing or operating an Operations Command Center / NOC / SOC-style "eyes-on-glass" model
- Familiarity with ITIL Event Management, SRE principles, and service reliability practices
- Experience with automation for alert enrichment, correlation, and routing (e.g., event correlation, deduplication, noise suppression)
- Background in governance frameworks and operating rhythm design (cadences, controls, compliance traceability)
Competencies / What Great Looks Like
- Opinionated, data-driven governance: decisions anchored in outcomes, not preferences
- Practical standardization: templates and policies that teams can actually follow
- Operational empathy: knows what 24x7 responders need to succeed in real time
- Quality bar: only actionable alerts reach Eyes-on-Glass; every alert has an owner and instructions
- Continuous improvement mindset: routinely prunes, tunes, and simplifies
Deliverables in the First 45 Days
- Alerting standards (severity model, metadata, naming, routing policy) published and adopted
- Intake and approval workflow established for new/changed alerts
- Top 20 noisy services rationalized (dedupe/suppress/threshold tuning) with measurable noise reduction
- Runbook template launched; minimum runbook coverage targets set (e.g., 80% of paged alerts)
- Central alert catalog created (ownership + routing + runbook link + last review date)