Skills

Communication Leadership Emotional Intelligence Incident Response Splunk ServiceNow Monitoring Stakeholder Management Decision-making Prometheus Grafana Microservices

Job Specifications

Locals only

Role: Major Incident Manager (MIM) / Recovery Manager

Function: IT Service Management

Shift Model: 7 AM to 10 PM EST, weekend Oncall

Location: Fort Mill (SC)

Work Model : Initially 100% onsite, Later can change to Hybrid(TBD)

________________________________________

Key Responsibilities

1. Major Incident Command & Recovery Leadership

• Assume full command ownership for P1/P2 incidents impacting critical business services

• Act as the Recovery Manager, not a passive coordinator

• Establish a clear recovery strategy and sequence of actions within the first minutes of incident declaration

• Direct and challenge resolver groups by asking the right diagnostic questions, such as:

o What changed last?

o What signals indicate blast radius?

o What evidence confirms or disproves the current hypothesis?

• Prevent “follow-the-script” behaviour by actively steering troubleshooting paths

• Make time-bound decisions on rollback, failover, degradation acceptance, or workaround activation

• Escalate with intent and clarity, not mechanically

________________________________________

2. Observability-Driven Triage & Technical Situational Awareness

• Use observability and monitoring platforms to independently:

o Validate incident impact

o Identify affected services, components, and dependencies

o Detect anomalies, error spikes, latency, saturation, or failures

• Perform initial triage before and during bridge calls, including:

o Application health indicators

o Infrastructure metrics

o Logs and traces (where applicable)

• Correlate signals across multiple tools to form an early incident hypothesis

• Guide resolver teams using data-backed insights, not assumptions

• Reduce dependency on individual SMEs for basic diagnostic visibility

________________________________________

3. Cross-Functional Orchestration

• Coordinate across:

o Application teams

o Infrastructure (Cloud, Network, Database)

o SRE / Platform teams

o Vendors and third parties

• Ensure clear role assignment (who investigates, who mitigates, who communicates)

• Maintain single-threaded recovery ownership

• Prevent duplication of effort and conflicting actions

________________________________________

4. Stakeholder & Business Communication

• Serve as the single authoritative voice during major incidents

• Provide:

o Clear impact statements

o Recovery progress updates

o Risk and ETA communication (fact-based, not speculative)

• Tailor communication for:

o Business stakeholders

o Senior leadership

o Technical teams

• Maintain composure and credibility under pressure

________________________________________

5. Post-Incident Review & Continuous Improvement

• Lead or contribute to:

o Post-Incident Reviews (PIRs)

o Root Cause Analysis (RCA)

• Challenge superficial root causes and push for systemic fixes

• Identify:

o Observability gaps

o Recovery bottlenecks

o Process or ownership weaknesses

• Recommend improvements to:

o Monitoring

o Alerting

o Runbooks

o Incident response playbooks

________________________________________

Required Skills & Competencies

Core Competencies (Must-Have)

• Strong command presence and decision-making ability during crisis situations

• Ability to ask incisive, technically meaningful questions

• Confidence to challenge senior engineers and vendors respectfully

• Structured thinking under pressure

• High emotional intelligence and stakeholder management

________________________________________

Technical & Tooling Knowledge (Good to have)

• Working knowledge of observability and monitoring tools, such as:

o Splunk, Dynatrace, AppDynamics, Datadog, New Relic, Prometheus/Grafana, Elastic, etc.

• Ability to interpret:

o Metrics (latency, errors, throughput, saturation)

o Logs and traces (basic level)

• Understanding of:

o Modern application architectures (microservices, APIs)

o Cloud and hybrid environments

o IT infrastructure dependencies

• Familiarity with ITSM tools (ServiceNow or equivalent)

________________________________________

Process & Framework Knowledge

• ITIL Major Incident Management (practical, not theoretical)

• SRE concepts (SLIs, SLOs, error budgets – preferred)

• Incident command models (e.g., war room, bridge command)

________________________________________

Experience Requirements

• 5+ years experience as MI manager leading critical disruptions to resolutions within SLA (not a participant from a support tower)

• Proven experience handling business-critical P1/P2 incidents

• Prior exposure to:

o Command Center operations

o On-call or high-pressure production environments

• Experience leading incidents, not just supporting them

________________________________________

Behavioural Expectations (Critical for This Role)

• Does not wait for instructions—takes ownership

• Comfortable making decisions with incomplete information

• Calm, assertive, and respected during high-stress events

• Balances speed with risk awareness

About the Company

Net2Source (N2S) is a global workforce solutions company recognized by SIA as the largest and fastest-growing Total Talent Solutions provider with a presence in 32 countries. and in-house Glo-Cal (global and local) teams to support our clients. We carve out custom talent solutions, keeping People, Process, and Technology as the pillars of making the process simple, robust, and efficient. With over 3,500+ contractors working worldwide, we specialize in Contingent Staffing, RPO, Direct Sourcing, Payroll Solutions (EOR/AOR), ... Know more