CrawlJobs Logo

Incident Engineer

India, Bengaluru · Job Posted May 03, 2026
Apply Position
Job Link Share

Job Description

A team within Global Platform Operations under the Monitoring Engineering pillar exhibits an unwavering attention to detail and a deep understanding of the platform wide monitoring implications to all merchants. In this role, you will be on-call monitoring platform performance, communicating with merchants, working on monitoring frameworks, providing feedback to product engineering teams to improve the reliability of the platform. You will initiate and lead initiatives across our platform offerings prioritizing merchant impact to proactively detect any issues and inform merchants quickly.

Job Responsibility

  • Participate in 24/7 on-call monitoring
  • Observe platform and merchant performance and detect any issues proactively to mitigate risks in partnership with Engineering teams
  • Be an expert in communicating with merchants real time during an incident and present the most accurate and updated information to keep them informed
  • Working together with Operations, Product, Engineering, and reliability teams to integrate, grow, and continuously improve our monitoring strategy and increase our reliability
  • Improve operations by leading/project managing initiatives and, or tools—development of automation for effective monitoring
  • Investigate alerts and provide feedback to engineering teams to build effective logging and alerts across the platform architecture
  • Mitigate merchant impact risk by actioning on alerts in partnership with Engineering teams, and contribute to the monitoring playbook by documenting your learnings
  • Focus on ruthlessly prioritizing, automating, and scaling every aspect of our detection capabilities

Requirements

  • You have at least 5 to 10 years of experience with incident client communication and platform monitoring operations
  • You're willing to participate in the on-call rotation and work in a fast-paced, dynamic environment
  • You have experience with monitoring and logging tools like Prometheus, Grafana, ELK Stack, etc
  • You have experience with observability platforms like Datadog, Dynatrace, Splunk
  • You have excellent analytical and problem-solving skills, with the ability to analyze complex systems and spot the root cause of issues
  • You thrive in an environment where collaboration is crucial and where a global approach is key for are you successful implementation of processes and projects
  • You have a passion for defining and standardizing processes to drive strategic improvement and able to translate complex technical concepts with ease for all non technical audiences
  • You have a natural ability for handling complex situations and multiple responsibilities simultaneously
  • You're a strong team player and thrive in a dynamic environment

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Incident Engineer

8 matching positions

Monitoring Engineer / Incident Manager

A team within Engineering under the Platform Excellence pillar exhibits an unwav...
Location
Location
Netherlands , Amsterdam
Salary
Salary:
Not provided
adyen.com Logo
Adyen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 5 years of experience with incident management, problem management, incident client communication, and platform monitoring operations
  • Experience with problem management practices - identifying trends across incidents, conducting root cause investigations and driving preventative action
  • Solid communication skills and the ability to develop strong working relationships throughout the organization, able to translate technical situations clearly and concisely to a diverse audience via data-visualizing dashboards and written documents
  • Willing to participate in the on-call rotation and work in a fast-paced, dynamic environment
  • Experience with monitoring and logging tools like Prometheus, Grafana, ELK Stack, etc.
  • Experience with observability platforms like Datadog, Dynatrace, Splunk
  • Excellent analytical and problem-solving skills, with the ability to analyze complex systems and spot the root cause of issues
  • Thrive in an environment where collaboration is crucial and where a global approach is key for successful implementation of processes and projects
  • Passion for defining and standardizing processes to drive strategic improvement and able to translate complex technical concepts with ease for all non technical audiences
  • Natural ability for handling complex situations and multiple responsibilities simultaneously
Job Responsibility
Job Responsibility
  • Participate in 24/7 on-call monitoring and observe platform and merchant performance and detect any issues proactively to mitigate risks in partnership with Engineering teams
  • Coordinate the mitigation, recovery, and resolution of high-impact incidents, ensuring a rapid and effective response across teams
  • Represent the customer perspective during incidents, maintaining a strong customer-centric approach
  • Communicate with merchants real time during an incident and present the most accurate and updated information to keep them informed
  • Escalate critical incidents when needed and provide structured communication to senior management
  • Go beyond reactive incident response by analyzing incident trends to identify recurring issues and systemic weaknesses and partner with engineering and product teams to advocate for long-term fixes
  • Work together with Operations, Product, and Engineering teams to integrate, grow, and continuously improve monitoring strategy and increase reliability
  • Investigate alerts and provide feedback to engineering teams to build effective logging and alerts across the platform architecture
  • Mitigate merchant impact risk by actioning on alerts in partnership with Engineering teams and contribute to the monitoring playbook by documenting learnings
  • Improve operations by leading/project managing initiatives and tools development of automation for effective monitoring
  • Fulltime
Read More
Arrow Right

O&M Infrastructure Engineer - Incident Management

Job Description: Promptly identify and categorize incidents based on impact and ...
Location
Location
Saudi Arabia , Riyadh
Salary
Salary:
Not provided
gizasystems.com Logo
Giza Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ability to prioritize incidents based on impact and urgency
  • Ability to manage multiple incidents simultaneously and meet SLAs
  • Ability to work under pressure and make sound decisions in a fast-paced 24x7 environment
  • Excellent verbal and written communication skills for reporting and stakeholder coordination
  • Proficiency in documenting procedures, findings, and customizing reports
  • Strong understanding of IT Service Management (ITIL) frameworks
  • Experience with incident logging, categorization, prioritization, and resolution processes
  • Strong data analysis skills to review monitoring data and identify trends
  • Expertise in defining and tracking ITSM Key Performance Indicators (KPIs)
  • Years of Experience Min: 1 Max: 3
Job Responsibility
Job Responsibility
  • Promptly identify and categorize incidents based on impact and urgency
  • Prioritize incidents based on severity and effect on business operations
  • Record detailed incident information in the incident management system
  • Escalate incidents to L2/L3 support teams or management when required
  • Open incidents with technology vendors, upload logs, follow up on cases, and coordinate technical meetings
  • Maintain clear and effective communication with IT teams, business users, management, and external vendors
  • Provide regular updates on incident status, resolution progress, and potential impact
  • Conduct post-incident reviews to evaluate handling effectiveness and identify improvements
  • Document all incident-related activities including actions taken and final resolutions
  • Analyze incident data to identify trends, patterns, and recurring issues
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer - Incident Management & Reliability

We’re not just building better tech. We’re rewriting how data moves and what the...
Location
Location
Canada
Salary
Salary:
225100.00 - 264500.00 CAD / Year
confluent.io Logo
Confluent
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of relevant experience in SRE, incident management, or reliability engineering
  • Cloud experience with at least one of AWS, GCP, or Azure
  • Experience navigating reliability/incident programs at 500+ engineer organizations
  • Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
  • Strong understanding of distributed systems and failure modes at scale
  • Deep experience with observability: metrics, logging, tracing
  • Kubernetes and container orchestration experience
  • Understanding of CI/CD pipelines and release processes
  • Strong written communication (design docs, runbooks, post-mortems)
  • Experience driving org-wide process and cultural changes
Job Responsibility
Job Responsibility
  • Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
  • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
  • Define and maintain SLO/SLA frameworks
  • use error budgets to guide reliability investments
  • Own standards, practices, and continuous improvement of incident response across engineering
  • Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
  • Develop and deliver training programs
  • coach teams through post-mortems
  • Partner with engineering leaders to elevate reliability practices org-wide
What we offer
What we offer
  • Remote-First Work
  • Robust Insurance Benefits
  • Flexible Time Away
  • The Best Teammates
  • Experience Ambassadors
  • Open and Honest Culture
  • Well-Being and Growth
  • Offers Equity
  • Fulltime
Read More
Arrow Right

Senior Security Engineer - Incident Response

Mozilla is looking for an Incident Responder to monitor and mitigate attacks acr...
Location
Location
Germany
Salary
Salary:
Not provided
mozilla.org Logo
Mozilla
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of demonstrated ability managing security incidents at a global scale and/or experience working in Security Operations Centers (SOC), Product Security Incident Response Teams (PSIRT), and Computer Security Incident Response Teams (CSIRT)
  • Expertise with security information and event management (SIEM) systems (eg. ELK, Google BigQuery, Splunk, etc.). Splunk proficiency is preferred
  • Expertise with endpoint detection and investigation. Hands-on experience with leading EDR tools and demonstrated ability to leverage endpoint telemetry to find root cause
  • Expertise with security orchestration and automation (SOAR) platforms such as Tines or Splunk SOAR
  • Superb communication and leadership capacity
  • ability to partner effectively with diverse company stakeholders
  • Real-world experience in software development and/or engineering operations for consumer products and services
  • B.S. in a technology-focused field is helpful
  • Practical experience working with cloud technologies (eg. Google Cloud Platform, Amazon Web Services, Heroku, Microsoft Azure, etc.)
  • Ownership and Accountability
Job Responsibility
Job Responsibility
  • Identify and respond to security incidents on a global scale
  • Act as an incident commander to drive incidents through the entire response lifecycle
  • Design and maintain a portfolio of security alerts, automated actions, playbooks and escalation workflows in support of a high-performing 24/7 incident response capability
  • Conduct threat hunting activities, anticipate future threats, and maintain forward-thinking strategies for tools/technology/processes that combat sophisticated threat actors
  • Research threat intelligence reports, triage and manage resulting workflows
  • Partner with key stakeholders and communicate effectively to maintain a continuously improving feedback loop of preparation, identification, analysis, containment, and post mortem activities
  • Participate in on-call rotation
What we offer
What we offer
  • Generous performance-based bonus plans
  • Rich medical, dental, and vision coverage
  • Generous retirement contributions with 100% immediate vesting
  • Quarterly all-company wellness days
  • Country specific holidays plus a day off for your birthday
  • One-time home office stipend
  • Annual professional development budget
  • Quarterly well-being stipend
  • Considerable paid parental leave
  • Employee referral bonus program
  • Fulltime
Read More
Arrow Right

Incident Management Engineer German Speaker

We are seeking a BTS Incident Management Engineer to support Vodafone Germany En...
Location
Location
Egypt , Cairo
Salary
Salary:
Not provided
vodafone.com Logo
Vodafone
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 0–2 years of experience in network operations, technical support, or IP-based troubleshooting environments
  • Foundational knowledge of IP and voice technologies, including routing, switching, and IP services such as DNS and DHCP
  • Strong analytical and problem-solving skills in complex technical scenarios
  • Customer-focused mindset with clear, confident communication skills
  • Fluent in German (written and spoken) at a minimum C1 level
  • Familiarity with Vodafone fixed, mobile, and convergence products is desirable
Job Responsibility
Job Responsibility
  • Perform end-to-end incident troubleshooting, identification, and resolution for Vodafone Germany Enterprise customers
  • Resolve incidents within agreed Service Level Agreements (SLAs) and Key Performance Indicators (KPIs)
  • Collaborate with internal competence teams, external carriers, and third-party partners to achieve timely incident resolution
  • Manage and update incident tickets accurately, ensuring transparency and clear communication throughout the lifecycle
  • Work on a rotational shift basis, covering 24/7 operations including weekends and public holidays
What we offer
What we offer
  • Exposure to enterprise-scale network environments supporting Vodafone Germany customers
  • Opportunity to work within a globally recognised technology organisation
  • Hands-on experience in a 24/7 operational support model
  • Collaboration with cross-functional and international technical teams
  • Structured environment to build a strong foundation in incident and service management
  • Fulltime
Read More
Arrow Right

Senior Security Engineer - Security Incident Response

The Cloud & AI organization accelerates Microsoft’s mission and bold ambitions t...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate in Statistics, Mathematics, Computer Science, or related field
  • Master's Degree in Statistics, Mathematics, Computer Science, or related field AND 3+ years experience in software development lifecycle, large-scale computing, threat modeling, cyber security, anomaly detection, Security Operations Center (SOC) detection, threat analytics, security incident and event management (SIEM), information technology (IT), or operations incident response
  • Bachelor's Degree in Statistics, Mathematics, Computer Science, or related field AND 4+ years experience in software development lifecycle, large-scale computing, threat modeling, cyber security, anomaly detection, Security Operations Center (SOC) detection, threat analytics, security incident and event management (SIEM), information technology (IT), or operations incident response
  • equivalent experience
  • Active U.S. Government Secret Security Clearance
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • U.S. citizenship verification
Job Responsibility
Job Responsibility
  • Coordinate with investigators to prioritize investigation objectives, understands attack paths, and systematically executes mitigation and protection actions to evict threat actors for any security incident impacting any of Microsoft’s products or services
  • Conduct hands-on mitigation where possible
  • engages service owners when there is a risk of a production outage
  • Maintain hands-on knowledge of mitigation and protection steps for various asset types (e.g. M365, Azure, AI) and publishes self-service guidance for impacted engineering teams
  • Brief executive stakeholders on eviction plans and associated status
  • Maintain and evolves an inventory of threat actor Tactics, Techniques, and Procedures (TTPs) and the corresponding eviction capabilities
  • Define and prioritize requirements and use cases for Microsoft’s threat actor eviction platform
  • operationalize as they are delivered
  • Drive strategic change to accelerate eviction scenarios (e.g. lean business cases to garner support for broader Microsoft product initiatives or features)
  • Participate in an on-call rotation
  • Fulltime
Read More
Arrow Right

Incident Management Engineer

Incident Management Engineers (IMEs) are the driving forces of stability across ...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
palantir.com Logo
Palantir Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Background in Computer Science, Engineering, Information Systems, or other technical field
  • Willingness and interest to travel to other Palantir locations as needed
Job Responsibility
Job Responsibility
  • Develop a deep understanding of Palantir’s product and delivery ecosystem
  • Collaborate with customer-facing, product, and infrastructure teams on the development and deployment of scalable, reliable software for our customers
  • Diagnose, resolve, and prevent issues encountered in the field
  • Reduce the operational overhead of responding to critical incidents at Palantir through investments in tooling, process, and automation
  • Take part in a 24/7 on-call rotation responsible for coordinating Palantir’s response to mission-critical incidents, ensuring efficient resolution with minimal customer impact
  • Fulltime
Read More
Arrow Right

Incident Management Engineer

Incident Management Engineers (IMEs) are the driving forces of stability across ...
Location
Location
United States , New York
Salary
Salary:
82000.00 - 140000.00 USD / Year
palantir.com Logo
Palantir Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Background in Computer Science, Engineering, Information Systems, Incident Management, or other technical field
  • Willingness and interest to travel to other Palantir locations on occasion
Job Responsibility
Job Responsibility
  • Develop a deep understanding of Palantir’s product and delivery ecosystem
  • Collaborate with customer-facing, product, and infrastructure teams on the development and deployment of scalable, reliable software for our customers
  • Diagnose, resolve, and prevent issues encountered in the field
  • Reduce the operational overhead of responding to critical incidents at Palantir through investments in tooling, process, and automation
  • Take part in a 24/7 on-call rotation responsible for coordinating Palantir’s response to mission-critical incidents, ensuring efficient resolution with minimal customer impact
What we offer
What we offer
  • Employees (and their eligible dependents) can enroll in medical, dental, and vision insurance as well as voluntary life insurance
  • Employees are automatically covered by Palantir’s basic life, AD&D and disability insurance
  • Commuter benefits
  • Relocation assistance
  • Take what you need paid time off, not accrual based
  • 2 weeks paid time off built into the end of each year (subject to team and business needs)
  • 10 paid holidays throughout the calendar year
  • Supportive leave of absence program including time off for military service and medical events
  • Paid leave for new parents and subsidized back-up care for all parents
  • Fertility and family building benefits including but not limited to adoption, surrogacy, and preservation
  • Fulltime
Read More
Arrow Right