Incident Engineer Job at Adyen (Bengaluru)

Monitoring Engineer / Incident Manager

A team within Engineering under the Platform Excellence pillar exhibits an unwav...

Location

Netherlands , Amsterdam

Salary:

Not provided

Adyen

Expiration Date

Until further notice

Requirements

At least 5 years of experience with incident management, problem management, incident client communication, and platform monitoring operations
Experience with problem management practices - identifying trends across incidents, conducting root cause investigations and driving preventative action
Solid communication skills and the ability to develop strong working relationships throughout the organization, able to translate technical situations clearly and concisely to a diverse audience via data-visualizing dashboards and written documents
Willing to participate in the on-call rotation and work in a fast-paced, dynamic environment
Experience with monitoring and logging tools like Prometheus, Grafana, ELK Stack, etc.
Experience with observability platforms like Datadog, Dynatrace, Splunk
Excellent analytical and problem-solving skills, with the ability to analyze complex systems and spot the root cause of issues
Thrive in an environment where collaboration is crucial and where a global approach is key for successful implementation of processes and projects
Passion for defining and standardizing processes to drive strategic improvement and able to translate complex technical concepts with ease for all non technical audiences
Natural ability for handling complex situations and multiple responsibilities simultaneously

Job Responsibility

Participate in 24/7 on-call monitoring and observe platform and merchant performance and detect any issues proactively to mitigate risks in partnership with Engineering teams
Coordinate the mitigation, recovery, and resolution of high-impact incidents, ensuring a rapid and effective response across teams
Represent the customer perspective during incidents, maintaining a strong customer-centric approach
Communicate with merchants real time during an incident and present the most accurate and updated information to keep them informed
Escalate critical incidents when needed and provide structured communication to senior management
Go beyond reactive incident response by analyzing incident trends to identify recurring issues and systemic weaknesses and partner with engineering and product teams to advocate for long-term fixes
Work together with Operations, Product, and Engineering teams to integrate, grow, and continuously improve monitoring strategy and increase reliability
Investigate alerts and provide feedback to engineering teams to build effective logging and alerts across the platform architecture
Mitigate merchant impact risk by actioning on alerts in partnership with Engineering teams and contribute to the monitoring playbook by documenting learnings
Improve operations by leading/project managing initiatives and tools development of automation for effective monitoring

Fulltime

O&M Infrastructure Engineer - Incident Management

Job Description: Promptly identify and categorize incidents based on impact and ...

Location

Saudi Arabia , Riyadh

Salary:

Not provided

Giza Systems

Expiration Date

Until further notice

Requirements

Ability to prioritize incidents based on impact and urgency
Ability to manage multiple incidents simultaneously and meet SLAs
Ability to work under pressure and make sound decisions in a fast-paced 24x7 environment
Excellent verbal and written communication skills for reporting and stakeholder coordination
Proficiency in documenting procedures, findings, and customizing reports
Strong understanding of IT Service Management (ITIL) frameworks
Experience with incident logging, categorization, prioritization, and resolution processes
Strong data analysis skills to review monitoring data and identify trends
Expertise in defining and tracking ITSM Key Performance Indicators (KPIs)
Years of Experience Min: 1 Max: 3

Job Responsibility

Promptly identify and categorize incidents based on impact and urgency
Prioritize incidents based on severity and effect on business operations
Record detailed incident information in the incident management system
Escalate incidents to L2/L3 support teams or management when required
Open incidents with technology vendors, upload logs, follow up on cases, and coordinate technical meetings
Maintain clear and effective communication with IT teams, business users, management, and external vendors
Provide regular updates on incident status, resolution progress, and potential impact
Conduct post-incident reviews to evaluate handling effectiveness and identify improvements
Document all incident-related activities including actions taken and final resolutions
Analyze incident data to identify trends, patterns, and recurring issues

Fulltime

Staff Site Reliability Engineer - Incident Management & Reliability

We’re not just building better tech. We’re rewriting how data moves and what the...

Location

Canada

Salary:

225100.00 - 264500.00 CAD / Year

Confluent

Expiration Date

Until further notice

Requirements

10+ years of relevant experience in SRE, incident management, or reliability engineering
Cloud experience with at least one of AWS, GCP, or Azure
Experience navigating reliability/incident programs at 500+ engineer organizations
Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
Strong understanding of distributed systems and failure modes at scale
Deep experience with observability: metrics, logging, tracing
Kubernetes and container orchestration experience
Understanding of CI/CD pipelines and release processes
Strong written communication (design docs, runbooks, post-mortems)
Experience driving org-wide process and cultural changes

Job Responsibility

Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
Define and maintain SLO/SLA frameworks
use error budgets to guide reliability investments
Own standards, practices, and continuous improvement of incident response across engineering
Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
Develop and deliver training programs
coach teams through post-mortems
Partner with engineering leaders to elevate reliability practices org-wide

What we offer

Remote-First Work
Robust Insurance Benefits
Flexible Time Away
The Best Teammates
Experience Ambassadors
Open and Honest Culture
Well-Being and Growth
Offers Equity

Fulltime

Senior Security Engineer - Incident Response

Mozilla is looking for an Incident Responder to monitor and mitigate attacks acr...

Location

Germany

Salary:

Not provided

Mozilla

Expiration Date

Until further notice

Requirements

5+ years of demonstrated ability managing security incidents at a global scale and/or experience working in Security Operations Centers (SOC), Product Security Incident Response Teams (PSIRT), and Computer Security Incident Response Teams (CSIRT)
Expertise with security information and event management (SIEM) systems (eg. ELK, Google BigQuery, Splunk, etc.). Splunk proficiency is preferred
Expertise with endpoint detection and investigation. Hands-on experience with leading EDR tools and demonstrated ability to leverage endpoint telemetry to find root cause
Expertise with security orchestration and automation (SOAR) platforms such as Tines or Splunk SOAR
Superb communication and leadership capacity
ability to partner effectively with diverse company stakeholders
Real-world experience in software development and/or engineering operations for consumer products and services
B.S. in a technology-focused field is helpful
Practical experience working with cloud technologies (eg. Google Cloud Platform, Amazon Web Services, Heroku, Microsoft Azure, etc.)
Ownership and Accountability

Job Responsibility

Identify and respond to security incidents on a global scale
Act as an incident commander to drive incidents through the entire response lifecycle
Design and maintain a portfolio of security alerts, automated actions, playbooks and escalation workflows in support of a high-performing 24/7 incident response capability
Conduct threat hunting activities, anticipate future threats, and maintain forward-thinking strategies for tools/technology/processes that combat sophisticated threat actors
Research threat intelligence reports, triage and manage resulting workflows
Partner with key stakeholders and communicate effectively to maintain a continuously improving feedback loop of preparation, identification, analysis, containment, and post mortem activities
Participate in on-call rotation

What we offer

Generous performance-based bonus plans
Rich medical, dental, and vision coverage
Generous retirement contributions with 100% immediate vesting
Quarterly all-company wellness days
Country specific holidays plus a day off for your birthday
One-time home office stipend
Annual professional development budget
Quarterly well-being stipend
Considerable paid parental leave
Employee referral bonus program

Fulltime

Incident Management Engineer German Speaker

We are seeking a BTS Incident Management Engineer to support Vodafone Germany En...

Location

Egypt , Cairo

Salary:

Not provided

Vodafone

Expiration Date

Until further notice

Requirements

0–2 years of experience in network operations, technical support, or IP-based troubleshooting environments
Foundational knowledge of IP and voice technologies, including routing, switching, and IP services such as DNS and DHCP
Strong analytical and problem-solving skills in complex technical scenarios
Customer-focused mindset with clear, confident communication skills
Fluent in German (written and spoken) at a minimum C1 level
Familiarity with Vodafone fixed, mobile, and convergence products is desirable

Job Responsibility

Perform end-to-end incident troubleshooting, identification, and resolution for Vodafone Germany Enterprise customers
Resolve incidents within agreed Service Level Agreements (SLAs) and Key Performance Indicators (KPIs)
Collaborate with internal competence teams, external carriers, and third-party partners to achieve timely incident resolution
Manage and update incident tickets accurately, ensuring transparency and clear communication throughout the lifecycle
Work on a rotational shift basis, covering 24/7 operations including weekends and public holidays

What we offer

Exposure to enterprise-scale network environments supporting Vodafone Germany customers
Opportunity to work within a globally recognised technology organisation
Hands-on experience in a 24/7 operational support model
Collaboration with cross-functional and international technical teams
Structured environment to build a strong foundation in incident and service management

Fulltime

Senior Security Engineer - Security Incident Response

The Cloud & AI organization accelerates Microsoft’s mission and bold ambitions t...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Doctorate in Statistics, Mathematics, Computer Science, or related field
Master's Degree in Statistics, Mathematics, Computer Science, or related field AND 3+ years experience in software development lifecycle, large-scale computing, threat modeling, cyber security, anomaly detection, Security Operations Center (SOC) detection, threat analytics, security incident and event management (SIEM), information technology (IT), or operations incident response
Bachelor's Degree in Statistics, Mathematics, Computer Science, or related field AND 4+ years experience in software development lifecycle, large-scale computing, threat modeling, cyber security, anomaly detection, Security Operations Center (SOC) detection, threat analytics, security incident and event management (SIEM), information technology (IT), or operations incident response
equivalent experience
Active U.S. Government Secret Security Clearance
Ability to meet Microsoft, customer and/or government security screening requirements
U.S. citizenship verification

Job Responsibility

Coordinate with investigators to prioritize investigation objectives, understands attack paths, and systematically executes mitigation and protection actions to evict threat actors for any security incident impacting any of Microsoft’s products or services
Conduct hands-on mitigation where possible
engages service owners when there is a risk of a production outage
Maintain hands-on knowledge of mitigation and protection steps for various asset types (e.g. M365, Azure, AI) and publishes self-service guidance for impacted engineering teams
Brief executive stakeholders on eviction plans and associated status
Maintain and evolves an inventory of threat actor Tactics, Techniques, and Procedures (TTPs) and the corresponding eviction capabilities
Define and prioritize requirements and use cases for Microsoft’s threat actor eviction platform
operationalize as they are delivered
Drive strategic change to accelerate eviction scenarios (e.g. lean business cases to garner support for broader Microsoft product initiatives or features)
Participate in an on-call rotation

Fulltime

Incident Management Engineer

Incident Management Engineers (IMEs) are the driving forces of stability across ...

Location

United Kingdom , London

Salary:

Not provided

Palantir Technologies

Expiration Date

Until further notice

Requirements

Background in Computer Science, Engineering, Information Systems, or other technical field
Willingness and interest to travel to other Palantir locations as needed

Job Responsibility

Develop a deep understanding of Palantir’s product and delivery ecosystem
Collaborate with customer-facing, product, and infrastructure teams on the development and deployment of scalable, reliable software for our customers
Diagnose, resolve, and prevent issues encountered in the field
Reduce the operational overhead of responding to critical incidents at Palantir through investments in tooling, process, and automation
Take part in a 24/7 on-call rotation responsible for coordinating Palantir’s response to mission-critical incidents, ensuring efficient resolution with minimal customer impact

Fulltime

Incident Management Engineer

Incident Management Engineers (IMEs) are the driving forces of stability across ...

Location

United States , New York

Salary:

82000.00 - 140000.00 USD / Year

Palantir Technologies

Expiration Date

Until further notice

Requirements

Background in Computer Science, Engineering, Information Systems, Incident Management, or other technical field
Willingness and interest to travel to other Palantir locations on occasion

Job Responsibility

Develop a deep understanding of Palantir’s product and delivery ecosystem
Collaborate with customer-facing, product, and infrastructure teams on the development and deployment of scalable, reliable software for our customers
Diagnose, resolve, and prevent issues encountered in the field
Reduce the operational overhead of responding to critical incidents at Palantir through investments in tooling, process, and automation
Take part in a 24/7 on-call rotation responsible for coordinating Palantir’s response to mission-critical incidents, ensuring efficient resolution with minimal customer impact

What we offer

Employees (and their eligible dependents) can enroll in medical, dental, and vision insurance as well as voluntary life insurance
Employees are automatically covered by Palantir’s basic life, AD&D and disability insurance
Commuter benefits
Relocation assistance
Take what you need paid time off, not accrual based
2 weeks paid time off built into the end of each year (subject to team and business needs)
10 paid holidays throughout the calendar year
Supportive leave of absence program including time off for military service and medical events
Paid leave for new parents and subsidized back-up care for all parents
Fertility and family building benefits including but not limited to adoption, surrogacy, and preservation

Fulltime

Select Country

Incident Engineer

Job Description

Job Responsibility

Requirements

Looking for more opportunities?