Senior Staff Engineer - Availability and Incident Management Job at Geico (Chevy Chase)

Senior Staff Engineer Software (Cloud Platform, Production & Reliability – Machine Identity Security)

The Production Engineering team is responsible for building, scaling, and operat...

Location

United States , Santa Clara

Salary:

126000.00 - 203500.00 USD / Year

Palo Alto Networks

Expiration Date

Until further notice

Requirements

5+ years of experience in DevOps, Platform Engineering, or Site Reliability Engineering (SRE)
Strong experience designing and operating cloud infrastructure on AWS, Azure, or GCP
Deep expertise managing and scaling Kubernetes environments (EKS, AKS, or GKE)
Strong experience with Infrastructure as Code tools (Terraform, Ansible, or Pulumi)
Proven experience designing and maintaining complex CI/CD systems (Jenkins, GitLab CI, ArgoCD, GitHub Actions)
Strong programming/scripting skills (Python, Go, or similar) for automation and tooling
Experience operating in high-scale, 24/7 production environments with ownership of incident response and reliability
Solid understanding of Linux systems and networking fundamentals (DNS, TCP/IP, load balancing, VPC, mTLS)
Strong problem-solving skills and ability to work across teams

Job Responsibility

Design, build, and evolve highly available cloud infrastructure platforms with a focus on scalability, resilience, and reliability
Lead improvements across production systems, including performance, availability, and incident response
Drive and standardize Infrastructure as Code (IaC) practices to improve consistency and reduce operational overhead
Design and optimize CI/CD pipelines to support fast, secure, and reliable software delivery at scale
Partner with development teams to improve system reliability, observability, and cloud-native design patterns
Define and implement monitoring, alerting, and observability strategies across distributed systems
Lead incident response efforts, including root cause analysis and long-term remediation strategies
Identify and eliminate operational toil through automation and system improvements
Mentor engineers and contribute to raising the bar for production engineering practices

What we offer

restricted stock units
bonus

Fulltime

Senior Staff Engineer, Software

Our Senior Staff Software Engineer works with our Managers, Distinguished and Sr...

Location

United States , Chevy Chase; Palo Alto; Dallas; Seattle

Salary:

120000.00 - 260000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Fluency in at least one modern language (Go/Python preferred)
Understanding of compression algorithms, deduplication, encryption, and error correction
Understanding of SQL and NoSQL databases, including stateful services management and storage
Understanding of networking, caches, key/value stores, load balancing, global load balancing, queues, DNS and CDN
Deep knowledge of SRE practices, methodologies, and principles, along with a solid understanding of on prem and public cloud-based network, compute, and storage technologies
In-depth knowledge of hybrid cloud architecture, IaaS and PaaS technologies, container orchestration platforms (e.g., Kubernetes), cloud efficiency and observability etc.
Strong background in incident management
Ability to create incident response playbooks, runbooks, incident triaging strategies, and post-incident analysis to drive continuous improvement in system reliability and availability
Experience with open-source management and monitoring tools
Experience with infrastructure automation, tooling, and configuration management frameworks (e.g., Puppet, Chef, Ansible, Pulumi, Terraform, etc.)

Job Responsibility

Develop and drive the overall strategy for the Business Continuity and Disaster Recovery (BCDR) organization, aligning it with the organization's business goals and objectives
Provide thought leadership in BCDR, staying ahead of industry trends and emerging technologies to enhance our backup/restore posture
Conduct comprehensive risk assessments to identify potential threats and vulnerabilities
Design and implement robust strategies to ensure data safety, integrity and correctness
Lead the design and architecture of resilient and scalable systems, considering both on-premises and cloud-based solutions
Collaborate with cross-functional teams to integrate data safeguard best practices into the development and deployment processes
Develop and maintain comprehensive incident response plans to address various disaster scenarios on our orchestration and backup/restore systems
Conduct regular simulations and drills to ensure the readiness of the organization in the event of a disaster
Hands-on software engineering and SDLC best practices (Technical Review Documents, Architecture, Software Development, Software Reviews, Testing, Production Readiness Reviews, among others)
Evaluate, select, and implement cutting-edge technologies and tools to enhance our data safeguard capabilities including but not limited to processes, compliance, and visibility

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Senior Staff DevOps Engineer (Secure Cloud Access)

As part of your role, you will design, implement, and deploy products and infras...

Location

Israel , Southern District

Salary:

Not provided

Palo Alto Networks Italia

Expiration Date

Until further notice

Requirements

4+ years of experience as a DevOps engineer or Site Reliability Engineer
Deep knowledge and experience in cloud infrastructure such as AWS, Azure, or Google Cloud
Strong hands-on experience operating production workloads on AWS, with an emphasis on serverless systems (Lambda, DynamoDB, OpenSearch, S3, API Gateway, EventBridge, SQS/SNS, CloudFront, IAM, and CloudWatch)
Experience with Infrastructure as Code (IaC) tools such as Terraform, AWS CDK, or CloudFormation
Experience with containerization and orchestration technologies like Docker, Kubernetes, or ECS/ECR
Experience with CI/CD tools and configuration management systems like Jenkins, Git, or Ansible
Practical experience with high-availability design, disaster recovery planning, backups, restores, and rollbacks across multiple AWS regions
Proficiency in scripting with Bash and Python
Experience with end-to-end system ownership, including on-call participation, incident response, and root-cause analysis
Fluent in English with strong writing skills

Job Responsibility

Design and manage Continuous Integration/Deployment Services, including build, packaging, and deployment
Design, document, implement, and maintain scripts to enhance current and future build and release processes
Incorporate new development projects into existing build structures
Continually evaluate tools and technologies to improve the overall release process

Fulltime

Senior Staff DevOps Engineer (Secure Cloud Access)

As part of your role, you will design, implement, and deploy products and infras...

Location

Israel , Southern District

Salary:

Not provided

Palo Alto Networks

Expiration Date

Until further notice

Requirements

4+ years of experience as a DevOps engineer or Site Reliability Engineer
Deep knowledge and experience in cloud infrastructure such as AWS, Azure, or Google Cloud
Strong hands-on experience operating production workloads on AWS, with an emphasis on serverless systems (Lambda, DynamoDB, OpenSearch, S3, API Gateway, EventBridge, SQS/SNS, CloudFront, IAM, and CloudWatch)
Experience with Infrastructure as Code (IaC) tools such as Terraform, AWS CDK, or CloudFormation
Experience with containerization and orchestration technologies like Docker, Kubernetes, or ECS/ECR
Experience with CI/CD tools and configuration management systems like Jenkins, Git, or Ansible
Practical experience with high-availability design, disaster recovery planning, backups, restores, and rollbacks across multiple AWS regions
Proficiency in scripting with Bash and Python
Experience with end-to-end system ownership, including on-call participation, incident response, and root-cause analysis
Fluent in English with strong writing skills

Job Responsibility

Design and manage Continuous Integration/Deployment Services, including build, packaging, and deployment
Design, document, implement, and maintain scripts to enhance current and future build and release processes
Incorporate new development projects into existing build structures
Continually evaluate tools and technologies to improve the overall release process

Fulltime

Senior Staff Software Engineer- GIA Platform

GEICO is seeking an experienced software engineer with a passion for building hi...

Location

United States , Palo Alto

Salary:

130000.00 - 260000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Fluency in at least one modern language (Go is preferred, .Net is a plus)
Proven track record of designing, implementing, and maintaining highly scalable, available and reliable system in production
Understanding of security best practices and data encryption technology
Understanding of SQL and NoSQL databases, including stateful services management and storage
Understanding of networking, caches, key/value stores, load balancing, global load balancing, queues, DNS and CDN
Deep knowledge of DevOps practices, methodologies, and principles, along with a solid understanding of on prem and public cloud-based network, compute, and storage technologies
In-depth knowledge of hybrid cloud architecture, IaaS and PaaS technologies, container orchestration platforms (e.g., Kubernetes), cloud efficiency and observability etc.
Strong background in incident management
Ability to create incident response playbooks, runbooks, incident triaging strategies, and post-incident analysis to drive continuous improvement in system reliability and availability
Experience with open-source management and monitoring tools

Job Responsibility

Develop and drive the overall technical roadmap for the GIA Platform organization, aligning it with the organization's business goals and objectives
Work closely with executive leadership, tech teams, and other cross-discipline stakeholders to build optimal strategy for delivering platform services
Leverage technical and domain expertise to influence partners and leadership to create a force multiplier in achieving milestones in the team’s technical roadmap
Provide thought leadership in GIA Platform, staying ahead of industry trends and emerging technologies to create effective strategy that minimizes business disruption while balancing the modernization of legacy platform components
Lead the design and architecture of resilient and scalable platform services, considering both on-premises and cloud-based solutions
Champion software development best practices and safe deployment processes to enable continuous, incremental delivery of business values
Contribute directly to and leading by example in day-to-day engineering activities (writing feature code and automated tests, raising PRs and reviewing peers’ PRs, developing and managing CI/CD pipelines, production support, among others)
Develop and maintain comprehensive incident response plans to address various disaster scenarios across multiple partner integration points
Spearhead collaboration with various stakeholders in production readiness assessment and operational excellence
Hands-on software engineering and SDLC best practices (Technical Review Documents, Architecture, Software Development, Code Reviews, Testing, Production Readiness Reviews, among others)

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Staff Cybersecurity Engineer - PKI/Secrets Management

The Role: We’re looking for a senior, self-driven Cyber Security Engineer to ow...

Location

United States , Austin; Warren

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, Mathematics, Physics, or equivalent senior-level industry experience
7+ years experience in enterprise security engineering or Site Reliability Engineering (SRE), with direct responsibility for high-availability security or cryptographic services
7+ years experience with enterprise secrets management platforms (e.g., HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, BeyondTrust), including architecture, operations, and integration at scale
Strong understanding of public-key cryptography, PKI, and modern cryptographic protocols, with the ability to make pragmatic, risk-informed design decisions
Demonstrated experience designing, operating, and evolving production PKI systems (root and issuing CAs, CRL/OCSP, certificate lifecycle, and policy governance)
Proficiency with infrastructure-as-code (e.g., Terraform) and engineering practices that enable repeatable, auditable, and secure deployments
Working knowledge of major cloud platforms (AWS, GCP, Azure) and how to integrate PKI and secrets management with cloud-native services
Experience with containerization, orchestration (e.g., Kubernetes), and CI/CD workflows, including secure delivery patterns and secrets handling
Excellent communication skills, with a track record of presenting complex technical concepts, trade-offs, and recommendations to engineering and executive audiences
Strong threat modeling and security architecture skills, with the ability to anticipate abuse cases and design for resilience

Job Responsibility

Setting the technical vision and architecting, implementing, and operating scalable, highly available PKI and secrets management services for the enterprise
Owning design decisions that shape internal trust models, cryptographic architectures, and access patterns for the most sensitive data and systems
Defining, implementing, and continuously improving policies, processes, and controls for the full lifecycle of keys, certificates, and secrets across diverse platforms
Influencing and aligning engineering, infrastructure, and leadership teams to deliver robust, observable, and compliant cryptographic systems
Mentoring and developing engineers, raising the bar for technical excellence, and driving consistent best practices for cryptographic and secrets management across the organization
Advising senior leadership on long-term security architecture strategy, trade-offs, and investment priorities related to identity, PKI, and secrets management
Providing operational leadership, including participation in on-call rotations for global, mission-critical services and driving post-incident improvements
Leading HSM strategy, including architecture, platform selection, appliance consolidation, and multi-year roadmap planning in alignment with enterprise security and compliance goals

Fulltime

Senior Staff Site Reliability Engineer

As a Site Reliability Engineer on the SASE Platform team, you will play a critic...

Location

Israel , Tel Aviv

Salary:

Not provided

Palo Alto Networks Italia

Expiration Date

Until further notice

Requirements

5+ years of experience working with Unix/Linux systems, including shell, tools, networking, and kernel concepts
2+ years of hands-on experience with microservices architectures running on Kubernetes and container platforms
Proven experience operating workloads in public cloud environments (e.g., AWS, GCP, Azure) at scale
Proficiency in building automation and tools in at least one scripting or programming language (e.g., Python, Go, Java)
Strong experience with Infrastructure as Code (IaC) tools such as Terraform or Ansible
Bachelor’s degree in Engineering, Computer Science, or a related technical field, or equivalent practical experience

Job Responsibility

Proactively collaborate with development teams to embed reliability, scalability, and operability into services from the earliest design stages
Design, review, and evolve cloud-native architectures to improve availability, performance, cost efficiency, and fault tolerance
Build and operate automation for provisioning, deploying, and managing global infrastructure using Infrastructure as Code (IaC)
Improve CI/CD pipelines and release processes to enable safe, fast, and repeatable deployments
Drive observability best practices, including metrics, logs, traces, and SLIs/SLOs to enable data-driven incident analysis
Participate in on-call rotations, reducing mean time to resolution (MTTR) through automation and proactive reliability improvements
Challenge existing processes by championing reliability, security, and operational maturity across the organization

Fulltime

Unix - Senior Cloud - Digital Engineering Sr. Staff Engineer

Location

India , Noida

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

Should have a minimum of 8 to 10 years of experience as a Linux/Unix System Administrator. Should have expertise on at least 2 flavors of Unix. Linux is a must!
Should have a deep level of understanding of Linux OS & should be able to handle day to day admin tasks.
Should be well versed shell scripting.
Expert in Unix-Linux, AWS Cloud Administration, OS/server administration, patching, maintenance, and troubleshooting.
Proficient in operating and troubleshooting AWS services like EC2, networking, RDS, backups, storage (EBS, EFS, S3, Glacier), and security (Well-Architected framework).
Possesses a strong understanding of networking concepts for configuring secure VPCs, subnets, landing zones, ACLs, and security groups.
Experience in end-to-end cloud migrations, including strategy, assessment, design, architecture, and execution on AWS.
Skilled in identifying and migrating suitable applications and workloads, gathering migration requirements, and collaborating with stakeholders.
Good knowledge of various AWS services like Lambda, SNS, SQS, DynamoDB, OpenSearch, Transfer Family, CloudWatch, EC2, EFS, EKS, Step Functions, ELB, ACM, Directory Services, and networking.
Hands-on expertise in designing, architecting, deploying, and supporting hybrid cloud environments.

Job Responsibility

Perform installation, customization and maintenance of the UNIX-LINUX Server operating system and system software products in support of business processing requirements for both On-premise and Cloud environment
Evaluate and integrate new operating system versions, drivers and hardware.
Provides in-depth diagnosis for operating systems software/hardware failures and develops solutions.
Monitors and tunes the system to achieve optimum performance levels in standalone and multi-tiered environments.
Conducts system analysis, configuration management and develops improvements for system software performance, availability and reliability.
Implements appropriate levels of system security. Maintain security patching and remediating vulnerabilities, propose solutions for the same.
Perform incident resolution, problem determination and root cause analysis in accordance with service level. Knowledge of ITIL.
Recommend and implement modifications to the server environment, Innovation, Ideas to improve
Preparation of Standard Documents and periodically review them for modifications
Identifies opportunities for process and procedure enhancements to drive efficiency and customer service levels.

Fulltime

Select Country

Senior Staff Engineer - Availability and Incident Management

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?