CrawlJobs Logo

Senior Software Engineer - Infrastructure Reliability

India, Bangalore · Job Posted April 19, 2026
Apply Position
Job Link Share

Job Description

We are seeking a Senior Software Engineer to join our Security Product team, focused on improving the reliability and resilience of our platform across customer environments. You will be embedded within the engineering team, investigating system outages and failures, identifying recurring patterns, and driving fixes - either independently or in collaboration with service owners. You will work closely with production engineering and SRE teams to build playbooks, conduct post-incident reviews, and ensure problems are properly addressed at their root cause.

Job Responsibility

  • Investigate system outages and production failures across customer environments (SaaS and self-hosted), spanning RabbitMQ, Kubernetes, Docker, Postgres, and cloud infrastructure (AWS, Azure, GCP)
  • Identify recurring failure patterns and systemic weaknesses from incident data, and drive them to resolution - whether by writing Go code yourself (resilience features, infrastructure fixes, observability) or by collaborating with service owners to prioritize and address reliability gaps
  • Lead and participate in post-incident reviews - document root causes, corrective actions, and follow through to ensure issues are properly resolved
  • Collaborate with production engineering and SRE teams to develop and maintain operational playbooks and runbooks that reduce time-to-resolution
  • Diagnose root causes across the full stack - message queue failures, container lifecycle issues, cloud networking, disk and memory pressure, and deployment topology mismatches
  • Design and implement data migrations and lifecycle management for infrastructure components such as queue management and vhost operations
  • Emit and monitor operational metrics to proactively detect infrastructure degradation and measure service reliability

Requirements

  • 7+ years of experience in software engineering, with at least 3+ years focused on debugging and solving infrastructure-level problems in distributed systems
  • Strong proficiency in Go
  • familiarity with Python and Helm is a plus
  • Deep hands-on experience with RabbitMQ or similar message brokers (Kafka, ActiveMQ) - including queue management, clustering, monitoring, and production troubleshooting
  • Solid working knowledge of Kubernetes (pod lifecycle, resource management, networking, debugging CrashLoopBackOff / OOMKilled scenarios) and Docker
  • Experience investigating production incidents and conducting post-incident reviews with clear root cause analysis and follow-through
  • Strong understanding of Linux systems, networking fundamentals, and cloud infrastructure (AWS, Azure, or GCP)
  • Ability to read and interpret logs, thread dumps, heap dumps, and system metrics to isolate root causes under time pressure
  • Excellent analytical and problem-solving skills with a methodical approach to debugging
  • Strong written and verbal communication skills - ability to produce clear incident reports, root cause analyses, and playbooks, and to communicate effectively across engineering, SRE, and customer-facing teams

Nice to have

  • Experience with artifact management or software supply chain tools (e.g., JFrog Artifactory, JFrog Xray)
  • Experience with observability stacks (Prometheus, Grafana, ELK/OpenSearch, Coralogix)
  • Experience with infrastructure-as-code tools (Terraform, Helm, Ansible)
  • Prior experience in a customer-facing technical role (escalation engineering, support engineering, or field engineering)
  • Familiarity with AI-assisted development tools - experience with skills, rules, hooks, and setting up Agents for developer workflows

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Software Engineer - Infrastructure Reliability

8 matching positions

Software Engineer II & Senior Software Engineer

Security represents the most critical priorities for our customers in a world aw...
Location
Location
United States , Redmond
Salary
Salary:
100600.00 - 199000.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, C, C++, C#, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
  • Master's Degree in Computer Science or related technical field AND 3+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 5+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Experience with Troubleshoot and optimize automation, reliability, and monitoring for Live Site running as part of an on-call rotation owned by engineering team
  • Experience with distributed systems, messaging systems like Kafka etc - Large scale system design
Job Responsibility
Job Responsibility
  • Lead the architecture, design and implementation of services for extremely high scale, throughput, durability, and low latency
  • Innovate and make service deployment and maintenance an efficient well-oiled machine that provides excellent reliability with minimal manual engineer intervention
  • Ability to conduct in-depth triage, troubleshooting, and forensics across all facets of the cloud stack while executing processes corrective action and continual service improvement
  • Drive Infrastructure security improvements for mission critical high scale workloads
  • Lead the definition of requirements, KPIs, priorities and planning of engineering deliverables
  • Mentor and grow the energetic, diverse, and driven team with a good mix of senior and mid-level
  • Fulltime
Read More
Arrow Right

Senior Software Engineer and Principal Software Engineer - Power Point AI Team

The PowerPoint team is embarking on an exciting new chapter - evolving a product...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 8+ years of experience in backend service engineering, including work on high-scale infrastructures
  • Proficiency in one or more systems programming languages such as C#, C++
  • 1+ years of experience in software engineering, designing and developing systems (and APIs) that deploy and integrate with AI models
  • 2+ years of experience working with rich telemetry, making data driven decisions, and carrying out rapid experimentation
  • 2+ years of experience building software for scale, performance, and reliability
  • Academic or industry experience with building, finetuning, deploying or building eval-driven systems utilizing the models (any category)
Job Responsibility
Job Responsibility
  • Lead design and delivery of complex, scalable AI features ensuring resilience and exceptional user experience
  • Drive technical strategy and architecture decisions across multiple services, influencing partner teams and aligning with compliance and security requirements
  • Champion modern engineering practices, including AI-driven approaches, automation, and cloud-native patterns, across the full development lifecycle
  • Mentor and guide engineers, fostering technical excellence and continuous improvement in security, reliability, and performance
  • Collaborate cross-org to solve challenging technical problems, streamline processes, and reduce operational costs while improving live-site health
  • Design and implement scalable backend services optimized for machine learning workflows and large language model integration
  • Develop and maintain evaluation-driven systems that leverage text and multimodal inputs (e.g., images) to power visual-creation experiences
  • Build and optimize APIs and infrastructure to support high-performance model inference and experimentation at scale
  • Collaborate with product, ML, and design teams to integrate models into user-facing features, ensuring seamless functionality and performance
  • Conduct model evaluations and experiments, analyze results, and iterate on improvements to enhance accuracy and user experience
  • Fulltime
Read More
Arrow Right

Senior Infrastructure Software Engineer, Storage

As a Senior Software Engineer on the Storage team, you will help design, build, ...
Location
Location
Canada
Salary
Salary:
190400.00 - 257600.00 CAD / Year
dropbox.com Logo
Dropbox
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of strong understanding of distributed systems principles, including replication, consistency, and fault tolerance
  • Experience developing and debugging production services in C++, Go, or Rust
  • Familiarity with distributed storage systems, file systems, or data infrastructure at scale
  • Demonstrated ability to write efficient, reliable, and maintainable code in mission-critical environments
  • Experience troubleshooting complex systems and participating in on-call or operational rotations
  • Solid communication and collaboration skills, with the ability to work across infrastructure and product teams
  • Eagerness to learn, grow, and contribute to multi-year infrastructure evolution initiatives
Job Responsibility
Job Responsibility
  • Design, implement, and maintain large-scale distributed storage systems that ensure data durability, availability, and performance
  • Collaborate with peers to evolve the architecture of Dropbox’s core storage infrastructure for improved scalability and efficiency
  • Contribute to the design of replication, erasure coding, and system lifecycle management systems that balance cost, reliability, and performance
  • Write high-quality, performant, and maintainable code in Go and Rust
  • Participate in the on-call rotation, gaining firsthand experience operating Dropbox’s production storage systems
  • Investigate and resolve complex production issues, performing root cause analysis and driving continuous reliability improvements
  • Partner with cross-functional teams (Networking, Hardware, Capacity Planning) to deliver end-to-end reliable and cost-efficient storage solutions
  • Take ownership of scoped projects and demonstrate growth toward leading larger, cross-team technical initiatives
What we offer
What we offer
  • Competitive medical, dental and vision coverage
  • Retirement savings through a defined contribution pension or savings plan
  • Flexible PTO/Paid Time Off, paid holidays, Volunteer Time Off, and more
  • Income Protection Plans: Life and disability insurance
  • Business Travel Protection: Travel medical and accident insurance
  • Perks Allowance to be used on what matters most to you
  • Parental benefits including: Parental Leave, Fertility Benefits, Adoptions and Surrogacy support, and Lactation support
  • Mental health and wellness benefits
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Infrastructure and Security

At Vanta, our mission is to help businesses earn and prove trust. We believe tha...
Location
Location
United States
Salary
Salary:
179000.00 - 211000.00 USD / Year
vanta.com Logo
Vanta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You’ve played technical leadership roles on Infrastructure or platform teams
  • You have experience with infrastructure, AWS services, and scaling platforms in fast-growing environments
  • You care deeply about performance and reliability
  • You’re thoughtful about trade-offs and have good product sense when creating new infrastructure
  • Open to using AI to amplify their skills and strengthen their work - demonstrating curiosity, a willingness to learn, and sound judgment in applying AI responsibly to improve efficiency and impact
Job Responsibility
Job Responsibility
  • Design and build scalable infrastructure to support rapid growth in data volume, service usage, and engineering velocity
  • Lead projects across our cloud infrastructure, including container orchestration (e.g., AWS Fargate, ECS), monitoring and alerting systems, networking, and database maintenance
  • Implement and maintain core security infrastructure and controls including, service-to-service authentication, secrets management, application security primitives (e.g., rate-limiting, encryption libraries, etc.), and infrastructure hardening
  • Identify and solve complex scalability and performance challenges, particularly related to service reliability and data throughput
  • Partner closely with Security Engineering to implement infrastructure that supports best-in-class security and compliance practices
  • Drive infrastructure design reviews and provide technical guidance on architectural decisions and trade-offs
  • Work with talented and kind engineers to make a significant impact on our customer base, enabling them to improve their security and prove it
  • Contribute to building Vanta’s engineering culture as we grow
What we offer
What we offer
  • Offers Equity
  • medical benefits
  • 401(k) plan
  • other company perk programs
  • Comprehensive medical, dental, and vision coverage, with 100% of employee-only benefit premiums covered for most medical plans
  • 16 weeks fully-paid Parental Leave for all new parents
  • Health & wellness stipend
  • Remote workspace, internet, and cellphone stipend
  • Commuter benefits for team members who report to the SF and NYC office
  • Family planning benefits
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Infrastructure and Security

Vanta’s Infrastructure & Security team provides a platform that powers the scala...
Location
Location
Canada
Salary
Salary:
Not provided
vanta.com Logo
Vanta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You’ve played technical leadership roles on Infrastructure or platform teams
  • You have experience with infrastructure, AWS services, and scaling platforms in fast-growing environments
  • You care deeply about performance and reliability
  • You’re thoughtful about trade-offs and have good product sense when creating new infrastructure
  • Open to using AI to amplify their skills and strengthen their work - demonstrating curiosity, a willingness to learn, and sound judgment in applying AI responsibly to improve efficiency and impact
Job Responsibility
Job Responsibility
  • Design and build scalable infrastructure to support rapid growth in data volume, service usage, and engineering velocity
  • Lead projects across our cloud infrastructure, including container orchestration (e.g., AWS Fargate, ECS), monitoring and alerting systems, networking, and database maintenance
  • Implement and maintain core security infrastructure and controls including, service-to-service authentication, secrets management, application security primitives (e.g., rate-limiting, encryption libraries, etc.), and infrastructure hardening
  • Identify and solve complex scalability and performance challenges, particularly related to service reliability and data throughput
  • Partner closely with Security Engineering to implement infrastructure that supports best-in-class security and compliance practices
  • Drive infrastructure design reviews and provide technical guidance on architectural decisions and trade-offs
  • Work with talented and kind engineers to make a significant impact on our customer base, enabling them to improve their security and prove it
  • Contribute to building Vanta’s engineering culture as we grow
What we offer
What we offer
  • Industry-competitive salary and equity
  • 100% covered medical, dental, and vision benefits with dependents coverage
  • Pension contribution
  • 16 weeks fully paid Parental Leave for all new parents
  • Health & wellness stipend
  • Remote workspace, internet, and cellphone stipend
  • Flexible work hours and location
  • 21 days of Vacation Time and 80 hours of Sick Leave
  • 11 company-paid holidays
  • Virtual team building activities, lunch and learns, and other company-wide events
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Infrastructure

Serval is building an AI platform to automate complex IT workflows for modern en...
Location
Location
United States , San Francisco
Salary
Salary:
200000.00 - 300000.00 USD / Year
serval.com Logo
Serval
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years building and operating large-scale distributed systems in production environments
  • Strong experience writing and maintaining Terraform for infrastructure provisioning and management
  • Deep knowledge of at least one major cloud provider (AWS, GCP, or Azure), including compute, networking, storage, and managed services
  • Experience building, packaging, and supporting self-hosted or on-premises software deployments for enterprise customers
  • Proficiency in Python, Go, or similar languages for building automation, tooling, and infrastructure services
  • Strong understanding of networking, databases, containerization (Docker, Kubernetes), and orchestration systems
  • Experience with monitoring, logging, alerting, and incident management tools (e.g., Datadog, Prometheus, Grafana, PagerDuty)
  • Ability to communicate technical concepts clearly to customers and provide infrastructure support and guidance
  • Ability to debug complex system issues, analyze performance bottlenecks, and implement effective solutions
Job Responsibility
Job Responsibility
  • Design, implement, and operate large-scale distributed systems that power Serval's AI agents, workflow orchestration, and data pipelines
  • Write and maintain Terraform modules to provision and manage cloud infrastructure across AWS, GCP, or Azure environments
  • Build and maintain deployment packages, installation scripts, and infrastructure templates that enable customers to self-host Serval in their own environments
  • Provide technical guidance and troubleshooting support to enterprise customers deploying and operating self-hosted instances of Serval
  • Ensure high availability, performance, and reliability of production systems through monitoring, alerting, incident response, and capacity planning
  • Build internal tools and platforms that enable product engineers to deploy, test, and operate services efficiently
  • Collaborate with engineering teams to design resilient, scalable architectures that support both cloud-hosted and self-hosted deployment models
  • Profile and optimize system performance, including compute, storage, networking, and database layers
  • Implement security best practices and ensure infrastructure meets enterprise compliance requirements for both managed and self-hosted deployments
What we offer
What we offer
  • Offers Equity
  • comprehensive health coverage
  • flexible PTO
  • daily lunches and snacks
  • onsite gym access
  • regular team events and offsites
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Infrastructure

Sentry.io provides a suite of services to diagnose health problems in their cust...
Location
Location
United States , San Francisco
Salary
Salary:
190000.00 - 280000.00 USD / Year
sentry.io Logo
Sentry
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience as a Software Engineer or similar role
  • Strong proficiency with Python
  • Expertise in designing and building scalable systems and APIs and distributed systems
  • Experience with cloud platforms (e.g., AWS, Azure, GCP) and their SDKs/APIs
  • Proficiency with containerization and orchestration tools (e.g., Docker, Kubernetes)
  • Understanding of CI/CD pipelines and deployment automation
  • Knowledge of distributed systems design
  • Track record of building reliable systems with strong operational ownership
  • Strong written communication skills and comfortable producing documentation that supports adoption
Job Responsibility
Job Responsibility
  • Design systems that scale with company growth, balancing reliability, performance, and cost
  • Build platform services and interfaces that enable self-service workflows for engineering teams
  • Collaborate with other engineering teams to enhance solutions tailored to their needs
  • Provide comprehensive documentation, training, and support for effective adoption tools
  • Continuously assess and enhance capabilities based on user feedback and emerging technologies
  • Monitor and troubleshoot issues to maintain solution availability
What we offer
What we offer
  • Offers Equity
  • incentive compensation
  • equity grants
  • paid time off
  • group health insurance coverage
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Infrastructure & Simulation

Mach Industries is building autonomous systems that must be thoroughly tested an...
Location
Location
United States , Huntington Beach
Salary
Salary:
170000.00 - 210000.00 USD / Year
machindustries.com Logo
Mach Industries
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering fundamentals with 5+ years of experience building production systems or developer infrastructure
  • Proficiency in C++ and at least one systems language such as Python, Rust, or C
  • Experience building or working with simulation, testing, or validation frameworks for complex software or embedded systems
  • Familiarity with CI/CD pipelines, containerization (Docker), and modern build systems
  • Experience scaling test infrastructure across distributed or cloud compute environments
  • Knowledge of infrastructure tooling such as Jenkins, GitHub Actions, Terraform, or similar
Job Responsibility
Job Responsibility
  • Design, build, and maintain simulation and validation infrastructure for aircraft systems, including discrete simulations of avionics, sensors, power distribution, flight controls, and embedded software
  • Develop and operate Software-in-the-Loop (SITL) and Hardware-in-the-Loop (HITL) frameworks to support rapid development, regression testing, and system validation across flight autonomy and embedded stacks
  • Build software infrastructure to emulate bare-metal drivers, avionics peripherals, and aircraft system behaviors, enabling early validation, fault injection, and failure-mode testing
  • Create and scale automated test pipelines that integrate simulation and HIL into CI/CD workflows to continuously validate flight software performance, safety, and reliability
  • Collaborate closely with autonomy, embedded, avionics, and systems engineers to define robust test strategies aligned with real-world flight profiles, environmental conditions, and operational constraints
  • Improve simulation fidelity, determinism, and performance to close the gap between simulated behavior and real aircraft dynamics
  • Optimize build and test infrastructure to leverage parallel execution, cluster computing, and shared compute resources for large-scale simulation and regression workloads
  • Define, track, and evolve metrics for test quality, coverage, and traceability, driving continuous improvement in aircraft software validation and confidence
What we offer
What we offer
  • Offers Equity
  • healthcare, dental and vision plans
  • retirement savings
  • paid time off
  • funds for continuing education, training, and career growth
  • Fulltime
Read More
Arrow Right