CrawlJobs Logo

Senior Systems Engineer - Infrastructure & Platform Reliability

lambda.ai Logo

Lambda

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

206000.00 - 310000.00 USD / Year

Job Description:

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. Information Systems at Lambda is responsible for building and scaling the internal systems that power our business. We partner across the company—Finance, GTM, Engineering, and People—to implement tools, automate workflows, and ensure data flows securely and accurately. Our scope includes enterprise applications, integrations, data platform and analytics, compliance automation, and all things IT.

Job Responsibility:

  • Design, write, and deliver software and services to improve the availability, scalability, reliability, and efficiency of Lambda’s internal IT systems and platforms
  • Solve problems relating to mission critical services and build automation to prevent problem recurrence with the goal of automating response to all non-exceptional events
  • Work with Lambda Engineering and internal teams to Influence and create new designs, architectures, standards, and methods for large-scale distributed systems
  • Engage in service capacity planning and demand forecasting, software performance analysis, and system tuning
  • Be an excellent communicator, producing documentation and related artifacts for the systems you are responsible for

Requirements:

  • Have a keen interest in system design, architecting for performance, scalability, and experience with multiple cloud infrastructure platforms (AWS, GCP, Azure, etc.)
  • Think carefully about systems: edge cases, failure modes, behaviors, and specific implementations
  • Know and prefer configuration management systems and toolchains (Chef, Ansible, Terraform, GitHub Actions, etc.)
  • Have solid programming skills: Python, Go, etc.
  • Have an urge to collaborate and communicate asynchronously, combined with a desire to record and document issues and solutions
  • Have an enthusiastic, go-for-it attitude. When you see something broken, you can’t help but fix it
  • Have an urge for delivering quickly and effectively, and iterating fast

Nice to have:

  • Experience and interest in ML/AI workloads and compute
  • Practical experience implementing and managing paging, alerting, and on-call scheduling flows
  • A positive attitude, combined with a desire to learn and collaborate
What we offer:
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan that we all actually use

Additional Information:

Job Posted:
February 18, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Systems Engineer - Infrastructure & Platform Reliability

Senior Site Reliability Engineer

This is a role at Baxter where your work impacts saving and sustaining lives thr...
Location
Location
United States , Deerfield
Salary
Salary:
96000.00 - 132000.00 USD / Year
https://www.baxter.com/ Logo
Baxter
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, IT, or related field (or equivalent experience)
  • Prior experience in Site Reliability Engineering and cloud-based infrastructure management
  • Experience in enterprise engineering, including 24x7 uptime, regulated environments, and planning/operations
  • Azure administration and operations experience, with certifications a plus
  • Knowledge of related technologies, including cloud, encryption, and security protocols
  • Systems administration experience in Windows and Linux environments
  • Proven problem-solving skills and experience with scripting and automation tools
  • Ability to create accurate documentation and reports, with excellent communication skills
  • Applicants must be authorized to work for any employer in the U.S.
  • Unable to sponsor or take over sponsorship of an employment visa at this time.
Job Responsibility
Job Responsibility
  • Drive strategies to ensure 24x7 availability of services and business continuity for customer-facing healthcare software applications and platforms hosted on Microsoft Azure cloud
  • Manage and administer Azure resources, including virtual machines, databases, and networking components
  • Define and document operating procedures to ensure required security, privacy and other compliance standards are maintained for digital solutions deployed in cloud
  • Manage process, planning, and execution for Disaster Recovery (DR) and Business Continuity Planning (BCP)
  • Define and refine Operations SLAs to maintain high level of Customer Satisfaction
  • Establish non-functional requirements to meet SLAs
  • Establish infrastructure and application monitoring dashboards and workflow for automatic routing of notifications
  • Define key performance indicators that can be monitored, measured, and used to derive opportunities
  • Standardize site metrics for stakeholders, reporting on various KPIs including SLAs, availability, capacity utilization, service metrics and cost utilization
  • Work closely with DevOps Engineers to automate infrastructure provisioning and deployment processes.
What we offer
What we offer
  • Support for Parents
  • Continuing Education/Professional Development
  • Employee Health & Well-Being Benefits
  • Paid Time Off
  • 2 Days a Year to Volunteer
  • Medical and dental coverage starting day one
  • Insurance coverage for basic life, accident, short-term and long-term disability
  • Business travel accident insurance
  • Employee Stock Purchase Plan (ESPP)
  • 401(k) Retirement Savings Plan
  • Fulltime
Read More
Arrow Right

Senior Systems Engineer

We are looking for a versatile and driven Senior Systems Engineer to join our En...
Location
Location
United States , Chicago
Salary
Salary:
130000.00 USD / Year
akunacapital.com Logo
AKUNA CAPITAL
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Degree in Computer Science, Information Systems, or a related field
  • 5-7 years of systems engineering experience
  • Advanced Linux knowledge including kernel bypass, kernel tuning, and customizing kernels
  • Deep understanding of virtualization and containerization technologies
  • Extensive experience with a variety of Linux distributions (RedHat, Ubuntu, etc.)
  • Deep understanding of system monitoring and configuration management tools (Ansible, Foreman, Prometheus and Icinga/Nagios)
  • Proficiency in scripting and using automation and orchestration tools such as Python and Bash
  • Expertise in troubleshooting multicast and TCP related performance issues
  • Experience automating daily software and hardware related tasks
  • Demonstrated ability to lead large technical projects
Job Responsibility
Job Responsibility
  • Analyze complex technical problems and collaborate on designing solutions for Akuna’s global Infrastructure platform
  • Drive projects and solutions to completion in a fast-paced environment
  • Design, develop and maintain orchestration and configuration solutions
  • Collaborate with developers and other infrastructure engineers to research new products and techniques that drive innovation and improve efficiency and performance in the environment
  • Architect and maintain multi-vendor, tier-based storage solutions
  • Build out a test automation framework for systems performance testing and tuning
  • Create and institute process enforcement across environments
  • Create tools that assist teams to optimize the available infrastructure
  • Develop and maintain comprehensive technical documentation, including system configurations, procedures, and troubleshooting guides
  • Lead knowledge transfer sessions and mentor team members to ensure continuity and operational excellence
What we offer
What we offer
  • Discretionary performance bonus
  • Comprehensive benefits package that may encompass employer-paid medical, dental, vision, retirement contributions, paid time off, and other benefits
  • Fulltime
Read More
Arrow Right

Senior Platform Engineer, Storage

We’re looking for a Senior Platform Engineer specializing in storage services to...
Location
Location
Ireland , Dublin
Salary
Salary:
102000.00 - 124000.00 EUR / Year
getdbt.com Logo
dbt Labs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Significant experience designing and operating relational data and object storage platforms in production
  • Hands-on experience with one or more cloud providers (AWS, Azure, GCP) and declarative Infrastructure as Code (Terraform preferred)
  • Programming/scripting ability in Python, Go, Rust or Bash
  • Excellent communication skills and experience working asynchronously on a fully remote, distributed team
Job Responsibility
Job Responsibility
  • Design, operate, and scale storage based infrastructure systems across multiple tenancy models (single vs. multi-tenant) and public clouds (AWS, Azure, and GCP)
  • Deepen our team’s expertise in one more areas including: relational databases, search, caching, queuing, and streaming - helping strengthen platform scalability, security, and developer experience
  • Partner with Architecture, Release Engineering, Network, Compute, and Security teams to provide a seamless platform for application teams
  • Leverage tools and languages such as Terraform, Kubernetes, Helm, Argo CD, Python, SQL, Go, Bash, and Datadog
  • Participate in a balanced on-call rotation in an environment that values continuous improvement, helping to improve reliability and reduce operational toil
What we offer
What we offer
  • Equity Stake
  • Unlimited PTO
  • Excellent healthcare coverage
  • Paid parental leave
  • Wellness and home office stipends
  • Fulltime
Read More
Arrow Right

Senior Platform Engineer

Glide is looking for a Senior Platform Engineer to join our Infrastructure team ...
Location
Location
Salary
Salary:
Not provided
glideapps.com Logo
Glide
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience as a platform engineer/SRE
  • 3+ years experience building and maintaining highly available and scalable distributed data sources
  • Experience with Google Cloud Platform services like Cloud SQL, Cloud Run, AlloyDB, or equivalent
  • Experience orchestrating complex systems with Kubernetes
  • Proficiency in TypeScript development
  • Strong SQL skills
  • can speak to covering index optimization strategies
  • Experience designing, building and running data-intensive event-driven architectures
  • You are a clear and effective communicator, be it when you write code, write emails, or explain complex technical issues to non-technical co-workers
  • Passionate and self-motivated, with a demonstrated ability to work in a fast-paced and evolving environment
Job Responsibility
Job Responsibility
  • Managing our existing infrastructure in GCP
  • Driving our platform evolution as the complexity and sophistication of our product only increases
  • Managing our Github/GH Actions based build pipeline
  • Provide build, test, and runtime infrastructure to service teams
  • Ensure patterns are established (e.g., for database throttling, request rate limiting, etc…) to protect Glide’s uptime
  • Monitor infrastructure costs and coordinate improvements when necessary
  • Drive SRE tooling and best practices around observability and alerting
  • Write, review, and maintain code primarily in TypeScript
  • Write architecture briefs and proposals, carry out code experiments, and build prototypes to learn how we can achieve reliable scale with our systems
  • Provide technical leadership, mentorship, pairing opportunities, and code review to encourage the growth of others
What we offer
What we offer
  • competitive salary and benefits package
  • a supportive and dynamic remote work environment
  • opportunities for career growth
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer Cloud Platform

Zilliz is a fast-growing startup developing the industry’s leading vector databa...
Location
Location
Salary
Salary:
175000.00 - 225000.00 USD / Year
zilliz.com Logo
Zilliz
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience in site reliability engineering or similar roles with a focus on cloud-native systems
  • Proficiency in scripting languages such as Python, Go, or Java
  • Strong knowledge of container orchestration technologies like Kubernetes and Docker
  • Expertise with cloud platforms such as AWS, GCP, or Azure, and their respective monitoring and management tools
  • Experience with infrastructure as code tools such as Terraform or Ansible
  • Familiarity with CI/CD tools such as Jenkins, GitLab CI, or Argo
  • Proven ability to troubleshoot complex distributed systems and resolve issues promptly
  • Bachelor’s degree or above in computer science, software engineering, or other relevant disciplines
  • Ability to thrive in a fast-paced, startup environment and handle multiple projects simultaneously
Job Responsibility
Job Responsibility
  • Work at the intersection of development and site reliability. Creating SRE tools and systems, as well as supporting existing infrastructure and platforms
  • Ensure the reliability, availability, and performance of Zilliz’s distributed database systems
  • Develop and implement strategies for monitoring, incident management, and disaster recovery
  • Automate system operations and maintenance tasks to improve efficiency and reduce manual intervention
  • Design and build tools to manage and monitor infrastructure, ensuring scalability and robustness
  • Collaborate with software engineers to enhance system reliability, scalability, and performance
  • Maintain and improve the CI/CD pipeline to ensure smooth and rapid deployment of changes
  • Actively contribute to the Milvus Vector Database open-source community, focusing on improving reliability and operational efficiency
  • Fulltime
Read More
Arrow Right

Senior Engineering Manager - Infra Platform

Lead the team that powers every product at Intercom, building the platform that ...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
intercom.com Logo
Intercom
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience leading an infrastructure or platform team in a cloud environment, ideally AWS, with accountability for availability and costs
  • A track record of independently leading and delivering on complex initiatives, often spanning multiple projects, teams, and quarters
  • Expert grasp of modern web and distributed systems
  • Clear, timely communication
  • Comfort with data
  • Ability to learn quickly, iterate, unblock yourself and your team, and persist until the right problem is solved
Job Responsibility
Job Responsibility
  • Lead and inspire a high-performing, autonomous team of engineers working on scaling and optimizing Intercom’s core infrastructure and supporting systems
  • Navigate complex scaling and infrastructure challenges, guiding the team through strategic decisions on how we build and scale Intercom’s core components
  • Collaborate closely with domain experts and other parts of R&D
  • Drive strategic initiatives and execute high-impact projects, ensuring they are delivered reliably and efficiently
  • Use the best tools in the industry
  • Foster a culture of agility and simplicity, prioritising scalability and uptime while supporting continuous deployment and incremental improvements
  • Mentor and coach engineers and future leaders, playing an active role in their hiring, onboarding, and career development
  • Adapt and thrive across various functions to ensure objectives are met
What we offer
What we offer
  • Competitive salary and equity in a fast-growing start-up
  • We serve lunch every weekday, plus a variety of snack foods and a fully stocked kitchen
  • Regular compensation reviews
  • Pension scheme & match up to 4%
  • Life assurance
  • Comprehensive health and dental insurance for you and your dependents
  • Flexible paid time off policy
  • Paid maternity leave
  • 6 weeks paternity leave for fathers
  • Cycle-to-Work Scheme
  • Fulltime
Read More
Arrow Right

Senior Distributed Systems Engineer - Platform Engineering

For our Platform Engineering team, we are looking for programmers with strong in...
Location
Location
Poland
Salary
Salary:
Not provided
rtbhouse.com Logo
RTB House
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Excellent understanding of how complex IT systems work - from the hardware level, through software, to algorithms
  • Ability to proactively define requirements, ask appropriate questions and draw conclusions that will combine technical constraints and business needs
  • Ability to lead the design and implementation of a solution
  • Experience in leading project teams
  • Willingness to be involved in topics that go beyond programming and design, such as responsibility for technical areas or communication with other teams
  • Proactive attitude, independence in taking action
  • Extensive experience in programming and readiness to implement key system elements as well as involvement in code reviews
  • Good knowledge of methods of creating concurrent programs and distributed systems
  • Ability to critically analyze created solutions in terms of performance (from estimating the theoretical performance of designed systems to detecting and removing actual performance problems in production)
  • C1 level in English and Polish
Job Responsibility
Job Responsibility
  • Plan and then hands-on lead further development within a given technical area like deployment, monitoring, databases or load balancing, in the context of existing infrastructure within RTB House
  • Coordinate the work of a project team of 3-4 people, also making arrangements with other teams and units within RTB House
  • Ensure the reliability and scalability of the solutions built
What we offer
What we offer
  • Attractive compensation
  • Work in a team of enthusiasts who are willing to share their knowledge and experience
  • Flexible cooperation conditions - we do not have core hours, we do not have holiday limits
  • Access to the latest technologies and the possibility of real use of them in a large-scale and highly dynamic project
Read More
Arrow Right

Senior Director of Engineering, Infrastructure

Senior Director of Engineering role leading the Infrastructure group at PagerDut...
Location
Location
United States , San Francisco
Salary
Salary:
233000.00 - 392000.00 USD / Year
https://www.pagerduty.com Logo
PagerDuty
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience in senior engineering leadership roles, managing multiple layers of managers
  • Significant experience as a hands-on technical contributor earlier in your career
  • Deep knowledge of modern infrastructure and software delivery: high availability, distributed systems, public cloud (AWS), microservices, containers, CI/CD pipelines, observability, and automation
  • Track record of building and scaling high-performing, inclusive engineering organizations
Job Responsibility
Job Responsibility
  • Define and drive the multi-year strategy for PagerDuty's infrastructure and platform foundations
  • Strong ownership of PagerDuty's reliability patterns and practices
  • Bar raiser for all engineering functions
  • Lead, mentor, and scale a diverse team of Engineering Managers, Senior Managers, and technical leaders across multiple geographies
  • Ensure the reliability, scalability, and security of PagerDuty's global SaaS platform
  • Partner with peers in Engineering, Product, and Security to deliver large cross-functional initiatives
  • Champion engineering excellence: CI/CD maturity, observability best practices, operational rigor, and incident readiness
  • Manage budgets, headcount, and vendor relationships to optimize infrastructure investments
  • Represent Infrastructure externally with customers and partners, and internally with executives, as a trusted voice on technical and business tradeoffs
  • Foster a culture of inclusion, accountability, collaboration, and growth
What we offer
What we offer
  • Competitive salary
  • Comprehensive benefits package
  • Flexible work arrangements
  • Company equity
  • ESPP (Employee Stock Purchase Program)
  • Retirement or pension plan
  • Generous paid vacation time
  • Paid holidays and sick leave
  • Dutonian Wellness Days & HibernationDuty - companywide paid days off in addition to PTO
  • Paid parental leave: 22 weeks for pregnant parent, 12 weeks for non-pregnant parent
  • Fulltime
Read More
Arrow Right