CrawlJobs Logo

Senior Site Reliability Engineer - Networking

lambda.ai Logo

Lambda

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

227000.00 - 401000.00 USD / Year

Job Description:

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance.

Job Responsibility:

  • Help scale Lambda’s high performance multi-tenant cloud network
  • Contribute to the reproducible automation of network configuration and deployments
  • Contribute to the implementation and operations of Software Defined Networks
  • Help to deploy and manage Spine and Leaf networks
  • Ensure high availability of our network through observability, failover, and redundancy
  • Ensure clients have predictable networking performance through the use of network engineering and other applicable technologies
  • Help with deploying and maintaining network monitoring and management tools
  • Participate in on-call

Requirements:

  • 5+ years of experience being a Site Reliability Engineer or Network Reliability Engineering
  • Been part of the implementation of production-scale networking projects
  • Experience being on-call and incident response management
  • Have experience building and maintaining Software Defined Networks (SDN), experience with OpenStack, Neutron, OVN
  • Are comfortable on the Linux command line, and have an understanding of the Linux networking stack
  • Have experience with multi-data center networks and hybrid cloud networks
  • Have Python programming experience and configuration management tools like Ansible
  • Have experience with CI/CD tools for deployment and GIT. Operated network environment with GitOps practices in place.
  • Experience with application lifecycle and deployments on Kubernetes

Nice to have:

  • Operated production-scale SDNs in a cloud context (e.g. helped implement or operate the infrastructure that powers an AWS VPC-like feature)
  • Have Software development experience with C, GO, Python
  • Experience automating network configuration within public clouds, with tools like Kubernetes, HELM, Terraform, and Ansible
  • Deep understanding of the Linux networking stack and its interaction with network virtualization, SR-IOV and DPDK
  • Understanding of the SDN ecosystem (e.g. OVS, Neutron, VMware NSX, Cisco ACI or Nexus Fabric Controller, Arista CVP)
  • Have experience with Spine and Leaf (Clos) network topology
  • Have experience and understanding of BGP EVPN VXLAN networks
  • Experience with building and maintaining multi-data center networks, SD-WAN, DWDM
  • Experience with Next-Generation Firewalls (NGFW)
What we offer:
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan

Additional Information:

Job Posted:
February 18, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Site Reliability Engineer - Networking

Senior Site Reliability Engineer

Baxter International is seeking a skilled Senior Principal Site Reliability Engi...
Location
Location
United States , Deerfield
Salary
Salary:
96000.00 - 132000.00 USD / Year
https://www.baxter.com/ Logo
Baxter
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, IT, or related field (or equivalent experience)
  • Prior experience in Site Reliability Engineering and cloud-based infrastructure management
  • Experience in enterprise engineering, including 24x7 uptime, regulated environments, and planning/operations
  • Azure administration and operations experience, with certifications a plus
  • Knowledge of related technologies, including cloud, encryption, and security protocols
  • Systems administration experience in Windows and Linux environments
  • Proven problem-solving skills and experience with scripting and automation tools
  • Ability to create accurate documentation and reports, with excellent communication skills
Job Responsibility
Job Responsibility
  • Drive strategies to ensure 24x7 availability of services and business continuity for customer facing healthcare software applications and platforms hosted on Microsoft Azure cloud
  • Manage and administer Azure resources, including virtual machines, databases, and networking components
  • Define and document operating procedures to ensure required security, privacy and other compliance standards are maintained for digital solutions deployed in cloud
  • Manage process, planning, and execution for Disaster Recovery (DR) and Business Continuity Planning (BCP)
  • Define and refine Operations SLAs to maintain high level of Customer Satisfaction
  • Establish non-functional requirements to meet SLAs
  • Establish infrastructure and application monitoring dashboards and workflow for automatic routing of notifications
  • Define key performance indicators that can be monitored, measured, and used to derive opportunities
  • Standardize site metrics for stakeholders, reporting on various KPIs including SLAs, availability, capacity utilization, service metrics and cost utilization
  • Work closely with DevOps Engineers to automate infrastructure provisioning and deployment processes
What we offer
What we offer
  • Healthcare benefits
  • Employee Stock Purchase Plan (ESPP)
  • 401(k) Retirement Savings Plan
  • Flexible Spending Accounts
  • Educational assistance programs
  • Paid holidays
  • Paid time off
  • Paid parental leave
  • Commuting benefits
  • Employee Discount Program
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

This is a role at Baxter where your work impacts saving and sustaining lives thr...
Location
Location
United States , Deerfield
Salary
Salary:
96000.00 - 132000.00 USD / Year
https://www.baxter.com/ Logo
Baxter
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, IT, or related field (or equivalent experience)
  • Prior experience in Site Reliability Engineering and cloud-based infrastructure management
  • Experience in enterprise engineering, including 24x7 uptime, regulated environments, and planning/operations
  • Azure administration and operations experience, with certifications a plus
  • Knowledge of related technologies, including cloud, encryption, and security protocols
  • Systems administration experience in Windows and Linux environments
  • Proven problem-solving skills and experience with scripting and automation tools
  • Ability to create accurate documentation and reports, with excellent communication skills
  • Applicants must be authorized to work for any employer in the U.S.
  • Unable to sponsor or take over sponsorship of an employment visa at this time.
Job Responsibility
Job Responsibility
  • Drive strategies to ensure 24x7 availability of services and business continuity for customer-facing healthcare software applications and platforms hosted on Microsoft Azure cloud
  • Manage and administer Azure resources, including virtual machines, databases, and networking components
  • Define and document operating procedures to ensure required security, privacy and other compliance standards are maintained for digital solutions deployed in cloud
  • Manage process, planning, and execution for Disaster Recovery (DR) and Business Continuity Planning (BCP)
  • Define and refine Operations SLAs to maintain high level of Customer Satisfaction
  • Establish non-functional requirements to meet SLAs
  • Establish infrastructure and application monitoring dashboards and workflow for automatic routing of notifications
  • Define key performance indicators that can be monitored, measured, and used to derive opportunities
  • Standardize site metrics for stakeholders, reporting on various KPIs including SLAs, availability, capacity utilization, service metrics and cost utilization
  • Work closely with DevOps Engineers to automate infrastructure provisioning and deployment processes.
What we offer
What we offer
  • Support for Parents
  • Continuing Education/Professional Development
  • Employee Health & Well-Being Benefits
  • Paid Time Off
  • 2 Days a Year to Volunteer
  • Medical and dental coverage starting day one
  • Insurance coverage for basic life, accident, short-term and long-term disability
  • Business travel accident insurance
  • Employee Stock Purchase Plan (ESPP)
  • 401(k) Retirement Savings Plan
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Architect, develop, and troubleshoot large-scale infrastructure, maintain and im...
Location
Location
United States , San Francisco
Salary
Salary:
180960.00 - 230900.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Software Engineering, Information Technology or a closely related field
  • four years of experience as a Site Reliability Engineer architecting, developing, and troubleshooting large scale infrastructure utilizing programming languages such as PowerShell, Python, or Bash
  • networking technologies such as TCP/IP or security
  • four years of experience in automation development and infrastructure as code implementation using tools such as Terraform, AWS CloudFormation, Ansible, or Salt
  • knowledge of Linux and Windows systems
  • cloud technologies within AWS, GCP, Azure
  • continuous integration continuous delivery/deployment (CICD) practices and monitoring and observability practices
  • must pass technical interview
Job Responsibility
Job Responsibility
  • Architect, develop, and troubleshoot large scale infrastructure utilizing programming languages such as PowerShell, Python, or Bash and networking technologies such as TCP/IP or security
  • provide real-time feedback on production systems
  • work with product family and platform developers to maintain and improve services and performance with a strong customer focus
  • utilize a variety of data collection, enrichment, analytics, and visualizations to support our complex systems
  • responsible for automation development and infrastructure-as-code implementation using tools such as Terraform, AWS CloudFormation, Ansible, and/or Salt
  • build solutions to enhance availability, performance, and stability for hundreds of Atlassian enterprise customers in the cloud as well as automate repetitive work
  • help secure the cloud architecture with penetration testing, vulnerability resolution, and compliance audit responses
  • responsible for continuous integration continuous delivery/deployment (CICD) practices and monitoring and observability practices
What we offer
What we offer
  • Health and wellbeing resources
  • paid volunteer days
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

We are seeking an experienced Senior Site Reliability Engineer (L3) to join our ...
Location
Location
India , Chennai
Salary
Salary:
Not provided
arcadia.com Logo
Arcadia
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
  • 8–10+ years of experience in SRE/DevOps/Cloud Engineering, with deep hands-on exposure to AWS and Kubernetes
  • Strong hands-on experience with: Terraform & Infrastructure as Code
  • AWS core services (EKS, IAM, RDS, EC2, VPC, CloudWatch, CloudTrail, GuardDuty)
  • Jenkins + Groovy, GitHub Actions, ArgoCD, FluxCD
  • Kubernetes troubleshooting and operations
  • Prometheus/Grafana/Datadog observability stacks
  • Proven ability to operate in high-scale, high-uptime, multi-environment production systems
  • Experience building automation via Python/Bash and reducing operational toil
  • Strong understanding of incident management, root cause analysis, and reliability engineering principles
Job Responsibility
Job Responsibility
  • Design, build, and maintain AWS infrastructure (EKS, VPC, RDS, IAM, CloudWatch, CloudTrail, GuardDuty, Load Balancers, S3, CloudFront) using Terraform and CloudFormation
  • Lead all aspects of Kubernetes operations including cluster upgrades, performance tuning, CNI troubleshooting, workload scaling, Helm chart packaging, and GitOps deployments
  • Own and evolve our CI/CD ecosystem across Jenkins (Groovy scripting), GitHub Actions, AWS CodePipeline, ArgoCD, and FluxCD
  • Improve platform reliability by reducing operational toil through automation, scripting (Python/Bash), and proactive system hardening
  • Implement and enhance observability across Prometheus, Grafana, Loki, Tempo, Datadog, and CloudWatch—ensuring actionable alerting, dashboards, and metrics alignment with SLO/SLIs
  • Drive FinOps initiatives, identifying cost inefficiencies and working with engineering teams to implement best practices, tagging standards, budgeting, and resource right-sizing
  • Manage database operations across MySQL and PostgreSQL including backups, performance tuning, replication, and operational runbooks
  • Maintain and improve secret management using Vault, AWS Secrets Manager, and Parameter Store
  • Strengthen cloud security posture with IAM least privilege, CSPM reviews, audit readiness, GuardDuty/CloudTrail monitoring, and environment hardening
  • Troubleshoot complex production issues across networking, Kubernetes, compute, databases, and CI/CD systems
What we offer
What we offer
  • Competitive compensation and employee stock options
  • Hybrid/remote-first working model (India-based role, with global collaboration)
  • Flexible leave policy
  • Comprehensive medical insurance (self + family members)
  • Annual performance cycle + quarterly recognition awards
  • A supportive, diverse engineering culture grounded in empathy, teamwork, and innovation
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

What will you be doing at Miniclip? Participate in an on-call rotation with the ...
Location
Location
Portugal , Lisbon
Salary
Salary:
Not provided
miniclip.com Logo
Miniclip
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on experience with AWS in both development and operations contexts
  • Strong Linux system administration skills, including performance tuning and debugging
  • Software development background and strong coding skills in one or more of the following: Go, Python, Ruby
  • Experience with Infrastructure as Code, particularly Terraform
  • Familiarity with CI/CD pipelines and artifact management tools
  • A mindset for resilient systems design, thinking about edge cases, failure modes, and graceful degradation
  • Excellent communication skills in English, both written and spoken
  • Comfortable in a fast-paced environment and adaptable to shifting priorities
Job Responsibility
Job Responsibility
  • Participate in an on-call rotation with the Cloud Engineering team to respond to production incidents and outages
  • Operate and evolve infrastructure using Infrastructure as Code (Terraform), configuration management tools, and containerized platforms on AWS
  • Build and maintain observability tooling to detect symptoms before they lead to outages
  • Automate repetitive tasks and processes to reduce operational toil
  • Collaborate with Engineering and Product teams to design resilient systems that meet performance and reliability goals
  • Troubleshoot production issues across application, network, and infrastructure layers
  • Document systems, processes, and runbooks to improve team transparency and onboarding
Read More
Arrow Right

Junior Site Reliability Engineer

As a Jr. Site Reliability Engineer, you will 'make things scale' which includes ...
Location
Location
United Kingdom
Salary
Salary:
Not provided
accesso.com Logo
accesso
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Some practical exposure to cloud platforms (AWS/Azure/GCP)—coursework, internships, or self-led projects
  • Ability to self-learn with assistance from Senior Engineers
  • Basic scripting ability using Python or Bash
  • Familiarity with basic Linux systems and general command–line
  • Understanding of Git and basic CI/CD concepts
  • Good written and verbal communication
  • customer-focused approach
  • Ability to work with minimal direction
  • Willingness to learn, take direction and work within a team
Job Responsibility
Job Responsibility
  • Assisting with provisioning and deploying accesso Horizon components to customer cloud accounts using Infrastructure as Code (Terraform)
  • Help maintain CI/CD pipelines (GitHub Actions) for application and infrastructure deployments
  • Support monitoring, logging and alerting (Prometheus, Grafana & Coralogix) and respond to basic alerts with supervision
  • Implement and improve basic automation and scripting
  • Participate in incident triage, root cause investigation and follow-up tasks
  • Follow security and compliance requirements for customer cloud environments (identity, secrets, network controls)
  • Produce and maintain operational runbooks, deployment guides and change notes
  • Participate in on-call rotation as a L1 responder
  • Normal workday may require time outside the normal working day
  • Learn and apply accesso Horizon product architecture and configuration
What we offer
What we offer
  • Competitive compensation package including an annual bonus opportunity
  • 8-days of paid bank holiday leave and 26-days of paid annual leave (paid leave increases with tenure)
  • 8 hours of paid Volunteer Time Off
  • Inclusive Family Benefits, including a $7,500 benefit for surrogacy, adoption, and fertility
  • Robust health insurance scheme with the opportunity to participate in private medical scheme after satisfactory performance
  • Matching pension scheme (up to 8%)
  • Unlimited access to Udemy for Business
  • Flexible work schedule
  • Fulltime
Read More
Arrow Right

Senior Site Reliability/DevOps Engineer

AutoRABIT is looking for a Senior Site Reliability/DevSecOps Engineer to help de...
Location
Location
United States
Salary
Salary:
175000.00 - 200000.00 USD / Year
autorabit.com Logo
AutoRABIT
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Design, implement, and maintain scalable, resilient, and secure infrastructure using AWS
  • Develop and manage infrastructure as code using Terraform
  • Implement and manage CI/CD pipelines to automate deployments and ensure smooth delivery of applications
  • Monitor system performance, identify bottlenecks, and implement solutions to improve reliability and performance
  • Troubleshoot, resolve, and perform RCAs for incidents, while ensuring minimal disruption to services
  • Collaborate with development teams to ensure applications are designed for reliability and performance
  • Working Experience with Shell Scripting (Bash), Python or equivalent is required
  • Good Knowledge of programming languages such as Python, Go, or Java
  • Working Experience with configuration management tools such as Ansible or Chef
  • Implement and maintain monitoring, logging, and alerting systems to ensure the health and performance of our infrastructure
Job Responsibility
Job Responsibility
  • Contribute to the development and maintenance of frameworks for monitoring, automation and code to increase the scalability and reliability of the service
  • Assist both internal and customer facing teams with deployment of new software releases, VPN and other related security infrastructure interfacing
  • Assist with resolution of AutoRABIT service or customer issues as required
  • Participate in and practice sustainable incident response and blameless postmortems
  • Contribute to the automation of manual tasks, such as the provisioning of users in production and test environments
  • Help and develop peers’ capabilities through knowledge sharing, mentoring, and collaboration
  • Work within a small agile team to develop and improve SRE software, support your peers, plan and self-improve
  • Participate in a regular on-call or rotational schedule needed to support AutoRABIT servers, including weekends and holidays
  • Fulltime
Read More
Arrow Right

Senior Engineer (Power Engineer)

Brightspeed is seeking a Senior Engineer (Power Engineer) to design, implement, ...
Location
Location
United States , Charlotte
Salary
Salary:
Not provided
brightspeed.com Logo
Brightspeed
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ideal candidate located in the North region (WI, MI, IL, IN, OH, MO)
  • Associate degree or equivalent education in Electrical Engineering, Power Systems Engineering, or related field, or equivalent work experience
  • 5+ years of experience designing DC power systems in telecommunications, engineering, network planning or equivalent
  • Strong knowledge of -48V DC systems, rectifiers, batteries, grounding, and bonding
  • Proficiency with one-line diagrams, load calculations, and design packages
  • Familiarity with Telcordia, NEC, OSHA, and industry best practices.
  • Knowledgeable and skilled in planning /engineering and reading technical construction drawings/documents
  • Understanding of transport and router equipment/technologies and protocols, along with layer 2 and layer 3 ethernet and transport networking
  • Product knowledge of Cisco and Nokia routers, Ciena and Fujitsu transport devices, Calix and Adtran access devices, localized power such as fuse panels and BDFB power bays, fiber management, and fiber optic modules
  • Knowledgeable and some experience in systems/tools such as SAP, IQGeo, CO Power Database, WMS, Service Now, Armor, One Control for Ciena 6500, Netsmart 1500 for Fujitsu, TNMS for Infinera, Smartsheet and Microsoft Office
Job Responsibility
Job Responsibility
  • Designing and reviewing DC power systems for Central Office and Core locations, including rectifiers, batteries, BDFB’s, PBD’s, distribution panels, inverters, automatic and manual generator transfer switches, grounding, fused, and unfused cabling)
  • Preparing and maintaining detailed engineering documentation—drawings, load calculations, and material specifications
  • Evaluating and standardizing power equipment and configurations for reliability and compliance
  • Collaborating with transport, OLT, and implementation teams to align project schedules
  • Supporting installation and commissioning of power systems, including site reviews, troubleshooting, and test verification
  • Ensuring adherence to Brightspeed standards, Telcordia GR-513, NEC, and applicable safety and compliance codes
  • Tracking power capacity, utilization, and redundancy to support network growth
  • Conducting preventive maintenance programs and power audits
  • Providing input to RFPs, vendor evaluations, and technology pilots for DC power modernization efforts
  • Conducting load analysis, redundancy planning, and capacity forecasting to meet future network demands as required to ensure sufficient backup power
What we offer
What we offer
  • competitive medical, dental, vision, and life insurance
  • an employee assistance program
  • a 401K plan with company match and a host of voluntary benefits
  • Fulltime
Read More
Arrow Right