CrawlJobs Logo

Cloud Platform Engineer (Site Reliability)

amentum.com Logo

Amentum

Location Icon

Location:
United States , Houston

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We have an exciting opportunity for a Cloud Platform Engineer (Site Reliability) to join our team Nexus, a teammate company. As a Cloud Platform Engineer (Site Reliability), you will be part of a team working with NASA at Johnson Space Center (JSC) on deep space exploration projects like Orion, Lunar Gateway, and Artemis.

Job Responsibility:

  • Developing new cloud-native platform services spanning all three major cloud environments
  • Developing best practices for cloud-native application development and promoting them within the organization
  • Administering NASA cloud networks and managing requests for deployment of COTS and Cloud Native applications into cloud environments
  • Writing quality code, providing quality and engaged code reviews for peers
  • Working with Managed Kubernetes offering across all three major cloud providers
  • Integrating cloud managed AI and data services with other bespoke and open-source Kubernetes applications
  • Developing best practices for cloud-native application development and promoting them within the organization
  • Identifying opportunities to abstract Prospective Project requirements and develop Enterprise-grade, multi-tenant Platform Services
  • Collaborate with NASA security and compliance teams to ensure teams are adhering to industry best practices and regulatory requirements
  • Working directly with NASA human spaceflight missions like Orion, Lunar Gateway, Artemis
  • Perform other duties as required

Requirements:

  • Typically requires a bachelor’s degree or equivalent certification in a related area and normally possess 10 years of experience in the field or in a related area
  • Strong experience with Kubernetes in production
  • Ability to manage and use GitLab (preferably very proficient)
  • Hands-on experience with CI/CD pipeline tools
  • Observability Monitoring tools such as Grafana and SuperSet
  • Proficiency with Infrastructure-as-Code utilizing Terraform for infrastructure automation and/or open source alternatives (OpenTofu)
  • Extensive Linux experience (familiarity with Windows also preferred, but not required)
  • Expert in at least one programming language (Go and Python is preferred)
  • Experience with Python, SQL (and R is preferable)
  • Working understanding of Machine Learning Model Lifecycle management (is preferred)
  • Proof of U.S. Citizenship or US Permanent Residency may be a requirement for this position
  • Must be able to complete a U.S. government background investigation

Nice to have:

  • Solution-oriented mindset, accepts feedback with enthusiasm
  • Effectively communicates with team members: clear status updates to direct report, regularly establishes unanimous decisions with peers
  • Self-motivated and self-managing, with strong organizational skills
  • Experience architecting and building cloud-native applications
  • Demonstrable success and confidence in owning the full project lifecycle
  • Good instincts for identifying opportunities for abstraction
  • Balances individual project development tasks at hand with overall platform and company goals
What we offer:
  • Excellent personal and professional career growth
  • 9/80 work schedule (every other Friday off), when applicable
  • Onsite cafeteria (breakfast & lunch)
  • Health, dental, and vision insurance
  • Paid time off and holidays
  • Retirement benefits (including 401(k) matching)
  • Educational reimbursement
  • Parental leave
  • Employee stock purchase plan
  • Tax-saving options
  • Disability and life insurance
  • Pet insurance

Additional Information:

Job Posted:
March 24, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Cloud Platform Engineer (Site Reliability)

Senior Site Reliability Engineer

Baxter International is seeking a skilled Senior Principal Site Reliability Engi...
Location
Location
United States , Deerfield
Salary
Salary:
96000.00 - 132000.00 USD / Year
https://www.baxter.com/ Logo
Baxter
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, IT, or related field (or equivalent experience)
  • Prior experience in Site Reliability Engineering and cloud-based infrastructure management
  • Experience in enterprise engineering, including 24x7 uptime, regulated environments, and planning/operations
  • Azure administration and operations experience, with certifications a plus
  • Knowledge of related technologies, including cloud, encryption, and security protocols
  • Systems administration experience in Windows and Linux environments
  • Proven problem-solving skills and experience with scripting and automation tools
  • Ability to create accurate documentation and reports, with excellent communication skills
Job Responsibility
Job Responsibility
  • Drive strategies to ensure 24x7 availability of services and business continuity for customer facing healthcare software applications and platforms hosted on Microsoft Azure cloud
  • Manage and administer Azure resources, including virtual machines, databases, and networking components
  • Define and document operating procedures to ensure required security, privacy and other compliance standards are maintained for digital solutions deployed in cloud
  • Manage process, planning, and execution for Disaster Recovery (DR) and Business Continuity Planning (BCP)
  • Define and refine Operations SLAs to maintain high level of Customer Satisfaction
  • Establish non-functional requirements to meet SLAs
  • Establish infrastructure and application monitoring dashboards and workflow for automatic routing of notifications
  • Define key performance indicators that can be monitored, measured, and used to derive opportunities
  • Standardize site metrics for stakeholders, reporting on various KPIs including SLAs, availability, capacity utilization, service metrics and cost utilization
  • Work closely with DevOps Engineers to automate infrastructure provisioning and deployment processes
What we offer
What we offer
  • Healthcare benefits
  • Employee Stock Purchase Plan (ESPP)
  • 401(k) Retirement Savings Plan
  • Flexible Spending Accounts
  • Educational assistance programs
  • Paid holidays
  • Paid time off
  • Paid parental leave
  • Commuting benefits
  • Employee Discount Program
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

This is a role at Baxter where your work impacts saving and sustaining lives thr...
Location
Location
United States , Deerfield
Salary
Salary:
96000.00 - 132000.00 USD / Year
https://www.baxter.com/ Logo
Baxter
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, IT, or related field (or equivalent experience)
  • Prior experience in Site Reliability Engineering and cloud-based infrastructure management
  • Experience in enterprise engineering, including 24x7 uptime, regulated environments, and planning/operations
  • Azure administration and operations experience, with certifications a plus
  • Knowledge of related technologies, including cloud, encryption, and security protocols
  • Systems administration experience in Windows and Linux environments
  • Proven problem-solving skills and experience with scripting and automation tools
  • Ability to create accurate documentation and reports, with excellent communication skills
  • Applicants must be authorized to work for any employer in the U.S.
  • Unable to sponsor or take over sponsorship of an employment visa at this time.
Job Responsibility
Job Responsibility
  • Drive strategies to ensure 24x7 availability of services and business continuity for customer-facing healthcare software applications and platforms hosted on Microsoft Azure cloud
  • Manage and administer Azure resources, including virtual machines, databases, and networking components
  • Define and document operating procedures to ensure required security, privacy and other compliance standards are maintained for digital solutions deployed in cloud
  • Manage process, planning, and execution for Disaster Recovery (DR) and Business Continuity Planning (BCP)
  • Define and refine Operations SLAs to maintain high level of Customer Satisfaction
  • Establish non-functional requirements to meet SLAs
  • Establish infrastructure and application monitoring dashboards and workflow for automatic routing of notifications
  • Define key performance indicators that can be monitored, measured, and used to derive opportunities
  • Standardize site metrics for stakeholders, reporting on various KPIs including SLAs, availability, capacity utilization, service metrics and cost utilization
  • Work closely with DevOps Engineers to automate infrastructure provisioning and deployment processes.
What we offer
What we offer
  • Support for Parents
  • Continuing Education/Professional Development
  • Employee Health & Well-Being Benefits
  • Paid Time Off
  • 2 Days a Year to Volunteer
  • Medical and dental coverage starting day one
  • Insurance coverage for basic life, accident, short-term and long-term disability
  • Business travel accident insurance
  • Employee Stock Purchase Plan (ESPP)
  • 401(k) Retirement Savings Plan
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

We are looking for a reliability expert who is passionate about scaling Cloud se...
Location
Location
United States , San Francisco; Mountain View
Salary
Salary:
170800.00 - 274300.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expert-level proficiency with 8+ years experience in at least Java
  • Expert-level proficiency with 5+ years experience in public cloud offerings (AWS components like EC2, CloudFormation, RDS / Aurora, Caches, SQS - or equivalents, e.g. in GCP / Azure)
  • Expert-level proficiency with 5+ years experience in operating high-availability, fault-tolerant, scalable, distributed software in production: building monitoring into your code, tweaking dashboards, defining alerts, writing runbooks, etc.
  • Experience in driving large, complex, cross-organizational initiatives from inception to completion
  • Excellent communication skills in written and verbal forms, and an ability to communicate complex technical issues to a range of technical and non-technical audiences (management, peers, clients)
  • Experience in leadership positions, able to influence others and drive impactful outcomes through delegation
  • An ability and desire to mentor and coach engineers
Job Responsibility
Job Responsibility
  • Advocate for reliability methodologies
  • Work with a variety of platform, product and SRE teams to both build reliability into our platform and drive adoption of those practices into our products
  • Analyze and help improve our services and processes to get us to an even higher level of reliability, performance, scalability, and cost efficiency
What we offer
What we offer
  • Health and wellbeing resources
  • Paid volunteer days
  • Equity
  • Bonuses
  • Commissions
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Architect, develop, and troubleshoot large-scale infrastructure, maintain and im...
Location
Location
United States , San Francisco
Salary
Salary:
180960.00 - 230900.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Software Engineering, Information Technology or a closely related field
  • four years of experience as a Site Reliability Engineer architecting, developing, and troubleshooting large scale infrastructure utilizing programming languages such as PowerShell, Python, or Bash
  • networking technologies such as TCP/IP or security
  • four years of experience in automation development and infrastructure as code implementation using tools such as Terraform, AWS CloudFormation, Ansible, or Salt
  • knowledge of Linux and Windows systems
  • cloud technologies within AWS, GCP, Azure
  • continuous integration continuous delivery/deployment (CICD) practices and monitoring and observability practices
  • must pass technical interview
Job Responsibility
Job Responsibility
  • Architect, develop, and troubleshoot large scale infrastructure utilizing programming languages such as PowerShell, Python, or Bash and networking technologies such as TCP/IP or security
  • provide real-time feedback on production systems
  • work with product family and platform developers to maintain and improve services and performance with a strong customer focus
  • utilize a variety of data collection, enrichment, analytics, and visualizations to support our complex systems
  • responsible for automation development and infrastructure-as-code implementation using tools such as Terraform, AWS CloudFormation, Ansible, and/or Salt
  • build solutions to enhance availability, performance, and stability for hundreds of Atlassian enterprise customers in the cloud as well as automate repetitive work
  • help secure the cloud architecture with penetration testing, vulnerability resolution, and compliance audit responses
  • responsible for continuous integration continuous delivery/deployment (CICD) practices and monitoring and observability practices
What we offer
What we offer
  • Health and wellbeing resources
  • paid volunteer days
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

At Ledger, we are looking for an experienced Reliability Engineer to join our SR...
Location
Location
France , Paris
Salary
Salary:
Not provided
https://www.ledger.com Logo
Ledger
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years on cloud engineering at scale, on organizations operating SaaS solutions
  • Proficiency in working in Unix/Linux environments, Git, Python, Terraform, Kubernetes, AWS cloud solutions and architectures, CI/CD tools, Argocd, Ansible, configuration management, etc.
  • Strong knowledge on observability practices, with experience implementing and managing Logging, Monitoring and Alerting framework with solutions such as Datadog or Prometheus/Grafana/Loki.
  • Experience of cross-functional work and the ability to demonstrate a collaborative approach with regards to building key relationships across the organization and define projects scope, goals, plan and deliverables
  • Customer focused with the ability to identify and understand both internal and external customer's needs
  • Creative problem-solving and analysis skills with an ability to identify, develop, and implement solutions to meet the needs of the business
  • Excellent presentation and written communication
  • Ability to deal with ambiguity, high level of pressure and rapidly changing environments
  • Engineering degree.
Job Responsibility
Job Responsibility
  • Participate in building a DevOps / SRE culture and enable the transition to modern infrastructure management and deployment practices
  • Participate in building the SRE team roadmap (vision and delivery accountability). Anticipate stakeholder needs, game-changing technologies emergence and challenge scope / deadlines
  • Perform integration of platform software components
  • Participate to design and deliver solutions to improve the availability, scalability, latency, and efficiency of systems
  • Influence and create standards & best practices in support of service level objectives
  • Automate key SRE metrics including SLOs/SLAs and error budgets
  • Provide expert support to our level-2/application support team, to troubleshoot priority incidents, and conduct post-mortems
  • Apply analytics on past incidents and usage patterns to predict issues and take proactive actions
  • Ensure control of technical debt and promote quality practices
  • Follow SRE and chaos engineering approaches across all strategic systems to predict in coordination with Service Design and prevent outages and improve solution availability
What we offer
What we offer
  • Equity: Employees are the foundation of our success, and we award stock options so you can share in that success as we grow
  • Flexibility: A hybrid work policy
  • Social: Annual company outing for Ledgerdary Days, plus frequent social events, snacks and drinks
  • Medical: Comprehensive health insurance policy offering extensive medical, dental and vision care coverage
  • Well-being: Personal development, coaching & fitness with our dedicated partners
  • Vacation: Five weeks of paid leave per year, in addition to national holidays and rest & relaxation (RTT) days
  • High tech: Access to high performance office equipment and gadgets, including Apple products
  • Transport: Ledger reimburses part of your preferred means of transportation
  • Discounts: Employee discount on all our products.
  • Fulltime
Read More
Arrow Right

Principal Network Operations Site Reliability Systems Engineer

This role entails incorporating Site Reliability Engineering (SRE) concepts into...
Location
Location
United States
Salary
Salary:
115500.00 - 266000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or master’s degree in computer science, Computer Engineering, Information Systems, or equivalent
  • Typically, 10+ years’ experience
  • Experience with cloud platforms
  • Experience with software development languages for console and web-based applications
  • Experience in User Interface (UI/UX) design
  • Understanding of and experience with common network infrastructure devices such as switches, routers, access points, authentication, authorization, and accounting systems and protocols, and network management utilities
  • Experience with network monitoring protocols
  • Ability to design and implement relational database solutions, time-series databases, and NoSQL database solutions
  • Excellent analytical and problem-solving skills
  • Experience in the overall architecture of software systems for products and solutions
Job Responsibility
Job Responsibility
  • Develop strategies and implement plans to incorporate SRE concepts into network, tool, and process designs and leads execution of those strategies and plans
  • Evaluates LAN, WLAN, SD-WAN, AAA, Private 5G, and other network designs for fit-for-use criteria, and designs prototype analysis tools to facilitate rapid iteration of network delivery service enhancements
  • Identifies and engineers new ways to leverage data from multiple platforms to identify network performance trends and detect anomalies
  • Prototypes machine learning anomaly detection, event signature identification, and trend identification
  • Automates common incident management and problem management procedures
  • Develops organization-wide architectures, methodologies, and prototypes for software systems design and development across multiple platforms and organizations within the Global Business Unit
  • Identifies and evaluates new technologies and innovations for alignment with technology roadmap and business value
  • creates plans for prototyping and prototype iteration
  • Reviews and evaluates designs and project activities for compliance with development guidelines and standards
  • provides tangible feedback to improve product quality and mitigate failure risk
What we offer
What we offer
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Career development programs
  • Inclusive environment celebrating individual uniqueness
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
  • Minimum 2 years of experience managing or leading cloud operations teams
  • Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
  • Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
  • Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
  • Familiarity with modern CI/CD automation and tools
  • Excellent communication, stakeholder management, and team-building skills
  • Experience scaling SRE practices in high-growth or large-scale environments
  • Ability to balance long-term reliability initiatives with short-term delivery needs.
Job Responsibility
Job Responsibility
  • Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
  • Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
  • Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
  • Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
  • Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
  • Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
  • Define and track key reliability metrics, and report on team performance and system health to leadership
  • Contribute to hiring, onboarding, and career development for SREs.
What we offer
What we offer
  • Health & Wellbeing benefits for physical, financial, and emotional wellbeing
  • Personal & Professional Development programs
  • Unconditional inclusion in the workplace.
  • Fulltime
Read More
Arrow Right

Cloud Security Site Reliability Engineer

This role sits within the Cloud Security team at Citi and focuses on building to...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree or equivalent work experience
  • 6+ years of relevant work experience
  • Highly motivated self-starter with excellent interpersonal and communication skills
  • Certification or formal training in site reliability engineering concepts and practices
  • Prior experience working towards SLIs, SLOs and observability capabilities at a large scale
  • 4+ years experience in Python or Java on large scale systems alongside Linux based scripting languages
  • Experience working on observability, logging and metrics toolsets
  • Experience of k8s and container technologies such as Docker, Openshift and EKS
  • Experience with public cloud technologies such as AWS, GCP or Azure
  • Experience with Secrets products such as HashiCorp Vault or CyberArk
Job Responsibility
Job Responsibility
  • Working across Container products and Secrets products, across Public and Private Cloud, as well as Cloud native specific products
  • Architecting and building tools and platforms that provide capabilities for SRE
  • Collaboration with multiple stakeholders and partners across Engineering and Operations as well as partner teams within the wider Citi organisation
  • Actively owning production level incidents till resolution.
What we offer
What we offer
  • Global benefits designed to support well-being, growth, and work-life balance.
  • Fulltime
Read More
Arrow Right