CrawlJobs Logo

Senior Cloud Engineer – Observability & Performance Engineering

United States, Washington · Job Posted July 03, 2026
Apply Position
Job Link Share

Job Description

We are seeking a highly experienced Cloud Engineer (Observability) to lead the engineering, optimization, and operational maturity of enterprise observability platforms across hybrid cloud and containerized environments. This role is ideal for a hands-on engineer with deep expertise in Datadog, distributed tracing, APM, cloud monitoring, performance engineering, and site reliability practices. The successful candidate will partner with infrastructure, cloud, platform, and application teams to improve operational visibility, reduce alert fatigue, accelerate incident resolution, and drive data-informed operational decisions.

Job Responsibility

  • Observability Platform Engineering
  • Cloud & Container Monitoring
  • Performance Engineering & Reliability
  • Capacity Planning & Operational Excellence

Requirements

  • Bachelor's degree in Information Technology, Computer Science, Engineering, or a related field
  • 8+ years of experience in infrastructure, platform, cloud, or operations engineering
  • 5+ years of experience focused on: Observability, Site Reliability Engineering (SRE), Performance Engineering, Application Performance Monitoring (APM)
  • Experience administering and optimizing observability platforms such as: Datadog, Dynatrace, New Relic, Splunk Observability, Grafana/Prometheus
  • Strong experience with: OpenTelemetry, Distributed tracing, Performance tuning, APM engineering, Cloud-native monitoring
  • Experience supporting Azure, AWS, and containerized platforms
  • Proven ability to troubleshoot complex performance and reliability issues
  • Ability to obtain and maintain Public Trust clearance

Nice to have

  • Experience supporting federal or regulated environments
  • Experience with: Kubernetes, OpenShift, Terraform, ARM, Bicep
  • Strong understanding of: SLO/SLI engineering, Incident management, Capacity planning, Operational analytics
  • Experience integrating observability platforms with ServiceNow and CI/CD tooling

What we offer

  • medical
  • vision
  • dental
  • life and disability insurance
  • 401(k) plan

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Cloud Engineer – Observability & Performance Engineering

8 matching positions

Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

We are looking for a highly skilled engineer with deep expertise in building and...
Location
Location
United States , San Francisco
Salary
Salary:
166000.00 - 201000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in infrastructure or platform engineering, with a focus on observability and monitoring systems
  • Deep expertise with metrics systems (Prometheus, Thanos, Mimir, Cortex), logging pipelines (Fluent Bit, Vector, Loki, ELK/Opensearch), and tracing platforms (Jaeger, Tempo, OpenTelemetry)
  • Strong programming skills in Go or Python for automation, operators, and custom integrations
  • Experience running observability platforms on Kubernetes and operating them at scale across multi-datacenter environments
  • Proven ability to design, optimize, and scale telemetry pipelines handling high cardinality and high throughput data
  • Solid understanding of distributed systems, performance engineering, and debugging complex workloads
  • Strong collaboration skills and the ability to influence engineering teams to adopt observability best practices
Job Responsibility
Job Responsibility
  • Designing and operating scalable observability systems (metrics, logging, tracing) across multi-datacenter Kubernetes environments
  • Architecting end-to-end telemetry pipelines, including ingestion, storage, querying, and visualization
  • Extending monitoring and alerting with Prometheus, Alertmanager, Thanos/Cortex, Grafana, and OpenTelemetry
  • Building scalable log collection and processing pipelines with Fluent Bit, Vector, Loki, or ELK/Opensearch stacks
  • Implementing distributed tracing platforms (Tempo, Jaeger, OpenTelemetry) and integrating with service meshes, load balancers, and APIs
  • Defining and driving adoption of SLOs, SLIs, and error budgets across services and teams
  • Automating provisioning and scaling of observability infrastructure with Kubernetes, Terraform, and custom tooling (Go, Python)
  • Ensuring reliability and cost efficiency of telemetry pipelines while supporting high-volume workloads (AI/ML, HPC clusters, GPU infrastructure)
  • Embedding security best practices into observability platforms, including RBAC, TLS, secret management, and multi-tenant access controls
  • Partnering with engineering teams to embed observability into applications, services, and infrastructure
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Senior Cloud Engineer – AV Cloud Engineering

We are hiring a Senior Cloud Engineer to join the AV Cloud Engineering team with...
Location
Location
United States , Austin, Texas; Sunnyvale, California
Salary
Salary:
170000.00 - 230000.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Fully qualified proficiency in Kubernetes (GKE) and the Google Cloud Platform (GCP) ecosystem, including VPC, IAM, etc.
  • Hands-on experience implementing high-availability systems and managing the lifecycle of production-grade clusters
  • Strong proficiency in software engineering and DevOps principles, specifically using Golang/Python and Terraform
  • Ability to operate independently in an ambiguous environment and translate high-level requirements into clear technical tasks
  • A "Growth-based Mindset" with a commitment to continuous upskilling and a belief that team capacities are developed through effort and coaching
  • Professional experience managing the trade-offs between hardware-level performance (GPU passthrough) and clean cloud abstractions
Job Responsibility
Job Responsibility
  • Architectural Execution: Implement and manage the lifecycle of Kubernetes (GKE) clusters across hybrid and multi-cloud environments, ensuring production safety through automated patching and upgrades
  • Platform Implementation: Develop and maintain self-service PaaS features that abstract infrastructure complexity, providing reliable and performant access to specialized hardware like GPUs
  • Connectivity & Traffic: Implement and optimize high-throughput ingress patterns and service mesh (Istio) configurations to support distributed AV data and ML workloads
  • Project Independence: Take ownership of complex cloud initiatives from design through deployment, identifying technical gaps and proactively implementing robust solutions
  • Operational Excellence: Eliminate "human duct tape" by replacing manual cloud-management tasks with declarative state-enforcement (Terraform/GitOps) and custom automation
  • Reliability & Observability: Define and monitor SLIs/SLOs for cloud services, ensuring the platform meets the availability targets required for Super Cruise validation
  • Peer Mentorship: Proactively share technical lessons learned and participate in rigorous code and architecture reviews to support a healthy, high-trust engineering culture
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Cloud Infrastructure & Observability

Location
Location
India , Bengaluru
Salary
Salary:
Not provided
roku.com Logo
Roku
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years in software engineering with a track record of architecting distributed systems or platforms at scale
  • Strong hands‑on experience in Golang and one scripting language (e.g., Python or Shell)
  • Experience operating observability at pb-scale ingestion and hundreds of millions of series
  • Expertise in observability platforms and tooling (Prometheus, Grafana, Loki, Tempo, ELK/OpenSearch, ClickHouse) and standards (OpenTelemetry, OpenMetrics)
  • Deep experience building systems of scale and operating cloud infrastructure with Kubernetes
  • strong proficiency with service mesh technologies (Istio/Envoy), infrastructure‑as‑code (Terraform) and experience in multi‑cloud (AWS, GCP)
  • Demonstrated ability to evolve storage and query architectures for cost, scale, and latency (e.g., TSDB, Parquet, distributed processing)
  • Proven experience integrating security as part of infrastructure and platform development
  • Exceptional cross‑functional communication
  • effective collaboration with both technical and non‑technical stakeholders
Job Responsibility
Job Responsibility
  • Architect and lead Roku’s observability platform across metrics, logs, and traces
  • evolve data pipelines and storage layers optimized for high throughput, performance, and cost at Roku scale (TSDBs, Parquet, distributed processing)
  • Extend and harden open‑source observability systems
  • overhaul core components (e.g., storage layers, query paths) to improve performance, reliability, and usability at scale
  • Implement features such as pre‑aggregation, down-sampling, and sampling to reduce load and accelerate queries across the platform
  • Collaborate across platform, SRE, and product teams to migrate hundreds of workloads to our common platform
  • augment and automate CI/CD flows and onboarding
  • Integrate security into infrastructure and platform services
  • ensure robust multi‑tenant, multi‑cluster, and multi‑cloud designs
  • Contribute improvements back to open source and CNCF‑aligned projects
What we offer
What we offer
  • Global access to mental health and financial wellness support and resources
  • healthcare (medical, dental, and vision)
  • life, accident, disability, commuter, and retirement options (401(k)/pension)
  • time off in accordance with local leave policies
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Cloud Infrastructure & Observability

We are building a next-generation observability and cloud platform that is high-...
Location
Location
United Kingdom , Cambridge
Salary
Salary:
Not provided
roku.com Logo
Roku
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience with software engineering with a track record of architecting distributed systems or platforms at scale
  • Strong hands-on experience in Golang and one scripting language (e.g., Python or Shell)
  • Experience operating observability at pb-scale ingestion and hundreds of millions of series
  • Expertise in observability platforms and tooling (Prometheus, Grafana, Loki, Tempo, ELK/OpenSearch, ClickHouse) and standards (OpenTelemetry, OpenMetrics)
  • Deep experience building systems of scale and operating cloud infrastructure with Kubernetes
  • strong proficiency with service mesh technologies (Istio/Envoy), infrastructure-as-code (Terraform) and experience in multi-cloud (AWS, GCP)
  • Demonstrated ability to evolve storage and query architectures for cost, scale, and latency (e.g., TSDB, Parquet, distributed processing)
  • Proven experience integrating security as part of infrastructure and platform development
  • Exceptional cross-functional communication
  • effective collaboration with both technical and non-technical stakeholders
Job Responsibility
Job Responsibility
  • Architect and lead Roku’s observability platform across metrics, logs, and traces
  • evolve data pipelines and storage layers optimized for high throughput, performance, and cost at Roku scale (TSDBs, Parquet, distributed processing)
  • Extend and harden open-source observability systems
  • overhaul core components (e.g., storage layers, query paths) to improve performance, reliability, and usability at scale
  • Implement features such as pre-aggregation, down-sampling, and sampling to reduce load and accelerate queries across the platform
  • Collaborate across platform, SRE, and product teams to migrate hundreds of workloads to our common platform
  • augment and automate CI/CD flows and onboarding
  • Integrate security into infrastructure and platform services
  • ensure robust multi-tenant, multi-cluster, and multi-cloud designs
  • Contribute improvements back to open source and CNCF-aligned projects
What we offer
What we offer
  • Global access to mental health and financial wellness support and resources
  • healthcare (medical, dental, and vision)
  • life, accident, disability, commuter, and retirement options (401(k)/pension)
  • time off work for vacation and other personal reasons
  • Fulltime
Read More
Arrow Right

Senior Platform Engineer – AV Cloud Engineering

We are hiring a Senior Platform Engineer to join the Autonomous Vehicle (AV) Clo...
Location
Location
United States , Austin;Sunnyvale
Salary
Salary:
170000.00 - 230000.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Fully qualified proficiency in Kubernetes (GKE) and the Google Cloud Platform (GCP) ecosystem, including VPC, IAM, etc.
  • Hands-on experience implementing high-availability systems and managing the lifecycle of production-grade clusters
  • Strong proficiency in software engineering and DevOps principles, specifically using Golang/Python and Terraform
  • Ability to operate independently in an ambiguous environment and translate high-level requirements into clear technical tasks
  • A 'Growth-based Mindset' with a commitment to continuous upskilling and a belief that team capacities are developed through effort and coaching
  • Professional experience managing the trade-offs between hardware-level performance (GPU passthrough) and clean cloud abstractions
  • 5+ years of experience or proven record of defining and executing technical strategy that required coordination across multiple teams, senior executives, and front-line engineers
  • Bachelors Degree in Computer Science or related field OR equivalent work experience
  • Hands-on experience with Kubernetes in production and strong familiarity with at least one major cloud ecosystem such as GCP, AWS, or Azure
  • Strong software engineering skills in Go, Python, or similar languages, with the ability to build reusable automation, services, APIs, or controllers
Job Responsibility
Job Responsibility
  • Build and evolve internal platform capabilities, self-service workflows, APIs, and automation that make the right path the easiest path for AV engineering teams
  • Design clean abstractions that allow product and research teams to consume complex infrastructure, including specialized hardware such as GPUs, without deep infrastructure expertise
  • Improve platform primitives across traffic, service mesh, runtime configuration, and connectivity for distributed AV workloads
  • Reduce toil and improve reliability through reusable platform tooling, declarative automation, strong service ownership, observability, and SLIs/SLOs
  • Take ownership of complex platform initiatives from problem definition through design, implementation, and adoption
  • Partner with adjacent AV infrastructure teams to reduce developer friction and improve the reliability of the platform AV engineers depend on
  • Contribute to a healthy engineering culture through design reviews, code reviews, mentorship, and clear technical communication
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right
New

Senior Cloud Engineer

The Senior Cloud Engineer designs, builds, and optimizes cloud‑native applicatio...
Location
Location
United States , Wilton
Salary
Salary:
Not provided
asml.com Logo
ASML
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years supporting large-scale software or cloud environments
  • Expertise with Google Cloud Platform (Compute, IAM, Networking, GCS, Cloud Build, GKE, AI Platform)
  • Experience with Azure and hybrid cloud architectures
  • Strong background in Linux (RHEL/CentOS)
  • Networking (TCP/IP, UDP)
  • CI/CD pipelines and DevOps tools
  • Version control (Git/SVN)
  • Application lifecycle tools (Jira, Confluence, Bitbucket)
  • VM and storage management (NFS, SMB, ZFS, NAS)
  • VMware and enterprise hardware environments
Job Responsibility
Job Responsibility
  • Architect, deploy, and maintain GCP-based platforms, integrating AI/ML services, data pipelines, and automated infrastructure
  • Build and support cloud-native applications, including installation, patching, performance tuning, and systems hardening
  • Implement monitoring, observability, and logging using tools such as Splunk and native GCP services
  • Troubleshoot complex distributed systems using diagnostic tools and structured problem analysis
  • Serve as Tier 1 / Tier 2 escalation for cloud and platform issues, ensuring fast resolution and clear communication
  • Drive CI/CD automation, Git-based workflows, and cloud release management
  • Propose and implement improvements to system performance, reliability, cloud cost efficiency, and developer experience
  • Ensure compliance with IT standards, security policies, and service-level requirements
  • Support engineering use cases by provisioning environments, automating workflows, and optimizing cloud resource utilization
What we offer
What we offer
  • Flexible workplace arrangement (up to two days a week remote)
  • Fulltime
Read More
Arrow Right

Senior Cloud Engineer

We believe that there is a smarter, more data-driven way to make decisions in he...
Location
Location
United States , Boston
Salary
Salary:
71250.00 - 143750.00 USD / Year
sophiagenetics.com Logo
SOPHiA GENETICS
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience as a Cloud Engineer, DevOps Engineer, SRE or similar role
  • Proven experience designing and implementing secure cloud infrastructure solutions
  • Expert Kubernetes knowledge
  • Strong Linux knowledge
  • Experience with infrastructure as code tools (e.g., Terraform, Ansible)
  • Strong understanding of cloud platform security features (specific to chosen platform)
  • Experience with observability tools (Prometheus, Grafana, ELK, Loki, etc.)
  • Excellent communication, collaboration, and problem-solving skills
  • A passion for cloud and staying current with the latest advancements
  • Strong knowledge of cloud computing platforms especially Azure
Job Responsibility
Job Responsibility
  • Design, architect, and implement secure cloud infrastructure solutions on cloud platforms (Azure in particular)
  • Kubernetes design, implementations and operations at scale
  • Mentor junior cloud engineers
  • Act as a subject matter expert for Microsoft Azure and Kubernetes
  • Help engineering teams troubleshoot and perform root cause analysis
  • Lead infrastructure and cloud related projects
What we offer
What we offer
  • Outstanding Medical, Dental & Vision with 90% Employer Contribution
  • Company matched 401K at 4%
  • Company-paid short & long-term disability insurance
  • FSA commuter benefits
  • 20 Days PTO, increasing to 25 with tenure
  • 5 Days Sick and 14 Public Holidays
  • Free EAP
  • Fulltime
Read More
Arrow Right

Senior Cloud Engineer - Crypto

As a Senior Cloud Engineer on Sokin’s Crypto team, you will architect, deploy, a...
Location
Location
Serbia , Belgrade
Salary
Salary:
Not provided
sokin.com Logo
Sokin
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of professional cloud engineering experience, with at least 2 years in a fintech, payments, or other regulated industry
  • Proven track record of designing and managing production AWS infrastructure at scale
  • Hands-on experience with Terraform (or equivalent IaC) and CI/CD pipelines (GitHub Actions preferred)
  • Experience with containerisation and orchestration (Docker, Kubernetes, ECS/EKS)
  • Demonstrable experience building or operating blockchain node infrastructure and crypto-related cloud services
  • Proficiency in AWS with relevant certifications (AWS Solutions Architect Professional or equivalent strongly preferred)
  • Expertise in scripting and automation (Python, Bash)
  • Strong understanding of networking (VPC, subnets, VPN, load balancers, service mesh)
  • Knowledge of blockchain infrastructure — running nodes, RPC providers, on-chain data indexing, and wallet/key management systems
  • Understanding of stablecoin mechanics — minting/burning, settlement flows, liquidity management, and the major stablecoin protocols and issuers
Job Responsibility
Job Responsibility
  • Architect and deploy secure, scalable, and cost-optimised AWS infrastructure to support Sokin’s payments platform, stablecoin settlement services, and real-time transaction processing
  • Design high-availability, multi-region architectures that meet the latency and throughput demands of cross-border payment flows
  • Own the end-to-end infrastructure lifecycle — from capacity planning and provisioning through monitoring, optimisation, and decommissioning
  • Build and maintain blockchain node infrastructure (RPC endpoints, indexers, event listeners) required for on-chain settlement and stablecoin operations
  • Design and operate secure key management and custody infrastructure for blockchain wallets and transaction signing
  • Implement infrastructure for fiat on-ramp/off-ramp services, ensuring reliable connectivity between traditional payment rails and blockchain networks
  • Monitor blockchain node health, chain synchronisation, and mempool conditions to ensure transaction reliability
  • Develop and maintain IaC using Terraform to automate all infrastructure provisioning, configuration, and lifecycle management
  • Integrate cloud infrastructure with CI/CD pipelines using GitHub Actions, enabling seamless and repeatable deployments
  • Implement GitOps workflows and automated testing for infrastructure changes, including policy-as-code compliance checks
  • Fulltime
Read More
Arrow Right