CrawlJobs Logo

Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

United States, San Francisco 166000.00 - 201000.00 USD / Year · Job Posted February 21, 2026
Apply Position
Job Link Share

Job Description

We are looking for a highly skilled engineer with deep expertise in building and operating observability platforms at scale. You will design, develop, and run Crusoe’s next-generation observability stack, enabling engineers to understand the internal state of distributed systems through metrics, logs, and traces. Your work will ensure reliability, performance, and actionable insights across Crusoe’s global infrastructure and cloud platform.

Job Responsibility

  • Designing and operating scalable observability systems (metrics, logging, tracing) across multi-datacenter Kubernetes environments
  • Architecting end-to-end telemetry pipelines, including ingestion, storage, querying, and visualization
  • Extending monitoring and alerting with Prometheus, Alertmanager, Thanos/Cortex, Grafana, and OpenTelemetry
  • Building scalable log collection and processing pipelines with Fluent Bit, Vector, Loki, or ELK/Opensearch stacks
  • Implementing distributed tracing platforms (Tempo, Jaeger, OpenTelemetry) and integrating with service meshes, load balancers, and APIs
  • Defining and driving adoption of SLOs, SLIs, and error budgets across services and teams
  • Automating provisioning and scaling of observability infrastructure with Kubernetes, Terraform, and custom tooling (Go, Python)
  • Ensuring reliability and cost efficiency of telemetry pipelines while supporting high-volume workloads (AI/ML, HPC clusters, GPU infrastructure)
  • Embedding security best practices into observability platforms, including RBAC, TLS, secret management, and multi-tenant access controls
  • Partnering with engineering teams to embed observability into applications, services, and infrastructure
  • Mentoring engineers and shaping Crusoe’s observability strategy and technical roadmap

Requirements

  • 7+ years of experience in infrastructure or platform engineering, with a focus on observability and monitoring systems
  • Deep expertise with metrics systems (Prometheus, Thanos, Mimir, Cortex), logging pipelines (Fluent Bit, Vector, Loki, ELK/Opensearch), and tracing platforms (Jaeger, Tempo, OpenTelemetry)
  • Strong programming skills in Go or Python for automation, operators, and custom integrations
  • Experience running observability platforms on Kubernetes and operating them at scale across multi-datacenter environments
  • Proven ability to design, optimize, and scale telemetry pipelines handling high cardinality and high throughput data
  • Solid understanding of distributed systems, performance engineering, and debugging complex workloads
  • Strong collaboration skills and the ability to influence engineering teams to adopt observability best practices

Nice to have

  • Contributions to open source observability projects (Prometheus, OpenTelemetry, Grafana, Loki, etc.)
  • Experience supporting AI/ML or GPU-heavy environments with high observability demands
  • Knowledge of event-driven or streaming systems (Kafka, NATS, Pulsar) used in telemetry pipelines
  • Experience implementing cost optimization strategies for large-scale observability platforms
  • Background in incident response, chaos engineering, and reliability practices

What we offer

  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit
  • $300 per month

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

8 matching positions

Senior Software Engineer

We are looking for an experienced Senior Software Engineer to help build and mai...
Location
Location
United States
Salary
Salary:
143000.00 - 192000.00 USD / Year
getdbt.com Logo
dbt Labs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience as a software engineer developing SaaS platforms and applications at scale
  • Proven experience designing and scaling services
  • Strong understanding of API design, system architecture, and database management
  • Proficiency with languages and frameworks including Python, Go, Rust, django, Node.js, Java, Spring
  • Familiarity with cloud infrastructure such as AWS, GCP, Azure, Kubernetes, Terraform
  • Proficiency in designing API-driven applications using REST and/or gRPC
  • Experience building scalable and secure distributed systems
  • A systematic problem solving approach, strong communication skills, and a sense of ownership
  • Ability to balance technical depth with fast, iterative delivery
Job Responsibility
Job Responsibility
  • Design, build, and maintain services and features that scale with our growing customer base
  • Tackle ambiguous, open-ended problems with strategic thinking, balancing technical constraints with user needs and product goals
  • Build services, APIs, and experiences that support user delight, quality, high availability and performance
  • Champion a culture of technical excellence and innovation
  • Work with cross-functional teams, including Product, UX, Infrastructure, and Security, to deliver impactful solutions
  • Contribute to engineering best practices, mentor junior engineers, and participate in design and code reviews
  • Debug production issues and optimize system performance using observability tools
  • Work with technologies such as Python, Rust, Typescript, Postgres, Kubernetes, AWS, Azure, GCP, Terraform, and Datadog
What we offer
What we offer
  • Equity Stake
  • Unlimited PTO
  • 401k with a 3% guaranteed contribution
  • Excellent healthcare coverage
  • Paid parental leave
  • Wellness and home office stipends
  • Fulltime
Read More
Arrow Right

Senior Distributed Systems Engineer - Ad Display Platform Engineering

The Bidding Platform organization is the core of the RTB business, processing ov...
Location
Location
Poland
Salary
Salary:
Not provided
rtbhouse.com Logo
RTB House
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of hands-on experience in software engineering
  • Proficiency in programming
  • Excellent understanding of how complex IT systems work (from the hardware level, through software, to algorithmics)
  • Very good knowledge of fundamental Internet protocols and technologies (DNS, HTTP, cookies and others)
  • Good knowledge of basic methods of creating concurrent programs and distributed systems (from thread level to geo-distributed clusters level)
  • Practical ability to observe, monitor and analyse the operation of production systems (and draw valuable conclusions from it)
  • The ability to critically analyze the solutions created in terms of performance (from estimating the theoretical performance of the designed systems to detecting and removing actual performance problems in production)
  • General knowledge of issues (typical problems and methods of solving them) in the areas of 'high scalability' and 'high availability'
  • C1 level in English and Polish
Job Responsibility
Job Responsibility
  • Implement and maintain (in all aspects, including setting up environment, writing configuration code, monitor production) high-quality backend services for displaying Ads globally, focusing on extreme performance and scalability
  • Develop tools (deployment, testing platforms, web performance and reliability monitoring), and critical optimizations to drive measurable improvements in critical user performance metrics for ad rendering and display
  • Write, test, and deploy robust, efficient, and well-documented code in Java/Python, ensuring adherence to the highest coding and performance standards
  • Participate in code reviews, knowledge sharing sessions, and help implement technical standards and best practices within the team
What we offer
What we offer
  • Projects focused on extreme performance and high code quality – solid code reviews are our standard
  • Collaboration within an interdisciplinary, self-sufficient team (including DevOps, database experts, backend developers, product designers, and QA engineers)
  • Hardware and software tailored to your preferences (e.g., MacBook, AI tool licenses)
  • Flexible working conditions – no core hours, fully remote cooperation possible
Read More
Arrow Right

Senior Principal Data Platform Software Engineer

We’re looking for a Sr Principal Data Platform Software Engineer (P70) to be a k...
Location
Location
Salary
Salary:
239400.00 - 312550.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years in Data Engineering, Software Engineering, or related roles, with substantial exposure to big data ecosystems
  • Demonstrated experience building and operating data platforms or large‑scale data services in production
  • Proven track record of building services from the ground up (requirements → design → implementation → deployment → ongoing ownership)
  • Hands‑on experience with AWS, GCP (e.g., compute, storage, data, and streaming services) and cloud‑native architectures
  • Practical experience with big data technologies, such as Databricks, Apache Spark, AWS EMR, Apache Flink, or StarRocks
  • Strong programming skills in one or more of: Kotlin, Scala, Java, Python
  • Experience leading cross‑team technical initiatives and influencing senior stakeholders
  • Experience mentoring Staff/Principal engineers and lifting the technical bar for a team or org
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Design, develop and own delivery of high quality big data and analytical platform solutions aiming to solve Atlassian’s needs to support millions of users with optimal cost, minimal latency and maximum reliability
  • Improve and operate large‑scale distributed data systems in the cloud (primarily AWS, with increasing integration with GCP and Kubernetes‑based microservices)
  • Drive the evolution of our high-performance analytical databases and its integrations with products, cloud infrastructures (AWS and GCP) and isolated cloud environments
  • Help define and uplift engineering and operational standards for petabyte scale data platforms, with sub‑second analytic queries and multi‑region availability (coding guidelines, code review practices, observability, incident response, SLIs/SLOs)
  • Partner across multiple product and platform teams (including Analytics, Marketplace/Ecosystem, Core Data Platform, ML Platform, Search, and Oasis/FedRAMP) to deliver company‑wide initiatives that depend on reliable, high‑quality data
  • Act as a technical mentor and multiplier, raising the bar on design quality, code quality, and operational excellence across the broader team
  • Design and implement self‑healing, resilient data platforms with strong observability, fault tolerance, and recovery characteristics
  • Own the long‑term architecture and technical direction of Atlassian’s product data platform with projects that are directly tied to Atlassian’s company-level OKRs
  • Be accountable for the reliability, cost efficiency, and strategic direction of Atlassian’s product analytical data platform
  • Partner with executives and influence senior leaders to align engineering efforts with Atlassian’s long-term business objectives
What we offer
What we offer
  • health and wellbeing resources
  • paid volunteer days
  • Fulltime
Read More
Arrow Right

Director of Engineering, Platform Engineering

In your role as ‘Director of Engineering, Platform Engineering’ you will guide t...
Location
Location
United States , Oakland, California
Salary
Salary:
241000.00 - 305000.00 USD / Year
everlaw.com Logo
Everlaw
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 4 years of experience managing and leading senior engineers, including technical workstream management and execution support
  • At least 2 years of experience managing and leading managers, coaching them on talent management, strategic planning, and execution, with a focus on platform engineering teams
  • At least 5 years of experience as a senior engineer building one or more of - developer productivity tools, highly available platform services (i.e. storage systems, pub-sub systems, search systems, caching solutions, observability solutions) and/or have expertise and experience with infrastructure and/or cloud technologies (like Ansible, Terraform, Kubernetes, Docker etc)
  • You have a good dynamic range that you apply to different situations - you can step back and empower, while also diving deep into the code to understand the details
  • You can communicate at the right altitude with both technical and non-technical stakeholders
  • You have experience working with stakeholder teams (internal and/or external) in setting and collaborating on technical roadmaps
  • You have experience communicating with customers articulating to them how the platform works on reliability, security and compliance matters
  • You have a BS/MS or PhD in Computer Science (or equivalent)
  • You have a sound foundational understanding of a wide range of computer science topics and concerns relating to system and software design
  • You are authorized to work in the United States
Job Responsibility
Job Responsibility
  • Inspire and empower your managers to cultivate high-performing teams, fostering a culture of continuous feedback and professional growth to ensure successful project delivery and career development
  • Use your technical knowledge to align stakeholders across Engineering and Product on the ideal path forward on complex technical decisions and roadmap decisions
  • Strategize, prioritize, resource, and execute against our Engineering roadmap
  • Work with Engineering Operations, cross-functional teams, team members and managers to improve various processes that affect infrastructure growth, support, alignment, collaboration, and accountability
  • Critically observe and understand Everlaw’s platform, tooling, and processes
What we offer
What we offer
  • Equity program
  • 401(k) retirement plan with company matching
  • Health, dental, and vision
  • Flexible Spending Accounts for health and dependent care expenses
  • Paid parental leave and approximately 10 days (80 hours) per year of sick leave
  • Seventeen paid vacation days plus 11 federal holidays
  • Membership to Modern Health to help employees prioritize mental health and wellness
  • Annual allocation for Learning & Development opportunities and applicable professional membership dues
  • Company-sponsored life and disability insurance
  • Work in Downtown Oakland, just steps from the BART line and dozens of restaurants
  • Fulltime
Read More
Arrow Right

Staff Platform Software Engineer

EarnIn is seeking a Staff Platform Engineer to lead the strategic design, automa...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
earnin.com Logo
EarnIn
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s Degree in Computer Science or equivalent industry experience
  • 7+ years of experience in cloud infrastructure, managing large-scale, high-availability, customer-facing distributed systems
  • Proven experience mentoring and guiding senior engineers, driving technical decisions, and leading company-wide cloud initiatives
  • Mastery of public cloud providers, specifically AWS (EKS, DynamoDB, Aurora, Kinesis, etc.)
  • Strong expertise in containerized microservices running on Kubernetes
  • Deep knowledge of automation and configuration management tools (Terraform, Ansible)
  • Expertise on CICD pipelines and tools, including Jenkins, GHA, Argo CD, Spinnaker & FluxCD or similar
  • Experience with advanced observability tools (DataDog, CloudWatch)
  • Track record of leading cost optimization / FinOps initiatives, performance tuning, and operational excellence projects
  • Proven ability to drive cross-functional initiatives with engineering, product, and business teams
Job Responsibility
Job Responsibility
  • Serve as a key architect and thought leader in the cloud infrastructure domain, guiding the team on best practices
  • Mentor and coach senior engineers across the company in advanced cloud operations practices
  • Provide oversight of hosted Linux and Windows systems, networks, databases, and applications, identifying and solving critical performance, scalability, and stability challenges
  • Design and develop reusable components and operational strategies to enhance the scalability, performance, and monitoring of cloud systems
  • Collaborate with other senior engineers to create technical solutions that address company-wide cloud challenges
  • Lead the establishment and continuous evolution of infrastructure-as-code best practices, driving automation, self-healing, and security standards
  • Drive operational cost savings through service optimizations, autoscaling strategies, and distributed processing architectures
  • Collaborate closely with cross-functional teams, including security, engineering, and business teams, to ensure that operational strategies align with company-wide objectives
  • Provide thought leadership in company-wide initiatives such as observability, automation, and disaster recovery
  • Continuously evaluate existing tools and processes, lead efforts to socialize, present, and implement enhancements for optimal operational efficiency
What we offer
What we offer
  • healthcare
  • internet/cell phone reimbursement
  • a learning and development stipend
  • opportunities to travel to our Mountain View HQ
  • Fulltime
Read More
Arrow Right

Senior Software Engineer II

We are looking for an experienced Senior Software Engineer II to help build and ...
Location
Location
United States
Salary
Salary:
170000.00 - 231000.00 USD / Year
getdbt.com Logo
dbt Labs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience as a software engineer developing SaaS platforms and applications at scale
  • Minimum requirement of Bachelor's Degree in a related field (computer science, computer engineering, etc.) OR completed enrollment in engineering related bootcamp
  • Proven experience designing and scaling backend services
  • Strong understanding of API design, system architecture, and database management
  • Proficiency with backend languages and frameworks such as Python, Go, Rust, django, Node.js, Java, Spring
  • Familiarity with cloud infrastructure such as AWS, GCP, Azure, Kubernetes, Terraform
  • Proficiency in designing API-driven applications using REST and/or gRPC
  • Experience building scalable and secure distributed systems
  • A systematic problem-solving approach, strong communication skills, and a sense of ownership
  • Ability to balance technical depth with fast, iterative delivery
Job Responsibility
Job Responsibility
  • Design, build, and maintain services that scale with our growing customer base
  • Tackle ambiguous, open-ended problems with strategic thinking, balancing technical constraints with user needs and product goals
  • Build services, APIs, and experiences that support user delight, quality, high availability and performance
  • Champion a culture of technical excellence and innovation, influencing engineering direction within the team
  • Work with cross-functional teams, including Product, UX, and Security, to deliver impactful solutions
  • Contribute to engineering best practices, mentor junior engineers, and participate in design and code reviews
  • Debug production issues and optimize system performance using observability tools
  • Work with technologies such as Python, Rust, Typescript, Postgres, Kubernetes, AWS, Terraform, and Datadog
What we offer
What we offer
  • Equity Stake
  • Unlimited PTO
  • 401k with a 3% guaranteed contribution
  • Excellent healthcare coverage
  • Paid parental leave
  • Wellness and home office stipends
  • Fulltime
Read More
Arrow Right

Senior Software Engineer (Full Stack)

Location
Location
Salary
Salary:
Not provided
kloud9.nyc Logo
Kloud9
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Academic background in Computer Science (BS or MS) or equivalent work experience
  • 7 - 9 years of additional relevant professional experience
  • In-depth experience in Java 11+
  • 6+ years’ experience developing scalable applications, APIs, and services in Java tech stack
  • 6+ years’ experience developing UI’s using react
  • Hands on experience in JavaScript, HTML, CSS
  • Experience in micro service architecture, domain driven design, and RESTful Services using Spring Boot
  • 3+ years' experience with relational database (MySQL/PostGre) and non-relation DB (MongoDB, Dynamo DB)
  • Be a self-starter with a passion for technology and a burning desire to constantly improve yourself, the product, and the codebase
  • Experience using cloud services to build an integrated application in production (AWS - EC2, ECS, API gateway, Lambda)
Job Responsibility
Job Responsibility
  • Building new capabilities such as localization, workflow driven copy writing etc. in the platform
  • Building platform excellence in all non-functional pillars such as availability, observability
  • Ensuring platform’s capability to handle voluminous data ingress and egress
  • Hypercare of the platform after release of any capability/additional features
  • Contributing ideas for new features and identifying technical areas for improvement proactively
  • Collaborating with other engineering teams to ensure effective integration of capabilities/data
  • Following best engineering practices to continuously deliver working software
What we offer
What we offer
  • Kloud9 provides a robust compensation package and a forward-looking opportunity for growth in emerging fields
Read More
Arrow Right

Senior Software Engineer, Prenatal

Ready to redefine what's possible in molecular diagnostics? Join a team of brill...
Location
Location
United States , Menlo Park
Salary
Salary:
190081.00 - 211201.00 USD / Year
billiontoone.com Logo
BillionToOne
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5 - 7 years of professional software development experience with a proven track record of delivering complex projects adhering to best practices
  • Deep understanding and experience using web frameworks like Django and FastAPI
  • Strong system design and architecture capabilities, applying domain-driven design to translate complex business domains into clear, scalable service and data boundaries.
  • Familiarity with modern AI-driven development practices and tools.
  • Strong foundation in cloud services, preferably AWS, including ECS, S3, AWS Batch, EC2, and AWS Lambda, enabling effective architecture and management of complex systems
  • Excellent communication and collaboration skills, with the ability to work effectively in a team-oriented environment
  • Excited about working in-person with our team in Menlo Park
Job Responsibility
Job Responsibility
  • Design, build, and operate scalable, high availability backend services, APIs, and data integrations that power customer facing product experiences
  • Lead technical delivery across the full lifecycle: architecture, design, implementation, testing, deployment, and ongoing operations
  • Collaborate closely with product, design, and cross functional engineering teams to deliver performant, reliable, and intuitive platform capabilities
  • Develop clean abstractions and platform APIs that simplify complex LIS workflows while balancing developer velocity with long term system stability
  • Build and maintain secure, reliable cloud infrastructure on AWS, including compute environments, networking, container orchestration, storage, and environment automation
  • Implement CI/CD pipelines, Infrastructure as Code, configuration management, and automated environment provisioning to standardize and accelerate delivery
  • Embed observability and operational excellence into the platform by default through robust monitoring, logging, alerting, and reliability patterns
  • Ensure the scalability, maintainability, and overall health of backend and cloud systems as the platform grows to support large enterprise workloads and future global expansion
  • Foster a culture of ownership, technical excellence, learning, and continuous improvement within the engineering team
  • Apply AI-assisted tooling and workflows to improve developer productivity, automate routine engineering tasks, enhance code quality, and accelerate troubleshooting
What we offer
What we offer
  • Working alongside brilliant, kind, passionate and dedicated colleagues, in an empowering environment, toward a global vision, striving for a future in which transformative molecular diagnostics can help millions of patients
  • Open, transparent culture that includes weekly Town Hall meetings
  • The ability to indirectly or directly change the lives of hundreds of thousands patients
  • Multiple medical benefit options
  • employee premiums paid 100% of select plans, dependents covered up to 80%
  • Extremely generous Family Bonding Leave for new parents (16 weeks, paid at 100%)
  • Supplemental fertility benefits coverage
  • Retirement savings program including a 4% Company match
  • Increase paid time off with increased tenure
  • Latest and greatest hardware (laptop, lab equipment, facilities)
  • Fulltime
Read More
Arrow Right