CrawlJobs Logo

Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

crusoe.ai Logo

Crusoe

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

166000.00 - 201000.00 USD / Year

Job Description:

We are looking for a highly skilled engineer with deep expertise in building and operating observability platforms at scale. You will design, develop, and run Crusoe’s next-generation observability stack, enabling engineers to understand the internal state of distributed systems through metrics, logs, and traces. Your work will ensure reliability, performance, and actionable insights across Crusoe’s global infrastructure and cloud platform.

Job Responsibility:

  • Designing and operating scalable observability systems (metrics, logging, tracing) across multi-datacenter Kubernetes environments
  • Architecting end-to-end telemetry pipelines, including ingestion, storage, querying, and visualization
  • Extending monitoring and alerting with Prometheus, Alertmanager, Thanos/Cortex, Grafana, and OpenTelemetry
  • Building scalable log collection and processing pipelines with Fluent Bit, Vector, Loki, or ELK/Opensearch stacks
  • Implementing distributed tracing platforms (Tempo, Jaeger, OpenTelemetry) and integrating with service meshes, load balancers, and APIs
  • Defining and driving adoption of SLOs, SLIs, and error budgets across services and teams
  • Automating provisioning and scaling of observability infrastructure with Kubernetes, Terraform, and custom tooling (Go, Python)
  • Ensuring reliability and cost efficiency of telemetry pipelines while supporting high-volume workloads (AI/ML, HPC clusters, GPU infrastructure)
  • Embedding security best practices into observability platforms, including RBAC, TLS, secret management, and multi-tenant access controls
  • Partnering with engineering teams to embed observability into applications, services, and infrastructure
  • Mentoring engineers and shaping Crusoe’s observability strategy and technical roadmap

Requirements:

  • 7+ years of experience in infrastructure or platform engineering, with a focus on observability and monitoring systems
  • Deep expertise with metrics systems (Prometheus, Thanos, Mimir, Cortex), logging pipelines (Fluent Bit, Vector, Loki, ELK/Opensearch), and tracing platforms (Jaeger, Tempo, OpenTelemetry)
  • Strong programming skills in Go or Python for automation, operators, and custom integrations
  • Experience running observability platforms on Kubernetes and operating them at scale across multi-datacenter environments
  • Proven ability to design, optimize, and scale telemetry pipelines handling high cardinality and high throughput data
  • Solid understanding of distributed systems, performance engineering, and debugging complex workloads
  • Strong collaboration skills and the ability to influence engineering teams to adopt observability best practices

Nice to have:

  • Contributions to open source observability projects (Prometheus, OpenTelemetry, Grafana, Loki, etc.)
  • Experience supporting AI/ML or GPU-heavy environments with high observability demands
  • Knowledge of event-driven or streaming systems (Kafka, NATS, Pulsar) used in telemetry pipelines
  • Experience implementing cost optimization strategies for large-scale observability platforms
  • Background in incident response, chaos engineering, and reliability practices
What we offer:
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit
  • $300 per month

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

Senior Software Engineer - Transactional Data Platform

As a Senior Software Engineer, you will play a critical role in designing, build...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related technical field
  • 5+ years of experience in backend software development
  • 3+ years of hands-on experience working with AWS cloud services, particularly AWS storage technologies (S3, DynamoDB, EBS, EFS, FSx, or Glacier)
  • 3+ years of experience in designing and developing distributed systems or high-scale backend services
  • Strong programming skills in Kotlin
  • Experience working in agile environments following DevOps and CI/CD best practices
  • Strong Backend Development Skills
  • Proficiency in Kotlin, Java for backend development
  • Experience building high-performance, scalable microservices and APIs
  • Strong understanding of RESTful APIs, gRPC, and event-driven architectures
Job Responsibility
Job Responsibility
  • Designing, building, and optimizing high-performance, scalable, and resilient backend storage solutions on AWS cloud infrastructure
  • Developing distributed storage systems, APIs, and backend services that power mission-critical applications, ensuring low-latency, high-throughput, and fault-tolerant data storage
  • Collaborating closely with principal engineers, architects, SREs, and product teams to define technical roadmaps, improve storage efficiency, and optimize access patterns
  • Driving performance tuning, data modeling, caching strategies, and cost optimization across AWS storage services like S3, DynamoDB, EBS, EFS, FSx, and Glacier
  • Contributing to infrastructure automation, security best practices, and monitoring strategies using tools like Terraform, CloudWatch, Prometheus, and OpenTelemetry
  • Troubleshooting and resolving production incidents related to data integrity, latency spikes, and storage failures, ensuring high availability and disaster recovery preparedness
  • Mentoring junior engineers, participating in design reviews and architectural discussions, and advocating for engineering best practices such as CI/CD automation, infrastructure as code, and observability-driven development
What we offer
What we offer
  • Atlassians can choose where they work – whether in an office, from home, or a combination of the two
  • Flexibility for eligible candidates to work remotely across the West US
  • Fulltime
Read More
Arrow Right

Senior Software Engineer

We are looking for an experienced Senior Software Engineer to help build and mai...
Location
Location
United States
Salary
Salary:
143000.00 - 192000.00 USD / Year
getdbt.com Logo
dbt Labs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience as a software engineer developing SaaS platforms and applications at scale
  • Proven experience designing and scaling services
  • Strong understanding of API design, system architecture, and database management
  • Proficiency with languages and frameworks including Python, Go, Rust, django, Node.js, Java, Spring
  • Familiarity with cloud infrastructure such as AWS, GCP, Azure, Kubernetes, Terraform
  • Proficiency in designing API-driven applications using REST and/or gRPC
  • Experience building scalable and secure distributed systems
  • A systematic problem solving approach, strong communication skills, and a sense of ownership
  • Ability to balance technical depth with fast, iterative delivery
Job Responsibility
Job Responsibility
  • Design, build, and maintain services and features that scale with our growing customer base
  • Tackle ambiguous, open-ended problems with strategic thinking, balancing technical constraints with user needs and product goals
  • Build services, APIs, and experiences that support user delight, quality, high availability and performance
  • Champion a culture of technical excellence and innovation
  • Work with cross-functional teams, including Product, UX, Infrastructure, and Security, to deliver impactful solutions
  • Contribute to engineering best practices, mentor junior engineers, and participate in design and code reviews
  • Debug production issues and optimize system performance using observability tools
  • Work with technologies such as Python, Rust, Typescript, Postgres, Kubernetes, AWS, Azure, GCP, Terraform, and Datadog
What we offer
What we offer
  • Equity Stake
  • Unlimited PTO
  • 401k with a 3% guaranteed contribution
  • Excellent healthcare coverage
  • Paid parental leave
  • Wellness and home office stipends
  • Fulltime
Read More
Arrow Right

Senior Distributed Systems Engineer - Ad Display Platform Engineering

The Bidding Platform organization is the core of the RTB business, processing ov...
Location
Location
Poland
Salary
Salary:
Not provided
rtbhouse.com Logo
RTB House
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of hands-on experience in software engineering
  • Proficiency in programming
  • Excellent understanding of how complex IT systems work (from the hardware level, through software, to algorithmics)
  • Very good knowledge of fundamental Internet protocols and technologies (DNS, HTTP, cookies and others)
  • Good knowledge of basic methods of creating concurrent programs and distributed systems (from thread level to geo-distributed clusters level)
  • Practical ability to observe, monitor and analyse the operation of production systems (and draw valuable conclusions from it)
  • The ability to critically analyze the solutions created in terms of performance (from estimating the theoretical performance of the designed systems to detecting and removing actual performance problems in production)
  • General knowledge of issues (typical problems and methods of solving them) in the areas of 'high scalability' and 'high availability'
  • C1 level in English and Polish
Job Responsibility
Job Responsibility
  • Implement and maintain (in all aspects, including setting up environment, writing configuration code, monitor production) high-quality backend services for displaying Ads globally, focusing on extreme performance and scalability
  • Develop tools (deployment, testing platforms, web performance and reliability monitoring), and critical optimizations to drive measurable improvements in critical user performance metrics for ad rendering and display
  • Write, test, and deploy robust, efficient, and well-documented code in Java/Python, ensuring adherence to the highest coding and performance standards
  • Participate in code reviews, knowledge sharing sessions, and help implement technical standards and best practices within the team
What we offer
What we offer
  • Projects focused on extreme performance and high code quality – solid code reviews are our standard
  • Collaboration within an interdisciplinary, self-sufficient team (including DevOps, database experts, backend developers, product designers, and QA engineers)
  • Hardware and software tailored to your preferences (e.g., MacBook, AI tool licenses)
  • Flexible working conditions – no core hours, fully remote cooperation possible
Read More
Arrow Right

Senior Principal Data Platform Software Engineer

We’re looking for a Sr Principal Data Platform Software Engineer (P70) to be a k...
Location
Location
Salary
Salary:
239400.00 - 312550.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years in Data Engineering, Software Engineering, or related roles, with substantial exposure to big data ecosystems
  • Demonstrated experience building and operating data platforms or large‑scale data services in production
  • Proven track record of building services from the ground up (requirements → design → implementation → deployment → ongoing ownership)
  • Hands‑on experience with AWS, GCP (e.g., compute, storage, data, and streaming services) and cloud‑native architectures
  • Practical experience with big data technologies, such as Databricks, Apache Spark, AWS EMR, Apache Flink, or StarRocks
  • Strong programming skills in one or more of: Kotlin, Scala, Java, Python
  • Experience leading cross‑team technical initiatives and influencing senior stakeholders
  • Experience mentoring Staff/Principal engineers and lifting the technical bar for a team or org
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Design, develop and own delivery of high quality big data and analytical platform solutions aiming to solve Atlassian’s needs to support millions of users with optimal cost, minimal latency and maximum reliability
  • Improve and operate large‑scale distributed data systems in the cloud (primarily AWS, with increasing integration with GCP and Kubernetes‑based microservices)
  • Drive the evolution of our high-performance analytical databases and its integrations with products, cloud infrastructures (AWS and GCP) and isolated cloud environments
  • Help define and uplift engineering and operational standards for petabyte scale data platforms, with sub‑second analytic queries and multi‑region availability (coding guidelines, code review practices, observability, incident response, SLIs/SLOs)
  • Partner across multiple product and platform teams (including Analytics, Marketplace/Ecosystem, Core Data Platform, ML Platform, Search, and Oasis/FedRAMP) to deliver company‑wide initiatives that depend on reliable, high‑quality data
  • Act as a technical mentor and multiplier, raising the bar on design quality, code quality, and operational excellence across the broader team
  • Design and implement self‑healing, resilient data platforms with strong observability, fault tolerance, and recovery characteristics
  • Own the long‑term architecture and technical direction of Atlassian’s product data platform with projects that are directly tied to Atlassian’s company-level OKRs
  • Be accountable for the reliability, cost efficiency, and strategic direction of Atlassian’s product analytical data platform
  • Partner with executives and influence senior leaders to align engineering efforts with Atlassian’s long-term business objectives
What we offer
What we offer
  • health and wellbeing resources
  • paid volunteer days
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Payments Integration

You will help us build and improve new payment capabilities. You’ll be part of a...
Location
Location
Netherlands , Amsterdam
Salary
Salary:
Not provided
https://www.ikea.com Logo
IKEA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven record of building microservices using modern frameworks and practices
  • Able to deliver secure, scalable, and resilient services based on DevOps principles
  • Experienced with cloud computing platforms – preferably GCP, but other platforms are great too
  • Experienced using modern observability practices such as distributed tracing, structured logging and metrics
  • Experienced with deployment automation tools such as Terraform and GitHub Actions
  • Have great written and verbal communication skills
Job Responsibility
Job Responsibility
  • Build and improve new payment capabilities
  • Be part of a global team taking care of many thousands of payments per minute for more than 30 markets worldwide
  • Promote innovation, learning, curiosity and best practices
  • Embrace DevOps
  • Commit code until it is available for customers
  • Focus on observability, security, scalability and resilience
What we offer
What we offer
  • Direct impact on the lives of millions of people worldwide
  • A challenging yet fun work environment
  • Continuous learning with 80/20 rule (80% work on product, 20% work on yourself)
  • A place to be you with diversity and inclusion
  • Fulltime
Read More
Arrow Right

Director of Engineering, Platform Engineering

In your role as ‘Director of Engineering, Platform Engineering’ you will guide t...
Location
Location
United States , Oakland, California
Salary
Salary:
241000.00 - 305000.00 USD / Year
everlaw.com Logo
Everlaw
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 4 years of experience managing and leading senior engineers, including technical workstream management and execution support
  • At least 2 years of experience managing and leading managers, coaching them on talent management, strategic planning, and execution, with a focus on platform engineering teams
  • At least 5 years of experience as a senior engineer building one or more of - developer productivity tools, highly available platform services (i.e. storage systems, pub-sub systems, search systems, caching solutions, observability solutions) and/or have expertise and experience with infrastructure and/or cloud technologies (like Ansible, Terraform, Kubernetes, Docker etc)
  • You have a good dynamic range that you apply to different situations - you can step back and empower, while also diving deep into the code to understand the details
  • You can communicate at the right altitude with both technical and non-technical stakeholders
  • You have experience working with stakeholder teams (internal and/or external) in setting and collaborating on technical roadmaps
  • You have experience communicating with customers articulating to them how the platform works on reliability, security and compliance matters
  • You have a BS/MS or PhD in Computer Science (or equivalent)
  • You have a sound foundational understanding of a wide range of computer science topics and concerns relating to system and software design
  • You are authorized to work in the United States
Job Responsibility
Job Responsibility
  • Inspire and empower your managers to cultivate high-performing teams, fostering a culture of continuous feedback and professional growth to ensure successful project delivery and career development
  • Use your technical knowledge to align stakeholders across Engineering and Product on the ideal path forward on complex technical decisions and roadmap decisions
  • Strategize, prioritize, resource, and execute against our Engineering roadmap
  • Work with Engineering Operations, cross-functional teams, team members and managers to improve various processes that affect infrastructure growth, support, alignment, collaboration, and accountability
  • Critically observe and understand Everlaw’s platform, tooling, and processes
What we offer
What we offer
  • Equity program
  • 401(k) retirement plan with company matching
  • Health, dental, and vision
  • Flexible Spending Accounts for health and dependent care expenses
  • Paid parental leave and approximately 10 days (80 hours) per year of sick leave
  • Seventeen paid vacation days plus 11 federal holidays
  • Membership to Modern Health to help employees prioritize mental health and wellness
  • Annual allocation for Learning & Development opportunities and applicable professional membership dues
  • Company-sponsored life and disability insurance
  • Work in Downtown Oakland, just steps from the BART line and dozens of restaurants
  • Fulltime
Read More
Arrow Right

Staff Platform Software Engineer

EarnIn is seeking a Staff Platform Engineer to lead the strategic design, automa...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
earnin.com Logo
EarnIn
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s Degree in Computer Science or equivalent industry experience
  • 7+ years of experience in cloud infrastructure, managing large-scale, high-availability, customer-facing distributed systems
  • Proven experience mentoring and guiding senior engineers, driving technical decisions, and leading company-wide cloud initiatives
  • Mastery of public cloud providers, specifically AWS (EKS, DynamoDB, Aurora, Kinesis, etc.)
  • Strong expertise in containerized microservices running on Kubernetes
  • Deep knowledge of automation and configuration management tools (Terraform, Ansible)
  • Expertise on CICD pipelines and tools, including Jenkins, GHA, Argo CD, Spinnaker & FluxCD or similar
  • Experience with advanced observability tools (DataDog, CloudWatch)
  • Track record of leading cost optimization / FinOps initiatives, performance tuning, and operational excellence projects
  • Proven ability to drive cross-functional initiatives with engineering, product, and business teams
Job Responsibility
Job Responsibility
  • Serve as a key architect and thought leader in the cloud infrastructure domain, guiding the team on best practices
  • Mentor and coach senior engineers across the company in advanced cloud operations practices
  • Provide oversight of hosted Linux and Windows systems, networks, databases, and applications, identifying and solving critical performance, scalability, and stability challenges
  • Design and develop reusable components and operational strategies to enhance the scalability, performance, and monitoring of cloud systems
  • Collaborate with other senior engineers to create technical solutions that address company-wide cloud challenges
  • Lead the establishment and continuous evolution of infrastructure-as-code best practices, driving automation, self-healing, and security standards
  • Drive operational cost savings through service optimizations, autoscaling strategies, and distributed processing architectures
  • Collaborate closely with cross-functional teams, including security, engineering, and business teams, to ensure that operational strategies align with company-wide objectives
  • Provide thought leadership in company-wide initiatives such as observability, automation, and disaster recovery
  • Continuously evaluate existing tools and processes, lead efforts to socialize, present, and implement enhancements for optimal operational efficiency
What we offer
What we offer
  • healthcare
  • internet/cell phone reimbursement
  • a learning and development stipend
  • opportunities to travel to our Mountain View HQ
  • Fulltime
Read More
Arrow Right

Senior Software Engineer II

We are looking for an experienced Senior Software Engineer II to help build and ...
Location
Location
United States
Salary
Salary:
170000.00 - 231000.00 USD / Year
getdbt.com Logo
dbt Labs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience as a software engineer developing SaaS platforms and applications at scale
  • Minimum requirement of Bachelor's Degree in a related field (computer science, computer engineering, etc.) OR completed enrollment in engineering related bootcamp
  • Proven experience designing and scaling backend services
  • Strong understanding of API design, system architecture, and database management
  • Proficiency with backend languages and frameworks such as Python, Go, Rust, django, Node.js, Java, Spring
  • Familiarity with cloud infrastructure such as AWS, GCP, Azure, Kubernetes, Terraform
  • Proficiency in designing API-driven applications using REST and/or gRPC
  • Experience building scalable and secure distributed systems
  • A systematic problem-solving approach, strong communication skills, and a sense of ownership
  • Ability to balance technical depth with fast, iterative delivery
Job Responsibility
Job Responsibility
  • Design, build, and maintain services that scale with our growing customer base
  • Tackle ambiguous, open-ended problems with strategic thinking, balancing technical constraints with user needs and product goals
  • Build services, APIs, and experiences that support user delight, quality, high availability and performance
  • Champion a culture of technical excellence and innovation, influencing engineering direction within the team
  • Work with cross-functional teams, including Product, UX, and Security, to deliver impactful solutions
  • Contribute to engineering best practices, mentor junior engineers, and participate in design and code reviews
  • Debug production issues and optimize system performance using observability tools
  • Work with technologies such as Python, Rust, Typescript, Postgres, Kubernetes, AWS, Terraform, and Datadog
What we offer
What we offer
  • Equity Stake
  • Unlimited PTO
  • 401k with a 3% guaranteed contribution
  • Excellent healthcare coverage
  • Paid parental leave
  • Wellness and home office stipends
  • Fulltime
Read More
Arrow Right