CrawlJobs Logo

Software Engineer - Cloud FinOps & Reliability

lumalabs.ai Logo

Luma AI

Location Icon

Location:
United States , Palo Alto

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

120000.00 - 255000.00 USD / Year

Job Description:

This is a foundational engineering position for a technical, data-driven expert who gets excited about optimization at a massive scale. As a foundational member of our SRE team, you will specialize in FinOps and cloud cost management, owning the financial health of one of the world's largest multi-cloud GPU infrastructures. You will be an SRE who applies a deep understanding of cloud architecture and pricing models to find and eliminate inefficiency. You will use your software engineering skills to build the tools and automation required to govern our cloud spend, providing critical insights that allow us to scale our AI research and products sustainably.

Job Responsibility:

  • Analyze & Optimize: Actively monitor and analyze costs across our entire technical ecosystem—including multi-cloud infrastructure (AWS, GCP, OCI), on-premise clusters, and third-party services—to identify and execute on opportunities for cost optimization. Develop forecasting models to predict future spend and inform our capacity planning
  • Manage & Commit: Develop and actively manage a multi-million dollar portfolio of Reserved Instances (RIs) and Savings Plans to maximize commitment-based discounts across our global GPU and CPU fleets
  • Automate & Build: Apply a software engineering approach to design, build, and maintain custom tools and automation in Python and SQL. Your systems will track, analyze, and report on costs across our entire fleet of providers and services, with a focus on detecting anomalies immediately
  • Partner & Advise: Working closely as an embedded member of the SRE team, you will partner with fellow SREs and research teams to model the cost implications of new models and infrastructure designs, providing expert guidance on cost-performance trade-offs
  • Visualize & Report: Create and manage a centralized observability stack for cloud costs, building dashboards in tools like Grafana to give a real-time, granular view of our financial posture to all stakeholders

Requirements:

  • 5+ years of experience in a technical role such as Site Reliability Engineer, DevOps Engineer, Infrastructure Engineer, or a dedicated Cloud Cost Engineer
  • Deep, hands-on expertise with the cost models and optimization levers of at least one major cloud provider (AWS, GCP), and a willingness to learn others
  • Proficient in Python for the purpose of scripting, data analysis, and building automation tooling
  • Strong, foundational understanding of cloud infrastructure, including containerization (Docker, Kubernetes), networking, and storage
  • Not an accountant
  • you are a systems thinker who is passionate about applying engineering principles to solve financial challenges at scale
  • A tenacious troubleshooter and a data-driven decision-maker who thrives on finding the 'why' behind the numbers

Nice to have:

  • Experience managing a monthly cloud spend in excess of $1 million
  • Relevant certifications, such as the FinOps Certified Practitioner (FOCP)
  • Experience building custom cost allocation, showback, or chargeback systems from scratch
  • A background working with large-scale GPU clusters for AI/ML workloads

Additional Information:

Job Posted:
January 13, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Software Engineer - Cloud FinOps & Reliability

Cloud Engineering Manager - FinOps

This role combines technical expertise, leadership, and operational excellence t...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven expertise in cloud platforms (e.g., AWS, Azure, Google Cloud) and cloud-native technologies
  • Strong knowledge of FinOps principles and cloud financial management, including cost optimization, forecasting, and governance
  • Experience with application development frameworks (e.g., Node.js, Python, Java) and modern software engineering practices
  • Familiarity with cloud monitoring and cost management tools, such as AWS Cost Explorer, Azure Cost Management, or third-party FinOps platforms (e.g., CloudHealth, Apptio)
  • Proficiency in containerization and orchestration technologies such as Docker and Kubernetes
  • Demonstrated success in leading engineering teams, managing priorities, and delivering complex projects on time and within budget
  • Strong collaboration skills, with the ability to work effectively across engineering, finance, and business teams
  • Exceptional ability to communicate technical concepts to non-technical stakeholders and align engineering efforts with business goals
  • Bachelor’s or master’s degree in computer science, engineering, information systems, or related field
  • Typically, 7-10 years’ experience, including 0-2 years of people management experience
Job Responsibility
Job Responsibility
  • Lead and inspire a team of cloud engineers focused on FinOps application development, fostering a culture of innovation, collaboration, and continuous improvement
  • Drive the design, development, and implementation of cloud engineering applications that enable visibility, optimization, and governance of cloud costs and usage
  • Architect scalable, secure, and resilient solutions that align with FinOps principles (e.g., cost optimization, forecasting, usage analytics)
  • Collaborate with product managers and business stakeholders to define requirements, prioritize features, and deliver value-driven solutions
  • Ensure seamless integration of FinOps applications with existing HPE cloud platform tools and systems
  • Lead efforts to optimize cloud infrastructure costs and usage patterns across HPE's cloud platforms, leveraging advanced analytics and automation
  • Establish and enforce engineering best practices, including CI/CD pipelines, DevSecOps principles, and automated testing frameworks
  • Monitor and improve application performance, reliability, and scalability through proactive measures and robust incident management
  • Collaborate with finance teams to ensure compliance with cloud spending policies and reporting requirements
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Staff Platform Software Engineer

EarnIn is seeking a Staff Platform Engineer to lead the strategic design, automa...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
earnin.com Logo
EarnIn
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s Degree in Computer Science or equivalent industry experience
  • 7+ years of experience in cloud infrastructure, managing large-scale, high-availability, customer-facing distributed systems
  • Proven experience mentoring and guiding senior engineers, driving technical decisions, and leading company-wide cloud initiatives
  • Mastery of public cloud providers, specifically AWS (EKS, DynamoDB, Aurora, Kinesis, etc.)
  • Strong expertise in containerized microservices running on Kubernetes
  • Deep knowledge of automation and configuration management tools (Terraform, Ansible)
  • Expertise on CICD pipelines and tools, including Jenkins, GHA, Argo CD, Spinnaker & FluxCD or similar
  • Experience with advanced observability tools (DataDog, CloudWatch)
  • Track record of leading cost optimization / FinOps initiatives, performance tuning, and operational excellence projects
  • Proven ability to drive cross-functional initiatives with engineering, product, and business teams
Job Responsibility
Job Responsibility
  • Serve as a key architect and thought leader in the cloud infrastructure domain, guiding the team on best practices
  • Mentor and coach senior engineers across the company in advanced cloud operations practices
  • Provide oversight of hosted Linux and Windows systems, networks, databases, and applications, identifying and solving critical performance, scalability, and stability challenges
  • Design and develop reusable components and operational strategies to enhance the scalability, performance, and monitoring of cloud systems
  • Collaborate with other senior engineers to create technical solutions that address company-wide cloud challenges
  • Lead the establishment and continuous evolution of infrastructure-as-code best practices, driving automation, self-healing, and security standards
  • Drive operational cost savings through service optimizations, autoscaling strategies, and distributed processing architectures
  • Collaborate closely with cross-functional teams, including security, engineering, and business teams, to ensure that operational strategies align with company-wide objectives
  • Provide thought leadership in company-wide initiatives such as observability, automation, and disaster recovery
  • Continuously evaluate existing tools and processes, lead efforts to socialize, present, and implement enhancements for optimal operational efficiency
What we offer
What we offer
  • healthcare
  • internet/cell phone reimbursement
  • a learning and development stipend
  • opportunities to travel to our Mountain View HQ
  • Fulltime
Read More
Arrow Right

Senior Platform Engineer

Shape Entalpic’s cloud-native platform: design scalable data pipelines, secure i...
Location
Location
France , Paris
Salary
Salary:
Not provided
breega.com Logo
Breega
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • M.S. in Computer Science, Software Engineering, or a related technical field
  • 10+ years of experience in software engineering, with a strong focus on data platforms and cloud infrastructure
  • You are a "Swiss army knife" software engineer at heart—happy to write code, build systems, and solve problems wherever the need arises
  • Excellent communication skills in English, with the ability to act as an evangelist for best infrastructure practices across interdisciplinary teams
  • Thrives in a fast-paced, evolving startup environment
  • Programming: Excellent industrial software engineering skills in Python and at least one other programming language (e.g., Go, Rust)
  • Cloud & Infrastructure as Code: Deep expertise managing cloud platforms (GCP preferred) and defining infrastructure via Terraform
  • DevOps & CI/CD: Strong track record of automating deployment pipelines (GitHub Actions) and introducing containerization to legacy or evolving stacks
  • Data Systems: Hands-on experience optimizing and scaling both SQL (PostgreSQL) and NoSQL (MongoDB) databases, alongside workflow orchestration tools like Airflow
  • Containerization & Orchestration Ecosystem: Strong practical experience with Docker and Kubernetes. Familiarity with ML-specific orchestration frameworks (such as Ray or Kubeflow) is highly advantageous
Job Responsibility
Job Responsibility
  • Cloud & Platform: Manage and evolve our existing GCP infrastructure using Terraform, proactively challenging the stack for scalability and financial optimization (FinOps)
  • DevOps & Developer Experience: Consolidate and strengthen our DevOps culture across projects and teams, help design and deploy efficient CI/CD pipelines, enhancing the overall developer experience for the ML and scientific teams
  • Data Storage & Pipeline Architecture: Design and manage robust infrastructure and tiered data storage to support pipelines handling high-throughput interactions between LLM agents, PostgreSQL, MongoDB, and Airflow
  • Security & Isolation: Design logical separation for tenant data and compute environments to protect highly sensitive intellectual property, keeping upcoming SOC-2 and ISO 27001 standards in mind
  • Containerization & Orchestration: Evaluate, introduce, and implement containerization technologies (e.g., Docker, Kubernetes) from the ground up to improve deployment reliability
  • AI automations: contribute to a culture of AI-driven automations by developing automated workflows, stay up to date with the latest advancements in the field and mentoring within the company across teams, from DevOps to Business needs
What we offer
What we offer
  • A competitive salary
  • Equity (BSPCE), to reflect the value you bring to Entalpic and to foster a shared journey
  • Comprehensive health insurance (Alan blue)
  • French level paid leave and time-off work
  • Dynamic work setting. Although our preference is for in-person collaboration, we will be flexible with occasional remote work arrangements
Read More
Arrow Right

Distinguished Engineer

At GEICO, we offer a rewarding career where your ambitions are met with endless ...
Location
Location
United States , Chevy Chase
Salary
Salary:
150000.00 - 300000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of professional experience in software engineering
  • 8+ years of experience with architecture and design
  • 6+ years of experience in open-source frameworks
  • 4+ years of experience with AWS, GCP, Azure, or another cloud service
  • Bachelor's degree in computer science, Information Systems, or equivalent education or work experience
  • Deep hands-on experience in building complex distributed system to process large scale telemetry and architectures to support the scale and performance, with great knowledge on Docker and Kubernetes
  • Advance knowledge of at least two of the OOP language such as Java, Go, Python, etc.
  • Great understanding of open-source databases like MySQL, PostgreSQL, etc. And strong foundation with No-SQL databases like Clickhouse, Cassandra. Apache Trino etc. Knowledge or Big data formats such as Parquet or Avro etc.
  • Experience in architecting, designing, building Observability platform solutions, Advanced data analytics using Open-Source technologies are a big plus.
  • Experience building distributed systems
Job Responsibility
Job Responsibility
  • Develop and drive the overall tech strategy for the Reliability and observability tools organization, and report to the Senior Director
  • Focus on multiple areas and provide technical and thought leadership as Observability Domain Technical Champion
  • Collaborate with product managers, team members, customers, and other engineering teams to solve our toughest problems
  • Develop and execute technical software development strategy for the Observability Engineering domain
  • Accountable for the quality, usability, and performance of the solutions
  • Be a role model and mentor, helping to coach and strengthen the technical expertise and know-how of our engineering and product community. Influence and educate executives
  • Consistently share best practices and improve processes within and across teams
  • Lead the design and architecture of resilient and scalable systems, considering both on-premises and cloud-based solutions
  • Develop and maintain comprehensive incident response plans to address various disaster scenarios on our backup/restore systems
  • Conduct regular simulations and drills to ensure the readiness of the organization in the event of a disaster
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Capacity & Efficiency

Join us in building the future of finance. Our mission is to democratize finance...
Location
Location
United States , Bellevue
Salary
Salary:
196000.00 - 230000.00 USD / Year
robinhood.com Logo
Robinhood
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience building and operating production systems in a cloud-native environment, ideally on AWS
  • Strong proficiency with Kubernetes and a practical understanding of resource efficiency and capacity planning
  • Experience working on infrastructure, platform, or data-heavy systems where cost, scale, and reliability matter
  • Ability to reason about cloud cost models, including tradeoffs between performance, reliability, and spend
  • Clear communication skills and comfort working with partner teams to explain findings and technical recommendations
Job Responsibility
Job Responsibility
  • Build and maintain software systems that detect, track, and attribute AWS cloud costs to the correct teams, services, and workloads
  • Develop tooling to identify cost anomalies, regressions, and over-provisioned resources across Kubernetes and managed services
  • Partner with Data Science to support forecasting models, unit economics, and projections that surface future cost risks
  • Analyze infrastructure usage patterns to identify inefficiencies and implement technical solutions that reduce cloud spend
  • Collaborate with partner teams to land efficiency improvements, validate cost reductions, and track remediation outcomes
What we offer
What we offer
  • Performance-driven compensation with multipliers for outsized impact, bonus programs, equity ownership, and 401(k) matching
  • 100% paid health insurance for employees with 90% coverage for dependents
  • Lifestyle wallet — a highly flexible benefits spending account for wellness, learning, and more
  • Employer-paid life & disability insurance, fertility benefits, and mental health benefits
  • Time off to recharge including company holidays, paid time off, sick time, parental leave, and more
  • Exceptional office experience with catered meals, events, and comfortable workspaces
  • Fulltime
Read More
Arrow Right

Principal Azure DevOps Engineer

We are looking to recruit an SC Cleared Principal Azure DevOps Engineer for a le...
Location
Location
United Kingdom
Salary
Salary:
80000.00 - 90000.00 GBP / Year
datacareers.co.uk Logo
DataCareers
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience in Azure services and architecture (VMs, EntraID, Application Gateway, Sentinel, Defender for Cloud, Azure Fabric, Functions, Logic Apps, Front Door, App Service, Dev Box, Azure Migrate)
  • Strong expertise in Azure DevOps, GitHub CI/CD, and build/release automation
  • Proficiency with Infrastructure as Code (Terraform, Pulumi, CloudFormation, PowerShell)
  • Experience deploying solutions in AWS is desirable
  • Familiarity with containerization and orchestration (Docker, Kubernetes) and automation/configuration tools (Ansible)
  • Strong scripting skills (PowerShell, Bash, Python)
  • Experience with monitoring and observability tools (Grafana, Azure Monitor, DataDog, New Relic)
  • Deep understanding of cloud security, governance, and FinOps principles
  • Solid Windows, Linux, and Microsoft 365 design and implementation experience
  • Proven experience migrating databases (e.g., MS SQL) in cloud environments
Job Responsibility
Job Responsibility
  • Lead the design and implementation of cloud infrastructure and DevOps processes across client projects
  • Act as a technical advisor for cloud engineers, providing guidance on CI/CD automation, container orchestration, and platform reliability
  • Design, document, and maintain secure technical and security architectures aligned with best practices
  • Collaborate with Architecture, Security, Software Engineering, and Product teams to align cloud platform strategy
  • Drive improvements in automation, infrastructure as code, and overall DevOps maturity across projects
  • Mentor and coach engineering teams to adopt modern engineering practices and automation strategies
  • Deliver large-scale infrastructure transformation projects with low-level design expertise
  • Stay ahead of emerging technologies, applying them to deliver maximum client value
  • Fulltime
Read More
Arrow Right

Senior Azure DevOps Engineer

We are looking to recruit an SC Cleared Senior Azure DevOps Engineer for a leadi...
Location
Location
United Kingdom
Salary
Salary:
80000.00 - 90000.00 GBP / Year
datacareers.co.uk Logo
DataCareers
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience in Azure services and architecture (VMs, EntraID, Application Gateway, Sentinel, Defender for Cloud, Azure Fabric, Functions, Logic Apps, Front Door, App Service, Dev Box, Azure Migrate)
  • Strong expertise in Azure DevOps, GitHub CI/CD, and build/release automation
  • Proficiency with Infrastructure as Code (Terraform, Pulumi, CloudFormation, PowerShell)
  • Experience deploying solutions in AWS is desirable
  • Familiarity with containerization and orchestration (Docker, Kubernetes) and automation/configuration tools (Ansible)
  • Strong scripting skills (PowerShell, Bash, Python)
  • Experience with monitoring and observability tools (Grafana, Azure Monitor, DataDog, New Relic)
  • Deep understanding of cloud security, governance, and FinOps principles
  • Solid Windows, Linux, and Microsoft 365 design and implementation experience
  • Proven experience migrating databases (e.g., MS SQL) in cloud environments
Job Responsibility
Job Responsibility
  • Lead the design and implementation of cloud infrastructure and DevOps processes across client projects
  • Act as a technical advisor for cloud engineers, providing guidance on CI/CD automation, container orchestration, and platform reliability
  • Design, document, and maintain secure technical and security architectures aligned with best practices
  • Collaborate with Architecture, Security, Software Engineering, and Product teams to align cloud platform strategy
  • Drive improvements in automation, infrastructure as code, and overall DevOps maturity across projects
  • Mentor and coach engineering teams to adopt modern engineering practices and automation strategies
  • Deliver large-scale infrastructure transformation projects with low-level design expertise
  • Stay ahead of emerging technologies, applying them to deliver maximum client value
  • Fulltime
Read More
Arrow Right

Engineering Manager - Machine Learning

As a ML Engineering Team Lead at Aignostics, you will lead a high-performing tea...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
aignostics.com Logo
Aignostics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Engineering, Mathematics, or a related field
  • 6+ years of software engineering or ML engineering experience, with at least 2 years in a technical leadership or team lead role
  • Proven track record of building and leading high-performing engineering teams
  • Experience guiding projects across the whole Software Development Life Cycle
  • Deep understanding of fundamental Machine Learning concepts and principles
  • Familiarity with advanced model optimization techniques
  • Significant experience with large-scale distributed training systems and frameworks (especially PyTorch and NCCL)
  • Familiarity with GPUs, distributed systems, parallel computing and scaling laws
  • Advanced programming skills in Python
  • Familiarity of MLOps/DevOps best practices including CI/CD, Docker, Kubernetes, and observability
Job Responsibility
Job Responsibility
  • Build and scale a high-performing team capable of tackling complex distributed ML challenges
  • Own the full employee lifecycle: recruiting, onboarding, performance management, career development, and retention
  • Empower your team members and help them grow in autonomy and technical expertise
  • Mentor engineers at all levels
  • Create an inclusive environment where diverse perspectives drive innovation
  • Define and execute technical roadmaps aligned with company objectives and product needs
  • Lead resource allocation and capacity planning
  • Own FinOps responsibilities: optimize cloud costs, track spending, and ensure efficient resource utilization
  • Ensure operational readiness through monitoring, incident response protocols, and system reliability practices
  • Establish and track KPIs for team performance, system efficiency and health
What we offer
What we offer
  • Learning & Development yearly budget of 1,000€ (plus 2 L&D days)
  • Language classes
  • Internal development programs
  • Access to leadership development programs and executive coaching
  • Flexible working hours and teleworking policy
  • 30 paid vacation days per year
  • Family & pet friendly
  • Support flexible parental leave options
  • Subsidized membership of your choice among public transport, sports, and well-being
  • Social gatherings, lunches, and off-site events
Read More
Arrow Right