CrawlJobs Logo

Software Engineer - Cloud FinOps & Reliability

United States, Palo Alto 120000.00 - 255000.00 USD / Year · Job Posted January 13, 2026
Apply Position
Job Link Share

Job Description

This is a foundational engineering position for a technical, data-driven expert who gets excited about optimization at a massive scale. As a foundational member of our SRE team, you will specialize in FinOps and cloud cost management, owning the financial health of one of the world's largest multi-cloud GPU infrastructures. You will be an SRE who applies a deep understanding of cloud architecture and pricing models to find and eliminate inefficiency. You will use your software engineering skills to build the tools and automation required to govern our cloud spend, providing critical insights that allow us to scale our AI research and products sustainably.

Job Responsibility

  • Analyze & Optimize: Actively monitor and analyze costs across our entire technical ecosystem—including multi-cloud infrastructure (AWS, GCP, OCI), on-premise clusters, and third-party services—to identify and execute on opportunities for cost optimization. Develop forecasting models to predict future spend and inform our capacity planning
  • Manage & Commit: Develop and actively manage a multi-million dollar portfolio of Reserved Instances (RIs) and Savings Plans to maximize commitment-based discounts across our global GPU and CPU fleets
  • Automate & Build: Apply a software engineering approach to design, build, and maintain custom tools and automation in Python and SQL. Your systems will track, analyze, and report on costs across our entire fleet of providers and services, with a focus on detecting anomalies immediately
  • Partner & Advise: Working closely as an embedded member of the SRE team, you will partner with fellow SREs and research teams to model the cost implications of new models and infrastructure designs, providing expert guidance on cost-performance trade-offs
  • Visualize & Report: Create and manage a centralized observability stack for cloud costs, building dashboards in tools like Grafana to give a real-time, granular view of our financial posture to all stakeholders

Requirements

  • 5+ years of experience in a technical role such as Site Reliability Engineer, DevOps Engineer, Infrastructure Engineer, or a dedicated Cloud Cost Engineer
  • Deep, hands-on expertise with the cost models and optimization levers of at least one major cloud provider (AWS, GCP), and a willingness to learn others
  • Proficient in Python for the purpose of scripting, data analysis, and building automation tooling
  • Strong, foundational understanding of cloud infrastructure, including containerization (Docker, Kubernetes), networking, and storage
  • Not an accountant
  • you are a systems thinker who is passionate about applying engineering principles to solve financial challenges at scale
  • A tenacious troubleshooter and a data-driven decision-maker who thrives on finding the 'why' behind the numbers

Nice to have

  • Experience managing a monthly cloud spend in excess of $1 million
  • Relevant certifications, such as the FinOps Certified Practitioner (FOCP)
  • Experience building custom cost allocation, showback, or chargeback systems from scratch
  • A background working with large-scale GPU clusters for AI/ML workloads

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Software Engineer - Cloud FinOps & Reliability

8 matching positions

Senior Software Engineer – Cloud Engineering & FinOps

Work Arrangement: Hybrid: This role is categorized as hybrid. This means the suc...
Location
Location
United States , Austin, Texas; Warren, Michigan
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science or a related technical field (or equivalent practical experience)
  • 5+ years of hands-on software engineering experience, with a strong focus on building cloud-native applications and platforms
  • Proficiency in modern programming languages: Next.js (or React) for frontend and Go for backend services
  • Strong experience with Docker for containerization and Kubernetes for orchestration
  • Experience designing and building scalable data pipelines, APIs, or backend services in a cloud environment
  • Solid understanding of cloud fundamentals across at least one major provider (Azure, GCP, AWS), including cost structures, billing concepts, and resource optimization
  • Demonstrated ability to write clean, maintainable code, conduct code reviews, and participate in technical decision-making
  • Demonstrated ability to clearly communicate technical and non-technical information verbally and in writing
  • Strong problem-solving skills with the ability to deliver high-quality features quickly in an agile environment
Job Responsibility
Job Responsibility
  • Building FinOps tooling and cloud onboarding experiences that power GM's enterprise-wide cloud transformation
  • Design, develop, and evolve our in-house Cloud Onboarding and FinOps Portal—a modern platform built with Next.js (frontend) and Go (backend services)
  • Creating frictionless developer and team experiences by embedding cost awareness, usage optimization, and governance directly into the cloud onboarding and operational workflows
  • Own key components including: Billing data ingestion pipelines from major cloud providers (Azure, GCP, AWS)
  • Utilization metrics, cost analytics, and optimization recommendation engines
  • Cloud onboarding workflows and frictionless, self-service capabilities
  • Design and build scalable, cloud-native services with speed and quality
  • Lead technical decision-making and architecture discussions
  • Conduct code reviews and uphold high engineering standards across the team
  • Collaborate closely with peer teams to design new features and deliver end-to-end solutions
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

The Principal Software Engineer is the senior-most hands-on technical leader for...
Location
Location
India , Chennai
Salary
Salary:
Not provided
rxglobal.com Logo
RX Global
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience as a senior technical leader across multiple teams/services within a bounded domain
  • Strong polyglot background (e.g., C#/.NET, Java, JavaScript/Node) and ability to choose fit-for-purpose technologies
  • Experience modernising systems: migrating from legacy architectures to cloud-native patterns, reducing technical debt, and decommissioning safely
  • Experience in systems analysis, design and a solid understanding of development, quality assurance and integration methodologies
  • Experience developing integrated solutions within a broad technical and business context of significant impact
  • Experience evaluating third-party services and platforms (security, cost, operations, integration complexity)
  • Experience leading cross‑team architectural change, platform adoption, or measurable improvements to reliability/cost/performance (with before/after metrics)
  • Familiarity with responsible AI usage in engineering workflows (policy/guardrails, data privacy, human‑in‑the‑loop review)
  • Bachelor’s/Master’s degree in Computer Science (or related) or equivalent professional experience
  • Expert software design skills: SOLID, DDD, event-driven architecture patterns, modular design, and maintainable codebases
Job Responsibility
Job Responsibility
  • Engineering Leadership & Culture: Create an environment where teams can do their best work by removing blockers, improving engineering practices, and contributing to a culture of psychological safety and high standards
  • Mentor and coach engineers across teams—especially senior engineers and emerging tech leads—in architecture, systems thinking, and operational excellence
  • Promote strong technical ownership ("you build it, you run it"), including operational readiness and post-incident learning
  • Support scalable knowledge-sharing mechanisms (e.g., tech talks, playbooks, templates, reference implementations)
  • Participate in hiring loops and help onboard new engineers into domain patterns and practices
  • Provide hands-on contributions where needed (prototypes, reference implementations, complex refactors, high-risk changes)
  • Guide teams in decomposition and sequencing to reduce delivery risk
  • support estimation/sizing and technical discovery
  • Leads through influence
  • demonstrates integrity, accountability, and constructive challenge
What we offer
What we offer
  • Comprehensive Health Insurance: Covers you, your immediate family, and parents
  • Enhanced Health Insurance Options: Competitive rates negotiated by the company
  • Group Life Insurance: Ensuring financial security for your loved ones
  • Group Accident Insurance: Extra protection for accidental death and permanent disablement
  • Flexible Working Arrangement: Achieve a harmonious work-life balance
  • Employee Assistance Program: Access support for personal and work-related challenges
  • Medical Screening: Your well-being is a top priority
  • Modern Family Benefits: Maternity, paternity, and adoption support
  • Long-Service Awards: Recognizing dedication and commitment
  • New Baby Gift: Celebrating the joy of parenthood
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Capacity & Efficiency

Join us in building the future of finance. Our mission is to democratize finance...
Location
Location
United States , Bellevue
Salary
Salary:
196000.00 - 230000.00 USD / Year
robinhood.com Logo
Robinhood
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience building and operating production systems in a cloud-native environment, ideally on AWS
  • Strong proficiency with Kubernetes and a practical understanding of resource efficiency and capacity planning
  • Experience working on infrastructure, platform, or data-heavy systems where cost, scale, and reliability matter
  • Ability to reason about cloud cost models, including tradeoffs between performance, reliability, and spend
  • Clear communication skills and comfort working with partner teams to explain findings and technical recommendations
Job Responsibility
Job Responsibility
  • Build and maintain software systems that detect, track, and attribute AWS cloud costs to the correct teams, services, and workloads
  • Develop tooling to identify cost anomalies, regressions, and over-provisioned resources across Kubernetes and managed services
  • Partner with Data Science to support forecasting models, unit economics, and projections that surface future cost risks
  • Analyze infrastructure usage patterns to identify inefficiencies and implement technical solutions that reduce cloud spend
  • Collaborate with partner teams to land efficiency improvements, validate cost reductions, and track remediation outcomes
What we offer
What we offer
  • Performance-driven compensation with multipliers for outsized impact, bonus programs, equity ownership, and 401(k) matching
  • 100% paid health insurance for employees with 90% coverage for dependents
  • Lifestyle wallet — a highly flexible benefits spending account for wellness, learning, and more
  • Employer-paid life & disability insurance, fertility benefits, and mental health benefits
  • Time off to recharge including company holidays, paid time off, sick time, parental leave, and more
  • Exceptional office experience with catered meals, events, and comfortable workspaces
  • Fulltime
Read More
Arrow Right

Staff Platform Software Engineer

EarnIn is seeking a Staff Platform Engineer to lead the strategic design, automa...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
earnin.com Logo
EarnIn
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s Degree in Computer Science or equivalent industry experience
  • 7+ years of experience in cloud infrastructure, managing large-scale, high-availability, customer-facing distributed systems
  • Proven experience mentoring and guiding senior engineers, driving technical decisions, and leading company-wide cloud initiatives
  • Mastery of public cloud providers, specifically AWS (EKS, DynamoDB, Aurora, Kinesis, etc.)
  • Strong expertise in containerized microservices running on Kubernetes
  • Deep knowledge of automation and configuration management tools (Terraform, Ansible)
  • Expertise on CICD pipelines and tools, including Jenkins, GHA, Argo CD, Spinnaker & FluxCD or similar
  • Experience with advanced observability tools (DataDog, CloudWatch)
  • Track record of leading cost optimization / FinOps initiatives, performance tuning, and operational excellence projects
  • Proven ability to drive cross-functional initiatives with engineering, product, and business teams
Job Responsibility
Job Responsibility
  • Serve as a key architect and thought leader in the cloud infrastructure domain, guiding the team on best practices
  • Mentor and coach senior engineers across the company in advanced cloud operations practices
  • Provide oversight of hosted Linux and Windows systems, networks, databases, and applications, identifying and solving critical performance, scalability, and stability challenges
  • Design and develop reusable components and operational strategies to enhance the scalability, performance, and monitoring of cloud systems
  • Collaborate with other senior engineers to create technical solutions that address company-wide cloud challenges
  • Lead the establishment and continuous evolution of infrastructure-as-code best practices, driving automation, self-healing, and security standards
  • Drive operational cost savings through service optimizations, autoscaling strategies, and distributed processing architectures
  • Collaborate closely with cross-functional teams, including security, engineering, and business teams, to ensure that operational strategies align with company-wide objectives
  • Provide thought leadership in company-wide initiatives such as observability, automation, and disaster recovery
  • Continuously evaluate existing tools and processes, lead efforts to socialize, present, and implement enhancements for optimal operational efficiency
What we offer
What we offer
  • healthcare
  • internet/cell phone reimbursement
  • a learning and development stipend
  • opportunities to travel to our Mountain View HQ
  • Fulltime
Read More
Arrow Right

Cloud Engineering Manager - FinOps

This role combines technical expertise, leadership, and operational excellence t...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven expertise in cloud platforms (e.g., AWS, Azure, Google Cloud) and cloud-native technologies
  • Strong knowledge of FinOps principles and cloud financial management, including cost optimization, forecasting, and governance
  • Experience with application development frameworks (e.g., Node.js, Python, Java) and modern software engineering practices
  • Familiarity with cloud monitoring and cost management tools, such as AWS Cost Explorer, Azure Cost Management, or third-party FinOps platforms (e.g., CloudHealth, Apptio)
  • Proficiency in containerization and orchestration technologies such as Docker and Kubernetes
  • Demonstrated success in leading engineering teams, managing priorities, and delivering complex projects on time and within budget
  • Strong collaboration skills, with the ability to work effectively across engineering, finance, and business teams
  • Exceptional ability to communicate technical concepts to non-technical stakeholders and align engineering efforts with business goals
  • Bachelor’s or master’s degree in computer science, engineering, information systems, or related field
  • Typically, 7-10 years’ experience, including 0-2 years of people management experience
Job Responsibility
Job Responsibility
  • Lead and inspire a team of cloud engineers focused on FinOps application development, fostering a culture of innovation, collaboration, and continuous improvement
  • Drive the design, development, and implementation of cloud engineering applications that enable visibility, optimization, and governance of cloud costs and usage
  • Architect scalable, secure, and resilient solutions that align with FinOps principles (e.g., cost optimization, forecasting, usage analytics)
  • Collaborate with product managers and business stakeholders to define requirements, prioritize features, and deliver value-driven solutions
  • Ensure seamless integration of FinOps applications with existing HPE cloud platform tools and systems
  • Lead efforts to optimize cloud infrastructure costs and usage patterns across HPE's cloud platforms, leveraging advanced analytics and automation
  • Establish and enforce engineering best practices, including CI/CD pipelines, DevSecOps principles, and automated testing frameworks
  • Monitor and improve application performance, reliability, and scalability through proactive measures and robust incident management
  • Collaborate with finance teams to ensure compliance with cloud spending policies and reporting requirements
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Staff Engineer – Full Stack Applications FinOps

GEICO is seeking an experienced Engineer with a passion for building high-perfor...
Location
Location
United States , Chevy Chase, MD; Palo Alto, CA; Dallas, TX; Seattle, WA
Salary
Salary:
110000.00 - 230000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience in leading and building full-stack web applications, with a strong focus on front-end technologies like React, Typescript, Bootstrap) and Django-based backends
  • Proven expertise in designing and developing micro-services using Golang,Java,Python,Django,gRPC with protocol buffers, Kafka and Apache Spark with a deep understanding of both API and event-driven architectures
  • Strong background in leading UI development efforts, particularly with JavaScript based frameworks ensuring a seamless user experience
  • Experience leading web application development using micro-frontend architecture with client-side composition methods
  • Experience leading the integration of micro-frontend applications with a large single page application
  • Experience building architecture, design patterns, reliability, security and scaling of new and existing web applications
  • Expertise leading and contributing to event driven microservices using Kafka and Apache spark
  • Expertise in data model design on relational databases like PostgreSQL and No-SQL databases like Cassandra, MongoDB
  • Understanding of existing monitoring concepts and tooling
  • Understanding of DevOps Concepts and Cloud Architecture
Job Responsibility
Job Responsibility
  • Provide technical and thought leadership across multiple layers of the stack, focusing on full-stack web application development and ensuring the integration of UI, micro-services, and backend systems
  • Work closely with product leaders, other engineers and partner teams to understand product requirements, build a technical backlog, and develop solutions that align with product vision
  • Lead the development of UI using React, Typescript and Bootstrap on a Django framework while also contributing to the Architecture and development of microservices using Golang, Python, Django and Kafka
  • Design and implement loosely coupled, scalable micro-services
  • Own and drive one to two service areas, being accountable for their successful delivery, from requirement analysis, design through to production, and ensuring they meet performance, scalability and reliability standards
  • Act as a role model and mentor to senior and junior engineers, guiding them in understanding the architecture, design and implementation of systems
  • Maintain excellent communication with Parter teams and leads, articulating technical implementations for various stakeholders, ensuring alignment across teams
  • Proactively explore unknown product requirements and design solutions that meet evolving needs, contributing to the continuous improvement of our platform
  • Leverage your experience in deploying wen applications in Kubernetes (k8s) environments, ensuring reliable interaction with backend services and seamless integration with cloud and on-premises systems
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Senior Manager IT Storage Engineering

Lead the strategy, architecture, and delivery of enterprise storage and data pro...
Location
Location
United States , San Jose
Salary
Salary:
180400.00 - 270600.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Leadership and people management skills
  • Ability to mentor and grow high-performing teams
  • Communication skills, ability to translate complex technical concepts into business outcomes
  • Collaborative mindset with strong stakeholder management across Engineering, IT, Security, and Compliance
  • Strategic thinking with a balance of innovation and pragmatic execution
  • Problem-solving orientation with a focus on continuous improvement and operational excellence
  • Customer-focused approach with an emphasis on user experience and service reliability
  • Experience in enterprise storage and data protection (file, block, object)
  • Engineering team management experience
  • Proven expertise in high-performance storage solutions for EDA, HPC, AI/ML workloads
Job Responsibility
Job Responsibility
  • Own and evolve the storage platform roadmap, balancing EDA performance requirements with enterprise resilience, compliance, and cost goals
  • Define and govern reference architectures across file, block, object storage, and data protection (on-prem and cloud)
  • Lead storage design reviews for EDA workflows, including metadata-intensive and high IOPS/low latency workloads (build/test/simulation/regression flows)
  • Establish standard patterns for tiering, archiving, retention, immutability (WORM), and disaster recovery (RPO/RTO, replication, failover)
  • Design solutions supporting AI workloads (including >5TB/s training throughput)
  • Ensure storage segmentation and isolation aligned with performance and security requirements
  • Align all architectures with security, encryption, RBAC, audit logging, and data governance standards
  • Own end-to-end storage and backup service delivery (availability, performance, capacity, recoverability, user experience)
  • Lead major incident management, root cause analysis, and corrective/preventive actions
  • Define and track SLAs/SLOs (latency, throughput, backup success, restore times, replication health)
What we offer
What we offer
  • Benefits offered are described: AMD benefits at a glance
  • Fulltime
Read More
Arrow Right

Distinguished Engineer

As a Distinguished Engineer at Capital One, you will be a part of a community of...
Location
Location
United States , Richmond, Virginia; McLean, Virginia
Salary
Salary:
269100.00 - 307200.00 USD / Year
capitalone.com Logo
Capital One
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree
  • At least 7 years of experience in Software, Site Reliability (SRE), and Solution Architecture
  • At least 7 years of experience in Enterprise Architecture, highly available system design, and design patterns
  • At least 5 years of experience in Cloud computing (AWS, Microsoft Azure, Google Cloud)
  • At least 5 years of experience in Web technologies (Javascript, TypeScript and SPA frameworks)
  • At least 1 year of experience in leading and applying Generative AI/headless AI to enhance automation, code quality or operational workflow
Job Responsibility
Job Responsibility
  • Define and implement the multi-year reliability roadmap and target architectural state for the entire product portfolio
  • Invent and deliver novel engineering solutions to reduce organizational toil, maximizing engineering velocity across all services
  • Serve as the final escalation point for system crises, codifying best practices to build an organizational culture of outage prevention
  • Govern secure IaC/Platform strategy, driving enterprise-wide adoption and standardization across multi-cloud environments
  • Establish organization-wide reliability standards and Error Budget governance to align business objectives with engineering risk
  • Lead the growth and technical development of SRE and platform engineering talent across multiple teams and disciplines
  • Establish company-wide FinOps governance and capacity management to ensure sustainable, cost-efficient infrastructure growth
What we offer
What we offer
  • Performance based incentive compensation, which may include cash bonus(es) and/or long term incentives (LTI)
  • comprehensive, competitive, and inclusive set of health, financial and other benefits
  • Fulltime
Read More
Arrow Right