Lead Software Engineer

Senior Software Engineer and Principal Software Engineer

We are building a planet-scale multi-modal database and infrastructure for execu...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, or Java
OR Equivalent experience
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, Java
OR equivalent experience
Experience in shipping products and scalable, reliable services
Currently programming/coding in your current or most recent role
Hands on experience with asynchronous programming and concurrency (threads, tasks, futures, async/await)
Experience with Azure Kubernetes Service (AKS), Amazon Elastic Kubernetes Service (EKS), and/or Google Kubernetes Engine (GKE)
Experience in building database engines, query engines, indexing solutions (columnar, full-text, vector), at scale
Experience with programming CUDA, AI systems at scale

Job Responsibility

Independently execute in the face of ambiguity
Leads identification of dependencies and the development of design documents for a product, application, service, or platform
Writes efficient systems code and able to debug distributed systems
Holds accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions

Fulltime

Senior Software Engineer - Sre

Hybrid: This role is categorized as hybrid and is expected to report to Austin ...

Location

United States , Austin; Warren

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

Bachelor's degree in computer science or a related field, or equivalent work experience
7-10 years software experience with strong proficiency in PostgreSQL and at least one other (Oracle, SQL Server) database technologies
Proficiency in at least one programming language (e.g., Python, Go, Java) and familiarity with multiple language ecosystems
Solid understanding of operating systems, networking, distributed systems, databases, and storage architectures
Deep understanding of how code runs on underlying hardware, including operating systems, algorithms, and data structures
Ability to optimize or troubleshoot code by understanding its execution and the impact on system resources
Experience handling production incidents, including root cause analysis, mitigation, and working through complex system failures
Strong communication skills, with an ability to explain technical concepts to both engineering and business stakeholders
Commitment to collaborative problem-solving and shared ownership of services
Proven experience in automating manual processes, building deployment pipelines, or managing configuration systems

Job Responsibility

Develop tools and software to automate operational processes, improve system reliability, and reduce manual intervention
Lead, Implement and improve monitoring and observability frameworks, enabling proactive detection and resolution of incidents
Participate in an on-call rotation to diagnose, troubleshoot, and mitigate production incidents, ensuring minimal downtime and swift resolution
Work alongside developers to ensure the quality, scalability, and reliability of our database services
Practice shared ownership of services in production, fostering a "You build it, you run it" culture
Manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to manage reliability expectations effectively
Conduct deep-dive analyses of incidents and collaborate on post-incident reviews to derive learnings and prevent recurrence
Champion a culture of continuous improvement
Evaluate system performance and advocate for optimizations that reduce infrastructure costs while maintaining service reliability

Fulltime

Location

India , Hyderabad

Salary:

Not provided

Wells Fargo

Expiration Date

Until further notice

Requirements

5+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
Experience in Software Engineering, SRE, DevOps, or Platform Engineering
Strong proficiency in Python for automation and tooling
Hands‑on experience with Grafana, Prometheus, and Splunk in production environments
Solid understanding of SLIs, SLOs, dashboards, alerting, and observability best practices
Experience applying AI/ML concepts to monitoring, alerting, or operational analytics
Strong knowledge of Linux, networking, and distributed systems
Experience with Cloud platforms and Kubernetes/OpenShift
Proven experience leading incidents, RCAs, and reliability initiatives
Experience building custom Prometheus exporters or advanced Grafana dashboards

Job Responsibility

Lead complex technology initiatives including those that are companywide with broad impact
Act as a key participant in developing standards and companywide best practices for engineering complex and large scale technology solutions for technology engineering disciplines
Design, code, test, debug, and document for projects and programs
Review and analyze complex, large-scale technology solutions for tactical and strategic business objectives, enterprise technological environment, and technical challenges that require in-depth evaluation of multiple factors, including intangibles or unprecedented technical factors
Make decisions in developing standard and companywide best practices for engineering and technology solutions requiring understanding of industry best practices and new technologies, influencing and leading technology team to meet deliverables and drive new initiatives
Collaborate and consult with key technical experts, senior technology team, and external industry groups to resolve complex technical issues and achieve goals
Lead projects, teams, or serve as a peer mentor
Own and improve availability, performance, scalability, and resilience of production systems
Define, monitor, and manage SLIs/SLOs and error budgets to guide reliability investments
Lead capacity planning, performance testing, failover readiness, and disaster‑recovery design

Fulltime

Software Engineer SRE

As a Site Reliability Engineer at OnePay, you will play a critical role in ensur...

Location

United States

Salary:

140000.00 - 180000.00 USD / Year

OnePay

Expiration Date

Until further notice

Requirements

5+ years of experience as a Software Engineer with a focus on building and running reliable, large-scale, distributed systems in production
5+ years of operational experience in observability tooling and libraries (metrics, logging, tracing) with experience using Datadog or similar tools (Prometheus, Grafana)
Proficiency in at least one programming language (Python, Go, Java, or Node.js preferred) for automation and tooling
Proficiency in incident management, going on-call, and writing post-mortem reports
Excellent collaboration skills with the ability to influence and educate product engineering teams on reliability and observability best practices
Hands-on experience with cloud platforms (AWS preferred), container orchestration (Kubernetes), and IAC tools (Terraform, Pulumi)
Drive and proactivity – everyone here is a builder and executor

Job Responsibility

Design, build, and maintain scalable infrastructure and tooling that improves reliability, performance, and availability across OnePay’s platform
Contribute to the evolution of our observability stack, platform libraries, cloud architecture, and CI/CD pipelines
Develop automation and monitoring systems to detect, prevent, and remediate incidents before they impact customers
Partner closely with product and platform engineering teams to embed reliability best practices in design, development, and deployment processes
Lead root cause analysis and postmortems, driving long-term improvements in resiliency and fault tolerance

What we offer

Competitive base salary, stock options, and health benefits from Day 1
401(k) plan with company match
Remote-friendly (US), flexible time off (FTO), and opportunities for growth
A high-growth, mission-driven, inclusive culture where your work has real impact

Fulltime

Intermediate Software Engineer SRE – AI

At PointClickCare our mission is simple: to help providers deliver exceptional c...

Location

Canada , Mississauga

Salary:

115000.00 - 128000.00 CAD / Year

PointClickCare

Expiration Date

Until further notice

Requirements

5+ years' experience in software engineering
Experience with SRE principles
Experience with AI/ML in production environments
A passion for automation, intelligent systems, and operational excellence
Strong debugging, problem-solving, and system design skills
Languages: Python, Java, Bash, Terraform
Platforms: Azure, Kubernetes, Docker
Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
CI/CD: Jenkins, ArgoCD, Spinnaker

Job Responsibility

Build ML-based anomaly detection and pattern recognition systems
Enhance telemetry with smart tagging and metadata for better AI insights
Develop event-driven workflows and self-healing systems using AI triggers
Automate incident response with generative AI and custom AI agent orchestration
Use time-series forecasting and predictive modelling to anticipate failures
Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
Build scalable, fault-tolerant systems in a cloud-native environment
Participate in on-call rotations and lead incident response for critical systems
Skilled in API integration for streamlined data exchange and system connectivity
Run internal AIOps workshops and help teams adopt AI maturity models

What we offer

Benefits starting from Day 1
Retirement Plan Matching
Flexible Paid Time Off
Wellness Support Programs and Resources
Parental & Caregiver Leaves
Fertility & Adoption Support
Continuous Development Support Program
Employee Assistance Program
Allyship and Inclusion Communities
Employee Recognition … and more

Fulltime

SRE Lead Design & Support Engineer

This is a critical enabler achieving a high resiliency during operations and als...

Location

Mexico , Miguel Hidalgo

Salary:

Not provided

Pepsico

Expiration Date

Until further notice

Requirements

8+ years of work experience evolving to a SRE engineer
3-5 years of experience in continuously improving and transforming IT operations ways of working
Bachelor’s degree in Computer Science, Information Technology or a related field
Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
Highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams
A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes
Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
A firm understanding of cloud archticture for distributed environments
Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
Back-end technologies: Server-side languages (Java, Spring Boot, and related technologies that build the server-side logic, APIs, and database interaction with MySQL, MongoDB, Cassandra, Couchbase)

Job Responsibility

Drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design / Project roadmap that enablees resilient outcomes
Apply pre-emptive approach into production minimizing business impact, via SRE-driven orchestration of connecting all components of the ecosystem diagnosing anomalies prior to user & remediating through automation
Ensure ecosystem availability and performance in production environments, Pro-actively preventing P1, P2, potential P3s
Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
Accountable to institute non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
Leads the team diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
Collaborates with Engineering & support teams, including participation in escalations, and blameless postmortems
Work closely with customer-facing support teams to empower them with SRE insights and tooling
Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
Continuously optimize the L2/support operations work via SRE workflow automation

What we offer

Opportunities to learn and develop every day through a wide range of programs
Internal digital platforms that promote self-learning
Development programs according to Leadership skills
Specialized training according to the role
Learning experiences with internal and external providers
Recognition programs for seniority, behavior, leadership, moments of life, among others
Financial wellness programs that will help you reach your goals in all stages of life
A flexibility program that will allow you to balance your personal and work life, adapting your working day to your lifestyle
Wellness Line, thousands of Agreements and Discounts, Scholarship programs for your children, Aid Plans for different moments of life

Staff Software Development Engineer-Automation Engineer

We’re building a world of health around every individual — shaping a more connec...

Location

United States

Salary:

106605.00 USD / Year

CVS Health

Expiration Date

June 29, 2026

Requirements

Extensive experience in software development and production support for enterprise systems
Strong expertise in automation/RPA platforms, scripting, and debugging complex workflows
Proven ability to lead incident response and root cause analysis in high-availability environments
Deep understanding of SDLC, CI/CD, release management, and production readiness standards
Bachelor's degree in Computer Science, Engineering, or equivalent practical experience

Job Responsibility

Serve as the technical owner for production support of automation and RPA solutions across critical business processes
Lead incident triage, root cause analysis, and permanent remediation for high-severity automation failures
Establish and enforce runbooks, support models, escalation paths, and on-call readiness for automation platforms
Proactively identify systemic issues and implement stability, resiliency, and performance improvements
Provide hands-on technical leadership for automation design, debugging, and optimization in production environments
Review automation code and configurations to ensure adherence to standards, security, and reliability best practices
Partner with development teams to ensure production readiness of new automations before release
Guide architectural decisions that reduce operational complexity and technical debt
Design and maintain monitoring, alerting, and health dashboards for automation platforms
Drive adoption of AIOps, SRE, and automation-first support practices where applicable

What we offer

Medical, dental, and vision coverage
Paid time off
Retirement savings options
Wellness programs

Fulltime

Principal Software Engineer

The Principal Software Engineer is the senior-most hands-on technical leader for...

Location

India , Chennai

Salary:

Not provided

RX Global

Expiration Date

Until further notice

Requirements

Proven experience as a senior technical leader across multiple teams/services within a bounded domain
Strong polyglot background (e.g., C#/.NET, Java, JavaScript/Node) and ability to choose fit-for-purpose technologies
Experience modernising systems: migrating from legacy architectures to cloud-native patterns, reducing technical debt, and decommissioning safely
Experience in systems analysis, design and a solid understanding of development, quality assurance and integration methodologies
Experience developing integrated solutions within a broad technical and business context of significant impact
Experience evaluating third-party services and platforms (security, cost, operations, integration complexity)
Experience leading cross‑team architectural change, platform adoption, or measurable improvements to reliability/cost/performance (with before/after metrics)
Familiarity with responsible AI usage in engineering workflows (policy/guardrails, data privacy, human‑in‑the‑loop review)
Bachelor’s/Master’s degree in Computer Science (or related) or equivalent professional experience
Expert software design skills: SOLID, DDD, event-driven architecture patterns, modular design, and maintainable codebases

Job Responsibility

Engineering Leadership & Culture: Create an environment where teams can do their best work by removing blockers, improving engineering practices, and contributing to a culture of psychological safety and high standards
Mentor and coach engineers across teams—especially senior engineers and emerging tech leads—in architecture, systems thinking, and operational excellence
Promote strong technical ownership ("you build it, you run it"), including operational readiness and post-incident learning
Support scalable knowledge-sharing mechanisms (e.g., tech talks, playbooks, templates, reference implementations)
Participate in hiring loops and help onboard new engineers into domain patterns and practices
Provide hands-on contributions where needed (prototypes, reference implementations, complex refactors, high-risk changes)
Guide teams in decomposition and sequencing to reduce delivery risk
support estimation/sizing and technical discovery
Leads through influence
demonstrates integrity, accountability, and constructive challenge

What we offer

Comprehensive Health Insurance: Covers you, your immediate family, and parents
Enhanced Health Insurance Options: Competitive rates negotiated by the company
Group Life Insurance: Ensuring financial security for your loved ones
Group Accident Insurance: Extra protection for accidental death and permanent disablement
Flexible Working Arrangement: Achieve a harmonious work-life balance
Employee Assistance Program: Access support for personal and work-related challenges
Medical Screening: Your well-being is a top priority
Modern Family Benefits: Maternity, paternity, and adoption support
Long-Service Awards: Recognizing dedication and commitment
New Baby Gift: Celebrating the joy of parenthood

Fulltime

Select Country

Lead Software Engineer - SRE

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?