CrawlJobs Logo

Lead Software Engineer - SRE

United States, CHARLOTTE Employment contract 119000.00 - 187000.00 USD / Year · Job Posted April 24, 2026
Apply Position
Job Link Share

Job Description

Wells Fargo is seeking a Lead Site Reliability Engineer (SRE) to join the WIMT Platform team. This role is responsible for driving the stability, resiliency, performance, and security of mission‑critical platforms that support Wells Fargo Advisors, First Clearing firms, and FINET practices. As a Lead SRE, you will provide hands‑on technical leadership across incident management, automation, observability, and reliability engineering, with a strong focus on proactive risk mitigation and continuous improvement. You will help define and enforce reliability standards while partnering closely with Application Development, Product, Business, and Enterprise teams to ensure operational excellence throughout the full-service lifecycle. This role is ideal for a highly motivated engineer with deep experience operating large‑scale, production systems who takes ownership, values accountability, and is passionate about building resilient, enterprise‑grade platforms. Learn more about career areas and business divisions at https://www.wellsfargojobs.com.

Job Responsibility

  • Design and implement scalability, reliability, and observability strategies for cloud and on-premise environments
  • Define SLIs (Service Level Indicators), SLOs (Service Level Objectives), and Error Budgets to improve system reliability
  • Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions
  • Maintain knowledge of industry best practices and new technologies and recommend innovations that enhance operations or provide a competitive advantage to the organization
  • Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership
  • Review and analyze complex, large-scale technology solutions for tactical and strategic business objectives, enterprise technological environment, and technical challenges that require in-depth evaluation of multiple factors, including intangibles or unprecedented technical factors
  • Drive adoption of NFRs, best practices-quality and compliance across observability and performance engineering
  • Ensure high availability and performance of production systems through proactive monitoring and incident response
  • Collaborate and consult with key technical experts, senior technology team, and external industry groups to resolve complex technical issues and achieve goals
  • Lead projects, teams, or serve as a peer mentor

Requirements

  • 5+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 5+ years of experience leading observability and monitoring tooling - Splunk, AppDynamics, Splunk Observability, Grafana, Open Telemetry
  • 5+ years in infrastructure (windows and Linux) support
  • 5+ years proven success in toil reduction initiatives
  • 5+ years in cloud application management especially OpenShift Container Platform

Nice to have

  • 5+ Years’ experience in SRE, public & private cloud technologies, Java performance tuning, capacity optimization for mission critical applications
  • Working knowledge of multiple programming languages (e.g., Java, JavaScript, Ruby, Python, JSON, Angular, NodeJS)
  • Hands-on experience with cloud and platform technologies such as AWS, PCF, PKS, Kubernetes, OpenShift, Linux, Azure, Windows, and VMware
  • Strong verbal, written, and interpersonal communication skills for effective collaboration across teams
  • Ability to engage with and influence stakeholders at various organizational levels
  • Expert experience on monitoring tools – Prometheus, Grafana, AppDynamics, Glassbox, Splunk
  • Advanced experience in one or more scripting languages - Python, Shell scripting etc
  • Strong knowledge of Kubernetes, OCP and troubleshooting skills
  • Strong grasp of Java performance concepts (heap, GC) and critical monitoring metrics for Java apps
  • Ability to identify manual tasks in the processes and automating them to reduce toil

What we offer

  • Health benefits
  • 401(k) Plan
  • Paid time off
  • Disability benefits
  • Life insurance, critical illness insurance, and accident insurance
  • Parental leave
  • Critical caregiving leave
  • Discounts and savings
  • Commuter benefits
  • Tuition reimbursement
  • Scholarships for dependent children
  • Adoption reimbursement

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Lead Software Engineer - SRE

8 matching positions

Senior Software Engineer and Principal Software Engineer

We are building a planet-scale multi-modal database and infrastructure for execu...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, or Java
  • OR Equivalent experience
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, Java
  • OR equivalent experience
  • Experience in shipping products and scalable, reliable services
  • Currently programming/coding in your current or most recent role
  • Hands on experience with asynchronous programming and concurrency (threads, tasks, futures, async/await)
  • Experience with Azure Kubernetes Service (AKS), Amazon Elastic Kubernetes Service (EKS), and/or Google Kubernetes Engine (GKE)
  • Experience in building database engines, query engines, indexing solutions (columnar, full-text, vector), at scale
  • Experience with programming CUDA, AI systems at scale
Job Responsibility
Job Responsibility
  • Independently execute in the face of ambiguity
  • Leads identification of dependencies and the development of design documents for a product, application, service, or platform
  • Writes efficient systems code and able to debug distributed systems
  • Holds accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Sre

Hybrid: This role is categorized as hybrid and is expected to report to Austin ...
Location
Location
United States , Austin; Warren
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science or a related field, or equivalent work experience
  • 7-10 years software experience with strong proficiency in PostgreSQL and at least one other (Oracle, SQL Server) database technologies
  • Proficiency in at least one programming language (e.g., Python, Go, Java) and familiarity with multiple language ecosystems
  • Solid understanding of operating systems, networking, distributed systems, databases, and storage architectures
  • Deep understanding of how code runs on underlying hardware, including operating systems, algorithms, and data structures
  • Ability to optimize or troubleshoot code by understanding its execution and the impact on system resources
  • Experience handling production incidents, including root cause analysis, mitigation, and working through complex system failures
  • Strong communication skills, with an ability to explain technical concepts to both engineering and business stakeholders
  • Commitment to collaborative problem-solving and shared ownership of services
  • Proven experience in automating manual processes, building deployment pipelines, or managing configuration systems
Job Responsibility
Job Responsibility
  • Develop tools and software to automate operational processes, improve system reliability, and reduce manual intervention
  • Lead, Implement and improve monitoring and observability frameworks, enabling proactive detection and resolution of incidents
  • Participate in an on-call rotation to diagnose, troubleshoot, and mitigate production incidents, ensuring minimal downtime and swift resolution
  • Work alongside developers to ensure the quality, scalability, and reliability of our database services
  • Practice shared ownership of services in production, fostering a "You build it, you run it" culture
  • Manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to manage reliability expectations effectively
  • Conduct deep-dive analyses of incidents and collaborate on post-incident reviews to derive learnings and prevent recurrence
  • Champion a culture of continuous improvement
  • Evaluate system performance and advocate for optimizations that reduce infrastructure costs while maintaining service reliability
  • Fulltime
Read More
Arrow Right

Lead Software Engineer

Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • Experience in Software Engineering, SRE, DevOps, or Platform Engineering
  • Strong proficiency in Python for automation and tooling
  • Hands‑on experience with Grafana, Prometheus, and Splunk in production environments
  • Solid understanding of SLIs, SLOs, dashboards, alerting, and observability best practices
  • Experience applying AI/ML concepts to monitoring, alerting, or operational analytics
  • Strong knowledge of Linux, networking, and distributed systems
  • Experience with Cloud platforms and Kubernetes/OpenShift
  • Proven experience leading incidents, RCAs, and reliability initiatives
  • Experience building custom Prometheus exporters or advanced Grafana dashboards
Job Responsibility
Job Responsibility
  • Lead complex technology initiatives including those that are companywide with broad impact
  • Act as a key participant in developing standards and companywide best practices for engineering complex and large scale technology solutions for technology engineering disciplines
  • Design, code, test, debug, and document for projects and programs
  • Review and analyze complex, large-scale technology solutions for tactical and strategic business objectives, enterprise technological environment, and technical challenges that require in-depth evaluation of multiple factors, including intangibles or unprecedented technical factors
  • Make decisions in developing standard and companywide best practices for engineering and technology solutions requiring understanding of industry best practices and new technologies, influencing and leading technology team to meet deliverables and drive new initiatives
  • Collaborate and consult with key technical experts, senior technology team, and external industry groups to resolve complex technical issues and achieve goals
  • Lead projects, teams, or serve as a peer mentor
  • Own and improve availability, performance, scalability, and resilience of production systems
  • Define, monitor, and manage SLIs/SLOs and error budgets to guide reliability investments
  • Lead capacity planning, performance testing, failover readiness, and disaster‑recovery design
  • Fulltime
Read More
Arrow Right

Software Engineer SRE

As a Site Reliability Engineer at OnePay, you will play a critical role in ensur...
Location
Location
United States
Salary
Salary:
140000.00 - 180000.00 USD / Year
onepay.com Logo
OnePay
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience as a Software Engineer with a focus on building and running reliable, large-scale, distributed systems in production
  • 5+ years of operational experience in observability tooling and libraries (metrics, logging, tracing) with experience using Datadog or similar tools (Prometheus, Grafana)
  • Proficiency in at least one programming language (Python, Go, Java, or Node.js preferred) for automation and tooling
  • Proficiency in incident management, going on-call, and writing post-mortem reports
  • Excellent collaboration skills with the ability to influence and educate product engineering teams on reliability and observability best practices
  • Hands-on experience with cloud platforms (AWS preferred), container orchestration (Kubernetes), and IAC tools (Terraform, Pulumi)
  • Drive and proactivity – everyone here is a builder and executor
Job Responsibility
Job Responsibility
  • Design, build, and maintain scalable infrastructure and tooling that improves reliability, performance, and availability across OnePay’s platform
  • Contribute to the evolution of our observability stack, platform libraries, cloud architecture, and CI/CD pipelines
  • Develop automation and monitoring systems to detect, prevent, and remediate incidents before they impact customers
  • Partner closely with product and platform engineering teams to embed reliability best practices in design, development, and deployment processes
  • Lead root cause analysis and postmortems, driving long-term improvements in resiliency and fault tolerance
What we offer
What we offer
  • Competitive base salary, stock options, and health benefits from Day 1
  • 401(k) plan with company match
  • Remote-friendly (US), flexible time off (FTO), and opportunities for growth
  • A high-growth, mission-driven, inclusive culture where your work has real impact
  • Fulltime
Read More
Arrow Right

Intermediate Software Engineer SRE – AI

At PointClickCare our mission is simple: to help providers deliver exceptional c...
Location
Location
Canada , Mississauga
Salary
Salary:
115000.00 - 128000.00 CAD / Year
pointclickcare.com Logo
PointClickCare
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years' experience in software engineering
  • Experience with SRE principles
  • Experience with AI/ML in production environments
  • A passion for automation, intelligent systems, and operational excellence
  • Strong debugging, problem-solving, and system design skills
  • Languages: Python, Java, Bash, Terraform
  • Platforms: Azure, Kubernetes, Docker
  • Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
  • ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
  • CI/CD: Jenkins, ArgoCD, Spinnaker
Job Responsibility
Job Responsibility
  • Build ML-based anomaly detection and pattern recognition systems
  • Enhance telemetry with smart tagging and metadata for better AI insights
  • Develop event-driven workflows and self-healing systems using AI triggers
  • Automate incident response with generative AI and custom AI agent orchestration
  • Use time-series forecasting and predictive modelling to anticipate failures
  • Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
  • Build scalable, fault-tolerant systems in a cloud-native environment
  • Participate in on-call rotations and lead incident response for critical systems
  • Skilled in API integration for streamlined data exchange and system connectivity
  • Run internal AIOps workshops and help teams adopt AI maturity models
What we offer
What we offer
  • Benefits starting from Day 1
  • Retirement Plan Matching
  • Flexible Paid Time Off
  • Wellness Support Programs and Resources
  • Parental & Caregiver Leaves
  • Fertility & Adoption Support
  • Continuous Development Support Program
  • Employee Assistance Program
  • Allyship and Inclusion Communities
  • Employee Recognition … and more
  • Fulltime
Read More
Arrow Right

SRE Lead Design & Support Engineer

This is a critical enabler achieving a high resiliency during operations and als...
Location
Location
Mexico , Miguel Hidalgo
Salary
Salary:
Not provided
pepsico.com Logo
Pepsico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of work experience evolving to a SRE engineer
  • 3-5 years of experience in continuously improving and transforming IT operations ways of working
  • Bachelor’s degree in Computer Science, Information Technology or a related field
  • Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
  • Highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams
  • A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes
  • Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
  • A firm understanding of cloud archticture for distributed environments
  • Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
  • Back-end technologies: Server-side languages (Java, Spring Boot, and related technologies that build the server-side logic, APIs, and database interaction with MySQL, MongoDB, Cassandra, Couchbase)
Job Responsibility
Job Responsibility
  • Drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design / Project roadmap that enablees resilient outcomes
  • Apply pre-emptive approach into production minimizing business impact, via SRE-driven orchestration of connecting all components of the ecosystem diagnosing anomalies prior to user & remediating through automation
  • Ensure ecosystem availability and performance in production environments, Pro-actively preventing P1, P2, potential P3s
  • Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
  • Accountable to institute non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
  • Leads the team diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
  • Collaborates with Engineering & support teams, including participation in escalations, and blameless postmortems
  • Work closely with customer-facing support teams to empower them with SRE insights and tooling
  • Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
  • Continuously optimize the L2/support operations work via SRE workflow automation
What we offer
What we offer
  • Opportunities to learn and develop every day through a wide range of programs
  • Internal digital platforms that promote self-learning
  • Development programs according to Leadership skills
  • Specialized training according to the role
  • Learning experiences with internal and external providers
  • Recognition programs for seniority, behavior, leadership, moments of life, among others
  • Financial wellness programs that will help you reach your goals in all stages of life
  • A flexibility program that will allow you to balance your personal and work life, adapting your working day to your lifestyle
  • Wellness Line, thousands of Agreements and Discounts, Scholarship programs for your children, Aid Plans for different moments of life
Read More
Arrow Right

Staff Software Development Engineer-Automation Engineer

We’re building a world of health around every individual — shaping a more connec...
Location
Location
United States
Salary
Salary:
106605.00 USD / Year
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
June 29, 2026
Flip Icon
Requirements
Requirements
  • Extensive experience in software development and production support for enterprise systems
  • Strong expertise in automation/RPA platforms, scripting, and debugging complex workflows
  • Proven ability to lead incident response and root cause analysis in high-availability environments
  • Deep understanding of SDLC, CI/CD, release management, and production readiness standards
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Serve as the technical owner for production support of automation and RPA solutions across critical business processes
  • Lead incident triage, root cause analysis, and permanent remediation for high-severity automation failures
  • Establish and enforce runbooks, support models, escalation paths, and on-call readiness for automation platforms
  • Proactively identify systemic issues and implement stability, resiliency, and performance improvements
  • Provide hands-on technical leadership for automation design, debugging, and optimization in production environments
  • Review automation code and configurations to ensure adherence to standards, security, and reliability best practices
  • Partner with development teams to ensure production readiness of new automations before release
  • Guide architectural decisions that reduce operational complexity and technical debt
  • Design and maintain monitoring, alerting, and health dashboards for automation platforms
  • Drive adoption of AIOps, SRE, and automation-first support practices where applicable
What we offer
What we offer
  • Medical, dental, and vision coverage
  • Paid time off
  • Retirement savings options
  • Wellness programs
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

The Principal Software Engineer is the senior-most hands-on technical leader for...
Location
Location
India , Chennai
Salary
Salary:
Not provided
rxglobal.com Logo
RX Global
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience as a senior technical leader across multiple teams/services within a bounded domain
  • Strong polyglot background (e.g., C#/.NET, Java, JavaScript/Node) and ability to choose fit-for-purpose technologies
  • Experience modernising systems: migrating from legacy architectures to cloud-native patterns, reducing technical debt, and decommissioning safely
  • Experience in systems analysis, design and a solid understanding of development, quality assurance and integration methodologies
  • Experience developing integrated solutions within a broad technical and business context of significant impact
  • Experience evaluating third-party services and platforms (security, cost, operations, integration complexity)
  • Experience leading cross‑team architectural change, platform adoption, or measurable improvements to reliability/cost/performance (with before/after metrics)
  • Familiarity with responsible AI usage in engineering workflows (policy/guardrails, data privacy, human‑in‑the‑loop review)
  • Bachelor’s/Master’s degree in Computer Science (or related) or equivalent professional experience
  • Expert software design skills: SOLID, DDD, event-driven architecture patterns, modular design, and maintainable codebases
Job Responsibility
Job Responsibility
  • Engineering Leadership & Culture: Create an environment where teams can do their best work by removing blockers, improving engineering practices, and contributing to a culture of psychological safety and high standards
  • Mentor and coach engineers across teams—especially senior engineers and emerging tech leads—in architecture, systems thinking, and operational excellence
  • Promote strong technical ownership ("you build it, you run it"), including operational readiness and post-incident learning
  • Support scalable knowledge-sharing mechanisms (e.g., tech talks, playbooks, templates, reference implementations)
  • Participate in hiring loops and help onboard new engineers into domain patterns and practices
  • Provide hands-on contributions where needed (prototypes, reference implementations, complex refactors, high-risk changes)
  • Guide teams in decomposition and sequencing to reduce delivery risk
  • support estimation/sizing and technical discovery
  • Leads through influence
  • demonstrates integrity, accountability, and constructive challenge
What we offer
What we offer
  • Comprehensive Health Insurance: Covers you, your immediate family, and parents
  • Enhanced Health Insurance Options: Competitive rates negotiated by the company
  • Group Life Insurance: Ensuring financial security for your loved ones
  • Group Accident Insurance: Extra protection for accidental death and permanent disablement
  • Flexible Working Arrangement: Achieve a harmonious work-life balance
  • Employee Assistance Program: Access support for personal and work-related challenges
  • Medical Screening: Your well-being is a top priority
  • Modern Family Benefits: Maternity, paternity, and adoption support
  • Long-Service Awards: Recognizing dedication and commitment
  • New Baby Gift: Celebrating the joy of parenthood
  • Fulltime
Read More
Arrow Right