CrawlJobs Logo

Lead Site Reliability Engineer

https://www.wellsfargo.com/ Logo

Wells Fargo

Location Icon

Location:
United States, Charlotte

Category Icon
Category:
IT - Software Development

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

The Site Reliability Engineering team is fundamental to ensure our platform delivers consistent, reliable service to our client base. This role will work at the intersection of software engineering and operations, applying engineering principles to infrastructure challenges. This individual will design and implement scalable systems, create observability solutions that offer actionable insights, and develop automation to improve our platform's reliability.

Job Responsibility:

  • Work alongside developers as well as the business stakeholders and strive to automate the acceptance criteria
  • Maintain high reliability and availability for software applications
  • Automate the mundane tasks and avoid human errors
  • Define SLI (Service level indicator) & SLO (service level objective) by collaborating with Product owners
  • Lead incident response efforts and post-mortem analysis to prevent future occurrences
  • Write incident root cause analysis, find out the core reason behind the issue and prevent it from happening again
  • Document procedures, best practices and troubleshooting FAQs
  • Debug the system and fixing the production related issues
  • Escalate / follow-up on permanent fix for development related issues
  • Handle complex operational tasks and recommends process and technology changes
  • Provide global support including troubleshooting production related issues and performing checkouts

Requirements:

  • 5+ years of Technology Infrastructure Engineering and Solutions experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 5+ years of Site Reliability Engineering experience or related experience
  • Strong understanding of the REST APIs
  • Strong understanding in working of the troubleshooting tools such as Splunk, AppDynamics, and Elastic APM
  • Strong experience in API Management tools such as Apigee
  • Working knowledge of databases such as MongoDB, Oracle
  • Strong foundation in reliability engineering principles and distributed systems behavior
  • Experience defining and implementing SLOs/SLIs and using them to drive system improvements
  • Demonstrated ability to design and implement observability solutions that provide actionable insights while minimizing alert fatigue
  • Understand modern observability practices and experience implementing and maintaining monitoring solutions such as Prometheus/Grafana, Splunk, NewRelic, CloudWatch, and ELK in the cloud
  • Strong incident response skills with experience leading incident retrospectives and driving improvements
  • Excellent problem-solving abilities and experience debugging distributed systems
  • Track record of successfully automating operations and reducing toil
  • Strong communication skills with ability to explain complex technical concepts to diverse audiences
  • Ability to work both independently and collaboratively (in groups) in an energetic, and diverse team environment

Additional Information:

Job Posted:
April 25, 2025

Expiration:
May 01, 2025

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:
Welcome to CrawlJobs.com
Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.