This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Wells Fargo is seeking a Senior Systems Operations Engineer.
Job Responsibility:
Lead or participate in managing all installed systems and infrastructure within the Systems Operations functional area
Contribute in increasing system efficiencies and lowering the human intervention time on related tasks
Review and analyze moderately complex operational support systems, application software, and system management tools to ensure the highest levels of systems and infrastructure availability
Work with vendors and other technical personnel for problem resolution
Lead team to meet technical deliverables while leveraging solid understanding of technical process controls or standards
Collaborate with vendors and other technical personnel to resolve technical issues and achieve highest levels of systems and infrastructure availability
Participate in development of Generative AI Platform Capabilities
Responsible for AI model delivery to on-prem infrastructure and cloud platforms (GCP, Azure ML)
Participate in day-to-day scrum calls for platform capability build
Research industry best practices, evaluate new technologies, develop standards and engineering best practices and recommend innovative solutions that support automation and improve platform resiliency and fault tolerance of critical applications
Execute on roadmaps that align with technology and business strategy.
Perform hardware and capacity planning, analysis and forecasts for your portfolio of applications with focus on highest availability, scalability, performance, and timely delivery
Act as an expert resource for other technical teams within DTI
Deliver day-to-day Application/Platform support services for Digital, AI/ML Platforms
Responsible for support functions and driving the execution of multiple Application/Platform support services including incident triage, root cause analysis, change evaluation-execution-validation, deployment management, and risk & vulnerability management.
Provides on-call production support of Mission Critical applications and resolve issues with in RTO.
Ensure effective production systems monitoring, alarming and notification response/maintenance.
Leverage diagnostic tools to maintain, troubleshoot and restore service or data to systems
Structure Operational data and come up with creative data visualization solutions (Build Dashboards)
Automate Production support routines leveraging AI
Maintain and update support documentation (e.g. game plans, run books, procedures, and process).
Communicate, co-ordinate and collaborate with multiple support teams and stakeholder.
Requirements:
4+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education.
Experience as Site Reliability Engineer
Knowledge/experience on developing automated solutions using Python
Hands on knowledge about LLMs/ leveraging LLM/ supporting LLM based solutions
Knowledge/experience of Puppet/Ansible.
Big data experience needed (Big Query, Hadoop)
Linux O/S capabilities
Experience in AIML area (MLOps)
Pyspark experience
Experience with Tableau/ MicroStrategy or similar BI tools
Strong experience with monitoring systems such as Splunk, App Dynamics.
Working knowledge of Auto ML technologies such as H2O Driverless AI, DataRobot, VertexAI, Elastic and Vector DB
Good understanding and hands on with GCP
Excellent verbal, written, and interpersonal communication skills. Ability to articulate technical solutions to both technical and business audiences
Recent and demonstrated ability to influence management on technical or business solutions
Working knowledge of design and build grid computing with CPU and GPU supporting AIML and NLP
Working knowledge of high-performance storage technologies along with Object Storage
Knowledge and understanding of network infrastructure to support high throughput and low latency grid computing.
Willing to work in shifts
Experience in LLM , Generative AI (dev/ops).
Experience in Elastic Search, Vector Database would be added benefit.
Experience with data processing technology (AbInitio, Informatica, IBM DataStage)
Experience with large data technology (Hadoop, Teradata, Elasticsearch, etc.)
Understanding of Agile practices and ability to work with Agile teams to define and track user stories
Experience with implementing complex F5 or other Load Balancer Technologies
Working knowledge of building high resiliency grid/cloud computing infrastructure supporting AIML and NLP workloads
Knowledge and understanding of Cloud computing, PaaS design principles and micro services and containers
Working knowledge/experience with Azure and/or GCP
Working knowledge/experience with on-premise and Public Cloud technologies, such as Cloud Foundry, Kubernetes, Docker
Experience in facilitating analysis of current systems and problem identification and resolution
Ability to facilitate technically complex discussions and working sessions in person or via teleconference