Software Engineer SRE Job at OnePay

Senior Software Engineer and Principal Software Engineer

We are building a planet-scale multi-modal database and infrastructure for execu...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, or Java
OR Equivalent experience
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, Java
OR equivalent experience
Experience in shipping products and scalable, reliable services
Currently programming/coding in your current or most recent role
Hands on experience with asynchronous programming and concurrency (threads, tasks, futures, async/await)
Experience with Azure Kubernetes Service (AKS), Amazon Elastic Kubernetes Service (EKS), and/or Google Kubernetes Engine (GKE)
Experience in building database engines, query engines, indexing solutions (columnar, full-text, vector), at scale
Experience with programming CUDA, AI systems at scale

Job Responsibility

Independently execute in the face of ambiguity
Leads identification of dependencies and the development of design documents for a product, application, service, or platform
Writes efficient systems code and able to debug distributed systems
Holds accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions

Fulltime

Senior Software Engineer - SRE

Roku is changing how the world watches TV. Roku is the #1 TV streaming platform ...

Location

India , Bengaluru

Salary:

Not provided

Roku

Expiration Date

Until further notice

Requirements

Preferably 8+ years of experience in DevOps/SRE roles, with demonstrated expertise in implementing SRE principles, SLO/SLI frameworks, and error budget policies in production environments
Deep experience with observability and monitoring platforms such as Prometheus, Grafana, Datadog, New Relic, or equivalent, including experience building custom dashboards, alerts, and SLO-based monitoring
Strong background in incident management, including experience as an Incident Commander, conducting blameless postmortems, and implementing systematic reliability improvements based on incident learnings
Strong understanding of distributed systems and reliability engineering, including failure modes, fault tolerance patterns, circuit breakers, bulkheads, rate limiting, and graceful degradation strategies
Experience with a number of the following: Kubernetes, Docker, Service Mesh such as Istio, Envoy, Linkerd, Solo & ECS
Experience in cloud-focused software development, preferably in Go, Python, or other object-oriented programming languages
Experience with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or CloudFormation
Experience with CI/CD automation, including GitLab pipelines and other related tools
Strong hands-on experience with cloud platforms such as AWS, GCP or Azure
Proven track record of implementing scalable, high-performance infrastructure solutions in fast-paced, dynamic environments

Job Responsibility

Design & Infrastructure
Contribute to postmortem culture by facilitating comprehensive, blameless post-incident reviews that identify root causes, contributing factors, and actionable remediation items. Track incident trends to identify systemic issues and prioritize reliability improvements
Implement chaos engineering practices to proactively identify failure modes, validate system resilience, and build confidence in recovery procedures. Conduct game days and disaster recovery exercises
SRE Process & Principles Implementation
Deploy and evolve SRE practices across the organization by establishing core SRE principles, frameworks, and methodologies. Define and implement service reliability practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets, to balance innovation velocity with system reliability
Manage Error Budgets as a mechanism for making data-driven decisions about feature velocity vs. reliability. Track, report, and enforce error budget policies, facilitating conversations between engineering and product teams about risk tolerance and release decisions
Reliability Engineering & Infrastructure
Reduce toil through automation by identifying repetitive operational work and systematically eliminating it through infrastructure-as-code, automation frameworks, and intelligent tooling. Measure and track toil reduction efforts, aiming to keep toil below 50% of team time
Implement capacity planning processes that ensure systems have adequate headroom to meet SLOs during peak traffic, unexpected load spikes, and degraded states. Develop predictive models and automated scaling mechanisms
Observability, Monitoring & Reporting

What we offer

global access to mental health and financial wellness support and resources
healthcare (medical, dental, and vision)
life, accident, disability, commuter, and retirement options (401(k)/pension)
time off in accordance with local leave policies

Fulltime

Senior Software Engineer - Sre

Hybrid: This role is categorized as hybrid and is expected to report to Austin ...

Location

United States , Austin; Warren

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

Bachelor's degree in computer science or a related field, or equivalent work experience
7-10 years software experience with strong proficiency in PostgreSQL and at least one other (Oracle, SQL Server) database technologies
Proficiency in at least one programming language (e.g., Python, Go, Java) and familiarity with multiple language ecosystems
Solid understanding of operating systems, networking, distributed systems, databases, and storage architectures
Deep understanding of how code runs on underlying hardware, including operating systems, algorithms, and data structures
Ability to optimize or troubleshoot code by understanding its execution and the impact on system resources
Experience handling production incidents, including root cause analysis, mitigation, and working through complex system failures
Strong communication skills, with an ability to explain technical concepts to both engineering and business stakeholders
Commitment to collaborative problem-solving and shared ownership of services
Proven experience in automating manual processes, building deployment pipelines, or managing configuration systems

Job Responsibility

Develop tools and software to automate operational processes, improve system reliability, and reduce manual intervention
Lead, Implement and improve monitoring and observability frameworks, enabling proactive detection and resolution of incidents
Participate in an on-call rotation to diagnose, troubleshoot, and mitigate production incidents, ensuring minimal downtime and swift resolution
Work alongside developers to ensure the quality, scalability, and reliability of our database services
Practice shared ownership of services in production, fostering a "You build it, you run it" culture
Manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to manage reliability expectations effectively
Conduct deep-dive analyses of incidents and collaborate on post-incident reviews to derive learnings and prevent recurrence
Champion a culture of continuous improvement
Evaluate system performance and advocate for optimizations that reduce infrastructure costs while maintaining service reliability

Fulltime

Lead Software Engineer - SRE

Wells Fargo is seeking a Lead Site Reliability Engineer (SRE) to join the WIMT P...

Location

United States , CHARLOTTE; SAINT LOUIS

Salary:

119000.00 - 187000.00 USD / Year

Wells Fargo

Expiration Date

Until further notice

Requirements

5+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
5+ years of experience leading observability and monitoring tooling - Splunk, AppDynamics, Splunk Observability, Grafana, Open Telemetry
5+ years in infrastructure (windows and Linux) support
5+ years proven success in toil reduction initiatives
5+ years in cloud application management especially OpenShift Container Platform

Job Responsibility

Design and implement scalability, reliability, and observability strategies for cloud and on-premise environments
Define SLIs (Service Level Indicators), SLOs (Service Level Objectives), and Error Budgets to improve system reliability
Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions
Maintain knowledge of industry best practices and new technologies and recommend innovations that enhance operations or provide a competitive advantage to the organization
Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership
Review and analyze complex, large-scale technology solutions for tactical and strategic business objectives, enterprise technological environment, and technical challenges that require in-depth evaluation of multiple factors, including intangibles or unprecedented technical factors
Drive adoption of NFRs, best practices-quality and compliance across observability and performance engineering
Ensure high availability and performance of production systems through proactive monitoring and incident response
Collaborate and consult with key technical experts, senior technology team, and external industry groups to resolve complex technical issues and achieve goals
Lead projects, teams, or serve as a peer mentor

What we offer

Health benefits
401(k) Plan
Paid time off
Disability benefits
Life insurance, critical illness insurance, and accident insurance
Parental leave
Critical caregiving leave
Discounts and savings
Commuter benefits
Tuition reimbursement

Fulltime

Senior Software Engineer, SRE

Abridge’s services and engineering team are in hyperscale mode. We are looking f...

Location

United States , SF Office, NYC Office

Salary:

210800.00 - 248000.00 USD / Year

Abridge

Expiration Date

Until further notice

Requirements

8+ years of software engineering experience focused on distributed systems or tooling, with an interest in engineering enablement and software scaling
At least 2 years experience as a back-end engineer focused on system performance and scalability
Experience reducing latency in software by multiples through leveraging observability and profiling tools
Experience building on Kubernetes and scaling compute services on Kubernetes
experience with related cloud native technologies including ArgoCD, Argo Rollouts, Istio, etc
Comfortable implementing and securing services in Google Cloud Platform with Infrastructure as Code, including GCP Projects, VPC Networks, Google Kubernetes Engine, and IAM Roles, Groups and policies
Experience building software with backend languages (e.g. Python, GoLang, Node, and Rust)
Experience monitoring distributed systems with Prometheus, OpenTelemetry Collector, and Grafana (or something similar), including metrics collection, visualization, alerting, and using observability data to drive performance optimizations
Passion for engineering enablement and solving software and distributed systems scaling challenges under pressure
Must be willing to travel up to 10%

Job Responsibility

Leverage load testing, chaos engineering, and other test practices to identify performance and latency bottlenecks across all of our systems, and make changes to application code to resolve them
Drive software changes that can rehome applications at the code level onto new infrastructure (run times, event driven infrastructure, databases, and more) in order to dramatically improve scalability as well as enable multi-tenant deployments
Identify and implement software configuration changes and performance tuning parameters that will dramatically improve performance and scalability
Build developer tools and software modules that help engineers build code faster and more effectively with more enablements to the entire engineering organization
Work with the Platform team to develop, and application teams to adopt, emerging elements of our internal developer platform, such as service templates and self-serve infrastructure
Work with application teams to establish and adopt SLOs and error budgets, and drive better metrics for application health that can drive automated canary releases, improved health monitoring, and better engineering practices
Uplevel our ability to respond to incidents by improving observability, runbooks, and incident response muscle across the organization
Evangelize, document, and train the engineering team on the solutions being built and uplevel them on cloud native design strategies and tools
Be a public evangelist for Abridge in the global platform engineering community, including conferences, open source, and research as we pioneer new AI-first cloud-native-first security-first implementations at scale

What we offer

Generous Time Off: 14 paid holidays, flexible PTO for salaried employees, and accrued time off for hourly employees
Comprehensive Health Plans: Medical, Dental, and Vision coverage for all full-time employees and their families
Generous HSA Contribution: If you choose a High Deductible Health Plan, Abridge makes monthly contributions to your HSA
Paid Parental Leave: Generous paid parental leave for all full-time employees
Family Forming Benefits: Resources and financial support to help you build your family
401(k) Matching: Contribution matching to help invest in your future
Personal Device Allowance: Tax free funds for personal device usage
Pre-tax Benefits: Access to Flexible Spending Accounts (FSA) and Commuter Benefits
Lifestyle Wallet: Monthly contributions for fitness, professional development, coworking, and more
Mental Health Support: Dedicated access to therapy and coaching to help you reach your goals

Fulltime

Senior Software Engineer/ SE II (DevOps/ SRE)

We are looking for DevOps/SRE Engineers to join the Optimizely team in Dhaka.

Location

Bangladesh , Dhaka

Salary:

Not provided

Optimizely

Expiration Date

Until further notice

Requirements

AWS & GCP experience (multi-account, multi-region)
Kubernetes & container orchestration (EKS, Helm, Docker)
Terraform / Infrastructure-as-Code at scale
Automation scripting (Python, Bash, Fabric)
Experience managing scalable, fault-tolerant distributed infrastructure
Others: Datadog, Atlantis, Karpenter, Spark/EMR
Should be comfortable contributing code to service repositories if necessary (e.g. Node/Python/Golang)
Minimum experience 3+ years
Bachelor’s Degree (Computer Science or engineering preferred) or equivalent work experience

Job Responsibility

Multi-cloud infrastructure spanning multiple AWS accounts and GCP projects
50+ microservices running on both EKS and GKE with auto-scaling
36+ Terraform modules, 149+ Ansible roles, and more
Real-time data pipelines with Kinesis, Redshift, OpenSearch, and MongoDB Atlas
Self-managed OpenSearch, RabbitMQ, and other services
GitOps workflows powered by Atlantis with automated plan/apply cycles
CI/CD across 250+ Jenkins pipelines and Github Actions

Fulltime

Sr. Software Engineer - QA / Test Automation Engineer

Location

India , Gurgaon

Salary:

Not provided

Randstad

Expiration Date

July 09, 2026

Requirements

8+ years of experience in QA automation, SDET, or software engineering roles focused on test automation for distributed or cloud-based systems
Strong understanding of QA methodologies, test design, and systems validation
Proficiency in .NET 8/C#, Node.js, Python, or TypeScript for automation scripting
Hands-on experience with Selenium, Playwright, Cypress, REST API automation, and integration testing frameworks
Experience running tests in AWS environments with strong understanding of CI/CD pipelines using Azure DevOps
Familiarity with IaC, containerized test execution, and observability tools
Experience testing SQL Server 2022, Snowflake, PostgreSQL data flows
Ability to validate ETL pipelines, schema changes, and data quality through automation
Expertise in automated testing (unit, integration, contract, E2E, regression)
Familiarity with blue/green and canary release testing

Job Responsibility

Contribute to the design of scalable, maintainable QA automation frameworks for API, UI, integration, and performance testing
Implement automated test scenarios across microservices, APIs, data workflows, and distributed systems
Participate in design discussions to ensure testability, document risks, and propose automation strategies aligned with engineering standards
Produce clean, reusable, and maintainable automation scripts following best practices
Implement unit, integration, contract, and E2E tests integrated with CI/CD pipelines
Conduct root-cause analysis for defects and drive preventive quality improvements
Perform debugging, reliability analysis, and optimization of automation suites
Own test execution pipelines from development through deployment and monitoring
Create automated dashboards, alerts, and quality signals to validate release readiness
Collaborate in production issue investigations by building automated repros and validation scripts

Fulltime

New

Principal Software Engineer

We are developing Manufacturing and Engineering AI tools that help employees gai...

Location

India , Hyderabad

Salary:

Not provided

Amgen

Expiration Date

Until further notice

Requirements

13-17 years of engineering experience building or platforming cloud services or developer platforms, with 3+ years leading engineering teams or technical programs
Proven experience designing and operating cloud-native platforms using Kubernetes, containers, microservices, and related distributed system patterns
Hands-on experience with LLM serving or adjacent model-serving patterns, including inference endpoints, routing, scaling, batching, and latency/cost optimization
Practical knowledge of API gateway patterns, authentication and authorization, and secure integrations
Familiarity with cost attribution and FinOps concepts for cloud and AI workloads
Strong track record partnering with product managers and senior technical stakeholders to deliver platform capabilities and roadmaps
Excellent communication skills with the ability to explain technical tradeoffs clearly to both technical and non-technical audiences
Experience with observability and SRE practices, including metrics, tracing, logging, incident management, and production support
Master's / Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience

Job Responsibility

Define the technical vision and reference architecture for AI platforms supporting chatbots, agents, orchestration, and related enterprise services
Translate product and business requirements into scalable platform capabilities, including agent hosting, model access, AI gateways, observability, and operational tooling
Drive platform decisions around LLM serving, model endpoints, caching, batching, latency-versus-cost tradeoffs, and multi-model support
Lead architecture for manufacturing integrations and industrial data connectivity, including patterns for SCADA, Data Historian, MES, ERP, LIMS, APIs, event streams, and document-based knowledge sources
Own platform reliability, scalability, and cost by defining SLIs/SLOs, capacity planning, cost attribution, and FinOps practices
Collaborate with Product Owners, Principal Engineers, and stakeholders to define roadmap, acceptance criteria, and delivery milestones
Lead and mentor engineers delivering platform services, integrations, CI/CD for agents and models, and marketplace/catalog capabilities
Establish standards for security, compliance, and model governance, including data handling, access controls, logging, auditability, and traceability
Be hands-on when needed to prototype architectures, review designs, troubleshoot production incidents, and participate in code and design reviews

What we offer

In addition to the base salary, Amgen offers competitive and comprehensive Total Rewards Plans that are aligned with local industry standards

Fulltime

Select Country

Software Engineer SRE

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?