IT Monitoring & Observability Engineer Job at Robert Half (Washington, DC)

IT Monitoring & Observability Engineer

We are seeking an experienced IT Monitoring & Observability Engineer to support ...

Location

United States , Washington, DC

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

Minimum of 7 years of relevant experience in IT monitoring, observability, or infrastructure operations
Hands‑on experience with OpenText Operations Bridge (OBM) and related tools including: Operations Bridge Manager, SiteScope, AI Operations Management, Optic
Extensive knowledge of multi‑vendor server operating systems
Direct experience with monitoring protocols such as SNMP and WMI
Scripting experience using PowerShell, VBScript, and/or other scripting languages
Experience managing monitoring environments with: 250+ hosts and/or 3,000+ sensors
Experience with additional monitoring platforms such as: Zenoss, PRTG, Zabbix, Nagios
Strong background monitoring: Servers, Storage, Databases, Networks, Applications
Proven ability to engineer monitoring solutions and provide technical leadership

Job Responsibility

Support and manage a unified Configuration Management Database (CMDB), ensuring accuracy and standardization
Collect, aggregate, and analyze monitoring and performance data to support ITIL processes including: Configuration, Event, Capacity, Availability, Demand, Incident and Problem Management
Assess, tune, and optimize monitoring capabilities to deliver accurate, actionable alerts for 24x7 operations teams
Design, create, and maintain intuitive dashboards showing real‑time and historical service health and performance
Configure, maintain, and optimize monitoring dashboards across diverse infrastructure components
Deploy, manage, and update Management Packs, connectors, and monitoring policies
Perform event correlation, suppression, and filtering to reduce alert noise and improve incident triage
Integrate data from third‑party monitoring tools into a centralized event console
Conduct proactive performance and availability monitoring, identify root causes, and implement preventive measures
Support continuous improvement of monitoring strategy, tooling, and operational effectiveness

What we offer

medical, vision, dental, and life and disability insurance
eligible to enroll in our company 401(k) plan

Observability and Monitoring Engineer

We are seeking a highly skilled Observability and Monitoring Engineer to design,...

Location

United States , Pennington

Salary:

115000.00 USD / Year

Realign

Expiration Date

Until further notice

Requirements

Senior Application Programmer
3–5 years of experience in supporting IT Operations
Strong knowledge of monitoring tools (Dynatrace, Splunk)
Experience with scripting languages (Python, Perl, Unix shell)
Creative problem solver who thrives in a fast-paced environment
Must be a team player and demonstrate ability to communicate effectively with both technical and non-technical individuals
Excellent verbal and written communication skills
Clear oral communication and strong English proficiency
Self-starter, motivated, innovative, capable of handling a team and providing technical solutions
Ability to deal with complex information, processes, and relationships to derive simple solutions

Job Responsibility

Deploy and configure Dynatrace across diverse environments (Windows, Linux, Mainframe)
Onboard applications into Splunk using forwarders, source types, and indexing best practices
Define and implement tagging strategies, dashboards, and alerting policies for Dynatrace and Splunk
Enable full-stack monitoring, including APM, infrastructure, logs, and synthetic monitoring
Implement distributed tracing, anomaly detection, and performance baselining
Develop scripts and workflows for automated onboarding and configuration using APIs
Integrate monitoring solutions with ticketing tools for incident management
Establish retention policies and data governance for logs and metrics
Document onboarding processes, SOPs, and troubleshooting guides
Partner with application teams, infrastructure, and CIO stakeholders to align monitoring strategies

Fulltime

Monitoring and Observability Engineer

A Monitoring and Observability Engineer is a strategic professional who stays ab...

Location

India , Pune

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

OpenShift/Kubernetes Administration: Experience deploying, managing, and troubleshooting containerized applications on OpenShift/Kubernetes, including resource management and networking
Proficiency in administering Geneos ITRS at scale
Proficiency in administering Grafana (user management, data sources, dashboards, alerts)
Working knowledge of Grafana backend components: Mimir (metrics), Loki (logs), and Tempo (traces)
Experience with Prometheus for metric collection and PromQL for querying
Helm Chart Management: Experience with Helm for deploying applications, including creating, modifying, and managing Helm charts, library charts, and dependencies
Technical Documentation: Ability to create clear and concise documentation for systems and processes
6-10 years experience
Practical problem solving and strategic thinking skills
Demonstrated leadership, interpersonal skills and relationship building skills

Job Responsibility

Operating with a global footprint
Collaborating across various organizations within Citi to understand and develop observability solutions for enterprise-wide deployment at scale
Managing the legacy monitoring stack across the Production Management organization within Citi
Driving the strategic delivery of end-to-end Observability solutions in Citi
Providing in-depth analysis with interpretive thinking to define problems and develop innovative solutions
Directly impacting the business by influencing strategic functional decisions through advice, counsel, or provided services
Persuading and influencing others through strong and comprehensive communication and diplomacy skills
Performing other duties and functions as assigned

Fulltime

Monitoring and Observability Engineer

A Monitoring and Observability Engineer is a strategic professional who stays ab...

Location

United Kingdom , Belfast

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

OpenShift/Kubernetes Administration: Experience deploying, managing, and troubleshooting containerized applications on OpenShift/Kubernetes, including resource management and networking
Grafana & Observability Stack: Proficiency in administering Geneos ITRS at scale
Proficiency in administering Grafana (user management, data sources, dashboards, alerts)
Working knowledge of Grafana backend components: Mimir (metrics), Loki (logs), and Tempo (traces)
Experience with Prometheus for metric collection and PromQL for querying
Helm Chart Management: Experience with Helm for deploying applications, including creating, modifying, and managing Helm charts, library charts, and dependencies
Technical Documentation: Ability to create clear and concise documentation for systems and processes

Job Responsibility

Operating with a global footprint
Collaborating across various organizations within Citi to understand and develop observability solutions for enterprise-wide deployment at scale
Managing the legacy monitoring stack across the Production Management organization within Citi
Driving the strategic delivery of end-to-end Observability solutions in Citi
Providing in-depth analysis with interpretive thinking to define problems and develop innovative solutions
Directly impacting the business by influencing strategic functional decisions through advice, counsel, or provided services
Persuading and influencing others through strong and comprehensive communication and diplomacy skills
Performing other duties and functions as assigned

What we offer

27 days annual leave (plus bank holidays)
A discretional annual performance related bonus
Private Medical Care & Life Insurance
Employee Assistance Program
Pension Plan
Paid Parental Leave
Special discounts for employees, family, and friends
Access to an array of learning and development resources

Fulltime

Site Reliability Engineer / Observability Engineer

Rackspace is building up its Professional Services Center of Excellence on Appli...

Location

Egypt , Giza

Salary:

Not provided

Rackspace

Expiration Date

Until further notice

Requirements

Bachelor’s degree in engineering/computer science or equivalent
Senior-level experience with Site Reliability Engineering, DevOps, Code level application support and troubleshooting, AWS Infrastructure design, implementation and optimization, Automation for deployment, scaling and reliability
Experience with observability solutions tools like Splunk, Datadog, SignalFx, etc.
Experience deploying, maintaining and supporting software applications/services in the AWS ecosystem
Proactive approach to identifying problems and solutions
Experience writing code with one or more interpreted languages such as Python, PHP, Perl, Ruby, Linux Shell
Experience with Terraform or Cloud Formation scripting
Experience with configuration management tools like Ansible, Chef or Puppet
Experience with standard software development best practices and tools such as code repositories (Git preferred)
Experience executing in an agile software development environment

Job Responsibility

Work with customers and implement Observability solutions
Build and maintain scalable systems and robust automation that supports engineering goals
Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
Proactively gather and analyze both metric and log data from systems and applications to perform anomaly detection, performance tuning, capacity planning and fault isolation
Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, security and performance standards
Collaborate with team members to document and share solutions
Maintain a deep understanding of the customer’s business as well as their technical environment
Identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues

Fulltime

Monitoring Engineer

In this role, you will be on-call monitoring platform performance, coordinating ...

Location

United States , San Francisco

Salary:

Not provided

Adyen

Expiration Date

Until further notice

Requirements

At least 5 years of experience with incident management, problem management, incident client communication, and platform monitoring operations
Experience with problem management practices - identifying trends across incidents, conducting root cause investigations and driving preventative action
Solid communication skills and the ability to develop strong working relationships throughout the organization, able to translate technical situations clearly and concisely to a diverse audience via data-visualizing dashboards and written documents
Willingness to participate in the on-call rotation and work in a fast-paced, dynamic environment
Experience with monitoring and logging tools like Prometheus, Grafana, ELK Stack, etc.
Experience with observability platforms like Datadog, Dynatrace, Splunk
Excellent analytical and problem-solving skills, with the ability to analyze complex systems and spot the root cause of issues
Thrives in an environment where collaboration is crucial and where a global approach is key for successful implementation of processes and projects
Passion for defining and standardizing processes to drive strategic improvement and ability to translate complex technical concepts with ease for all non technical audiences
Natural ability for handling complex situations and multiple responsibilities simultaneously

Job Responsibility

On-call: Observe platform and merchant performance and detect any issues proactively to mitigate risks in partnership with Engineering teams
Incident Management: Coordinate the mitigation, recovery, and resolution of high-impact incidents, ensuring a rapid and effective response across teams. Represent the customer perspective during incidents, maintaining a strong customer-centric approach
Communication: Be an expert in communicating with merchants real time during an incident and present the most accurate and updated information to keep them informed. Escalate critical incidents when needed and provide structured communication to senior management
Problem Management: Go beyond reactive incident response by analyzing incident trends to identify recurring issues and systemic weaknesses. Partner with engineering and product teams to advocate for long-term fixes over repeated short-term patches
Working together with Operations, Product, and Engineering teams to integrate, grow, and continuously improve our monitoring strategy and increase our reliability
Investigate alerts and provide feedback to engineering teams to build effective logging and alerts across the platform architecture
Mitigate merchant impact risk by actioning on alerts in partnership with Engineering teams, and contribute to the monitoring playbook by documenting your learnings
Improve operations by leading/project managing initiatives and, or tools—development of automation for effective monitoring
Focus on ruthlessly prioritizing, automating, and scaling every aspect of our detection capabilities

Fulltime

Monitoring Engineer / Incident Manager

A team within Engineering under the Platform Excellence pillar exhibits an unwav...

Location

Netherlands , Amsterdam

Salary:

Not provided

Adyen

Expiration Date

Until further notice

Requirements

At least 5 years of experience with incident management, problem management, incident client communication, and platform monitoring operations
Experience with problem management practices - identifying trends across incidents, conducting root cause investigations and driving preventative action
Solid communication skills and the ability to develop strong working relationships throughout the organization, able to translate technical situations clearly and concisely to a diverse audience via data-visualizing dashboards and written documents
Willing to participate in the on-call rotation and work in a fast-paced, dynamic environment
Experience with monitoring and logging tools like Prometheus, Grafana, ELK Stack, etc.
Experience with observability platforms like Datadog, Dynatrace, Splunk
Excellent analytical and problem-solving skills, with the ability to analyze complex systems and spot the root cause of issues
Thrive in an environment where collaboration is crucial and where a global approach is key for successful implementation of processes and projects
Passion for defining and standardizing processes to drive strategic improvement and able to translate complex technical concepts with ease for all non technical audiences
Natural ability for handling complex situations and multiple responsibilities simultaneously

Job Responsibility

Participate in 24/7 on-call monitoring and observe platform and merchant performance and detect any issues proactively to mitigate risks in partnership with Engineering teams
Coordinate the mitigation, recovery, and resolution of high-impact incidents, ensuring a rapid and effective response across teams
Represent the customer perspective during incidents, maintaining a strong customer-centric approach
Communicate with merchants real time during an incident and present the most accurate and updated information to keep them informed
Escalate critical incidents when needed and provide structured communication to senior management
Go beyond reactive incident response by analyzing incident trends to identify recurring issues and systemic weaknesses and partner with engineering and product teams to advocate for long-term fixes
Work together with Operations, Product, and Engineering teams to integrate, grow, and continuously improve monitoring strategy and increase reliability
Investigate alerts and provide feedback to engineering teams to build effective logging and alerts across the platform architecture
Mitigate merchant impact risk by actioning on alerts in partnership with Engineering teams and contribute to the monitoring playbook by documenting learnings
Improve operations by leading/project managing initiatives and tools development of automation for effective monitoring

Fulltime

Software Engineer, Observability

As a Software Engineer in Observability, you’ll be responsible for our metrics a...

Location

India , Bengaluru

Salary:

Not provided

Dialpad

Expiration Date

Until further notice

Requirements

Background in both Systems and/or Software Engineering
Experience in designing, automating, maintaining, and optimizing observability platforms (logging, metrics, and tracing)
Experience with configuration management tools such as Ansible, Terraform, etc.
Experience with Public Cloud environments such as GCP, AWS, etc.
Familiarity with languages such as Python, Go, Rust, etc.
Previous direct experience with Grafana, Loki, Prometheus
Experience with Linux
Experience with Kubernetes (including GKE/EKS) and building containerized applications
Undergraduate degree in Computer Science or Engineering

Job Responsibility

Develop and improve instrumentation for monitoring and logging the health and availability of services
Develop and maintain the observability stack within Dialpad engineering
Define best practices and standards around making systems and services measurable, and work with various teams to get those best practices applied
Create tools and libraries for other engineering teams to enable them to build self-monitoring capabilities
Create and own internal documentation used by the other engineering teams
Stay up-to-date with the latest trends in observability, logging, monitoring, and cloud technologies
Collaborate with different engineering teams to integrate observability practices into their workflows
Participate in a rotating on-call within the larger Infrastructure Engineering division

What we offer

Competitive salary
comprehensive benefits
real opportunities for growth
cutting-edge AI tools
robust training program

Fulltime

Select Country

IT Monitoring & Observability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

IT Monitoring & Observability Engineer

IT Monitoring & Observability Engineer

Observability and Monitoring Engineer

Monitoring and Observability Engineer

Monitoring and Observability Engineer

Site Reliability Engineer / Observability Engineer

Monitoring Engineer

Monitoring Engineer / Incident Manager

Software Engineer, Observability

Our AI answers in your language