CrawlJobs Logo

IT Monitoring & Observability Engineer

United States, Washington, DC · Job Posted March 21, 2026
Apply Position
Job Link Share

Job Description

We are seeking an experienced IT Monitoring & Observability Engineer to support enterprise monitoring, performance, and availability across a complex IT environment. This role is responsible for managing and optimizing a unified monitoring and event management platform, driving actionable insights, improving alert quality, and supporting 24x7 operations.

Job Responsibility

  • Support and manage a unified Configuration Management Database (CMDB), ensuring accuracy and standardization
  • Collect, aggregate, and analyze monitoring and performance data to support ITIL processes including: Configuration, Event, Capacity, Availability, Demand, Incident and Problem Management
  • Assess, tune, and optimize monitoring capabilities to deliver accurate, actionable alerts for 24x7 operations teams
  • Design, create, and maintain intuitive dashboards showing real‑time and historical service health and performance
  • Configure, maintain, and optimize monitoring dashboards across diverse infrastructure components
  • Deploy, manage, and update Management Packs, connectors, and monitoring policies
  • Perform event correlation, suppression, and filtering to reduce alert noise and improve incident triage
  • Integrate data from third‑party monitoring tools into a centralized event console
  • Conduct proactive performance and availability monitoring, identify root causes, and implement preventive measures
  • Support continuous improvement of monitoring strategy, tooling, and operational effectiveness

Requirements

  • Minimum of 7 years of relevant experience in IT monitoring, observability, or infrastructure operations
  • Hands‑on experience with OpenText Operations Bridge (OBM) and related tools including: Operations Bridge Manager, SiteScope, AI Operations Management, Optic
  • Extensive knowledge of multi‑vendor server operating systems
  • Direct experience with monitoring protocols such as SNMP and WMI
  • Scripting experience using PowerShell, VBScript, and/or other scripting languages
  • Experience managing monitoring environments with: 250+ hosts and/or 3,000+ sensors
  • Experience with additional monitoring platforms such as: Zenoss, PRTG, Zabbix, Nagios
  • Strong background monitoring: Servers, Storage, Databases, Networks, Applications
  • Proven ability to engineer monitoring solutions and provide technical leadership

Nice to have

  • Experience supporting 24x7 operations environments
  • Experience acting as a technical lead during major incidents or service outages
  • Systems administration experience with Windows and/or Linux
  • Advanced scripting and automation expertise
  • Experience integrating monitoring tools with ServiceNow
  • Experience automating alert‑to‑ticket workflows
  • Strong understanding of ITIL and ITSM concepts including monitoring, capacity, availability, and demand management
  • ITIL certification (Foundation or higher) strongly preferred
  • Experience producing executive‑level dashboards and performance reports
  • Experience with data visualization and computational tools

What we offer

  • medical, vision, dental, and life and disability insurance
  • eligible to enroll in our company 401(k) plan
  • free online training

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

IT Monitoring & Observability Engineer

8 matching positions

IT Monitoring & Observability Engineer

We are seeking an experienced IT Monitoring & Observability Engineer to support ...
Location
Location
United States , Washington, DC
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of 7 years of relevant experience in IT monitoring, observability, or infrastructure operations
  • Hands‑on experience with OpenText Operations Bridge (OBM) and related tools including: Operations Bridge Manager, SiteScope, AI Operations Management, Optic
  • Extensive knowledge of multi‑vendor server operating systems
  • Direct experience with monitoring protocols such as SNMP and WMI
  • Scripting experience using PowerShell, VBScript, and/or other scripting languages
  • Experience managing monitoring environments with: 250+ hosts and/or 3,000+ sensors
  • Experience with additional monitoring platforms such as: Zenoss, PRTG, Zabbix, Nagios
  • Strong background monitoring: Servers, Storage, Databases, Networks, Applications
  • Proven ability to engineer monitoring solutions and provide technical leadership
Job Responsibility
Job Responsibility
  • Support and manage a unified Configuration Management Database (CMDB), ensuring accuracy and standardization
  • Collect, aggregate, and analyze monitoring and performance data to support ITIL processes including: Configuration, Event, Capacity, Availability, Demand, Incident and Problem Management
  • Assess, tune, and optimize monitoring capabilities to deliver accurate, actionable alerts for 24x7 operations teams
  • Design, create, and maintain intuitive dashboards showing real‑time and historical service health and performance
  • Configure, maintain, and optimize monitoring dashboards across diverse infrastructure components
  • Deploy, manage, and update Management Packs, connectors, and monitoring policies
  • Perform event correlation, suppression, and filtering to reduce alert noise and improve incident triage
  • Integrate data from third‑party monitoring tools into a centralized event console
  • Conduct proactive performance and availability monitoring, identify root causes, and implement preventive measures
  • Support continuous improvement of monitoring strategy, tooling, and operational effectiveness
What we offer
What we offer
  • medical, vision, dental, and life and disability insurance
  • eligible to enroll in our company 401(k) plan
Read More
Arrow Right

Observability and Monitoring Engineer

We are seeking a highly skilled Observability and Monitoring Engineer to design,...
Location
Location
United States , Pennington
Salary
Salary:
115000.00 USD / Year
realign-llc.com Logo
Realign
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Senior Application Programmer
  • 3–5 years of experience in supporting IT Operations
  • Strong knowledge of monitoring tools (Dynatrace, Splunk)
  • Experience with scripting languages (Python, Perl, Unix shell)
  • Creative problem solver who thrives in a fast-paced environment
  • Must be a team player and demonstrate ability to communicate effectively with both technical and non-technical individuals
  • Excellent verbal and written communication skills
  • Clear oral communication and strong English proficiency
  • Self-starter, motivated, innovative, capable of handling a team and providing technical solutions
  • Ability to deal with complex information, processes, and relationships to derive simple solutions
Job Responsibility
Job Responsibility
  • Deploy and configure Dynatrace across diverse environments (Windows, Linux, Mainframe)
  • Onboard applications into Splunk using forwarders, source types, and indexing best practices
  • Define and implement tagging strategies, dashboards, and alerting policies for Dynatrace and Splunk
  • Enable full-stack monitoring, including APM, infrastructure, logs, and synthetic monitoring
  • Implement distributed tracing, anomaly detection, and performance baselining
  • Develop scripts and workflows for automated onboarding and configuration using APIs
  • Integrate monitoring solutions with ticketing tools for incident management
  • Establish retention policies and data governance for logs and metrics
  • Document onboarding processes, SOPs, and troubleshooting guides
  • Partner with application teams, infrastructure, and CIO stakeholders to align monitoring strategies
  • Fulltime
Read More
Arrow Right

Monitoring and Observability Engineer

A Monitoring and Observability Engineer is a strategic professional who stays ab...
Location
Location
India , Pune
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • OpenShift/Kubernetes Administration: Experience deploying, managing, and troubleshooting containerized applications on OpenShift/Kubernetes, including resource management and networking
  • Proficiency in administering Geneos ITRS at scale
  • Proficiency in administering Grafana (user management, data sources, dashboards, alerts)
  • Working knowledge of Grafana backend components: Mimir (metrics), Loki (logs), and Tempo (traces)
  • Experience with Prometheus for metric collection and PromQL for querying
  • Helm Chart Management: Experience with Helm for deploying applications, including creating, modifying, and managing Helm charts, library charts, and dependencies
  • Technical Documentation: Ability to create clear and concise documentation for systems and processes
  • 6-10 years experience
  • Practical problem solving and strategic thinking skills
  • Demonstrated leadership, interpersonal skills and relationship building skills
Job Responsibility
Job Responsibility
  • Operating with a global footprint
  • Collaborating across various organizations within Citi to understand and develop observability solutions for enterprise-wide deployment at scale
  • Managing the legacy monitoring stack across the Production Management organization within Citi
  • Driving the strategic delivery of end-to-end Observability solutions in Citi
  • Providing in-depth analysis with interpretive thinking to define problems and develop innovative solutions
  • Directly impacting the business by influencing strategic functional decisions through advice, counsel, or provided services
  • Persuading and influencing others through strong and comprehensive communication and diplomacy skills
  • Performing other duties and functions as assigned
  • Fulltime
Read More
Arrow Right

Monitoring and Observability Engineer

A Monitoring and Observability Engineer is a strategic professional who stays ab...
Location
Location
United Kingdom , Belfast
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • OpenShift/Kubernetes Administration: Experience deploying, managing, and troubleshooting containerized applications on OpenShift/Kubernetes, including resource management and networking
  • Grafana & Observability Stack: Proficiency in administering Geneos ITRS at scale
  • Proficiency in administering Grafana (user management, data sources, dashboards, alerts)
  • Working knowledge of Grafana backend components: Mimir (metrics), Loki (logs), and Tempo (traces)
  • Experience with Prometheus for metric collection and PromQL for querying
  • Helm Chart Management: Experience with Helm for deploying applications, including creating, modifying, and managing Helm charts, library charts, and dependencies
  • Technical Documentation: Ability to create clear and concise documentation for systems and processes
Job Responsibility
Job Responsibility
  • Operating with a global footprint
  • Collaborating across various organizations within Citi to understand and develop observability solutions for enterprise-wide deployment at scale
  • Managing the legacy monitoring stack across the Production Management organization within Citi
  • Driving the strategic delivery of end-to-end Observability solutions in Citi
  • Providing in-depth analysis with interpretive thinking to define problems and develop innovative solutions
  • Directly impacting the business by influencing strategic functional decisions through advice, counsel, or provided services
  • Persuading and influencing others through strong and comprehensive communication and diplomacy skills
  • Performing other duties and functions as assigned
What we offer
What we offer
  • 27 days annual leave (plus bank holidays)
  • A discretional annual performance related bonus
  • Private Medical Care & Life Insurance
  • Employee Assistance Program
  • Pension Plan
  • Paid Parental Leave
  • Special discounts for employees, family, and friends
  • Access to an array of learning and development resources
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer / Observability Engineer

Rackspace is building up its Professional Services Center of Excellence on Appli...
Location
Location
Egypt , Giza
Salary
Salary:
Not provided
rackspace.com Logo
Rackspace
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering/computer science or equivalent
  • Senior-level experience with Site Reliability Engineering, DevOps, Code level application support and troubleshooting, AWS Infrastructure design, implementation and optimization, Automation for deployment, scaling and reliability
  • Experience with observability solutions tools like Splunk, Datadog, SignalFx, etc.
  • Experience deploying, maintaining and supporting software applications/services in the AWS ecosystem
  • Proactive approach to identifying problems and solutions
  • Experience writing code with one or more interpreted languages such as Python, PHP, Perl, Ruby, Linux Shell
  • Experience with Terraform or Cloud Formation scripting
  • Experience with configuration management tools like Ansible, Chef or Puppet
  • Experience with standard software development best practices and tools such as code repositories (Git preferred)
  • Experience executing in an agile software development environment
Job Responsibility
Job Responsibility
  • Work with customers and implement Observability solutions
  • Build and maintain scalable systems and robust automation that supports engineering goals
  • Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
  • Proactively gather and analyze both metric and log data from systems and applications to perform anomaly detection, performance tuning, capacity planning and fault isolation
  • Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, security and performance standards
  • Collaborate with team members to document and share solutions
  • Maintain a deep understanding of the customer’s business as well as their technical environment
  • Identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues
  • Fulltime
Read More
Arrow Right

Monitoring Engineer

In this role, you will be on-call monitoring platform performance, coordinating ...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
adyen.com Logo
Adyen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 5 years of experience with incident management, problem management, incident client communication, and platform monitoring operations
  • Experience with problem management practices - identifying trends across incidents, conducting root cause investigations and driving preventative action
  • Solid communication skills and the ability to develop strong working relationships throughout the organization, able to translate technical situations clearly and concisely to a diverse audience via data-visualizing dashboards and written documents
  • Willingness to participate in the on-call rotation and work in a fast-paced, dynamic environment
  • Experience with monitoring and logging tools like Prometheus, Grafana, ELK Stack, etc.
  • Experience with observability platforms like Datadog, Dynatrace, Splunk
  • Excellent analytical and problem-solving skills, with the ability to analyze complex systems and spot the root cause of issues
  • Thrives in an environment where collaboration is crucial and where a global approach is key for successful implementation of processes and projects
  • Passion for defining and standardizing processes to drive strategic improvement and ability to translate complex technical concepts with ease for all non technical audiences
  • Natural ability for handling complex situations and multiple responsibilities simultaneously
Job Responsibility
Job Responsibility
  • On-call: Observe platform and merchant performance and detect any issues proactively to mitigate risks in partnership with Engineering teams
  • Incident Management: Coordinate the mitigation, recovery, and resolution of high-impact incidents, ensuring a rapid and effective response across teams. Represent the customer perspective during incidents, maintaining a strong customer-centric approach
  • Communication: Be an expert in communicating with merchants real time during an incident and present the most accurate and updated information to keep them informed. Escalate critical incidents when needed and provide structured communication to senior management
  • Problem Management: Go beyond reactive incident response by analyzing incident trends to identify recurring issues and systemic weaknesses. Partner with engineering and product teams to advocate for long-term fixes over repeated short-term patches
  • Working together with Operations, Product, and Engineering teams to integrate, grow, and continuously improve our monitoring strategy and increase our reliability
  • Investigate alerts and provide feedback to engineering teams to build effective logging and alerts across the platform architecture
  • Mitigate merchant impact risk by actioning on alerts in partnership with Engineering teams, and contribute to the monitoring playbook by documenting your learnings
  • Improve operations by leading/project managing initiatives and, or tools—development of automation for effective monitoring
  • Focus on ruthlessly prioritizing, automating, and scaling every aspect of our detection capabilities
  • Fulltime
Read More
Arrow Right

Monitoring Engineer / Incident Manager

A team within Engineering under the Platform Excellence pillar exhibits an unwav...
Location
Location
Netherlands , Amsterdam
Salary
Salary:
Not provided
adyen.com Logo
Adyen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 5 years of experience with incident management, problem management, incident client communication, and platform monitoring operations
  • Experience with problem management practices - identifying trends across incidents, conducting root cause investigations and driving preventative action
  • Solid communication skills and the ability to develop strong working relationships throughout the organization, able to translate technical situations clearly and concisely to a diverse audience via data-visualizing dashboards and written documents
  • Willing to participate in the on-call rotation and work in a fast-paced, dynamic environment
  • Experience with monitoring and logging tools like Prometheus, Grafana, ELK Stack, etc.
  • Experience with observability platforms like Datadog, Dynatrace, Splunk
  • Excellent analytical and problem-solving skills, with the ability to analyze complex systems and spot the root cause of issues
  • Thrive in an environment where collaboration is crucial and where a global approach is key for successful implementation of processes and projects
  • Passion for defining and standardizing processes to drive strategic improvement and able to translate complex technical concepts with ease for all non technical audiences
  • Natural ability for handling complex situations and multiple responsibilities simultaneously
Job Responsibility
Job Responsibility
  • Participate in 24/7 on-call monitoring and observe platform and merchant performance and detect any issues proactively to mitigate risks in partnership with Engineering teams
  • Coordinate the mitigation, recovery, and resolution of high-impact incidents, ensuring a rapid and effective response across teams
  • Represent the customer perspective during incidents, maintaining a strong customer-centric approach
  • Communicate with merchants real time during an incident and present the most accurate and updated information to keep them informed
  • Escalate critical incidents when needed and provide structured communication to senior management
  • Go beyond reactive incident response by analyzing incident trends to identify recurring issues and systemic weaknesses and partner with engineering and product teams to advocate for long-term fixes
  • Work together with Operations, Product, and Engineering teams to integrate, grow, and continuously improve monitoring strategy and increase reliability
  • Investigate alerts and provide feedback to engineering teams to build effective logging and alerts across the platform architecture
  • Mitigate merchant impact risk by actioning on alerts in partnership with Engineering teams and contribute to the monitoring playbook by documenting learnings
  • Improve operations by leading/project managing initiatives and tools development of automation for effective monitoring
  • Fulltime
Read More
Arrow Right

Software Engineer, Observability

As a Software Engineer in Observability, you’ll be responsible for our metrics a...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
dialpad.com Logo
Dialpad
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Background in both Systems and/or Software Engineering
  • Experience in designing, automating, maintaining, and optimizing observability platforms (logging, metrics, and tracing)
  • Experience with configuration management tools such as Ansible, Terraform, etc.
  • Experience with Public Cloud environments such as GCP, AWS, etc.
  • Familiarity with languages such as Python, Go, Rust, etc.
  • Previous direct experience with Grafana, Loki, Prometheus
  • Experience with Linux
  • Experience with Kubernetes (including GKE/EKS) and building containerized applications
  • Undergraduate degree in Computer Science or Engineering
Job Responsibility
Job Responsibility
  • Develop and improve instrumentation for monitoring and logging the health and availability of services
  • Develop and maintain the observability stack within Dialpad engineering
  • Define best practices and standards around making systems and services measurable, and work with various teams to get those best practices applied
  • Create tools and libraries for other engineering teams to enable them to build self-monitoring capabilities
  • Create and own internal documentation used by the other engineering teams
  • Stay up-to-date with the latest trends in observability, logging, monitoring, and cloud technologies
  • Collaborate with different engineering teams to integrate observability practices into their workflows
  • Participate in a rotating on-call within the larger Infrastructure Engineering division
What we offer
What we offer
  • Competitive salary
  • comprehensive benefits
  • real opportunities for growth
  • cutting-edge AI tools
  • robust training program
  • Fulltime
Read More
Arrow Right