CrawlJobs Logo

Principal Architect - Cloud and Observability

https://www.cvshealth.com/ Logo

CVS Health

Location Icon

Location:
United States

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

144200.00 - 288400.00 USD / Year

Job Description:

We're building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you'll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time. Position Summary We're hiring a Principal Architect to take ownership of how we do observability and hybrid cloud at CVS Health. This person will sit within our Enterprise Architecture organization and be responsible for the architecture, standards, and technical direction behind our observability platforms and our multi-cloud infrastructure posture. We run workloads across on-prem private cloud (OpenShift, KVM, Dell PowerFlex), Azure, AWS, and GCP. We need someone who can build and maintain the reference architectures, telemetry standards, and instrumentation patterns that let our engineering teams monitor all of that consistently. We've committed to an OpenTelemetry-first approach and use the Grafana stack (Mimir, Loki, Tempo) as our primary backends, but we also operate Datadog, Splunk, and Dynatrace in various parts of the org. On the cloud side, there is real work to do around workload identity, runtime selection, autoscaling guidance, and FinOps. Teams are asking for concrete standards they can follow. This is a hands-on role. You'll write architecture docs, build proof-of-concepts, configure OTel pipelines, and present to leadership. *This position can work remotely from anywhere in the continental USA.

Job Responsibility:

  • Own the enterprise observability reference architecture covering metrics, logs, traces, and events across all environments (cloud and on-prem)
  • Drive the OpenTelemetry-first instrumentation strategy -- standard libraries, semantic conventions, collector topologies (DaemonSet, gateway, sidecar), and pipeline design
  • Build and operate telemetry pipelines on Grafana Mimir, Loki, and Tempo, including multi-tenant configurations, retention policies, and capacity planning
  • Define how we measure reliability: SLOs, SLIs, error budgets, and alerting frameworks -- consistently across all lines of business
  • Own the integration between observability tooling and incident management (ServiceNow ITOM, xMatters)
  • Drive telemetry schema standards to ensure teams emit data that is useful downstream, not just technically compliant
  • Build and maintain reference architectures for our hybrid footprint: OpenShift on-prem with KVM/libvirt and Dell PowerFlex storage, plus Azure, AWS, and GCP
  • Lead standards work around workload identity and federation using SPIFFE/SPIRE and cloud-native IAM patterns to move away from static secrets
  • Provide guidance on compute runtime selection -- containers vs. VMs vs. bare metal vs. serverless -- with a clear decision framework for teams
  • Help teams connect autoscaling and capacity planning behavior to actual telemetry signals
  • Push FinOps maturity forward by integrating cost data into the observability stack, establishing unit economics, and working toward open billing standards like FOCUS
  • Identify where AI/ML adds practical value in our observability stack -- anomaly detection, root cause analysis, log clustering, and smarter alerting
  • Define observability standards for AI-powered systems (agents, RAG pipelines) -- covering latency, token costs, model drift, and related signals
  • Ensure new AI-powered platforms are instrumented correctly from day one
  • Participate in cross-functional architecture working groups focused on observability and hybrid cloud standards
  • Publish architecture decision records and reference implementations that teams can actually use
  • Mentor architects and platform engineers
  • conduct architecture reviews to raise the bar across the org
  • Work with security and compliance on HIPAA, SOX, and PCI requirements as they apply to telemetry and cloud infrastructure
  • Represent CVS Health in vendor evaluations and stay connected to the open-source ecosystem (CNCF, OpenTelemetry, Grafana Labs)

Requirements:

  • 10+ years in infrastructure, cloud architecture, platform engineering, or SRE
  • 8+ years of architecture work in observability, cloud infrastructure, or both at a large enterprise
  • Solid experience with at least two of Azure, AWS, or GCP -- including networking, identity, compute, and storage
  • 5+ years with Kubernetes in production (OpenShift, EKS, AKS, or GKE)
  • 5+ years with OpenTelemetry or similar frameworks (collectors, SDKs, semantic conventions, pipeline design)
  • 5+ years with observability platforms: Grafana/Mimir/Loki/Tempo, Prometheus, Datadog, Splunk, Dynatrace, or comparable tools
  • Experience defining SLOs/SLIs and building alerting strategies at an organizational level
  • Proven track record writing architecture standards that other teams adopted and followed
  • Able to communicate clearly with both engineers and senior leadership

Nice to have:

  • On-prem / private cloud experience (OpenShift Virtualization, KVM/libvirt, VMware, Dell PowerFlex or similar storage)
  • Workload identity (SPIFFE/SPIRE) and zero-trust networking
  • Infrastructure-as-code (Terraform, Pulumi, Helm, ArgoCD)
  • Streaming platforms such as Kafka or Confluent, especially in telemetry pipeline contexts
  • AIOps or ML-based anomaly detection experience
  • FinOps background -- cloud cost optimization, chargeback, unit economics
  • Service mesh (Istio, Envoy, Linkerd) or eBPF-based tools (Cilium, Pixie)
  • Involvement in open-source communities (CNCF, OpenTelemetry, etc.)
  • Healthcare, insurance, or financial services experience (HIPAA/SOX familiarity)
  • Cloud certifications are a plus but not required
What we offer:
  • medical, dental, and vision coverage
  • paid time off
  • retirement savings options
  • wellness programs
  • other resources, based on eligibility
  • bonus, commission or short-term incentive program
  • equity award program

Additional Information:

Job Posted:
April 24, 2026

Expiration:
June 29, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Principal Architect - Cloud and Observability

Principal Data Architect

We’re seeking a Principal Data Architect to lead strategic data modernization ad...
Location
Location
United States; Canada
Salary
Salary:
170000.00 - 200000.00 USD / Year
terazo.com Logo
Terazo
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years designing and delivering enterprise‑scale data solutions (with at least 5 years in cloud platforms such as Azure, Databricks, Snowflake, and/or AWS/GCP)
  • Proven experience working with Cloud Platforms and services
  • Proven experience migrating workloads to Cloud environments
  • Proven track record leading advisory and modernization programs end‑to‑end, from discovery through roadmap and into delivery oversight
  • Deep knowledge of enterprise data reference architecture, including Lakehouse/Medallion patterns, Mesh principles, CDC/streaming, and domain‑oriented data products
  • Expertise in data modeling (dimensional, 3NF, Data Vault), metadata/lineage, data quality, and observability
  • Hands-on governance with Unity Catalog (Databricks) and/or Microsoft Purview
  • pragmatic implementation of RBAC, lineage, and data masking
  • Strong grasp of governance (policy‑driven access, privacy/security, and regulatory expectations in banking such as GLBA, AML/KYC, BCBS 239), FinOps/cost management, and performance/scalability
  • Executive‑level communication and facilitation skills
Job Responsibility
Job Responsibility
  • Partner with Sales/Accounts to shape opportunities, conduct discovery, craft SOWs and proposals, build estimates/approaches, and present to executives
  • Lead executive advisory engagements that connect business value (growth, efficiency, risk) to a pragmatic data modernization strategy and investment plan
  • Define and validate target-state enterprise data architecture. Produce a phased, value‑anchored roadmap with clear outcomes, success metrics, and FinOps/budget guardrails. Build additional architecture artifacts to describe desired-state architecture
  • Design governance frameworks (policy-driven access, quality, lineage, metadata, ownership/stewardship, regulatory alignment) and connect them to measurable outcomes
  • Design reference architectures based on Cloud principles, such as the Well-Architected Framework
  • Own the reference architecture across Lakehouse, Warehouse, and Mesh patterns leveraging industry leading technologies
  • Define design patterns for ingestion, transformation, orchestration, and observability (monitoring, logging, alerting) that scale across domains
  • Mentor and coach architects, engineers, and analysts
What we offer
What we offer
  • Pawreavement & Pawternity Leave: Two days off of work if you bring a new furry friend home, and two days off of work if you have to say goodbye to one
  • Unlimited PTO
Read More
Arrow Right

Principal AI Architect

We are seeking an experienced AI Architect to lead the design, implementation, a...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
evoluteiq.com Logo
EvoluteIQ
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years of experience in data science, ML engineering and AI system architecture
  • Hands-on experience with Python, TensorFlow, PyTorch, Scikit-learn, spaCy and related AI/ML frameworks
  • Expertise in MLOps tools such as MLflow, Kubeflow, Vertex AI, or SageMaker
  • Proficiency in data processing technologies (Spark, Kafka, Airflow) and data modeling
  • Strong background in deploying models such as APIs or services using Docker, Kubernetes, and REST/gRPC
  • Experience designing data pipelines and integrating AI with production systems
  • Should have an understanding of prompt engineering, LLM fine-tuning, and vector stores (e.g. Pinecone, FAISS, Weaviate)
  • Knowledge of cloud AI services (AWS, GCP, Azure) and distributed computing architectures
  • Proven experience implementing observability for models (drift, accuracy, bias, and performance)
Job Responsibility
Job Responsibility
  • Architect and oversee AI/ML pipelines covering data collection, preparation, training, validation, and inference
  • Define and implement scalable AI infrastructure for training, deployment, and continuous integration (MLOps)
  • Collaborate with data scientists, ML engineers, product manager, and product teams to translate business problems into AI-driven solutions
  • Establish frameworks for model governance, versioning, reproducibility, and explainability
  • Integrate models into production systems ensuring low latency, scalability, and reliability
  • Define data strategy, storage, and access patterns to support AI workloads
  • Build solutions to monitor model performance, drift, and data quality, implementing continuous retraining strategies
  • Ensure compliance with ethical AI, data privacy, and security best practices
  • Mentor AI/ML engineers and contribute to architectural decisions across the AI platform stack
What we offer
What we offer
  • Opportunity to shape the strategy of a next-gen hyper-automation platform
  • Work with a cross-disciplinary team in a fast-growing, innovation-driven environment
  • Competitive compensation and growth opportunities
  • A culture of innovation, ownership, and continuous learning
  • Fulltime
Read More
Arrow Right

Principal Engineer

The Principal AI/ML Operations Engineer leads the architecture, automation, and ...
Location
Location
United States , Pleasanton, California
Salary
Salary:
251000.00 - 314500.00 USD / Year
blackline.com Logo
BlackLine
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Machine Learning, Data Science, or a related field
  • 10+ years in ML infrastructure, DevOps, and software system architecture
  • 4+ years in leading MLOps or AI Ops platforms
  • Strong programming skills in languages such as Python, Java, or Scala
  • Expertise in ML frameworks (TensorFlow, PyTorch, scikit-learn) and orchestration tools (Airflow, Kubeflow, Vertex AI, MLflow)
  • Proven experience operating production pipelines for ML and LLM-based systems across cloud ecosystems (GCP, AWS, Azure)
  • Deep familiarity with LangChain, LangGraph, ADK or similar agentic system runtime management
  • Strong competencies in CI/CD, IaC, and DevSecOps pipelines integrating testing, compliance, and deployment automation
  • Hands-on with observability stacks (Prometheus, Grafana, Newrelic) for model and agent performance tracking
  • Understanding of governance frameworks for Responsible AI, auditability, and cost metering across training and inference workloads
Job Responsibility
Job Responsibility
  • Define enterprise-level standards and reference architectures for ML-Ops and AIOps systems
  • Partner with data science, security, and product teams to set evaluation and governance standards (Guardrails, Bias, Drift, Latency SLAs)
  • Mentor senior engineers and drive design reviews for ML pipelines, model registries, and agentic runtime environments
  • Lead incident response and reliability strategies for ML/AI systems
  • Lead the deployment of AI models and systems in various environments
  • Collaborate with development teams to integrate AI solutions into existing workflows and applications
  • Ensure seamless integration with different platforms and technologies
  • Define and manage MCP Registry for agentic component onboarding, lifecycle versioning, and dependency governance
  • Build CI/CD pipelines automating LLM agent deployment, policy validation, and prompt evaluation of workflows
  • Develop and operationalize experimentation frameworks for agent evaluations, scenario regression, and performance analytics
What we offer
What we offer
  • short-term and long-term incentive programs
  • robust offering of benefit and wellness plans
  • Fulltime
Read More
Arrow Right

Principal Data Engineer

We are on the lookout for a Principal Data Engineer to help define and lead the ...
Location
Location
United Kingdom
Salary
Salary:
Not provided
dotdigital.com Logo
Dotdigital
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience delivering python-based projects in the data engineering space
  • Extensive experience working with SQL and NoSQL database technologies (e.g. SQL Server, MongoDB & Cassandra)
  • Proven experience with modern data warehousing and large-scale data processing tools (e.g. Snowflake, DBT, BiqQuery, Clickhouse)
  • Hands on experience with data orchestration tools like Airflow, Dagster or Prefect
  • Experience using cloud environments (e.g. Azure, AWS, GCP) to process, store and surface large scale data
  • Experience using Kafka or similar event-based architectures e.g. (Pub/Sub via AWS SQS, Azure EventHubs, AWS Kinesis)
  • Strong grasp of data architecture and data modelling principles for both OLAP and OLTP workloads
  • Capable in the wider software development lifecycle in terms of agile ways of working and continuous integration/deployment of data solutions
  • Experience as a lead or Principal Engineer on large-scale data initiative or product builds
  • Demonstrated ability to architect data systems and data structures for high volume, high throughput systems
Job Responsibility
Job Responsibility
  • Lead the design and implementation of scalable, secure and resilient data systems across streaming, batch and real-time use cases
  • Architect data pipelines, model and storage solutions that power analytical and product use cases
  • using primarily Python and SQL via orchestration tooling that run workloads in the cloud
  • Leverage AI to automate both data processing and engineering processes
  • Assure and drive best practices relating to data infrastructure, governance, security and observability
  • Work with technologists across multiple teams to deliver coherent features and data outcomes
  • Support the data team to help adopt data engineering principles
  • Identify, validate and promote new tools and technologies that improve the performance and stability of data services
What we offer
What we offer
  • Parental leave
  • Medical benefits
  • Paid sick leave
  • Dotdigital day
  • Share reward
  • Wellbeing reward
  • Wellbeing Days
  • Loyalty reward
  • Fulltime
Read More
Arrow Right

Principal Engineer I - Cloud Observability

We’re not just building better tech. We’re rewriting how data moves and what the...
Location
Location
India
Salary
Salary:
Not provided
confluent.io Logo
Confluent
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of 15+ years of hands-on software development experience with the ability to anticipate future technical needs for the product and craft plans to realize them
  • Taking ideas to production is something we look for
  • Ready to roll up your sleeves - code, debug, design - do whatever it takes to ship the product to production
  • Experience building and operating large-scale systems. Solid understanding of basic systems operations (disk, network, operating systems, etc). Experience running production services in the cloud
  • Strong fundamentals in distributed systems design and development. Solid fundamentals in concurrent and multi-threading programming
  • A self starter with the ability to work effectively in teams. Proactively identifying the symptoms of technical issues and reason about their causes is needed. This will be followed by fixing the root causes
  • Timely shipping of deliverables
  • being able to trade-off short term technical decisions with the long term. Move fast, build in increments, and iterate. A sense of urgency, a mindset towards achieving results, and excellent prioritization skills
  • Ability to influence the team, peers and upper management in technology decisions using effective communication and collaborative techniques
  • Degree in Computer Science, Engineering or equivalent experience. Understanding of various technologies, programming paradigms and frameworks is needed. Ability to be pragmatic and trade off their usage in production is essential
Job Responsibility
Job Responsibility
  • You will work with a team of engineers and architects to help evolve Confluent Observability features
  • Work closely with product management, engineering leadership, and other key stakeholders across various teams in Confluent to build and drive the overall roadmap
  • Need you to be a strong tech voice outside Confluent Observability within Confluent
  • Influence the overall domain health and operational hygiene for Confluent Observability
  • We need a tech champion for the observability capabilities we provide to our customers
  • You are expected to review designs and code and improve our technical standards
  • We are looking at you to lead the technology charter for our observability features in Confluent Cloud and in hybrid scenarios with Confluent Platform
  • Mentor a team of high-performing engineers and leads, helping them to continue in growing their skill set through hands-on experience and mentorship
  • Be a strong technical leader and representative for engineering teams in India
  • Provide timely and productive feedback, encourage a growth mindset, and advise team members in setting and working toward personal development goals
What we offer
What we offer
  • Remote-First Work
  • Robust Insurance Benefits
  • Flexible Time Away
  • The Best Teammates
  • Experience Ambassadors
  • Open and Honest Culture
  • Well-Being and Growth
  • Fulltime
Read More
Arrow Right

Principal Customer Success Manager

The Customer Success Architect position is a technical champion within the Custo...
Location
Location
United States
Salary
Salary:
115500.00 - 266000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10-15 years experience in IT management (ITOM)/APM fields
  • At least 5+ years experience in senior customer-facing positions as Implementation Architect, Service Delivery Architect, or Lead Solution Architect
  • In-depth knowledge and hands-on experience in Observability, Process Automation, Patching, AIOps
  • Familiarity with cloud-native design patterns, microservices, and modern web-scale architectures
  • Excellent written and oral communication skills
  • Ability to perform proactive problem management, issue resolution, and manage customer expectations
  • Ability to quickly learn and certify newer technologies
Job Responsibility
Job Responsibility
  • Drive adoption of OpsRamp products and best practices with customers
  • Manage technical health of Enterprise/GSI/OEM clients
  • Own structured adoption and outcomes leading to value realization, expansion, and growth
  • Work with customers' technical/operational decision-makers to identify and prioritize business problems
  • Define KPIs and use cases
  • Plan technical strategies and build solutions
  • Design solution and architecture
  • Serve as trusted partner for customer on use-case and product functionality
  • Lead customers in application of OpsRamp products and services
  • Perform health checks during customer success engagement lifecycle
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Career development programs
  • Unconditional inclusion and flexible work arrangements
  • Fulltime
Read More
Arrow Right

Principal Software Engineer, AI Cloud

At Docker, we make app development easier so developers can focus on what matter...
Location
Location
United States , Seattle
Salary
Salary:
232000.00 - 319000.00 USD / Year
docker.com Logo
Docker
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of software engineering experience, including 3+ years in technical leadership roles (Staff or Principal level)
  • Proven experience designing and building highly scalable distributed systems in production environments
  • Deep understanding of cloud infrastructure (AWS, Azure, GCP, or OCI), including compute, networking, and storage primitives
  • Proficiency in Go, Rust, or Java
  • Expertise in Kubernetes, microservices, and service mesh architectures
  • Strong foundation in observability, CI/CD, and infrastructure-as-code (Terraform, Pulumi, or CloudFormation)
  • Experience operating high-availability (99.99%+) production systems
  • Exceptional communication skills and ability to influence across technical and business domains
  • Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Define and drive the long-term technical strategy for Docker AI Cloud’s control and data plane services
  • Architect highly available, multi-region systems capable of operating seamlessly across multiple cloud providers
  • Design APIs and service abstractions that integrate Docker Desktop, Hub, and enterprise cloud services
  • Establish standards for reliability, scalability, and observability across the Docker AI Cloud platform
  • Lead cross-functional technical discussions and influence architectural decisions company-wide
  • Design and implement distributed systems for workload orchestration, service discovery, and lifecycle management
  • Build and operate control plane components that manage multi-tenant workloads and cloud networking
  • Develop infrastructure that delivers predictable performance, intelligent scaling, and automated failover
  • Ensure security, data integrity, and compliance across Docker’s global infrastructure footprint
  • Partner with platform and product teams to deliver developer-friendly APIs and cloud experiences
What we offer
What we offer
  • Freedom & flexibility
  • fit your work around your life
  • Designated quarterly Whaleness Days plus end of year Whaleness break
  • Home office setup
  • we want you comfortable while you work
  • 16 weeks of paid Parental leave
  • Technology stipend equivalent to $100 net/month
  • PTO plan that encourages you to take time to do the things you enjoy
  • Training stipend for conferences, courses and classes
  • Equity
  • Fulltime
Read More
Arrow Right

Principal Architect, Core Platform

We are seeking a Principal Architect / Distinguished Engineer to lead the strate...
Location
Location
United States , San Diego
Salary
Salary:
Not provided
teradata.com Logo
Teradata
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in large-scale distributed systems, cloud infrastructure, or platform engineering
  • Proven track record architecting high-scale, elastic, multi-region or multi-cloud systems
  • Deep understanding of cloud provider elasticity primitives, orchestration systems (Kubernetes, serverless, autoscalers), and modern infrastructure technologies
Job Responsibility
Job Responsibility
  • Define and evolve the architectural blueprint for a highly elastic, self-optimizing cloud platform
  • Develop frameworks and systems that support real-time autoscaling, predictive scaling, capacity forecasting, and cost-aware elasticity
  • Maintain a deep understanding of cloud provider offerings (AWS, Azure, GCP, etc.) and relevant industry tooling, including their performance characteristics, constraints, and roadmap directions
  • Evaluate emerging technologies, vendor solutions, and open-source projects, identifying opportunities and gaps relative to our needs
  • Influence long-term engineering strategy, collaborating with product, operations, and security teams to align architectural decisions with business goals
  • Mentor senior technical leaders and engineering teams
  • drive engineering excellence and establish best practices for scalable cloud architecture
  • Establish patterns for distributed systems that optimize workload placement, resource utilization, reliability, and failure recovery
  • Focus on cost efficiency by designing elasticity mechanisms that reduce waste without compromising performance or customer experience
  • Lead the creation of reference implementations, proofs-of-concept, and performance experiments to validate design decisions
What we offer
What we offer
  • People-first culture
  • Flexible work model
  • Focus on well-being
  • Inclusive environment
  • Fulltime
Read More
Arrow Right