Staff Observability Platform Engineer (Sre), Scottsdale
-
Scottsdale, USA
-
Posted: a week ago
-
Save
divh2Lead Platform Reliability Engineer/h2pWere building a world of health around every individual shaping a more connected, convenient and compassionate health experience.
At CVS Health, youll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do.
Join us and be part of something bigger helping to simplify health care one person, one family and one community at a time.
/ppCVS Health PBM is looking for hands-on, passionate people who want to join a high energy and growing team, who want to be on the forefront of digital innovation that aims to reinvent what a pharmacy and a health care company can be in the digital world.
/ppAs a Lead Platform Reliability Engineer, you will design and implement metrics and observability frameworks with a strong focus on service level objectives (SLOs), service level indicators (SLIs), error budgets, and cloud infrastructure scaling and capacity estimation.
/ppThis individual contributor role is critical to enhancing our monitoring and observability capabilities, while also driving automation initiatives related to quality gates within the release engineering process.
You will work closely with cross-functional teams to ensure the reliability, performance, and scalable growth of our cloud-based systems.
/ppExpectations for the Role:/pulliMetrics Development: Define, implement, and maintain key performance metrics, SLOs, and SLIs to measure system reliability and performance.
Ensure alignment with business objectives and operational goals.
/liliError Budgets: Manage error budgets effectively, collaborating with development teams to balance reliability and feature delivery.
Analyze incidents and outages to inform adjustments to error budgets.
/liliMonitoring Observability: Design and implement comprehensive monitoring solutions to provide real-time visibility into system health.
Utilize tools such as Prometheus, Grafana, Loki, Temp and other observability platforms to create dashboards and alerts.
/liliCloud Infrastructure Scaling: Architect, design, and implement scalable cloud infrastructure capable of supporting multiple business applications, ensuring reliability, performance, and future growth.
/liliQuality Gates Automation: Develop and implement automated quality gates that ensure all releases meet defined reliability and performance standards.
Lead the release Devops team to integrate these gates into the CI/CD pipeline.
/liliIncident Management: Assist in incident response efforts by providing insights from metrics and monitoring tools.
Conduct post-mortem analyses to identify root causes and recommend preventive measures.
/liliAIOps Insight Automation: Use AI to surface what changed / whats abnormal / next best action from metrics, logs, and tracesminimizing manual dashboard analysis.
/liliAI-Accelerated Incident Response: Apply GenAI to speed triage and RCA with fast signal summarization and guided investigation paths.
/liliAI/LLM Observability Governance: Monitor AI workloads for quality, safety, cost, latency, reliability with end-to-end tracing (request ? prompt ? tools ? output) and secure logging/redaction.
/liliAI-Backed Release Quality Gates: Embed AI signal checks into CI/CD to flag SLO risk, latency/error drift, and regression patterns before production release.
/li/ulpRequired Qualifications:/pulli10+ years of experience in Software Engineering, Platform Engineering, or SRE.
/lili7+ years of experience with observability practices, including SLIs/SLOs/SLAs, alerting, and incident management.
/lili7+ years building production-grade backend services in Java/python.
/lili7+ years implementing and operating OpenTelemetry, including OTLP, semantic conventions, and instrumentation patterns.
/lili7+ years with cloud-native and containerized platforms (Docker, Kubernetes, Argo CD).
/lili7+ years working with public cloud platforms (AWS, GCP, or Azure).
/lili5+ years designing and scaling distributed, high-volume data pipelines.
/lili5+ years working with Grafana OSS or comparable observability backends (e.g., Grafana, Loki, Tempo, Prometheus).
/lili5+ years with relational databases (PostgreSQL, MySQL).
/li/ulpPreferred Qualifications:/pulliExcellent analytical skills and the ability to communicate complex technical concepts to non-technical stakeholders/liliExperience with service meshes and networking technologies such as Envoy and Istio/liliExperience integrating or operating commercial observability platforms (Splunk, AppDynamics, etc.)/liliExperience with streaming and data platforms such as Kafka, Pulsar, or similar technologies/liliFamiliarity with time-series, NoSQL, or analytical databases (ClickHouse, Bigtable, Cassandra, etc.)/liliExperience with Infrastructure as Code tools such as Terraform or CloudFormation/liliExperience with cost optimization and capacity planning for large-scale cloud infra/liliExperience with chaos engineering, resiliency testing, or fault injection/liliBackground in security-aware platform design, including secure service-to-service communication/liliExperience mentoring senior engineers and influencing platform standards across organizations/liliStrong operational experience supporting 24x7 production systems, including on-call responsibilities/liliKnowledge of security best practices in cloud environments/li/ulpEducation:/ppBachelors degree or equivalent experience (HS diploma + 4 years relevant experience)/ppPay Range:/ppThe typical pay range for this role is:/pp$118,****** - $236,******/ppThis pay range represents the base hourly rate or base annual full-time salary for all positions in the job grade within which this position falls.
The actual base salary offer will depend on a variety of factors including experience, education, geography and other relevant factors.
This position is eligible for a CVS Health bonus, commission or short-term incentive program in addition to the base pay range listed above.
This position also includes an award target in the companys equity award program.
/ppOur people fuel our future.
Our teams reflect the customers, patients, members and communities we serve and we are committed to fostering a workplace where every colleague feels valued and that they belong.
/ppGreat benefits for great people/ppWe take pride in offering a comprehensive and competitive mix of pay and benefits that reflects our commitment to our colleagues and their families.
/ppThis full-time position is eligible for a comprehensive benefits package designed to support the physical, emotional, and financial well-being of colleagues and their families.
The benefits for this position include medical, dental, and vision coverage, paid time off, retirement savings options, wellness programs, and other resources, based on eligibility.
/ppAdditional details about available benefits are provided during the application process and on Benefits Moments.
/p/div
-
Company nameCvs Health
-
Job positionStaff Observability Platform Engineer (Sre)
Staff Observability Platform Engineer (Sre) has been posted in the Tempe Engineering category on Locanto.
In this category, there are no other ads right now posted in Tempe.
There are more ads within a 10 mi radius for this category. If you want to view those ads, click here.