Job Description
            
                Job Description: Experienced DevOps/Site Reliability Engineer
 
 *** s Digital Analytics and System Health software engineering team is seeking a talented and highly motivated DevOps / Site Reliability Engineer to join our team in Ridley Park, Pennsylvania, Hazelwood, Missouri, Plano, Texas or Oklahoma City, Oklahoma.
 
 The ideal candidate will possess a strong foundation in DevOps and practical experience in owning and operating platform services and underlying infrastructure to help ensure the reliability, scalability, and performance of our systems.
You will work closely with a cross-functional team to implement automated monitoring, incident response, capacity planning, and runbooks.
You will contribute to the evolution of our reliability practices, instrumentation, and error budgets, while gaining hands-on experience with our production systems.
This role suits those who enjoy building scalable platforms, automating end-to-end processes, and improving the overall user experience.
As part of the team, you will tackle a broad range of complex tasks using modern tools and methodologies, contributing to the evolution of our digital and analytics solutions.
 
 
 At ***, we are all innovators on a mission to connect, protect, explore and inspire.
From the seabed to outer space, you ll learn and grow, contributing to work that shapes the world.
Find your future with us.
 
 
 Position Responsibilities
 Maintain and improve the reliability, availability, and performance of production services, with a focus on reducing incident frequency and recovery/restoration time.
 Design, implement, and operate monitoring, alerting, logging, and tracing solutions to provide end-to-end visibility of systems and dependencies.
 Respond to and resolve production incidents, participate in post-incident reviews, and help implement corrective actions.
 Build and maintain runbooks, standard operating procedures, and automation to reduce manual toil and improve operational consistency.
 Collaborate with software engineers to optimize code for reliability, scalability, and resilience, and assist with capacity planning and performance tuning.
 Implement and manage CI/CD pipelines, deployment strategies, and blue/green/canary release patterns to ensure safe and rapid software delivery.
 Manage infrastructure and assist with provisioning, scaling, and maintaining cloud resources.
 Enforce security and compliance best practices in the production environment, including access controls, secrets management, and secure logging.
 Participate in on-call coverage, rotate responsibilities, and communicate clearly with stakeholders about status and risks.
 Contribute to reliability-related projects, tooling, and initiatives that improve platform health and developer experience.
 Infrastructure reliability and resilience: regularly assess and improve the reliability of core infrastructure components (networking, storage, compute, databases, caching layers) with emphasis on redundancy, fault tolerance, and scalable failover strategies.
 Participate in defining disaster recovery objectives (RPO, RTO), implement capabilities (backup/restore, cross-region failover, site failover), and conduct regular exercises to validate recovery procedures.
 Ensure robust backup/restore procedures, perform regular backup validation, and protect critical data across regions and environments.
 Forecast growth, model failure domains, and ensure capacity buffers and scalable architectures to withstand regional outages or component failures.
 
 
 Basic Qualifications (Required Skills/Experience)
 Bachelor s degree in Computer Science, Information Technology, or a related field (or equivalent practical experience).
 5-7 years of experience in DevOps or a related field.
 Strong Linux/Unix administration skills and proficiency in at least one scripting language (e.G., Python, Bash).
 Experience with cloud platforms, containers, and orchestration (AWS/Azure/GCP, Docker/Kubernetes).
 Familiarity with containerization (Docker) and container orchestration (Kubernetes).
 Experience with monitoring and observability tools (Prometheus, Grafana, ELK/EFK, OpenTelemetry).
 Solid understanding of incident management processes, on-call practices, and post-mortem analysis.
 Knowledge of CI/CD concepts and tooling (e.G., Jenkins, GitHub Actions, GitLab CI) and automation scripting.
 Strong problem-solving, debugging, and communication skills;ability to work in a collaborative, cross-functional environment.
 
 
 Preferred Qualifications (Desired Skills/Experience)
 Bachelor s degree in Information Technology, Computer Science or a related field, or equivalent practical experience.
 ITIL/ITSM or similar service management certifications (ITIL Foundation or equivalent) environments is a plus.
 Knowledge of DoD or government security requirements or other regulated environments is a plus.
 1+ years of experience in the Aerospace industry