Site Reliability Engineer
This role involves ensuring the reliability and resilience of cloud-based production systems through automation, incident management, and continuous improvement. The engineer will lead major incident response, develop monitoring and alerting strategies using tools like Prometheus and Datadog, and build self-healing systems on AWS with Kubernetes and Docker. Collaboration with engineering and product teams is key to improving platform stability and operational efficiency.