We are looking for a Senior Site Reliability Engineer (SRE) to help design, scale, and secure our rapidly growing platform infrastructure. You will work across all critical systems — from customer-facing applications and APIs to internal platforms and data services — ensuring availability, performance, and cost efficiency at scale. You'll be hands‑on with Kubernetes, observability, GitOps, automation, and cloud infrastructure, while partnering closely with application, platform, and data teams to deliver a highly reliable and self‑healing environment. This role is ideal for an engineer who thrives on complex distributed systems, loves to automate everything, and can balance speed, stability, and cost‑efficiency in production.
Qualifications
- Bachelor's degree in Computer Science, Engineering, or a related field — or equivalent work experience.
- Design, deploy, monitor, and maintain production workloads across Kubernetes (EKS / AKS / GKE) clusters.
- Build self‑healing, auto‑scaling systems that minimize manual intervention and ensure uptime.
- Design and operate reliable database and storage platforms (SQL, NoSQL, and object stores) within Kubernetes environments.
- Implement backup, disaster recovery, replication, and failover strategies to meet RPO / RTO targets.
- Troubleshoot and recover Kubernetes Persistent Volumes (StorageClasses, CSI drivers, PVC issues).
- Optimize storage performance and cost through multi‑tier strategies, hot / cold data separation, and S3 / offloading lifecycle policies.
- Secure and scale object storage platforms (e.g., MinIO / S3‑compatible) for high‑throughput data pipelines.
- Manage block storage (EBS / io2 / gp3) and shared file systems (EFS, NFS) for resilience and cost balance.
- Collaborate with teams to optimize networking, ingress / egress traffic, and service mesh for secure communication.
Platform & Infrastructure Reliability
Design, deploy, monitor, and maintain production workloads across Kubernetes (EKS / AKS / GKE) clustersBuild self‑healing, auto‑scaling systems that minimize toil and manual interventionOptimize networking, ingress / egress traffic control, and service mesh for secure & performant communicationDesign and operate reliable database and storage platforms (SQL, NoSQL, and object stores) in Kubernetes environmentsOwn backup, disaster recovery, replication, and failover strategies to meet RPO / RTO targets for critical data servicesOptimize storage performance and cost through multi‑tier strategies, hot / cold data separation, and S3 / offloading lifecycle policiesTroubleshoot and recover Kubernetes Persistent Volumes confidently during incidents (StorageClasses, CSI drivers, PVC issues)Secure and scale object storage platforms (e.g., MinIO / S3‑compatible) and integrate with workloads for high‑throughput data pipelinesWork with block storage (EBS / io2 / gp3) and shared file systems (EFS, NFS) to balance performance, resiliency, and costAutomation & Delivery
Champion GitOps and CI / CD best practices (ArgoCD, Flux, GitHub Actions). Build automation for infrastructure provisioning and upgrades using Terraform, Helm, and Kubernetes OperatorsReduce release risk through progressive delivery strategies (blue / green, canary, spot instance rolling updates)Observability & Incident Response
Own the monitoring and alerting stack (Prometheus, Grafana, Loki, VictoriaMetrics, OpenSearch)Lead incident management and postmortems to prevent recurrenceProvide real-time visibility into system health, performance, and cost metricsSecurity & Compliance
Implement least‑privilege IAM policies, secure service‑to‑service communication, and network ACLs / firewallsEnforce Kubernetes RBAC, secret management, and secure image supply chainParticipate in audit readiness and compliance effortsPerformance & Cost Optimization
Analyze and tune system performance under scale (CPU / memory / IO)Partner with product and platform teams to right‑size clusters, databases, and storage tiersIntroduce cost visibility dashboards for engineering leadership.
Preferred Qualifications
Experience managing mission‑critical systems at scale (high traffic, multi‑region)Proven cost optimization in cloud / K8s environmentsFamiliarity with service mesh (Istio, Linkerd) or advanced networking / egress controlExperience with data platform components (Airflow, Debezium, ClickHouse, etc.) is a plus but not requiredStrong communication skills and teamworker — able to collaborate across engineering, DevOps, security, and product teams.
Requirements
8+ years in SRE / DevOps / Infrastructure Engineering rolesDeep Kubernetes expertise (multi‑cluster, Helm chart development, advanced networking)Strong GitOps workflows using ArgoCD / FluxExpertise with AWS (preferred) or Azure / GCP, plus Infrastructure‑as‑Code (Terraform, Pulumi, CloudFormation)Advanced knowledge of SQL & NoSQL databases (MySQL / Aurora, PostgreSQL, MongoDB, Redis)Scripting / automation skills in Python, Bash, or GoSolid background in monitoring / observability (Prometheus, Grafana, Loki, ELK / Opensearch, VictoriaMetrics)Experience with CI / CD at scale and managing production incidentsExperience with streaming / messaging (Kafka, RabbitMQ, or similar)Benefits
Comprehensive Training & Development programsPerformance‑based Bonus incentivesFlexible Work From Home options#J-18808-Ljbffr