Talent.com
Senior Site Reliability Engineer (SRE)

Senior Site Reliability Engineer (SRE)

SallaSaudi Arabia
11 منذ أيام
الوصف الوظيفي

Overview

We are looking for a Senior Site Reliability Engineer (SRE) to help design, scale, and secure our rapidly growing platform infrastructure. You will work across all critical systems — from customer-facing applications and APIs to internal platforms and data services — ensuring availability, performance, and cost efficiency at scale.

You’ll be hands-on with Kubernetes, observability, GitOps, automation, and cloud infrastructure, while partnering closely with application, platform, and data teams to deliver a highly reliable and self-healing environment.

This role is ideal for an engineer who thrives on complex distributed systems, loves to automate everything, and can balance speed, stability, and cost-efficiency in production.

Responsibilities

  • Design, deploy, monitor, and maintain production workloads across Kubernetes (EKS / AKS / GKE) clusters.
  • Build self-healing, auto-scaling systems that minimize toil and manual intervention and ensure uptime.
  • Design and operate reliable database and storage platforms (SQL, NoSQL, and object stores) within Kubernetes environments.
  • Implement backup, disaster recovery, replication, and failover strategies to meet RPO / RTO targets.
  • Troubleshoot and recover Kubernetes Persistent Volumes (StorageClasses, CSI drivers, PVC issues).
  • Optimize storage performance and cost through multi-tier strategies, hot / cold data separation, and S3 / offloading lifecycle policies.
  • Secure and scale object storage platforms (e.g., MinIO / S3-compatible) for high-throughput data pipelines.
  • Manage block storage (EBS / io2 / gp3) and shared file systems (EFS, NFS) for resilience and cost balance.
  • Collaborate with teams to optimize networking, ingress / egress traffic, and service mesh for secure communication.

Platform & Infrastructure Reliability

  • Design, deploy, monitor, and maintain production workloads across Kubernetes (EKS / AKS / GKE) clusters.
  • Build self-healing, auto-scaling systems that minimize toil and manual intervention.
  • Optimize networking, ingress / egress traffic control, and service mesh for secure & performant communication.
  • Design and operate reliable database and storage platforms (SQL, NoSQL, and object stores) in Kubernetes environments.
  • Own backup, disaster recovery, replication, and failover strategies to meet RPO / RTO targets for critical data services.
  • Optimize storage performance and cost through multi-tier strategies, hot / cold data separation, and S3 / offloading lifecycle policies.
  • Troubleshoot and recover Kubernetes Persistent Volumes confidently during incidents (StorageClasses, CSI drivers, PVC issues).
  • Secure and scale object storage platforms (e.g., MinIO / S3-compatible) and integrate with workloads for high-throughput data pipelines.
  • Work with block storage (EBS / io2 / gp3) and shared file systems (EFS, NFS) to balance performance, resiliency, and cost.
  • Automation & Delivery

  • Champion GitOps and CI / CD best practices (ArgoCD, Flux, GitHub Actions).
  • Build automation for infrastructure provisioning and upgrades using Terraform, Helm, and Kubernetes Operators.
  • Reduce release risk through progressive delivery strategies (blue / green, canary, spot instance rolling updates).
  • Observability & Incident Response

  • Own the monitoring and alerting stack (Prometheus, Grafana, Loki, VictoriaMetrics, OpenSearch).
  • Lead incident management and postmortems to prevent recurrence.
  • Provide real-time visibility into system health, performance, and cost metrics.
  • Security & Compliance

  • Implement least-privilege IAM policies, secure service-to-service communication, and network ACLs / firewalls.
  • Enforce Kubernetes RBAC, secret management, and secure image supply chain.
  • Participate in audit readiness and compliance efforts.
  • Performance & Cost Optimization

  • Analyze and tune system performance under scale (CPU / memory / IO).
  • Partner with product and platform teams to right-size clusters, databases, and storage tiers.
  • Introduce cost visibility dashboards for engineering leadership.
  • Strong communication skills and teamworker — able to collaborate across engineering, DevOps, security, and product teams.

    Qualifications

  • Bachelor’s degree in Computer Science, Engineering, or a related field — or equivalent work experience.
  • Design, deploy, monitor, and maintain production workloads across Kubernetes clusters (EKS / AKS / GKE).
  • Preferred Qualifications

  • Experience managing mission-critical systems at scale (high traffic, multi-region).
  • Proven cost optimization in cloud / K8s environments.
  • Familiarity with service mesh (Istio, Linkerd) or advanced networking / egress control.
  • Experience with data platform components (Airflow, Debezium, ClickHouse, etc.) is a plus but not required.
  • Strong communication skills and teamworker — able to collaborate across engineering, DevOps, security, and product teams.

    #J-18808-Ljbffr

    إنشاء تنبيه وظيفي لهذا البحث

    Senior Site Engineer • Saudi Arabia