Ensure System Uptime and Reliability : Monitor and maintain cloud-based applications and infrastructure, ensuring minimal downtime and efficient incident response.
Build and Optimize Monitoring and Alerting Systems : Set up and continuously improve comprehensive monitoring and alerting frameworks to detect and address issues proactively.
Cloud Infrastructure Management : Manage, optimize, and scale systems on Azure cloud platforms, ensuring high performance and cost-effectiveness.
Incident Management and Response : Act as the first line of defense in identifying, diagnosing, and resolving technical issues in real-time or escalate them to the appropriate teams.
Tooling and Observability : Leverage technologies such as Grafana for observability and Argo for CI / CD automation, enhancing our ability to respond swiftly and effectively to infrastructure needs.
Collaboration : Work closely with cross-functional teams to align on SRE best practices, share insights, and support development and operational goals.
Language Requirements : Fluent spoken and written Arabic / English.