We are hiring a Site Reliability Engineer to help define, measure, automate, and improve the uptime of critical customer-facing business processes.
This role is suitable for an engineer with strong SRE/DevOps experience. A software engineering background is not mandatory. The main focus is reliability from the customer and business perspective: service availability, transaction success, customer impact, operational continuity, and revenue protection.
You will work on SLA/SLO/SLI frameworks, business-flow monitoring, customer-impact alerting, automation, incident prevention, operational readiness, and reliability improvements together with development, product, observability, and platform teams.
The role includes participation in a structured on-call rotation for critical business services. The expectation is not only incident response, but also improving monitoring, automation, escalation, and preventive reliability controls after each case.
We are looking for a proactive engineer with strong ownership and the desire to introduce new tools, practices, dashboards, and automation that improve uptime and reduce production instability.
+ ' ' +
- Experience with monitoring, alerting, observability, and production troubleshooting.
- Strong understanding of service reliability, uptime, availability, and customer impact.
- Ability to connect technical metrics with business impact.
- Experience building dashboards, alerts, automation scripts, or internal tools.
- Good understanding of APIs, microservices, distributed systems, and production environments.
- Practical experience with Prometheus, Grafana, Alertmanager, ELK/OpenSearch, OpenTelemetry, or similar tools.
- Basic hands-on experience with Kubernetes and production microservice environments.
- Scripting experience with Python, Bash, Go, or similar.
- Understanding of incident management, escalation, post-incident actions, and operational improvements.
- Willingness to participate in a structured on-call rotation and contribute to improving incident response quality.
- Strong ownership mindset and ability to drive improvements independently.
- Good communication skills for working with developers, product owners, engineering leads, and support teams.
Nice to have:
- Experience with SLA/SLO/SLI frameworks and error budgets.
- Experience with business-flow monitoring, customer journey monitoring, or synthetic checks.
- Experience with transactional systems such as payments, transfers, authentication, onboarding, or card operations.
- Experience with resilience patterns: retries, circuit breakers, fallback logic, timeout tuning, rate limiting, and graceful degradation.
- Experience with operational readiness checks, reliability reviews, postmortems, or incident management.
- Experience with CI/CD, GitOps, service mesh, Kafka, Redis, databases, or API gateways.
+ ' ' +
- Opportunities for professional growth and development.
- Competitive salary and bonuses.
- Comprehensive insurance coverage.
- Supportive work environment.
- Visa Premium salary card.
- Corporate discounts and events.
- Additional vacation days.
- Discounted education and employee loans.
+ ' ' +
- Define and improve uptime measurement for critical business processes and customer journeys.
- Build and improve SLA/SLO/SLI dashboards, reliability reports, and customer-impact monitoring.
- Create automation to detect, prevent, or reduce incidents and manual operational work.
- Improve alerting quality by focusing on actionable, business-critical, and customer-impacting signals.
- Analyze incidents from the business-impact perspective and drive preventive actions.
- Participate in a structured on-call rotation for critical business flows, including incident response, escalation, and follow-up reliability improvements.
- Participate in incident reviews and drive clear follow-up actions.
- Identify reliability gaps that may affect customers, transaction success, revenue, or service availability.
- Help define operational readiness requirements before services become part of critical business flows.
- Work with development, product, observability, platform, and engineering teams to improve end-to-end reliability.
- Introduce new reliability tools, dashboards, checks, and automation that improve uptime.
Kapital Bank iş mühiti, əlavə fürsətlər və digər vakansiyaları görüntüləmək üçün Kapital Bank Life səhifəsinə keçid edin.
Vakansiyalardan daha tez xəbərdar olmaq üçün Telegram kanalımıza abunə olun!
Daha çox Xidmət vakansiyaları üçün www.isbu.az saytına keçid edin.