Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Remotive

Remotive

Remote

11 hours ago

No application

About

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more.

Role Description

This role involves joining an Identity Security Cloud software development team as a Senior Site Reliability Engineer (SRE). You will work closely with software engineers, infrastructure platform services, engineering managers, and other stakeholders to ensure the reliability, scalability, and performance of the team's services.

  • Work with development and service owners to solve performance issues and ensure system scalability.
  • Design, develop, and implement solutions to improve reliability, availability, performance, and scalability of systems.
  • Develop alerts and dashboards in collaboration with technical leaders and infrastructure platform services.
  • Own and improve key operational metrics (SLIs, SLOs, Error Budgets, monitoring and alerting).
  • Drive continuous improvement through post-incident reviews and blameless postmortems of non-functional issues.
  • Develop and maintain comprehensive monitoring and alerting to proactively identify and resolve issues.
  • Create and maintain dashboards, conducting ongoing reviews to optimize gaps.
  • Collaborate with technical leads, DevOps/SRE, and infra teams for capacity planning.
  • Identify and address production performance bottlenecks through profiling, tuning, and optimization.
  • Automate repetitive tasks and processes to improve efficiency.
  • Work closely with Software, Performance, and Test Engineers to influence system design and architecture.
  • Review and contribute to documentation for systems, processes, runbooks, and procedures.
  • Participate in a 24/7 on-call rotation to gain subject matter expertise.
  • Lead incident postmortem efforts, ensuring timely compilation of reports.
  • Utilize excellent diagnostic and problem-solving skills to analyze complex systems and data.

Qualifications

  • Bachelor’s degree in computer science, a related field, or equivalent practical experience.
  • Proven 5+ years of SRE experience.
  • Strong understanding of SRE principles and practices.
  • Experience with cloud platforms (AWS, GCP, or Azure).
  • Proficiency in at least one scripting language (e.g., Python, Bash, Go).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, Honeycomb, OpenSearch).
  • Level of coding experience beyond simple scripts with programming languages such as Go, Java, or Python.
  • Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes).
  • Understanding of network protocols and security best practices.
  • Familiarity with DevOps culture and practices and experience with CI/CD toolchains (Jenkins, ArgoCD, SpaceLift).
  • Experience with Incident Response tools and processes (PagerDuty).
  • Experience with Infrastructure as Code (Terraform, Helm).
  • Strong problem-solving and troubleshooting skills.
  • Excellent communication and collaboration skills.
  • Ability to work independently and as part of a team.

Preferred Qualifications

  • Technology experience: Kafka, relational databases, performance tuning (JVM, Go).
  • Experience with Grafana K6 – Continuous Performance Tool.

Onboarding Timeline

  • In the first 30 days you will:
    • Meet team, understand the team’s mission and vision.
    • Gain clarity on various roles and expectations.
    • Complete development environment setup.
    • Read guides, documentation, perform mandatory training.
    • Learn company processes, benefits.
  • By 6 months you should:
    • Understand team goals and OKRs for the quarter and beyond.
    • Complete initial analysis and implementation of SRE team assignments.
    • Be comfortable with tools, systems, and processes used on a day-to-day basis.
    • Complete project work, both supervised and unsupervised.