Assignment Description

Our client is currently seeking a Senior SRE (Site Reliability Engineer) within the Platform Operations and Support team. The selected candidate will be responsible for managing and dispatching incident and service requests, providing high-quality support, and driving troubleshooting efforts alongside Root Cause Analyses (RCAs). As an advisor to the Development teams, the Senior SRE Engineer will play a crucial role in maintaining platform availability, reducing the time to market for new features, and enhancing performance. This position is pivotal in troubleshooting and ensuring quality assurance from an end-to-end perspective, focusing on understanding, monitoring, and improving the production system to actively prevent future incidents. Additionally, the role involves leading continuous improvements and innovations within the team.

Overview of Responsibilities

System Support & Troubleshooting:

  • Guide and coordinate junior colleagues within the team.
  • Assist in the initial technical analysis for production incidents.
  • Support the development team in building capabilities for alerts and monitoring.
  • Conduct code reviews for reported cases, develop fixes, and oversee their delivery.

Infrastructure Automation and Configuration Management:

  • Develop and maintain automation tools, scripts, and configuration management systems.
  • Implement Infrastructure as Code (IaC) practices using tools like Ansible, Terraform, or Kubernetes.
  • Collaborate with development and operations teams to automate build, test, and deployment processes.

Reliability Engineering and Resilience:

  • Design and implement systems and processes to enhance infrastructure reliability and resilience.
  • Continuously improve system reliability by analyzing logs and trends, identifying areas for improvement, and implementing preventative measures.

System Monitoring and Incident Response:

  • Develop and manage monitoring tools and systems to track software and infrastructure health, performance, security, and availability.
  • Set up alerts, dashboards, and metrics for proactive detection and response to incidents.
  • Investigate and diagnose root causes of incidents and work towards resolution in a timely manner.

Continuous Improvement and Collaboration:

  • Drive a culture of continuous improvement by identifying areas for automation and efficiency.
  • Document procedures, incidents, and best practices for knowledge sharing and team efficiency.
  • Stay updated on industry trends and emerging technologies to propose innovative solutions.
  • Collaborate closely with cross-functional teams to ensure smooth operation of systems.

Required Skills & Experience:

  • Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience) with at least 5 years of DevOps or SRE work.
  • Proficiency in scripting/programming languages such as Python and Bash.
  • Experience with cloud platforms (AWS preferred).
  • Experience in DevOps practices, CI/CD, and monitoring tools.
  • Experience with automation tools and configuration management frameworks such as Terraform, AWS CDK, Puppet, or Ansible.
  • Strong troubleshooting and problem-solving skills with a keen attention to detail.
  • Excellent communication and collaboration skills to work effectively in a cross-functional team environment.
  • Strong experience in system administration, infrastructure management, or site reliability engineering.

Additional Information Specifically for This Job Request:

  • A solid understanding of distributed systems and microservice architecture.
  • Technical background in IT system development/system administration.
  • Software engineering background and/or experience in tool development (e.g., Python, JavaScript, Java, or Kotlin).
  • Experience with Application Performance Monitoring tools like Prometheus and Grafana.
  • Good knowledge of SLA, SLO, SLI, and how to use metrics to measure service levels and objectives.
  • Experience with centralized logging platforms (e.g., Elastic stack, Splunk, Datadog).
  • Experience with container orchestration (e.g., Kubernetes).

Desired Attributes:

  • A service-minded team player with a quality-driven approach.
Detaljer

Referens:49157

Ort: Göteborg

Omfattning:100%

Startdatum:2024-05-20

Slutdatum:2025-05-20

Konsultförmedlare

Det går inte längre att söka den här tjänsten.