Site Reliability Engineer

Home
Site Reliability Engineer

Site Reliability Engineer

Job Type: Fulltime

Location: Canada Remote

Role and Responsibilities:

Design, build, and maintain highly reliable and scalable systems and infrastructure to ensure optimal performance and availability.
Develop and implement automation tools and frameworks to streamline operations, deployment, and configuration management processes.
Monitor, analyze, and optimize the performance of systems, applications, and network infrastructure to proactively identify and resolve bottlenecks and issues.
Collaborate with software engineering teams to design and implement scalable, fault-tolerant architectures for new and existing applications.
Troubleshoot and resolve production incidents, conducting root cause analysis and implementing preventive measures to minimize the occurrence of similar issues.
Implement and maintain monitoring, alerting, and logging systems to ensure the early detection of anomalies and to facilitate troubleshooting and debugging efforts.
Conduct capacity planning and performance testing to ensure systems can handle current and future loads and traffic.
Define and enforce best practices, standards, and policies for system reliability, security, and performance across the organization.
Collaborate with cross-functional teams to ensure proper backup, disaster recovery, and business continuity plans are in place and tested.
Participate in on-call rotations to respond to and resolve critical incidents outside of regular business hours.
Stay up-to-date with industry trends, emerging technologies, and best practices in site reliability engineering, and provide recommendations for improving system reliability and performance.

Qualifications:

Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent experience).
4 years of experience as a Site Reliability Engineer or in a similar role, working on large-scale distributed systems.
Strong experience with Linux/Unix systems administration, networking protocols, and troubleshooting.
Proficiency in scripting languages such as Python, Bash, or Ruby for automation and tooling.
Solid understanding of cloud platforms like AWS, Azure, or Google Cloud Platform, including provisioning, monitoring, and managing infrastructure.
Experience with containerization technologies such as Docker and container orchestration frameworks like Kubernetes.
Proficiency in implementing and managing monitoring and alerting systems like Prometheus, Grafana, ELK stack, or Splunk.
Familiarity with infrastructure as code (IaC) tools like Terraform or CloudFormation.
Strong problem-solving and troubleshooting skills, with the ability to analyze complex systems and network issues.
Excellent communication and collaboration skills, with the ability to work effectively with cross-functional teams and communicate technical concepts to non-technical stakeholders.