Site Reliability Engineer
- Home
- Site Reliability Engineer
Site Reliability Engineer
Job Type: Fulltime
Location: Canada Remote
Role and Responsibilities:
- Design, build, and maintain highly reliable and scalable systems and infrastructure to ensure
optimal performance and availability.
- Develop and implement automation tools and frameworks to streamline operations, deployment, and
configuration management processes.
- Monitor, analyze, and optimize the performance of systems, applications, and network infrastructure
to proactively identify and resolve bottlenecks and issues.
- Collaborate with software engineering teams to design and implement scalable, fault-tolerant
architectures for new and existing applications.
- Troubleshoot and resolve production incidents, conducting root cause analysis and implementing
preventive measures to minimize the occurrence of similar issues.
- Implement and maintain monitoring, alerting, and logging systems to ensure the early detection of
anomalies and to facilitate troubleshooting and debugging efforts.
- Conduct capacity planning and performance testing to ensure systems can handle current and future
loads and traffic.
- Define and enforce best practices, standards, and policies for system reliability, security, and
performance across the organization.
- Collaborate with cross-functional teams to ensure proper backup, disaster recovery, and business
continuity plans are in place and tested.
- Participate in on-call rotations to respond to and resolve critical incidents outside of regular
business hours.
- Stay up-to-date with industry trends, emerging technologies, and best practices in site reliability
engineering, and provide recommendations for improving system reliability and performance.
Qualifications:
- Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent
experience).
- 4 years of experience as a Site Reliability Engineer or in a similar role, working on large-scale
distributed systems.
- Strong experience with Linux/Unix systems administration, networking protocols, and
troubleshooting.
- Proficiency in scripting languages such as Python, Bash, or Ruby for automation and tooling.
- Solid understanding of cloud platforms like AWS, Azure, or Google Cloud Platform, including
provisioning, monitoring, and managing infrastructure.
- Experience with containerization technologies such as Docker and container orchestration frameworks
like Kubernetes.
- Proficiency in implementing and managing monitoring and alerting systems like Prometheus, Grafana,
ELK stack, or Splunk.
- Familiarity with infrastructure as code (IaC) tools like Terraform or CloudFormation.
- Strong problem-solving and troubleshooting skills, with the ability to analyze complex systems and
network issues.
- Excellent communication and collaboration skills, with the ability to work effectively with
cross-functional teams and communicate technical concepts to non-technical stakeholders.