Description
With a career at The Home Depot, you can be yourself and also be part of something bigger.
Position Overview:
The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background in reliability reviews, performance engineering practices, production engineering, and operational support, with emphasis on DevOps principles and GCP expertise.
Responsibilities:
- Leadership & Management:
- Lead and mentor a team of Site Reliability Engineers
- Foster a culture of continuous improvement and innovation
- Collaborate with cross-functional teams to align SRE practices with business objectives
- Reliability & Performance:
- Conduct reliability reviews to identify areas for improvement and implement solutions to enhance system reliability, particularly in GCP environments
- Implement and promote performance engineering practices to ensure optimal system performance on GCP
- Develop and maintain service level objectives (SLOs) and error budgets
- Production Engineering & Operational Support:
- Oversee production engineering efforts to ensure systems are designed for operational excellence and reliability, leveraging GCP services and best practices
- Manage incident response and post-incident reviews to minimize downtime and improve system resilience
- Implement monitoring, alerting, and observability solutions to proactively identify and address issues
- Develop and maintain runbooks and playbooks for common operational tasks.
- Coordinate with security teams to ensure compliance with security policies and best practice
- DevOps & Continuous Improvement:
- Drive DevOps initiatives to improve collaboration between development and operations teams, with a focus on GCP-native tools and services
- Implement and maintain CI/CD pipelines to streamline deployment processes in GCP environments
- Identify and implement automation opportunities to reduce manual tasks and improve efficiency
- Promote the use of Infrastructure as Code (IaC) to manage and provision cloud resources.
- Continuously evaluate and integrate new tools and technologies to enhance DevOps practices
- Release Management:
- Implement and maintain release management best practices to minimize disruptions and maximize system stability
- Collaborate with DevOps teams to integrate release management into CI/CD pipelines
- Oversee release schedules, ensuring minimal impact on business operations
- Ensure there is a rigorous release readiness process in place that includes reviews and post-release retrospectives
- Maintain a release calendar and communicate release plans to stakeholders
- Strategic Planning:
- Create and maintain a strategic roadmap for SRE initiatives, aligning with business goals and technological advancements.
- Refine and standardize Standard Operating Procedures (SOPs) to enhance operational efficiency and consistency.
- Address customer pain points by developing and implementing solutions that improve user experience and system reliability.
- Engage with stakeholders to understand their needs and incorporate feedback into strategic planning and execution
- Monitor industry trends and best practices to ensure the SRE team remains at the forefront of technology.
Experience:
- Bachelor’s degree in computer science, Engineering, or a related field
- Strong problem-solving and analytical abilities
- Excellent communication and collaboration skills
- 4-6 years of relevant work experience, including significant experience with GCP
- Extensive experience with cloud infrastructure, GCP services and architecture
- Proven track record of managing and optimizing large-scale systems on GCP
- Proven ability to effectively communicate with individuals at all levels of the organization
- Ability to maintain relationship and negotiate with vendors.
- Ability to operate in and leverage resources in a matrixed environment.
- Ability to analyze and present data to support ideas.
- Ability to clearly communicate to all levels of the organization.