As a Senior Site Reliability Engineer at Treasure Data, you will be at the forefront of shaping the technical direction of our Kubernetes platform. This critical role involves working collaboratively with software developers across the company and maintaining a close connection with our internal customers. Your expertise will be pivotal in ensuring the reliability, scalability, and innovation of our services.
Responsibilities
- Help lead and define the technical roadmap for our Kubernetes platform to improve developer experience, reduce delivery time, and optimize cost.
- Collaborate with cross-functional teams to integrate Kubernetes into our broader service offerings.
- Partner with service teams to optimize performance, cost, and operability to address internal user needs.
- Drive stability and performance improvements with a data-driven approach.
- Troubleshoot and resolve complex technical issues in a multi-cluster Kubernetes environment.
- Be a technical leader with the SRE group by assisting and mentoring other engineers.
- Work with other technical and people leaders to define projects, track progress, identify risks, and highlight opportunities.
- Explore new technologies and enhance existing services to support our users.
- Diagnose complex networking, orchestration, and integration issues at the infrastructure and software level.
- Design and implement new services to automate everyday tasks and unlock new capabilities.
Requirements
- At least five years working experience in Software Engineering or Systems Engineering role with Distributed Systems at scale.
- Knowledge of
- Site Reliability Engineering and distributed systems
- Architecting multi-cluster Kubernetes
- Cloud computing providers like AWS, GCP, or Azure
- Cloud networking concepts
- One modern web development language (JavaScript/Node, Go, Python, Ruby, etc.)
- Modern SaaS software development practices (CI/CD, GitOps, software testing, release workflow automation)
- Excellent English communication skills, with an ability to articulate technical concepts to non-SREs.
- Strong collaboration skills and experience in working with diverse and distributed teams.
Nice to haves
While not specifically required, tell us if you have any of the following.
- A history of involvement in the open-source community.
- Knowledge of high-scale data platforms (e.g. Hadoop) or relational databases (e.g. PostgreSQL)
- Experience speaking and/or writing in Japanese.
- Understanding of agile development practices (e.g. Scrum, Kanban)