At PayPay, we’re constantly working on improving our systems and processes to support PayPay’s exponential growth. As an SRE at PayPay, we strive towards empowering our developers with the right tools and ensuring high availability and top-level performance so that our users can have a great experience with our services.
Considering PayPay’s growth, we are looking for experienced SREs who can deliver insights into system bottlenecks and ensure system reliability and scalability, while increasing the number of services that our company offers.
We are looking for individuals who can bring informed and unique viewpoints, enjoy collaborating with a cross-functional team and are actively pushing boundaries to develop reliable and scalable solutions and positive user experiences.
Responsibilities
- Analyze current technologies used in the company and develop monitoring and notification tools to improve observability and visibility
- Ensure system stability by pre-emptively verifying failure scenarios and implement solutions to reduce MTTR
- Develop solutions to improve system performance with a focus on high availability, scalability and resilience
- Establish SLAs for service uptime and integrate with telemetry and alerting platforms to track and improve reliability of systems
- Implement industry best practices for system development, configuration management and system deployment
- Ensure seamless flow of information between teams by documenting knowledge gained
- Be up to date on modern technologies and trends to advocate for inclusion within products if they add value
Requirements
- Experience troubleshooting, tuning high performance microservice architectures running on Kubernetes and AWS in highly available production environments
- 5+ years experience in software development in Python, Java, Go, etc with strong fundamentals in data structures, algorithms, problem solving and complexity analysis
- During the SRE selection process, you will have a coding challenge
- Curious and proactive in finding performance bottlenecks, scalability and resilience problem areas and addressing them
- Experience with observability tools and gathering data
- Database knowledge such as RDS, NoSQL, distributed TiDB, etc.
- Excellent communication skills, collaborative and getting things done attitude
- Enjoy taking up a challenge and driving it to conclusion
Nice to haves
While not required, tell us if you have any of the following.
- Container image management and optimization
- Experience in large distributed system architecture and capacity planning
- Understanding of IaC, automation tools, terraform, cloud formation, etc.
- Background in SRE/Devops concepts and implementation
- Experience in managing monitoring tools like CloudWatch, NewRelic, Prometheus and reporting with Google BigQuery and Looker Studios
- In depth knowledge of web technologies such as CloudFlare, Nginx, etc.
Compensation
7 to 14 million JPY annually.