We’re currently looking for a Site Reliability Engineer (SRE) to join our team and help us scale. This is an incredible opportunity to grow with a team of experts passionate about creating robust, scalable, and efficient systems.

We are looking for an SRE interested in making a big impact, both by collaborating with internal stakeholders closely (dev teams, product managers) and by demonstrating technical leadership (sharing technical knowledge, new ideas, and participating in design discussions).

We are also looking for someone friendly and easy to get along with – we have a very welcoming and positive engineering culture, and we value that highly!

Responsibilities

SRE Team is looking for a software engineer who can maintain the stability of its systems that handle heavy loads of traffic. We’re looking for a software engineer who can engage themselves in the automation of systems and attend to system failures as well as carry out development to improve the reliability, performance, and scalability of its systems moving forward.
Specific work responsibilities include the following:
- Contribute optimizations to the backend codebase
- Assist in designing, implementing, and maintaining our multi-cloud infrastructure, specifically:
  - GCP Cloud Run, Networking, GKE, Cloud SQL
  - AWS IVS and CloudFront
- Work with Docker containers and orchestration tools; knowledge of Kubernetes is a plus.
- Utilize Terraform for infrastructure as code deployments and tool configuration
- Utilize CircleCI for continuous integration and delivery pipelines
- Fine-tune application performance monitoring with Datadog, proactively identifying and resolving issues as part of a department-wide on-call rotation
- Own production releases together with development teams
- Participate in incident management, post-mortem analysis, and system optimization
- Develop and maintain documentation on system configurations, operations, and troubleshooting procedures.
- Fine-tune database deployments (MySQL and InnoDB)

Requirements

Bachelor's degree in Computer Science or equivalent practical experience
Ability to work independently, learn quickly, and proactively collaborate
Strong written and verbal communication skills
Strong problem solving and troubleshooting skills
Ability to work calmly under pressure (such as during a production outage)
Experience with at least one public cloud platform
Experience with at least one programming language (pref Python, Golang, or Javascript/Typescript)
Familiarity with Linux administration and shell scripting
Conversational Japanese skills

Nice to haves

While not specifically required, tell us if you have any of the following.

Experience with monitoring and infrastructure tooling (Datadog, Pagerduty, Terraform, CircleCI)
GCP design and administration experience
Understanding of networking principles and protocols
Ability to write clear, concise, and informative documentation
Experience in developing and operating large-scale web applications
Experience collaborating with product managers and designers
AWS design and administration experience
Kubernetes design and administration experience
Experience with Docker, containerization, and microservice architectures
Experience with Stripe or other payment processors
Experience designing, implementing, and maintaining deployment tooling (canary, blue/green)
Strong communication skills in Japanese, both written and verbal
Passion for efficiency, scalability, and technology in general
Authorization to work in Japan without requiring visa sponsorship

Compensation

¥7,000,000 ~ ¥11,000,000 annually.

Site Reliability Engineer

About THECOO

About the position

Responsibilities

Requirements

Nice to haves

Compensation

Related jobs

Money Forward

SRE, Digital Bank

Infrastructure Engineer (SRE-AWS)

IAM Engineer (Identity Platform), Money Forward Cloud, ID Platform Group

MeetsMore

Senior Site Reliability Engineer / DevOps

Treasure AI

Senior Software Engineer - Query Engines & Storage

About THECOO

More jobs like this