Site Reliability Engineer - Observability

KOMOJU Musashino-Shi, Tokyo
  • 💴 No salary range given
  • 🏡 Fully remote (within Japan)
  • 🧪 3+ years experience required
  • 💬 No Japanese required
  • 🗾 Japan residents only
  • 🧳 No relocation support
DO YOU NEED MORE INFO?
ASK A QUESTION

About KOMOJU

KOMOJU Musashino-Shi, Tokyo

The leading cross-border payment gateway for Japan. We power payments for companies like video game distribution platform Steam and the popular mobile app TikTok.

Key benefits

  • Developer-centric, inclusive culture
  • International at our core
  • Generous holiday policy

About the position

As our systems grow in complexity, scale, and traffic, maintaining their reliability and availability becomes increasingly challenging—and critical. We’re looking for a Site Reliability Engineer (SRE) with a focus for observability to help us meet these demands.

In this role, you’ll be at the forefront of ensuring that our infrastructure is not just running, but understandable and measurable. Observability is a core pillar of our reliability strategy—it’s how we detect issues before they impact our merchants and users, quickly understand the root causes of incidents, and continuously improve our systems performance and reliability.

You’ll design and evolve our observability platform, including metrics, logging, tracing, and alerting, and partner with development teams to embed observability into every stage of the software lifecycle. Your work will directly impact our ability to scale confidently and respond to incidents swiftly.

This is a key role for someone who wants to build resilient systems, empower teams with actionable insights, and make a real difference in how we operate at scale.

While we are a remote-first company, this position is based in Tokyo, and we expect candidates to be willing to relocate to Japan.

Responsibilities

  • Design, implement, and maintain our observability stack (metrics, logging, tracing, dashboards).
  • Define and monitor SLIs/SLOs to ensure service health and reliability.
  • Correspond with engineering teams to instrument applications for better visibility.
  • Build and maintain dashboards and alerts that provide actionable insights and minimize alert fatigue.
  • Troubleshoot system performance and reliability issues using observability data.
  • Educate and guide engineering teams on best practices in monitoring, alerting, and incident response.
  • Contribute to postmortems and continuously improve system transparency and resiliency.

Requirements

  • 3+ years in SRE roles.
  • Hands-on experience with observability tools, preferably Datadog.
  • Proficiency in Terraform.
  • Background in software development.
  • Proficiency in at least one scripting or programming language (Ruby/Rails, Python, Go, Shell Script, etc.).
  • Experience working with AWS.
  • Familiarity with monitoring design principles: RED, USE, SLI/SLO, alert tuning.
  • Ability to analyze logs, metrics, and traces to diagnose issues and identify trends.

Nice to haves

While not specifically required, tell us if you have any of the following.

  • Knowledge of CI/CD pipelines and integrating observability into build and deploy processes.
  • Familiarity with incident response, on-call rotations, and post-incident reviews.
  • Business-level Japanese.

Compensation

Includes (rough estimate of) profit share. Based on experience and skill level.

DO YOU NEED MORE INFO?
ASK A QUESTION

Meet KOMOJU's Developers

Head of Customer Engineering Makoto Mizukami describes the unconventional candidates his unique team is looking for.

Read their story...

Nicole joined KOMOJU in 2020, and worked her way up to be technical lead of the merchant management team. She shares her journey, how KOMOJU supported her career growth and how the company is adapting to its growing needs.

Read their story...

Nigel was fresh out of college when he joined KOMOJU as a developer. He's now risen to tech lead, where he's helped build out their payment platform while maintaining a healthy work-life balance.

Read their story...

Muhammad Denaw, Senior Site Reliability Engineer at Komoju, talks about his work and shares how Komoju's trust and support propelled him to a promotion.

Read their story...

Related jobs

More jobs like this

I'll send you a digest of new English-friendly software developer jobs in Japan. Your email stays private, I don’t share or sell it.