Member of Technical Staff - Inference Serving

AI& Yokohama, Kanagawa May 17 2026
  • 💴 No salary range given
  • 🏡
    Partially remote
  • 🌏
    Apply from abroad
    Relocate to Japan
  • 💬
    No Japanese required
    Business English
  • 🧪
    Intermediate level
    Unspecified years of experience
DO YOU NEED MORE INFO?
ASK A QUESTION

About AI&

AI& Yokohama, Kanagawa

A vertically integrated AI platform from Japan for the global market. We recently officially launched with $50M in seed funding and more than $2B in committed infrastructure capital.

About the position

As an inference & serving engineer, your objective is to build a high-performance, multi-tenant serving stack that squeezes maximum utilization out of heterogeneous hardware. This involves navigating the trade-offs between various state-of-the-art inference frameworks and engines, selecting and optimizing the right runtime for the right workload. The scope of work is not limited to Large Language Models; it extends to the frontier of Generative AI, including high-throughput Video generation and complex Multimodal systems where memory pressure and compute requirements are significantly more demanding.

Beyond just deploying models at scale, this role is responsible for building a robust system that bridges the gap between boutique, high-performance clusters and massive, multi-node deployments as the company grows. This requires a deep understanding of the “Inference Triangle”—constantly tuning the stack to find the optimal equilibrium between low-latency (TTFT/ITL), high-throughput, and inference quality (Precision/Quantization). The ideal candidate is a hands-on engineer who views the entire GPU fleet as a single, programmable compute fabric and is eager to get their hands dirty at every level of the stack.

Responsibilities

  • Runtime Selection & Deep Optimization: Lead the evaluation, integration, and continuous tuning of diverse inference frameworks to ensure best-in-class performance across LLM, Video, and Multimodal workloads.
  • Latency & Throughput Engineering: Own the end-to-end performance profile of the model lifecycle, implementing advanced strategies such as disaggregated prefill/decode, speculative decoding, and continuous batching to minimize TTFT and maximize tokens-per-second.
  • Scalable Systems Evolution: Design and implement serving architectures that function seamlessly on small experimental clusters while providing a clear, robust path to massive-scale, multi-node deployments.
  • Advanced Memory & Cache Orchestration: Implement and optimize memory management techniques to maximize KV-cache reuse and minimize redundant computations in multi-turn or high-concurrency scenarios.
  • Day 0 Model Support: Working with the ecosystem, craft a Day 0 model support strategy ensuring our stack provides stable, high-performance support for new models when they are released.
  • Cross-Stack Integration: Collaborate with the Backend/Gateway and Compute Orchestration teams to ensure the inference engine’s telemetry, failure domains, and lifecycle management are perfectly aligned with the global load balancer and API layers.
  • Hands-on Technical Leadership: Maintain a high level of personal agency by writing production code, debugging complex distributed system “hangs,” and contributing to architectural decisions in a flat, fast-moving team environment.
  • Collaborative Communication: Function as a primary technical peer to engineering leads, translating complex hardware and model constraints into clear product and infrastructure strategies.
  • Inference Strategy & Trade-offs: Define path forward when balancing model precision and quantization against the physical limits of HBM bandwidth and compute throughput

Requirements

  • Inference Engine: Deep experience with the internals of modern runtimes. You are a prominent contributor to inference engine ecosystems, including but not limited to OSS projects or proprietary engines at top-tier AI labs.
  • Multimodal Domain Knowledge: Understanding of the specific challenges involved in serving Large Language Models alongside Video and Vision-based generative models.
  • Scale-First Engineering: A track record of building and managing distributed systems that have evolved from small-scale proofs-of-concept to large-scale production deployments.
  • Great Team Spirit: A mission-driven approach to engineering, valuing clear communication, hands-on execution, and collective success over individual silos.
DO YOU NEED MORE INFO?
ASK A QUESTION

Related jobs

More jobs like this

We'll send you a digest of new English-friendly software developer jobs in Japan. Your email stays private, we don't share or sell it.