Member of Technical Staff - Networking

AI& Yokohama, Kanagawa May 17 2026
  • 💴 No salary range given
  • 🏡
    Partially remote
  • 🌏
    Apply from abroad
    Relocate to Japan
  • 💬
    No Japanese required
    Business English
  • 🧪
    Intermediate level
    Unspecified years of experience
DO YOU NEED MORE INFO?
ASK A QUESTION

About AI&

AI& Yokohama, Kanagawa

A vertically integrated AI platform from Japan for the global market. We recently officially launched with $50M in seed funding and more than $2B in committed infrastructure capital.

About the position

As a Network Engineer at ai&, you are the domain expert on the lossless networking fabrics that tie our GPU fleet together. AI at scale lives and dies on the network. Collective communication operations, AllReduce, AllGather, ReduceScatter, are on the critical path of every distributed training and inference workload we run. Your job is to make sure the fabric is fast, lossless, and never the bottleneck.

You will work across RoCE v2 and InfiniBand fabrics, tune NCCL and network interfaces, and own the end-to-end network performance of our compute clusters. You will work closely with the systems, kernel, and inference teams to ensure that what gets built at the physical layer translates directly into performance at the workload layer.

Responsibilities

  • Lossless Fabric Design & Operations Design, deploy, and operate lossless networking fabrics across our data centers. Own RoCE v2 and InfiniBand (NDR/XDR) deployments end to end.
  • NCCL & Interface Tuning Tune NCCL, NICs, and DPUs to guarantee maximum bandwidth and zero packet loss for distributed AI workloads. Own the performance of collective communication operations across the fleet.
  • Network Architecture Design the network architecture for new data center deployments. Make topology, switch, and cabling decisions that scale from current clusters to future multi-site deployments.
  • Performance Monitoring & Optimization Instrument the network for observability. Proactively identify and eliminate bottlenecks before they affect workloads. Own network performance benchmarks and drive continuous improvement.
  • Cross-Team Collaboration Work closely with the systems, storage, and ML infrastructure teams to ensure the network fabric supports the demands of distributed training and inference at every scale.

Requirements

  • AI Networking Expertise Deep experience designing and operating lossless AI networking fabrics. You have worked with InfiniBand and RoCE v2 at scale and you understand the trade-offs between them.
  • NCCL & Collective Communications Hands-on experience tuning NCCL for distributed AI workloads. You understand how collective communication patterns interact with network topology and you know how to optimize for both bandwidth and latency.
  • NIC & DPU Proficiency Experience configuring and tuning high-performance NICs and DPUs from vendors including NVIDIA ConnectX and Bluefield series.
  • Network Architecture Judgment You make network design decisions that hold up at scale. Fat-tree topologies, rail-optimized designs, congestion control — you have an informed view on all of it.
  • Great Team Spirit A mission-driven approach to engineering, valuing clear communication, hands-on execution, and collective success over individual silos.
DO YOU NEED MORE INFO?
ASK A QUESTION

Related jobs

More jobs like this

We'll send you a digest of new English-friendly software developer jobs in Japan. Your email stays private, we don't share or sell it.