As a Systems Engineer at ai&, you are responsible for the physical and software foundation that everything else runs on. You will plan, configure, and manage the bare-metal infrastructure that powers our data centers — from OS tuning and driver management to rack-scale GPU system provisioning. You are the person who makes sure the hardware is running at its full potential before the software teams ever touch it.

This is a hands-on role. You will work on some of the most advanced compute hardware available, including NVL72 and AMD Helios rack-scale systems, and you will be responsible for keeping them running at maximum efficiency. You think carefully about system configuration, firmware, and the low-level software decisions that compound into real performance differences at scale.

Responsibilities

Bare-Metal Infrastructure Management Configure and manage bare-metal servers end to end. Own OS tuning, driver management, firmware upgrades, and CUDA configuration across the fleet.
Rack-Scale GPU System Operations Lead the installation, provisioning, and continuous operation of high-density, liquid-cooled rack-scale GPU systems including NVL72 and AMD Helios deployments.
System Architecture & Planning Plan and architect the next generation of system configurations including compute, storage, networking interconnects, routers, and switches. Make decisions that scale.
Performance Optimization Tune system-level configurations to maximize hardware utilization and minimize overhead. Work closely with the kernel and inference teams to ensure software and hardware are fully aligned.
Cross-Team Collaboration Work closely with the network, storage, and data center teams to ensure the physical infrastructure operates as a unified, high-performance system.

Requirements

Bare-Metal Operations Experience Deep hands-on experience managing large-scale bare-metal server environments. You have configured OS, drivers, firmware, and CUDA at scale and you know the failure modes.
GPU System Expertise Experience provisioning and operating high-density GPU systems. Familiarity with NVIDIA NVLink, NVSwitch, and AMD MI-series architectures is a strong signal.
Low-Level Systems Knowledge Strong understanding of Linux internals, kernel parameters, NUMA topology, PCIe configurations, and how these interact with AI workloads.
Infrastructure Judgment You make system configuration decisions that hold up at scale. You think about maintainability, reproducibility, and failure recovery from the start.
Great Team Spirit A mission-driven approach to engineering, valuing clear communication, hands-on execution, and collective success over individual silos.

Member of Technical Staff - Systems

About AI&

About the position

Responsibilities

Requirements

Related jobs

Money Forward

Infrastructure Engineer (SRE-AWS)

Lead Engineer, ML Platform, Tokyo

ML Engineer, ML Platform, Tokyo

IAM Engineer (Identity Platform), Money Forward Cloud, ID Platform Group

Treasure AI

Senior Software Engineer – Machine Learning

About AI&

More jobs like this