As a Systems Engineer at ai&, you are responsible for the physical and software foundation that everything else runs on. You will plan, configure, and manage the bare-metal infrastructure that powers our data centers — from OS tuning and driver management to rack-scale GPU system provisioning. You are the person who makes sure the hardware is running at its full potential before the software teams ever touch it.
This is a hands-on role. You will work on some of the most advanced compute hardware available, including NVL72 and AMD Helios rack-scale systems, and you will be responsible for keeping them running at maximum efficiency. You think carefully about system configuration, firmware, and the low-level software decisions that compound into real performance differences at scale.
Responsibilities
- Bare-Metal Infrastructure Management Configure and manage bare-metal servers end to end. Own OS tuning, driver management, firmware upgrades, and CUDA configuration across the fleet.
- Rack-Scale GPU System Operations Lead the installation, provisioning, and continuous operation of high-density, liquid-cooled rack-scale GPU systems including NVL72 and AMD Helios deployments.
- System Architecture & Planning Plan and architect the next generation of system configurations including compute, storage, networking interconnects, routers, and switches. Make decisions that scale.
- Performance Optimization Tune system-level configurations to maximize hardware utilization and minimize overhead. Work closely with the kernel and inference teams to ensure software and hardware are fully aligned.
- Cross-Team Collaboration Work closely with the network, storage, and data center teams to ensure the physical infrastructure operates as a unified, high-performance system.
Requirements
- Bare-Metal Operations Experience Deep hands-on experience managing large-scale bare-metal server environments. You have configured OS, drivers, firmware, and CUDA at scale and you know the failure modes.
- GPU System Expertise Experience provisioning and operating high-density GPU systems. Familiarity with NVIDIA NVLink, NVSwitch, and AMD MI-series architectures is a strong signal.
- Low-Level Systems Knowledge Strong understanding of Linux internals, kernel parameters, NUMA topology, PCIe configurations, and how these interact with AI workloads.
- Infrastructure Judgment You make system configuration decisions that hold up at scale. You think about maintainability, reproducibility, and failure recovery from the start.
- Great Team Spirit A mission-driven approach to engineering, valuing clear communication, hands-on execution, and collective success over individual silos.