As a Kernel Optimization Engineer, your objective is to extract everything from heterogeneous GPU hardware. This means going below the framework layer, writing, profiling, and tuning custom CUDA and ROCm/HIP kernels that sit at the heart of our inference and training stack. You will work across NVIDIA and AMD silicon, understanding the deep architectural differences between the two and writing code that is optimal for each.

This is not a role about deploying existing kernels. It is about authoring them. You will identify bottlenecks in the execution loop including memory bandwidth saturation, warp divergence, occupancy limits, and cache thrashing, and build solutions from first principles. You will work closely with our inference and serving team to ensure that the kernels you build translate into real-world performance gains — but your domain is the kernel layer and everything below it.

The scope spans attention mechanisms, quantization primitives, custom activation functions, fused operators, and the communication kernels that tie multi-GPU systems together. The ideal candidate has a hardware-first intuition: they think in warps, tiles, and memory hierarchies before they think in frameworks. They are equally comfortable reading PTX and roofline charts. And they are never done optimizing.

Responsibilities

Custom Kernel Development Design and implement high-performance kernels for core AI primitives including GEMM, attention, normalization, and convolution. Own the full cycle from profiling to production deployment across LLM inference, training, and generative model workloads.
Cross-Vendor Hardware Optimization Develop deep expertise across NVIDIA and AMD GPU architectures. Understand the micro-architectural differences including memory subsystems, scheduler behavior, and cache hierarchies, and write kernels that are genuinely optimal for each target. Optimize across heterogeneous compute units including SIMD, matrix engines, and DMA.
Attention & Linear Algebra Primitives Build and tune fused attention kernels (Flash Attention variants, MLA, paged attention), GEMM primitives, and quantized compute paths (INT8, FP8, AWQ, GPTQ) that push the hardware to its limits.
Precision & Numerical Stability Prototype and evaluate precision formats including FP16, BF16, FP8, e5m2, and stochastic rounding. Understand the accuracy and performance trade-offs at a deep level and make principled decisions about where each format belongs.
Profiling & Bottleneck Analysis Use Nsight Compute, rocprof, Perfetto, VTune, and custom instrumentation to identify and eliminate performance bottlenecks. Translate profiling data into concrete architectural improvements.
Operator Fusion Identify opportunities to fuse multi-step operations into single kernel launches, reducing memory round-trips and kernel launch overhead across the inference and training execution graphs.
Communication Kernel Optimization Optimize collective communication primitives (AllReduce, AllGather, ReduceScatter) for multi-GPU and multi-node topologies, working closely with the infrastructure team.
Compiler & Runtime Integration Collaborate with compiler and runtime teams to integrate kernels into Triton, PyTorch, or SYCL pipelines. Contribute to micro-architecture feedback loops, helping co-design ISA and memory features with the hardware team where relevant.
Cross-Team Collaboration Work closely with the inference and serving team to ensure kernel-level performance translates into system-level gains. Share profiling insights, align on optimization priorities, and contribute to architectural decisions across teams.
Technical Leadership Maintain a high level of personal agency. Write production code, review kernel implementations, and contribute to architectural decisions in a flat, fast-moving team environment.

Requirements

Deep Kernel Authorship You have written production CUDA or ROCm kernels from scratch. You understand warp execution, shared memory bank conflicts, occupancy, and instruction-level parallelism at an intuitive level. Strong proficiency in C++11 or higher, CUDA, Triton, and ideally LLVM/MLIR.
Hardware Architecture Knowledge Strong familiarity with NVIDIA Hopper/Ampere and AMD CDNA architectures. You know the differences between HBM bandwidth profiles, cache sizes, and execution units and you write code that reflects that knowledge. Deep understanding of memory layout, vectorization, thread and block scheduling, and cache behavior.
Precision & Numerical Fluency Solid grasp of numerical stability, mixed precision arithmetic, and modern precision formats. Experience making principled trade-offs between precision and performance in production systems.
Profiling Fluency Comfortable with Nsight Compute, rocprof, Perfetto, VTune, and roofline modeling. You do not guess where the bottleneck is. You measure it.
Parallel Programming Breadth Strong background across parallel programming models including CUDA, Triton, SYCL, OpenCL, or OpenMP. Experience optimizing irregular algorithms such as sparse linear algebra or graph computations.
Systems Thinking Ability to reason about how individual kernels compose into larger execution graphs, and how kernel-level decisions propagate up through the inference or training stack.
Great Team Spirit A mission-driven approach to engineering, valuing clear communication, hands-on execution, and collective success over individual silos

Member of Technical Staff - Inference Optimization

About AI&

About the position

Responsibilities

Requirements

Related jobs

Rapyuta Robotics

Robotics Software Engineer - Navigation

Robotics Systems Engineer

Wizcorp

Unreal Engine Game Programmer

Game Server Programmer

Synspective

Mid-level/Senior AOCS Engineer

About AI&

More jobs like this