As the performance gap between GPUs and CPUs keeps increasing, the kernel launch overhead is becoming a first-order bottleneck for many ML workloads. NVIDIA introduced CUDA Graphs to mitigate this ...