cugraphopt

15-418 Final Project Proposal

CuGraphOpt: End-to-End CUDA Accelerated Factor Graph Optimization

URL: https://leolrg.github.io/cugraphopt/ Team: leolrg (Solo project, exception requested).

Summary

We are going to implement a minimal, fully GPU-accelerated non-linear least squares solver (Gauss-Newton) natively in CUDA, strictly scoped to 3D Pose Graph Optimization (PGO). The system will read standard pose-graph datasets, perform parallel linearization and sparse matrix assembly on the device, and solve the resulting linear system using a custom CUDA Preconditioned Conjugate Gradient (PCG) solver to minimize host-device data transfer.

Background

Factor graph optimization is the mathematical backend for modern robotics and SLAM. The core bottleneck involves iteratively evaluating the error and Jacobians for thousands of constraints, assembling them into a massive global sparse Hessian matrix ($H = J^T J$) and gradient vector ($b = J^T e$), and solving the linear system $H \Delta x = -b$. Currently, state-of-the-art libraries perform much of this on the CPU. Moving the entire inner loop to the GPU avoids the severe latency of transferring the Hessian matrix across the PCIe bus every iteration.

By narrowing the scope to PGO, every variable is uniformly structured as a 3D pose, allowing us to focus on the parallel systems challenges of memory coalescing and contention rather than handling dynamic, heterogeneous sensor types.

The Challenge

This project presents distinct parallel systems challenges regarding irregular memory access and divergence:

Resources

I will utilize the NVIDIA GPUs available in the GHC/latedays clusters. The project will be built from scratch in C++ and CUDA.

For benchmarking and ground-truth validation, I will use GTSAM as the single-threaded CPU baseline. I will parse standard PGO benchmark datasets (e.g., sphere2500.g2o, parking-garage.g2o, torus3D.g2o, M3500.g2o) to feed the graph topology and initial estimates directly into both solvers.

Goals and Deliverables

Plan to Achieve:

Hope to Achieve:

Platform Choice

NVIDIA GPUs accessed via CUDA are the ideal target for this workload. The fine-grained control over shared memory, warp-level primitives, and memory coalescing is strictly necessary to effectively map the irregular, block-sparse structures of factor graphs to parallel hardware without crippling the memory bandwidth.

Schedule