|
Flash-Recon — Real-Time Monocular SLAM (in progress)
Visual SLAM
CUDA
PyTorch
3D Reconstruction
Real-time monocular SLAM based on DROID-SLAM, built from the ground up with fused CUDA kernels for bundle adjustment — achieving 2.5cm ATE and 10ms median latency. Runs DepthAnythingV2 and Gaussian Splatting concurrently on a single GPU, creating a new optimization landscape that motivated
[code]
|
|
|
Torq — Graph Compiler with CUDA-Level Interception (in progress)
CUDA
C
Python
Compilers
Quantization
Driver-level CUDA API interception via LD_PRELOAD for automatic graph capture, stream management, and contention detection. Building toward contention-aware dual-graph dispatch with static INT8 quantization.
[code]
|
|
pperf — Hierarchical Profiler for Quick Iteration
Python
Profiling
Tiny hierarchical profiler that surfaces the worst bottlenecks with their full call tree. Pluggable metrics like latency, GPU memory, or anything you define. Built to profile flash-recon under real contention. |
|
VisionRT — Deterministic Inference via Vertical Optimization
Computer Vision
Real-time Systems
CUDA
Triton
PyTorch
V4L2
Real-time computer vision with deterministic performance. Direct V4L2 integration, custom PyTorch compiler backend, and CUDA graph capture achieving microsecond-level timing precision. |
|
Inclusive Scan — GPU Prefix Sum That Beats CUB
CUDA
Parallel Algorithms
High-performance GPU prefix sum achieving 94% of theoretical DRAM bandwidth. Uses Kogge-Stone scans and decoupled lookback for cross-CTA communication. Beats NVIDIA CUB at mid-range sizes.
[code]
|
|
Transpose Scale — 3-6x Faster Than Intel MKL
C++
SIMD
Performance Engineering
High-performance matrix transpose using double cache blocking, branch elimination, and vectorized in-register operations. Cross-platform SIMD support with AVX2, SSE, and NEON.
[code]
|