Flash-Recon — Real-Time Monocular SLAM (in progress)

Visual SLAM CUDA PyTorch 3D Reconstruction

Real-time monocular SLAM based on DROID-SLAM, built from the ground up with fused CUDA kernels for bundle adjustment — achieving 2.5cm ATE and 10ms median latency. Runs DepthAnythingV2 and Gaussian Splatting concurrently on a single GPU, creating a new optimization landscape that motivated torq and pperf.

torq architecture

Torq — Graph Compiler with CUDA-Level Interception (in progress)

CUDA C Python Compilers Quantization

Driver-level CUDA API interception via LD_PRELOAD for automatic graph capture, stream management, and contention detection. Building toward contention-aware dual-graph dispatch with static INT8 quantization.

pperf output

pperf — Hierarchical Profiler for Quick Iteration

Python Profiling

Tiny hierarchical profiler that surfaces the worst bottlenecks with their full call tree. Pluggable metrics like latency, GPU memory, or anything you define. Built to profile flash-recon under real contention.

visionrt latency histogram

VisionRT — Deterministic Inference via Vertical Optimization

Computer Vision Real-time Systems CUDA Triton PyTorch V4L2

Real-time computer vision with deterministic performance. Direct V4L2 integration, custom PyTorch compiler backend, and CUDA graph capture achieving microsecond-level timing precision.

inclusive-scan benchmark

Inclusive Scan — GPU Prefix Sum That Beats CUB

CUDA Parallel Algorithms

High-performance GPU prefix sum achieving 94% of theoretical DRAM bandwidth. Uses Kogge-Stone scans and decoupled lookback for cross-CTA communication. Beats NVIDIA CUB at mid-range sizes.

transpose-scale benchmark

Transpose Scale — 3-6x Faster Than Intel MKL

C++ SIMD Performance Engineering

High-performance matrix transpose using double cache blocking, branch elimination, and vectorized in-register operations. Cross-platform SIMD support with AVX2, SSE, and NEON.