Abiel Almonte

	Flash-Recon — Real-Time Monocular SLAM (in progress) Visual SLAM CUDA PyTorch 3D Reconstruction Real-time monocular SLAM based on DROID-SLAM, built from the ground up with fused CUDA kernels for bundle adjustment — achieving 2.5cm ATE and 10ms median latency. Runs DepthAnythingV2 and Gaussian Splatting concurrently on a single GPU, creating a new optimization landscape that motivated `torq` and `pperf`. [code]
	Torq — Graph Compiler with CUDA-Level Interception (in progress) CUDA C Python Compilers Quantization Driver-level CUDA API interception via LD_PRELOAD for automatic graph capture, stream management, and contention detection. Building toward contention-aware dual-graph dispatch with static INT8 quantization. [code]
	pperf — Hierarchical Profiler for Quick Iteration Python Profiling Tiny hierarchical profiler that surfaces the worst bottlenecks with their full call tree. Pluggable metrics like latency, GPU memory, or anything you define. Built to profile flash-recon under real contention. [blog post] [code]
	VisionRT — Deterministic Inference via Vertical Optimization Computer Vision Real-time Systems CUDA Triton PyTorch V4L2 Real-time computer vision with deterministic performance. Direct V4L2 integration, custom PyTorch compiler backend, and CUDA graph capture achieving microsecond-level timing precision. [blog post] [code]
	Inclusive Scan — GPU Prefix Sum That Beats CUB CUDA Parallel Algorithms High-performance GPU prefix sum achieving 94% of theoretical DRAM bandwidth. Uses Kogge-Stone scans and decoupled lookback for cross-CTA communication. Beats NVIDIA CUB at mid-range sizes. [code]
	Transpose Scale — 3-6x Faster Than Intel MKL C++ SIMD Performance Engineering High-performance matrix transpose using double cache blocking, branch elimination, and vectorized in-register operations. Cross-platform SIMD support with AVX2, SSE, and NEON. [code]