Changelog¶

v0.4.3¶

Performance: torch.compile for all algorithms + cleanup

Enabled torch.compile for CRBA and ABA on CUDA (previously skipped when Triton kernels were available). Compilation now optimizes surrounding PyTorch operations while Triton handles the bottleneck 6x6 double-matmul. Result: CRBA 2.1x faster, ABA 1.7x faster at B=1 with compile enabled.
Moved Triton kernel import to module level, eliminating repeated import overhead in hot loops. _use_triton_kernels now properly gates on HAS_TRITON instead of being hardcoded to True.
Rewrote benchmarks/speed_benchmark.py for fair comparison: each bard algorithm timing now includes update_kinematics (end-to-end), matching ADAM’s self-contained calls. Speedup tables use ADAM as the primary baseline. Pinocchio C++ is included as a serial CPU reference.
Added autograd documentation to Quick Start guide with examples for d(M)/d(q) through CRBA, d(qdd)/d(tau) through ABA.
Removed obsolete files: quick_bench.py, basic_test.py, bard_basic_test.py, and root-level profiling/debug scripts.

Performance: Inline spatial cross products, eliminate per-node allocations

Inline spatial cross products to eliminate per-node GPU allocations.
Revert JIT functions to torch.zeros pattern for torch.compile compatibility.
Eliminate hidden tensor allocations in core transform functions.

Unified Model/Data API

Introduced Model and Data classes following the Pinocchio/MuJoCo pattern. All computations are now accessed through top-level free functions (bard.forward_kinematics(), bard.jacobian(), bard.rnea(), etc.) operating on a model + data pair.
Added bard.build_model_from_urdf() as the primary entry point for loading robots. Returns a Model object that holds the robot’s topology, inertias, and joint parameters.
Added bard.create_data() for creating pre-allocated computation workspaces. One Model can be used with multiple Data instances.
bard.update_kinematics(model, data, q, qd) performs a single tree traversal, caching transforms, spatial adjoints, and velocities. All subsequent algorithm calls reuse the cached data with zero redundant computation.
Breaking change: Removed RobotDynamics, KinematicsState, ForwardKinematics, SpatialAcceleration, Jacobian, RNEA, and CRBA classes. The build_chain_from_urdf() function is no longer part of the public API. See the Quick Start guide for migration examples.

Initial release with batched Forward Kinematics, Jacobian, RNEA, CRBA, and Spatial Acceleration.
URDF parsing support.
torch.compile compatibility.
Fixed-base and floating-base robot support.