Changelog

v0.4.3

Performance: torch.compile for all algorithms + cleanup

  • Enabled torch.compile for CRBA and ABA on CUDA (previously skipped when Triton kernels were available). Compilation now optimizes surrounding PyTorch operations while Triton handles the bottleneck 6x6 double-matmul. Result: CRBA 2.1x faster, ABA 1.7x faster at B=1 with compile enabled.

  • Moved Triton kernel import to module level, eliminating repeated import overhead in hot loops. _use_triton_kernels now properly gates on HAS_TRITON instead of being hardcoded to True.

  • Rewrote benchmarks/speed_benchmark.py for fair comparison: each bard algorithm timing now includes update_kinematics (end-to-end), matching ADAM’s self-contained calls. Speedup tables use ADAM as the primary baseline. Pinocchio C++ is included as a serial CPU reference.

  • Added autograd documentation to Quick Start guide with examples for d(M)/d(q) through CRBA, d(qdd)/d(tau) through ABA.

  • Removed obsolete files: quick_bench.py, basic_test.py, bard_basic_test.py, and root-level profiling/debug scripts.

v0.4.2

Performance: Inline spatial cross products, eliminate per-node allocations

  • Inline spatial cross products to eliminate per-node GPU allocations.

  • Revert JIT functions to torch.zeros pattern for torch.compile compatibility.

  • Eliminate hidden tensor allocations in core transform functions.

v0.3

Unified Model/Data API

  • Introduced Model and Data classes following the Pinocchio/MuJoCo pattern. All computations are now accessed through top-level free functions (bard.forward_kinematics(), bard.jacobian(), bard.rnea(), etc.) operating on a model + data pair.

  • Added bard.build_model_from_urdf() as the primary entry point for loading robots. Returns a Model object that holds the robot’s topology, inertias, and joint parameters.

  • Added bard.create_data() for creating pre-allocated computation workspaces. One Model can be used with multiple Data instances.

  • bard.update_kinematics(model, data, q, qd) performs a single tree traversal, caching transforms, spatial adjoints, and velocities. All subsequent algorithm calls reuse the cached data with zero redundant computation.

  • Breaking change: Removed RobotDynamics, KinematicsState, ForwardKinematics, SpatialAcceleration, Jacobian, RNEA, and CRBA classes. The build_chain_from_urdf() function is no longer part of the public API. See the Quick Start guide for migration examples.

v0.2

  • Benchmarks and performance documentation.

  • Fixed lxml dependency version.

  • Removed Python 3.14 support from CI.

v0.1

  • Initial release with batched Forward Kinematics, Jacobian, RNEA, CRBA, and Spatial Acceleration.

  • URDF parsing support.

  • torch.compile compatibility.

  • Fixed-base and floating-base robot support.