Introduction
Problem: Modern warehouses still rely heavily on human labor for repetitive value-added services (VAS) tasks such as kitting, SKU reorientation, packing, and palletizing. Manual operations are time-consuming, error-prone, and difficult to scale, especially when product shapes and layouts frequently change. These workflows create throughput bottlenecks and increase labor expenses, remaining the weakest link even in partially automated facilities.
Solution: This project builds an adaptive robotic ecosystem for smart warehouses, transitioning from bottlenecked manual operations to a zero touch, fully adaptive robotic workflow. Using a leader follower SO 101 robotic arm platform, the system learns through imitation and adapts to changing scenarios on the fly. The architecture is driven by transformer based policies, primarily the Action Chunking Transformer (ACT) for low level imitation learning and smolVLA Vision Language Action models for high level multimodal reasoning, with ROS 2 and MoveIt 2 serving as a deterministic fallback to ensure reliability when AI control falls below confidence thresholds.
System Architecture
The complete system integrates hardware, software, and AI models into a unified adaptive pipeline:
- Hardware Platform: Dual SO 101 robot arms (Leader and Follower) driven by STS3215 servos, controlled via Wonrabai Servo Driver Board with Arduino Mega. NVIDIA Jetson Nano serves as the edge computer running Ubuntu, ROS 2, and AI inference.
- Perception: Intel RealSense D415 overhead camera and wrist mounted camera for multi view RGB D sensing. YOLO OBB provides orientation aware object detection for precise alignment.
- AI Layer: LeRobot framework for demonstration recording and dataset management. ACT converts demonstrations into compact action sequences for the Follower arm. VLA models (smolVLA) enable generalization across SKUs and natural language instruction control.
- Control and Safety: Follower arm executes commands through continuous servo level feedback. Safety devices include emergency stop, fuse protection, and protective frame enclosure.
Key Features
- Imitation Learning with ACT: Transforms teleoperated demonstrations into reusable action chunks, enabling data efficient trajectory learning without predefined coding.
- Vision Language Action Models (smolVLA): Extends control beyond demonstrations to include natural language instructions and semantic reasoning for unseen tasks.
- Hierarchical Control Pipeline: Three level architecture: Low Level (ACT motion primitives), Mid Level (digital twin validation), High Level (smolVLA reasoning).
- Plug and Play Dashboard: CEVA Control Dashboard simplifies teleoperation, recording, and replay for non technical warehouse operators.
- Digital Twin Synchronization: Real to sim feedback loop via ROS 2 bridge keeps Isaac Sim simulations aligned with physical robot state.
- AI and Deterministic Fallback: Seamless handoff from transformer policies to ROS 2 and MoveIt 2 motion planning ensures system never fully fails.
Benefits
- Reduces human involvement in repetitive warehouse tasks, moving toward zero human touch operations.
- Provides a scalable foundation for future adaptive robotic deployments across logistics operations.
- Improves robustness through a fallback pipeline from AI control to deterministic ROS 2 + MoveIt 2 motion planning, ensuring the system never fully fails.
- Enables experimentation with advanced transformer models while remaining compatible with existing robotics stacks.
Skills
- Transformer based robotic control (ACT, VLA): primary control layer
- ROS 2 and MoveIt 2 motion planning: fallback control layer
- Vision based manipulation and dataset creation
- Digital twin creation and simulation workflows
- GPU based training and optimization (HPC)
- System architecture for multi component robotic ecosystems
How ACT Works (Action Chunking Transformer)
ACT trains on demonstration sequences so the robot can predict the next chunk of actions, using self-attention to focus on the most relevant past context. When ACT or VLA models fail or fall below confidence thresholds, control is handed off to ROS 2 + MoveIt 2 for deterministic execution.
Media
Digital twin in Isaac Sim: simulation to real transfer
Final ACT model inference achieving stable motion
Learning Experience
This internship at CEVA Logistics provided deep insights into bridging academic research with industrial robotics. Key outcomes include:
Technical Skills
- ROS 2 and MoveIt 2: Built and tested simulation environments in Gazebo and Isaac Sim, understanding trade offs between fidelity, real time performance, and computational cost.
- YOLO OBB: Implemented orientation aware object detection, gaining experience in annotation workflows and dataset preparation for robotic perception.
- ACT Imitation Learning: Trained and fine tuned the Action Chunking Transformer, observing its strengths in data efficiency and limitations around demonstration quality.
- User Interface Design: Developed a plug and play CEVA Control Dashboard to simplify teleoperation, recording, and replay for non technical users.
Research & Analytical Skills
- Sim to Real Gap: Learned to critically evaluate simulation only limitations and recognize when hardware based validation is required for policy robustness.
- VLA Framework Exploration: Analyzed state of the art Vision Language Action models (PI Zero, GR00T, smolVLA) for warehouse automation potential.
- Research to Practice Translation: Strengthened ability to deploy academic models into production grade industrial workflows.
Professional Growth
- Project Management: Maintained continuity through alternative pipelines during equipment downtime, improving contingency planning.
- Stakeholder Communication: Explained complex technical concepts to diverse audiences through structured reporting and system visualizations.
- Domain Expertise: Gained deep understanding of warehouse VAS operations and identified high-impact areas for adaptive automation.
Credits
A special thank you to all those who contributed to the development of this project:
- CEVA Logistics Solutions Singapore (CMA CGM Group): Provided real world warehouse use cases, operational environment, and industry expertise for testing and validation.
- National University of Singapore: Provided the research platform, computational resources, and academic guidance for developing advanced robotic control systems.
- LeRobot Framework (Hugging Face): Open source framework for demonstration recording and dataset management. GitHub
- smolVLA Model (Hugging Face and LeRobot): Vision Language Action model for multimodal reasoning and instruction based control. Model Page
- NVIDIA Isaac Sim: Enabled high fidelity simulation and digital twin creation for safe and efficient sim to real transfer.
- ROS 2 and MoveIt 2 Community: Open source robotics frameworks forming the deterministic fallback control layer.
Conclusion & Future Work
This project established a strong foundation for adaptive robotic control in warehouse automation. The integration of ACT for imitation based control, smolVLA for vision language reasoning, and digital twin synchronization demonstrated a feasible pathway toward intelligent, real world robot autonomy, progressing from simulation prototyping in Gazebo and Isaac Sim to teleoperation data collection and validation on the SO 101 hardware platform.
Next Phase:
- Extend smolVLA with curriculum learning pipeline for progressively complex tasks
- Explore hierarchical skill learning to decompose warehouse operations into reusable motion primitives
- Improve wrist stability through action smoothing and low-pass filtering
- Extend validation to multi-object scenarios
- Integrate control dashboard with AI inference modules for automated policy deployment
Comments