Natural-Language Robot Manipulation via MCP: An Integrated Framework for Vision-Guided Pick-and-Place Automation
Date:
This work presents a unified software framework for controlling robotic manipulators through unconstrained natural language, built around the Model Context Protocol (MCP). Users can issue commands via text, voice, or a web GUI, which are interpreted by a large language model that decomposes instructions into structured tool calls executed by a FastMCP server. The system integrates real-time open-vocabulary object detection, instance segmentation, and workspace coordinate transforms to ground language in the physical scene, enabling complex pick-and-place operations. A Redis-based pub/sub architecture decouples perception, reasoning, and control into independent processes, while a platform-agnostic hardware abstraction supports both the Niryo Ned2 and WidowX-250 robotic arms. Evaluation across eight diverse manipulation tasks — including multi-step operations, spatial reasoning, and multilingual instructions — achieved an 83% success rate on a Niryo Ned2, with the primary bottleneck being open-vocabulary perception rather than LLM reasoning or motion control. All components are publicly available, providing a reproducible foundation for natural-language interfaces to cyber-physical systems.
