ViT-VS: Generalizable Visual Servoing with Vision Transformers

On the Applicability of Pretrained Vision Transformer Features for Generalizable Visual Servoing

Alessandro Scherl ^1,2 Stefan Thalhammer² Bernhard Neuberger² Wilfried Wöber² José García-Rodríguez¹

¹Department of Computer Technology, University of Alicante, Spain
²Industrial Engineering Department, UAS Technikum Vienna, Austria

Abstract

Visual servoing enables robots to precisely position their end-effector relative to a target object. While classical methods rely on hand-crafted features and thus are universally applicable without task-specific training, they often struggle with occlusions and environmental variations, whereas learning-based approaches improve robustness but typically require extensive training. We present a visual servoing approach that leverages pretrained vision transformers for semantic feature extraction, combining the advantages of both paradigms while also being able to generalize beyond the provided sample. Our approach achieves full convergence in unperturbed scenarios and surpasses classical image-based visual servoing by up to 31.2% relative improvement in perturbed scenarios. Even the convergence rates of learning-based methods are matched despite requiring no task- or object-specific training. Real-world evaluations confirm robust performance in end-effector positioning, industrial box manipulation, and grasping of unseen objects using only a reference from the same category.

Method Overview

Our ViT-VS approach combines Vision Transformer (ViT) correspondence matching with classical image-based visual servoing (IBVS) principles. The methodology addresses key challenges of using pretrained ViTs for robotic control, including rotation invariance compensation and velocity stabilization for smoother trajectories.

Experiments

Category-Level Object Grasping

Our method demonstrates robust performance across all object categories, achieving success rates of 100% for shoes, 90% for mugs, and 80% for toy cars. This experiment shows a representative grasping sequence of an unseen blue toy car.

Object Category	Success Rate
Shoe	10/10 (100%)
Mug	9/10 (90%)
Toy Car	8/10 (80%)
Overall	27/30 (90%)

Category-Level Object Sorting

Building on our category-level grasping capabilities, we demonstrate a complete pick-and-place sorting task using only reference images from each category. The system identifies unseen objects from the same categories and successfully sorts them into designated locations with a high success rate.

This experiment highlights the practical application of our ViT-VS approach in a realistic object manipulation scenario, demonstrating how pretrained vision transformers can enable generalized robotic manipulation without task-specific training.

Industrial Box Manipulation

We demonstrate industrial box manipulation with 100% success rate (n = 20) on a mobile robot with a starting point positioning error of ±10cm. This experiment shows the robustness of our approach in real-world industrial scenarios.

Detailed Experiment

This experiment demonstrates real-world evaluation on the "hollywood poster" for 1500 iterations. A detailed visualization and analysis of initial image, desired image, final image, camera velocities and position errors is presented in our paper, demonstrating full convergence from a partially visible and heavily rotated initial position.

Simulation Experiments

Our approach achieves full convergence in unperturbed scenarios and surpasses classical image-based visual servoing by up to 31.2% relative improvement in perturbed scenarios. Even the convergence rates of learning-based methods are matched despite requiring no task- or object-specific training.

Method	Convergence Rate (Unperturbed/Perturbed)
ViT-VS (ours)	100.0% / 76.6%
Classical IBVS	89.6% / 58.4%
Deep Learning-based	100.0% / 76.0%