RICERCA

E-VAT: An Asymmetric End-to-End Approach to Visual Active Exploration and Tracking
E-VAT: An Asymmetric End-to-End Approach to Visual Active Exploration and Tracking

A Dionigi, A Devo, L Guiducci, G Costante – IEEE Robotics and Automation Letters, 2022

The development of visual tracking systems is becoming a major goal for the Robotics community. Most of the works dealing with this topic focus exclusively on passive tracking, where the target is confined within the camera’s field of view. Only a minority propose active approaches, capable not only of identifying the object to be tracked but also of producing motion control actions to maintain visual contact with it. However, all the methods introduced so far assume that the target is initially in the immediate proximity of the tracker. This represents an undesirable constraint on the applicability of these techniques, and it is to overcome this limitation that we propose a novel End-to-End Deep Reinforcement Learning based system, capable of both exploring the surrounding environment to find the target and then of tracking it. To do this, we develop a network consisting of two sub-components: i) the Target-Detection Network , which detects the target in the camera’s field-of-view , and ii) the Exploration and Tracking Network , which employs this information to switch between the exploration policy and the tracking policy with the goal of exploring the environment, finding the target and finally tracking it. Through different experiments, we demonstrate the effectiveness of our approach and its superior performance with respect to current state-of-the-art (SotA) methods.

Autonomous Single-Image Drone Exploration With Deep Reinforcement Learning and Mixed Reality
Autonomous Single-Image Drone Exploration With Deep Reinforcement Learning and Mixed Reality

A Devo, J Mao, G Costante, G Loianno – IEEE Robotics and Automation Letters, 2022

Autonomous exploration is a longstanding goal of the robotics community. Aerial drone navigation has proven to be especially challenging. The stringent requirements on cost, weight, maneuverability, and power consumption do not allow exploration approaches to easily be employed or adapted to different types of environments. End-to-End Deep Reinforcement Learning (DRL) techniques based on Convolutional Networks approximators, which grant constant-time computation, predefined memory usage, and deliver high visual perception capabilities, represent a very promising alternative to current state of the art solutions relying on metric environment reconstruction. In this work, we address the autonomous exploration problem with aerial robots with a monocular camera based on DRL. Specifically, we propose a novel asymmetric actor-critic model for drone exploration that efficiently leverages ground truth information provided by the simulator environment to speed up learning and enhance final exploration performances. Furthermore, in order to reduce the sim-to-real gap for exploration, we present a novel mixed reality framework that allows an easier, smoother, and safer simulation to real-world transition. Both aspects allow to further exploit the great potential of simulation engines and contribute to reducing the risk associated with directly deploying algorithms on a physical platform with no intermediate step between the simulation and the real world. This is well-known to create several safety concerns and be dangerous when deploying aerial vehicles. Experimental results with a drone exploring multiple environments show the effectiveness of the proposed approach.

The Role of the Input in Natural Language Video Description
The Role of the Input in Natural Language Video Description

S Cascianelli, G Costante, A Devo, TA Ciarfuglia, P Valigi, ML Fravolini – IEEE Transactions on Multimedia, 2019

Natural language video description (NLVD) has recently received strong interest in the computer vision, natural language processing (NLP), multimedia, and autonomous robotics communities. The state-of-the-art (SotA) approaches obtained remarkable results when tested on the benchmark datasets. However, those approaches poorly generalize to new datasets. In addition, none of the existing works focus on the processing of the input to the NLVD systems, which is both visual and textual. In this paper, an extensive study is presented to deal with the role of the visual input, evaluated with respect to the overall NLP performance. This is achieved by performing data augmentation of the visual component, applying common transformations to model camera distortions, noise, lighting, and camera positioning that are typical in real-world operative scenarios. A t-SNE-based analysis is proposed to evaluate the effects of the considered transformations on the overall visual data distribution. For this study, the English subset of the Microsoft Research Video Description (MSVD) dataset is considered, which is used commonly for NLVD. It was observed that this dataset contains a relevant amount of syntactic and semantic errors. These errors have been amended manually, and the new version of the dataset (called MSVD-v2) is used in the experimentation. The MSVD-v2 dataset is released to help to gain insight into the NLVD problem.

Enhancing Continuous Control of Mobile Robots for End-to-End Visual Active Tracking
Enhancing Continuous Control of Mobile Robots for End-to-End Visual Active Tracking

A Devo, A Dionigi, G Costante – Robotics and Autonomous Systems, 2021

In the last decades, visual target tracking has been one of the primary research interests of the Robotics research community. The recent advances in Deep Learning technologies have made the exploitation of visual tracking approaches effective and possible in a wide variety of applications, ranging from automotive to surveillance and human assistance. However, the majority of the existing works focus exclusively on passive visual tracking, i.e., tracking elements in sequences of images by assuming that no actions can be taken to adapt the camera position to the motion of the tracked entity. On the contrary, in this work, we address visual active tracking, in which the tracker has to actively search for and track a specified target. Current State-of-the-Art approaches use Deep Reinforcement Learning (DRL) techniques to address the problem in an end-to-end manner. However, two main problems arise: (i) most of the contributions focus only on discrete action spaces, and the ones that consider continuous control do not achieve the same level of performance; and (ii) if not properly tuned, DRL models can be challenging to train, resulting in considerably slow learning progress and poor final performance. To address these challenges, we propose a novel DRL-based visual active tracking system that provides continuous action policies. To accelerate training and improve the overall performance, we introduce additional objective functions and a Heuristic Trajectory Generator (HTG) to facilitate learning. Through extensive experimentation, we show that our method can reach and surpass other State-of-the-Art approaches performances, and demonstrate that, even if trained exclusively in simulation, it can successfully perform visual active tracking even in real scenarios.

Deep Reinforcement Learning for Instruction Following Visual Navigation in 3D Maze-Like Environments
Deep Reinforcement Learning for Instruction Following Visual Navigation in 3D Maze-Like Environments

A Devo, G Costante, P Valigi – IEEE Robotics and Automation Letters, 2020

In this work, we address the problem of visual navigation by following instructions. In this task, the robot must interpret a natural language instruction in order to follow a predefined path in a possibly unknown environment. Despite different approaches have been proposed in the last years, they are all based on the assumption that the environment contains objects or other elements that can be used to formulate instructions, such as houses or offices. On the contrary, we focus on situations where the environment objects cannot be used to specify a navigation path. In particular, we consider 3D maze-like environments as our test bench because they can be very large and offer very intricate structures. We show that without reference points, visual navigation and instruction following can be rather challenging, and that standard approaches can not be applied successfully. For this reason, we propose a new architecture that explicitly learns both visual navigation and instruction understanding. We demonstrate with simulated experiments that our method can effectively follow instructions and navigate in previously unseen mazes of various sizes.

Towards Generalization in Target-Driven Visual Navigation by Using Deep Reinforcement Learning
Towards Generalization in Target-Driven Visual Navigation by Using Deep Reinforcement Learning

A Devo, G Mezzetti, G Costante, ML Fravolini, P Valigi – IEEE Transactions on Robotics 36 (5), 1546-1561

Among the main challenges in robotics, target-driven visual navigation has gained increasing interest in recent years. In this task, an agent has to navigate in an environment to reach a user specified target, only through vision. Recent fruitful approaches rely on deep reinforcement learning, which has proven to be an effective framework to learn navigation policies. However, current state-of-the-art methods require to retrain, or at least fine-tune, the model for every new environment and object. In real scenarios, this operation can be extremely challenging or even dangerous. For these reasons, we address generalization in target-driven visual navigation by proposing a novel architecture composed of two networks, both exclusively trained in simulation. The first one has the objective of exploring the environment, while the other one of locating the target. They are specifically designed to work together, while separately trained to help generalization. In this article, we test our agent in both simulated and real scenarios, and validate its generalization capabilities through extensive experiments with previously unseen goals and unknown mazes, even much larger than the ones used for training.