ABSTRACT
Modern advanced driver assistance systems (ADAS) rely on a range of sensors including radar, ultrasound, cameras and LIDAR. Active sensors such as radar are primarily used for detecting traffic participants (TPs) and measuring their distance. More expensive LIDAR are used for estimating both traffic participants and scene elements (SEs). However, camera-based systems have the potential to achieve the same at a much lower cost, while allowing new capabilities such as determination of TP and SE semantics as well as their interactions in complex traffic scenes.
In this talk, we present several recent developments. A common theme is to overcome challenges posed by lack of large-scale annotations in deep learning frameworks. We introduce approaches to correspondence estimation that are trained on purely synthetic data but adapt well to real data at test-time. Posing the problem in a metric learning framework with fully convolutional architectures allows estimation accuracies that surpass other state-of-art by large margins. We introduce object detectors that are light enough for ADAS, trained with knowledge distillation to retain accuracies of deeper architectures. Our semantic segmentation is trained on weak supervision that requires only a tenth of conventional annotation time. We propose methods for 3D reconstruction that use deep supervision to recover fine object part locations, but rely on purely synthetic 3D CAD models for training. Further, we develop generative adversarial frameworks for reconstruction that alleviate the need to align 3D CAD models with images at train time. Finally, we present a framework for TP behavior prediction in complex traffic scenes, that utilizes the above as inputs to predict future trajectories that fully account for TP-TP and TP-SE interactions. Our approach allows prediction of diverse uncertain outcomes and is trained to predict long-term strategic behaviors in complex scenes.