Unsupervised Learning of Multi-Frame Optical Flow with Occlusions
Learning optical flow with neural networks is hampered by the need for obtaining training data with associated ground truth. Unsupervised learning is a promising direction, yet the performance of current unsupervised methods is still limited. In particular, the lack of proper occlusion handling in commonly used data terms constitutes a major source of error. While most optical flow methods process pairs of consecutive frames, more advanced occlusion reasoning can be realized when considering multiple frames. In this paper, we propose a framework for unsupervised learning of optical flow and occlusions over multiple frames. More specifically, we exploit the minimal configuration of three frames to strengthen the photometric loss and explicitly reason about occlusions. We demonstrate that our multi-frame, occlusion sensitive formulation outperforms existing unsupervised two-frame methods and even produces results on par with some fully supervised methods.
On the Integration of Optical Flow and Action Recognition
Most of the top-performing action recognition methods use optical flow as a black box input. Here, we take a deeper look at the combination of flow and action recognition and investigate why optical flow is helpful, what makes a flow method good for action recognition, and how we can make it better. In particular, we investigate the impact of different flow algorithms and input transformations to better understand how these affect a state-of-the-art action recognition method. Furthermore, we fine-tune two neural-network flow methods end-to-end on the most widely used action recognition dataset (UCF101). Based on these experiments, we make the following five observations:
1) optical flow is useful for action recognition because it is invariant to appearance,
2) optical flow methods are optimized to minimize end-point-error (EPE), but the EPE of current methods is not well correlated with action recognition performance,
3) for the flow methods tested, accuracy at boundaries and at small displacements is most correlated with action recognition performance,
4) training optical flow to minimize classification error instead of minimizing EPE improves recognition performance, and
5) optical flow learned for the task of action recognition differs from traditional optical flow especially inside the human body and at the boundary of the body.
These observations may encourage optical flow researchers to look beyond EPE as a goal and guide action recognition researchers to seek better motion cues, leading to a tighter integration of the optical flow and action recognition communities.
Slow Flow: Exploiting High-Speed Cameras for Accurate and Diverse Optical Flow Reference Data
Existing optical flow datasets are limited in size and variability due to the difficulty of capturing dense ground truth. In this paper, we tackle this problem by tracking pixels through densely sampled space-time volumes recorded with a high-speed video camera. Our model exploits the linearity of small motions and reasons about occlusions from multiple frames. Using our technique, we are able to establish accurate reference flow fields outside the laboratory in natural environments. Besides, we show how our predictions can be used to augment the input images with realistic motion blur. We demonstrate the quality of the produced flow fields on synthetic and real-world datasets. Finally, we collect a novel challenging optical flow dataset by applying our technique on data from a high-speed camera and analyze the performance of the state-of-the-art in optical flow under various levels of motion blur.
Computer Vision for Autonomous Vehicles: Problems, Datasets and State-of the-Art
Recent years have witnessed amazing progress in AI related fields such as computer vision, machine learning and autonomous vehicles. As with any rapidly growing field, however, it becomes increasingly difficult to stay up-to-date or enter the field as a beginner. While several topic specific survey papers have been written, to date no general survey on problems, datasets and methods in computer vision for autonomous vehicles exists. This paper attempts to narrow this gap by providing a state-of-the-art survey on this topic. Our survey includes both the historically most relevant literature as well as the current state-of-the-art on several specific topics, including recognition, reconstruction, motion estimation, tracking, scene understanding and end-to-end learning. Towards this goal, we first provide a taxonomy to classify each approach and then analyze the performance of the state-of-the-art on several challenging benchmarking datasets including KITTI, ISPRS, MOT and Cityscapes. Besides, we discuss open problems and current research challenges.
To ease accessibility and accommodate missing references, we also provide an interactive platform which allows to navigate topics and methods, and provides additional information and project links for each paper.
Deep Discrete Flow
Motivated by the success of deep learning techniques in matching problems, we present a method for learning context-aware features for solving optical flow using discrete optimization. Towards this goal, we present an efficient way of training a context network with a large receptive field size on top of a local network using dilated convolutions on patches. We perform feature matching by comparing each pixel in the reference image to every pixel in the target image, utilizing fast GPU matrix multiplication. The matching cost volume from the network’s output forms the data term for discrete MAP inference in a pairwise Markov random field. We provide an extensive empirical investigation of network architectures and model parameters. At the time of submission, our method ranks second on the challenging MPI Sintel test set.
Exploiting Object Similarity in 3D Reconstruction
Despite recent progress, reconstructing outdoor scenes in 3D from movable platforms remains a highly difficult endeavor. Challenges include low frame rates, occlusions, large distortions and difficult lighting conditions. In this paper, we leverage the fact that the larger the reconstructed area, the more likely objects of similar type and shape will occur in the scene. This is particularly true for outdoor scenes where buildings and vehicles often suffer from missing texture or reflections, but share similarity in 3D shape. We take advantage of this shape similarity by locating objects using detectors and jointly reconstructing them while learning a volumetric model of their shape. This allows us to reduce noise while completing missing surfaces as objects of similar shape benefit from all observations for the respective category. We evaluate our approach with respect to LIDAR ground truth on a novel challenging suburban dataset and show its advantages over the state-of-the-art.
Displets: Resolving Stereo Ambiguities using Object Knowledge
Stereo techniques have witnessed tremendous progress over the last decades, yet some aspects of the problem still remain challenging today. Striking examples are reflecting and textureless surfaces which cannot easily be recovered using traditional local regularizers. In this paper, we therefore propose to regularize over larger distances using object-category specific disparity proposals (displets) which we sample using inverse graphics techniques based on a sparse disparity estimate and a semantic segmentation of the image. The proposed displets encode the fact that objects of certain categories are not arbitrarily shaped but typically exhibit regular structures. We integrate them as non-local regularizer for the challenging object class ‘car’ into a superpixel based CRF framework and demonstrate its benefits on the KITTI stereo evaluation.