This article was contributed by Can Kocagil, data scientist at OREDATA.
From spatial to spatiotemporal visual processing
Instance-based classification, segmentation, and object detection in images are fundamental issues in the context of computer vision. Different from image-level information retrieval, the video-level problems aim at detection, segmentation, and tracking of object instances in spatiotemporal domain that have both space and time dimensions.
Video domain learning is a crucial task for spatiotemporal understanding in camera and drone-based systems with applications in video-editing, autonomous driving, pedestrian tracking, augmented reality, robot vision, and a lot more. Furthermore, it helps us to decode spatiotemporal raw data to actionable insights along with the video, as it has richer content compared to visual-spatial data. With the addition of temporal dimension to our decoding process, we get further information about
- Viewpoint variations
- Local ambiguities
from the video frames. Because of this, video-level information retrieval has gained popularity as a research area, and it attracts the community along the lines of research for video understanding.
Conceptually speaking, video-level information retrieval algorithms are mostly adapted from image-level processes by adding additional heads to capture temporal information. Aside from simpler video-level classification and regression tasks, video object detection, video object tracking, video captioning, and video instance segmentation are the most common tasks.
To start with, let’s recall the image-level instance segmentation problem.
Image-level instance segmentation
Instance segmentation not only groups pixels into different semantic classes, but also groups them into different object instances. A two-stage paradigm is usually adopted, which first generates object proposals using a Region Proposal Network (RPN), and then predicts object bounding boxes and masks using aggregated RoI features. Different from semantic segmentation, which segments different semantic classes only, instance segmentation also segments the different instances of each class.
The video classification task is a direct adaptation of image classification to the video domain. Instead of giving images as inputs, video frames are given to the model to learn from. By nature, the sequences of images that are temporally correlated are given to learning algorithms that incorporate features of both spatial and temporal visual information to produce classification scores.
The core idea is that, given specific video frames, we want to identify the type of video from pre-defined classes.
Video captioning is the task of generating captions for a video by understanding the action and event in the video, which can help in the retrieval of the video efficiently through text. The idea here is that, given specific video frames, we want to generate natural language that describes the concept and context of the video.
Video captioning is a multidisciplinary problem that requires algorithms from both computer vision (to extract features) and natural language processing (to map extracted features to natural language).
Video object detection (VOD)
Video object detection aims to detect objects in videos, which was first proposed as part of the ImageNet visual challenge. Even though the association and providing of identity improves the detection quality, this challenge is limited to spatially preserved evaluation metrics for per-frame detection and does not require joint object detection and tracking. However, there is no joint detection, segmentation, and tracking as opposed to video-level semantic tasks.
The difference between image-level object detection and video object detection is that the time series of images are given to the machine learning model, which contains temporal information as opposed to image-level processes.
Video object tracking (VOT)
Video object tracking is the process of both localizing the objects and tracking them across the video. Given an initial set of detections in the first frame, the algorithm generates a unique ID for each object in each timestamp and tries to successfully match them across the video. For instance, if I say that the particular object has an ID of “P1” in the first frame, the model tries to predict the ID of “P1” of that particular object in the remaining frames.
Video object tracking tasks are generally categorized as detection-based and detection-free tracking approaches. In detection-based tracking algorithms, objects are jointly detected and tracked such that the tracking part improves the detection quality, whereas in detection-free approaches we’re given an initial bounding box and try to track that object across video frames.
Video instance segmentation (VIS)
Video instance segmentation is the recently introduced computer vision research topic that aims at joint detection, segmentation, and tracking of instances in the video domain. Because the video instance segmentation task is supervised, it requires human-oriented high-quality annotations for bounding boxes and binary segmentation masks with predefined categories. It requires both segmentation and tracking, and it is a more challenging task compared to image-level instance segmentation. Hence, as opposed to previous fundamental computer vision tasks, video instance segmentation requires multidisciplinary and aggregated approaches. VIS is like a contemporary all-in-one computer vision task that is the composition of general vision problems.
Knowledge brings value: Video-level information retrieval in action
Acknowledging the technical boundaries of video-level information retrieval tasks will improve the understanding of business concerns and customer needs from a practical perspective. For example, when a client says, “we have videos and want to extract only the locations of pedestrians from the videos,” you’ll recognize that your task is video object detection. What if they want to both localize and track them in videos? Then your problem is translated to the video object tracking task. Let’s say that they also want to segment them across videos. Your task is now video instance segmentation. However, if a client says that they want to generate automatic captions for videos, from a technical point of view, your problem can be formulated as video captioning. Understanding the scope of the project and drawing technical business requirements depends on the kind of insights clients want to derive, and it is crucial for technical teams to formulate the issue as an optimization problem.
This article was contributed by Can Kocagil, data scientist at OREDATA.
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!
Read More From DataDecisionMakers