VOST consists of more than 700 high-resolution videos, captured in diverse environments, which are 20 seconds long on average and densely labeled with instance masks. A careful, multi-step approach is adopted to ensure that these videos focus on complex transformations, capturing their full temporal extent.
Source: Breaking the “Object” in Video Object SegmentationPaper | Code | Results | Date | Stars |
---|