no code implementations • 21 Dec 2023 • Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A. Ross, Bryan Seybold, Lu Jiang
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals.
Ranked #4 on Text-to-Video Generation on MSR-VTT
no code implementations • 14 Dec 2023 • Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid
When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region.
no code implementations • 12 Dec 2023 • Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, Siyang Li
Existing open-vocabulary image segmentation methods require a fine-tuning step on mask labels and/or image-text datasets.
no code implementations • 11 Dec 2023 • Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama
We present W. A. L. T, a transformer-based approach for photorealistic video generation via diffusion modeling.
Ranked #1 on Video Prediction on Kinetics-600 12 frames, 64x64
1 code implementation • 9 Nov 2023 • Xuan Yang, Liangzhe Yuan, Kimberly Wilber, Astuti Sharma, Xiuye Gu, Siyuan Qiao, Stephanie Debats, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Liang-Chieh Chen
Despite this shift, methods based on the per-pixel prediction paradigm still dominate the benchmarks on the other dense prediction tasks that require continuous outputs, such as depth estimation and surface normal prediction.
Ranked #2 on Surface Normals Estimation on NYU Depth v2
no code implementations • 9 Oct 2023 • Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang
While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation.
Ranked #2 on Video Prediction on Kinetics-600 12 frames, 64x64
no code implementations • 13 Feb 2023 • James Urquhart Allingham, Jie Ren, Michael W Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, Balaji Lakshminarayanan
In particular, we ask "Given a large pool of prompts, can we automatically score the prompts and ensemble those that are most suitable for a particular downstream dataset, without needing access to labeled validation data?".
no code implementations • 20 Dec 2022 • Vivek Rathod, Bryan Seybold, Sudheendra Vijayanarasimhan, Austin Myers, Xiuye Gu, Vighnesh Birodkar, David A. Ross
Detecting actions in untrimmed videos should not be limited to a small, closed set of classes.
1 code implementation • 30 Sep 2022 • Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova
We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models.
1 code implementation • 22 Dec 2021 • Golnaz Ghiasi, Xiuye Gu, Yin Cui, Tsung-Yi Lin
We propose OpenSeg to address the above issue while still making use of scalable image-level supervision of captions.
4 code implementations • ICLR 2022 • Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui
On COCO, ViLD outperforms the previous state-of-the-art by 4. 8 on novel AP and 11. 4 on overall AP.
Ranked #2 on Open Vocabulary Object Detection on Objects365
1 code implementation • 26 Nov 2019 • Xiuye Gu, Weixin Luo, Michael S. Ryoo, Yong Jae Lee
Cameras are prevalent in our daily lives, and enable many useful systems built upon computer vision technologies such as smart cameras and home robots for service applications.
2 code implementations • CVPR 2019 • Xiuye Gu, Yijie Wang, Chongruo wu, Yong-Jae lee, Panqu Wang
We present a novel deep neural network architecture for end-to-end scene flow estimation that directly operates on large-scale 3D point clouds.
no code implementations • 16 Nov 2017 • Deng Cai, Xiuye Gu, Chaoqi Wang
However, there are serious flaws in the evaluations of existing deep hashing papers: (1) The datasets they used are too small and simple to simulate the real CBIR situation.
1 code implementation • CVPR 2017 • Maheen Rashid, Xiuye Gu, Yong Jae Lee
Instead of directly finetuning a network trained to detect keypoints on human faces to animal faces (which is sub-optimal since human and animal faces can look quite different), we propose to first adapt the animal images to the pre-trained human detection network by correcting for the differences in animal and human face shape.