Search Results for author: Yuan Gong

Found 31 papers, 22 papers with code

Generic Knowledge Boosted Pre-training For Remote Sensing Images

1 code implementation • 9 Jan 2024 • Ziyue Huang, Mingming Zhang, Yuan Gong, Qingjie Liu, Yunhong Wang

Deep learning models are essential for scene classification, change detection, land cover segmentation, and other remote sensing image understanding tasks.

Change Detection General Knowledge +4

Paper
Code

Joint Audio and Speech Understanding

1 code implementation • 25 Sep 2023 • Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, James Glass

Humans are surrounded by audio signals that include both speech and non-speech sounds.

316

Paper
Code

Natural Language Embedded Programs for Hybrid Language Symbolic Reasoning

1 code implementation • 19 Sep 2023 • Tianhua Zhang, Jiaxin Ge, Hongyin Luo, Yung-Sung Chuang, Mingye Gao, Yuan Gong, Xixin Wu, Yoon Kim, Helen Meng, James Glass

How can we perform computations over natural language representations to solve tasks that require symbolic and numeric reasoning?

Instruction Following Language Modelling +5

Paper
Code

ToonTalker: Cross-Domain Face Reenactment

no code implementations • ICCV 2023 • Yuan Gong, Yong Zhang, Xiaodong Cun, Fei Yin, Yanbo Fan, Xuan Wang, Baoyuan Wu, Yujiu Yang

Moreover, since no paired data is provided, we propose a novel cross-domain training scheme using data from two domains with the designed analogy constraint.

Face Reenactment Talking Face Generation

Paper
Add Code

Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

1 code implementation • 13 Jul 2023 • Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, Qifeng Chen

For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.

Retrieval Video Generation +2

241

Paper
Code

TaleCrafter: Interactive Story Visualization with Multiple Characters

1 code implementation • 29 May 2023 • Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Yingqing He, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, Yujiu Yang

Accurate Story visualization requires several necessary elements, such as identity consistency across frames, the alignment between plain text and visual content, and a reasonable layout of objects in images.

Story Visualization Text-to-Image Generation

245

Paper
Code

SAIL: Search-Augmented Instruction Learning

no code implementations • 24 May 2023 • Hongyin Luo, Yung-Sung Chuang, Yuan Gong, Tianhua Zhang, Yoon Kim, Xixin Wu, Danny Fox, Helen Meng, James Glass

Large language models (LLMs) have been significantly improved by instruction fine-tuning, but still lack transparency and the ability to utilize up-to-date knowledge and information.

Denoising Fact Checking +3

Paper
Add Code

Listen, Think, and Understand

2 code implementations • 18 May 2023 • Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, James Glass

On the other hand, modern large language models (LLMs) exhibit emerging reasoning ability but they lack audio perception capabilities.

Ranked #3 on Music Question Answering on MusicQA (using extra training data)

Language Modelling Large Language Model +1

316

Paper
Code

3D GAN Inversion with Facial Symmetry Prior

no code implementations • CVPR 2023 • Fei Yin, Yong Zhang, Xuan Wang, Tengfei Wang, Xiaoyu Li, Yuan Gong, Yanbo Fan, Xiaodong Cun, Ying Shan, Cengiz Oztireli, Yujiu Yang

It is natural to associate 3D GANs with GAN inversion methods to project a real image into the generator's latent space, allowing free-view consistent synthesis and editing, referred as 3D GAN inversion.

Image Reconstruction Neural Rendering

Paper
Add Code

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

1 code implementation • CVPR 2023 • Yatai Ji, Junjie Wang, Yuan Gong, Lin Zhang, Yanru Zhu, Hongfa Wang, Jiaxing Zhang, Tetsuya Sakai, Yujiu Yang

Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets.

Contrastive Learning Image-text matching +9

Paper
Code

Contrastive Audio-Visual Masked Autoencoder

1 code implementation • 2 Oct 2022 • Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities.

Ranked #1 on Sound Prompted Semantic Segmentation on ADE20K

Audio Classification Audio Tagging +6

209

Paper
Code

Rethinking Knowledge Distillation via Cross-Entropy

1 code implementation • 22 Aug 2022 • Zhendong Yang, Zhe Li, Yuan Gong, Tianke Zhang, Shanshan Lao, Chun Yuan, Yu Li

Furthermore, we smooth students' target output to treat it as the soft target for training without teachers and propose a teacher-free new KD loss (tf-NKD).

Knowledge Distillation

197

Paper
Code

UAVM: Towards Unifying Audio and Visual Models

1 code implementation • 29 Jul 2022 • Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass

Conventional audio-visual models have independent audio and video branches.

Ranked #2 on Multi-modal Classification on AudioSet (using extra training data)

Audio Classification audio-visual learning +1

Paper
Code

Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

1 code implementation • 6 May 2022 • Yuan Gong, Ziyi Chen, Iek-Heng Chu, Peng Chang, James Glass

Automatic pronunciation assessment is an important technology to help self-directed language learners.

Ranked #2 on Phone-level pronunciation scoring on speechocean762 (using extra training data)

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

131

Paper
Code

Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition

1 code implementation • 6 May 2022 • Yuan Gong, Jin Yu, James Glass

Recognizing human non-speech vocalizations is an important task and has broad applications such as automatic sound transcription and health condition monitoring.

Ranked #1 on Audio Classification on VocalSound

Audio Classification

Paper
Code

Attentions Help CNNs See Better: Attention-based Hybrid Image Quality Assessment Network

2 code implementations • 22 Apr 2022 • Shanshan Lao, Yuan Gong, Shuwei Shi, Sidi Yang, Tianhe Wu, Jiahao Wang, Weihao Xia, Yujiu Yang

Image quality assessment (IQA) algorithm aims to quantify the human perception of image quality.

Ranked #1 on Image Quality Assessment on MSU FR VQA Database

Generative Adversarial Network Image Quality Assessment +1

268

Paper
Code

MANIQA: Multi-dimension Attention Network for No-Reference Image Quality Assessment

2 code implementations • 19 Apr 2022 • Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, Yujiu Yang

No-Reference Image Quality Assessment (NR-IQA) aims to assess the perceptual quality of images in accordance with human subjective perception.

Ranked #8 on Video Quality Assessment on MSU SR-QA Dataset

No-Reference Image Quality Assessment NR-IQA +1

268

Paper
Code

CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification

2 code implementations • 13 Mar 2022 • Yuan Gong, Sameer Khurana, Andrew Rouditchenko, James Glass

Audio classification is an active research area with a wide range of applications.

Audio Classification Knowledge Distillation

1,040

Paper
Code

Focal and Global Knowledge Distillation for Detectors

1 code implementation • CVPR 2022 • Zhendong Yang, Zhe Li, Xiaohu Jiang, Yuan Gong, Zehuan Yuan, Danpei Zhao, Chun Yuan

Global distillation rebuilds the relation between different pixels and transfers it from teachers to students, compensating for missing global information in focal distillation.

Ranked #1 on Knowledge Distillation on MS COCO

Image Classification Knowledge Distillation +2

337

Paper
Code

SSAST: Self-Supervised Audio Spectrogram Transformer

3 code implementations • 19 Oct 2021 • Yuan Gong, Cheng-I Jeff Lai, Yu-An Chung, James Glass

However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining that requires a large amount of labeled data and a complex training pipeline, thus limiting the practical usage of AST.

Ranked #1 on Spoken Command Recognition on Speech Command v2

Audio Classification Emotion Recognition +4

1,040

Paper
Code

AST: Audio Spectrogram Transformer

4 code implementations • 5 Apr 2021 • Yuan Gong, Yu-An Chung, James Glass

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels.

Ranked #1 on Audio Classification on Speech Commands

Audio Classification Audio Tagging +4

1,040

Paper
Code

PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation

1 code implementation • 2 Feb 2021 • Yuan Gong, Yu-An Chung, James Glass

Audio tagging is an active research area and has a wide range of applications.

Ranked #6 on Audio Classification on FSD50K (using extra training data)

Audio Classification Audio Tagging +2

129

Paper
Code

Detecting Replay Attacks Using Multi-Channel Audio: A Neural Network-Based Method

2 code implementations • 18 Mar 2020 • Yuan Gong, Jian Yang, Christian Poellabauer

With the rapidly growing number of security-sensitive systems that use voice as the primary input, it becomes increasingly important to address these systems' potential vulnerability to replay attacks.

Paper
Code

Second-order Non-local Attention Networks for Person Re-identification

no code implementations • 31 Aug 2019 • Bryan, Xia, Yuan Gong, Yizhe Zhang, Christian Poellabauer

Recent efforts have shown promising results for person re-identification by designing part-based architectures to allow a neural network to learn discriminative representations from semantically coherent parts.

Person Re-Identification

Paper
Add Code

Real-Time Adversarial Attacks

1 code implementation • 31 May 2019 • Yuan Gong, Boyang Li, Christian Poellabauer, Yiyu Shi

In recent years, many efforts have demonstrated that modern machine learning algorithms are vulnerable to adversarial attacks, where small, but carefully crafted, perturbations on the input can make them fail.

Adversarial Attack BIG-bench Machine Learning

Paper
Code

ReMASC: Realistic Replay Attack Corpus for Voice Controlled Systems

2 code implementations • 6 Apr 2019 • Yuan Gong, Jian Yang, Jacob Huber, Mitchell MacKnight, Christian Poellabauer

This paper introduces a new database of voice recordings with the goal of supporting research on vulnerabilities and protection of voice-controlled systems (VCSs).

Voice Anti-spoofing

Paper
Code

Towards Learning Fine-Grained Disentangled Representations from Speech

no code implementations • 8 Aug 2018 • Yuan Gong, Christian Poellabauer

Learning disentangled representations of high-dimensional data is currently an active research area.

Representation Learning

Paper
Add Code

Topic Modeling Based Multi-modal Depression Detection

no code implementations • 28 Mar 2018 • Yuan Gong, Christian Poellabauer

Major depressive disorder is a common mental disorder that affects almost 7% of the adult U. S. population.

Depression Detection

Paper
Add Code

An Overview of Vulnerabilities of Voice Controlled Systems

no code implementations • 24 Mar 2018 • Yuan Gong, Christian Poellabauer

These systems have been shown to be vulnerable to various types of voice spoofing attacks.

General Classification

Paper
Add Code

How do deep convolutional neural networks learn from raw audio waveforms?

no code implementations • ICLR 2018 • Yuan Gong, Christian Poellabauer

Prior work on speech and audio processing has demonstrated the ability to obtain excellent performance when learning directly from raw audio waveforms using convolutional neural networks (CNNs).

Paper
Add Code

Crafting Adversarial Examples For Speech Paralinguistics Applications

no code implementations • 9 Nov 2017 • Yuan Gong, Christian Poellabauer

Computational paralinguistic analysis is increasingly being used in a wide range of cyber applications, including security-sensitive applications such as speaker verification, deceptive speech detection, and medical diagnostics.

Medical Diagnosis Speaker Verification

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.