Spatial Token Mixer
4 papers with code • 0 benchmarks • 0 datasets
Spatial Token Mixer (STM) is a module for vision transformers that aims to improve the efficiency of token mixing. STM is a type of depthwise convolution that operates on the spatial dimension of the tokens. STM is a drop-in replacement for the token mixing layers in vision transformers.
Benchmarks
These leaderboards are used to track progress in Spatial Token Mixer
Most implemented papers
WaveMix: A Resource-efficient Neural Network for Image Analysis
The whole architecture is a stack of self-similar and resolution-preserving WaveMix blocks, which allows architectural flexibility for various tasks and levels of resource availability.
Demystify Transformers & Convolutions in Modern Image Deep Networks
Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs, but performance differences persist among different STMs.
CARD: Semantic Segmentation with Efficient Class-Aware Regularized Decoder
Extensive experiments and ablation studies conducted on multiple benchmark datasets demonstrate that the proposed CAR can boost the accuracy of all baseline models by up to 2. 23% mIOU with superior generalization ability.
UniNeXt: Exploring A Unified Architecture for Vision Recognition
Interestingly, the ranking of these spatial token mixers also changes under our UniNeXt, suggesting that an excellent spatial token mixer may be stifled due to a suboptimal general architecture, which further shows the importance of the study on the general architecture of vision backbone.