| 任务 | 模型 | 结构 | #Params(M) | Flops(G) | Top-1 Acc (%) |
| 图像分类 | ResNet-101 [3] | CNN | 45 | 7.9 | 79.8 |
| ViT-B [20] | Transformer | 86.6 | 17.6 | 77.9 | |
| PVT-S [25] | 架构参考 | 24.5 | 3.8 | 79.8 | |
| CSWin-S [29] | 架构参考 | 35 | 6.9 | 83.6 | |
| HRT-B [30] | 架构参考 | 50.3 | 13.7 | 82.8 | |
| DeiT-B [35] | 知识蒸馏 | 86 | 17.5 | 81.8 | |
| CoAtNet-0 [37] | 串联拼接 | 25 | 4.2 | 81.6 | |
| ConTNet-B [40] | 串联拼接 | 39.6 | 6.4 | 81.8 | |
| MobileViT-S [41] | 串联拼接 | 5.6 | - | 78.4 | |
| MobileViTV2-2.0 [42] | 串联拼接 | 10.6 | 4 | 80.4 | |
| ConFormer-S [43] | 并联拼接 | 37.7 | 10.6 | 83.4 | |
| Mobile-Former-508M [44] | 并联拼接 | 14 | 0.508 | 79.3 | |
| ViTc-1GF [47] | 嵌入块替换 | 17.8 | 4 | 79.1 | |
| CCT [48] | 嵌入块替换 | 22.36 | 11.06 | 80.67 | |
| LocalViT-S [49] | 前馈层替换 | 22.4 | 4.6 | 80.8 | |
| ConViT-S [50] | 自注意力层替换 | 27 | 5.4 | 81.3 | |
| BoTNet-S1-50 [51] | 自注意力层替换 | 20.8 | 8.54 | 84.7 | |
| LeViT-384 [52] | 架构参考 + 局部替换 | 39.1 | 2.353 | 82.6 | |
| CvT-21 [53] | 架构参考 + 局部替换 | 32 | 7.1 | 82.5 | |
| CeiT-S [54] | 嵌入块替换 + 前馈层替换 | 24.2 | 4.5 | 82 | |
| Edgevits-S [56] | 架构参考 + 局部替换 | 11.1 | 1.9 | 81 | |
| CMT-S [57] | 架构参考 + 局部替换 | 25.1 | 4 | 83.5 | |
| ParC-Net-S [58] | 架构参考 + 局部替换 | 5 | 3.5 | 78.6 | |
| Next-ViT-S [60] | 局部替换 + 创新混合策略 | 31.7 | 5.8 | 82.5 | |
| EdgeNeXt-S [61] | 架构参考 + 局部替换 | 5.6 | 1.93 | 78.8 | |
| Swin-ACmix-S [62] | 创新混合模块 | 51 | 9 | 83.5 |