Hybrid Capsule Networks & Vision Transformers for CIFAR-100 Image Classification

Description

  • January, 2025

We built and benchmarked three image classification models—ResNet-34, ViT-Tiny, and a CapsNet implementation based on Sabour et al.—on the CIFAR-100 dataset (50K train / 10K test). To keep the comparison fair, we used a consistent training setup and shared augmentation strategy across all models in PyTorch with CUDA acceleration. Building on these results, we proposed a hybrid architecture (CapsViT) that connects capsules and transformer tokens through cross-attention (capsules query transformer tokens). CapsViT reached 62.6% top-1 accuracy, outperforming ResNet-34/CapsNet (58.3%) and ViT-Tiny (51.3%), while reducing routing overhead.