| | --- |
| | tags: |
| | - model_hub_mixin |
| | - pytorch_model_hub_mixin |
| | - vision |
| | - perceiver |
| | - adaptive-computation |
| | license: mit |
| | datasets: |
| | - timm/imagenet-12k-wds |
| | --- |
| | |
| | # AdaPerceiver (Logit + Feature Distilled from ViT-H CLIP) |
| |
|
| | This repository hosts the **logit + feature distilled AdaPerceiver model**, introduced in |
| | **“AdaPerceiver: Transformers with Adaptive Width, Depth, and Tokens”**. |
| |
|
| | 📄 Paper: https://arxiv.org/abs/2511.18105 |
| | 📦 Code: https://github.com/pjajal/AdaPerceiver |
| | 📚 Model Collection: https://huggingface.co/collections/pjajal/adaperceiver-v1 |
| |
|
| | This model is distilled from [ViT-H CLIP model](https://huggingface.co/timm/vit_huge_patch14_clip_224.laion2b_ft_in12k). |
| |
|
| | --- |
| |
|
| | ## Model Description |
| |
|
| | **AdaPerceiver** is a Perceiver-style transformer architecture designed for **runtime-adaptive computation**. |
| | A single trained model can dynamically trade off **accuracy and compute** by adjusting: |
| |
|
| | - the **number of latent tokens**, |
| | - the **effective depth**, and |
| | - the **embedding dimension**. |
| |
|
| | This specific checkpoint corresponds to the **logit + feature distilled AdaPerceiver model**, trained on **ImageNet-12K** using a ViT-H teacher. It exposes both: |
| | - classification **logits**, and |
| | - **feature representations** |
| |
|
| | --- |
| |
|
| | ## Training Details |
| |
|
| | - **Training Data:** ImageNet-12K |
| | - **Training Objective:** Logit distillation + feature distillation |
| | - **Teacher Model:** [ViT-H/14 CLIP model](https://huggingface.co/timm/vit_huge_patch14_clip_224.laion2b_ft_in12k). |
| | - **Architecture:** Adaptive Perceiver with block-masked attention and Matryoshka FFNs |
| | - **Adaptivity Axes:** Tokens, Depth, Width |
| |
|
| | For full training details, see Appendix D of the paper. |
| |
|
| | --- |
| |
|
| | ## How to Use |
| |
|
| | This model can be loaded using the AdaPerceiver Hub-compatible class. |
| |
|
| | ```python |
| | import torch |
| | from hub.networks.adaperceiver_distill import DistillAdaPerceiver |
| | |
| | model = DistillAdaPerceiver.from_pretrained("pjajal/adaperceiver-v1") |
| | |
| | # forward( |
| | # x: input image tensor (B, C, H, W) |
| | # num_tokens: number of latent tokens to process (optional) |
| | # mat_dim: embedding dimension (optional) |
| | # depth: early-exit depth (optional) |
| | # token_grans: block-mask granularities (optional) |
| | # ) |
| | out = model( |
| | torch.randn(1, 3, 224, 224), |
| | num_tokens=256, |
| | mat_dim=128, |
| | depth=12, |
| | ) |
| | |
| | print(out.logits.shape, out.features.shape) |
| | ``` |
| |
|
| | ## Reference |
| |
|
| | If you use this models please cite the AdaPerceiver paper: |
| |
|
| | ```bibtex |
| | @article{jajal2025adaperceiver, |
| | title={AdaPerceiver: Transformers with Adaptive Width, Depth, and Tokens}, |
| | author={Jajal, Purvish and Eliopoulos, Nick John and Chou, Benjamin Shiue-Hal and Thiruvathukal, George K and Lu, Yung-Hsiang and Davis, James C}, |
| | journal={arXiv preprint arXiv:2511.18105}, |
| | year={2025} |
| | } |
| | ``` |
| |
|