Error I get when I pass an image in the message. However, when I pass no image, it works just fine. I have tried everything and nothing seems to work. I'm on Google Colab and using a T4 GPU.

#24
by fzzrxx - opened

Screenshot 2026-03-09 at 20.27.44

Hey @fzzrxx ,

The issue you're running into is a standard data type mismatch between your image inputs and the model's weights. To optimise GPU memory and speed, the models are typically loaded in bfloat16 . However, the standard image processor defaults to creating float32 tensors. When PyTorch attempts the 2D convolution in the vision encoder, it strictly requires both the input and the weights to match, which is why it's throwing the RuntimeError.

To fix this, you just need to explicitly set the data type when you initialise your Hugging Face pipeline by passing torch_dtype=torch.bfloat16. This will instruct the pipeline to automatically cast your processed image and text tensors to match the model's bfloat16 weights before running inference, resolving the crash.

Thank you!

Sign up or log in to comment