Instructions to use microsoft/Florence-2-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/Florence-2-large with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="microsoft/Florence-2-large", trust_remote_code=True)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use microsoft/Florence-2-large with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/Florence-2-large"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Florence-2-large",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/microsoft/Florence-2-large

SGLang

How to use microsoft/Florence-2-large with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/Florence-2-large" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Florence-2-large",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/Florence-2-large" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Florence-2-large",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use microsoft/Florence-2-large with Docker Model Runner:
```
docker model run hf.co/microsoft/Florence-2-large
```

Special tokens: purpose

#59

by tdeboissiere - opened Jul 30, 2024

Discussion

tdeboissiere

Jul 30, 2024

•

edited Aug 1, 2024

Hello !

I was curious about the special tokens (e.g. < od >, < /od >, < ocr >, < /ocr >]) in the Florence2Processor

These tokens don't seem to be used anywhere, so what is their purpose ?
Related: how was Florence-2 initially trained, say, for object detection ? (Were the inputs to the model the image + a text prompt such as "Locate the objects with category name in the image." + the category + the actual location of the objects in the image ?

ZappY-AI

Aug 1, 2024

Those special tokens are for Object detection. They can be used to separate class names in the input prompt.

tdeboissiere

Aug 1, 2024

•

edited Aug 1, 2024

Wouldn't it make more sense that special tokens like < od > and < /od > would be used to indicate the start and and of an object detection task, and similarly of < ocr > < /ocr > and so on ?

So at training time, a data point used for an object detection would look like this

image tokens, followed by
< od > dog < loc 100 > < loc 200 > < loc 200 > < loc 300 > cat < loc 200 > < loc 400 > < loc 400 > < loc 600 > < od >

while a data point for captioning would look like this

image tokens, followed by
< cap > my cool caption < /cap >

And if that's what was done at training time, why doesn't processing_florence2.py automatically prepends those special tokens at inference time ?

sagar2126

Aug 21, 2024

•

edited Aug 21, 2024

Same doubt. I think they aren't mapping "tags" like to special token like , rather it's like, model knows it should perform object detection from natural text prompt corresponding to tag.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment