AI & ML interests
Exploring smol models (for text, vision and video) and high quality web and synthetic datasets
Recent Activity
View all activity
Papers
View all Paperscmpatinoย
updated a
model 5 days ago
cmpatinoย
published a
model 5 days ago
clefourrierย
authored a
paper 16 days ago
qgallouedecย
posted an update about 1 month ago
Post
2920
@CohereLabs just released ๐ฟ Tiny Aya: a fully open-source 3B parameter model that speaks 70+ languages ๐! But thereโs a catch:
Tiny Aya is just a language model. It doesnโt support tool calling, the key capability that turns frontier models into powerful *agents*.
So the real question is:
How hard is it to turn Tiny Aya into an agent?
Turns outโฆ itโs simple, thanks to Hugging Face TRL.
Weโre sharing a hands-on example showing how to train Tiny Aya to turn it into a tool-calling agent using TRL, unlocking what could become the first *massively multilingual open agent*.
Small model. Global reach. Agent capabilities.
๐ https://github.com/huggingface/trl/blob/main/examples/notebooks/sft_tool_calling.ipynb
Tiny Aya is just a language model. It doesnโt support tool calling, the key capability that turns frontier models into powerful *agents*.
So the real question is:
How hard is it to turn Tiny Aya into an agent?
Turns outโฆ itโs simple, thanks to Hugging Face TRL.
Weโre sharing a hands-on example showing how to train Tiny Aya to turn it into a tool-calling agent using TRL, unlocking what could become the first *massively multilingual open agent*.
Small model. Global reach. Agent capabilities.
๐ https://github.com/huggingface/trl/blob/main/examples/notebooks/sft_tool_calling.ipynb
lewtunย
submitted a
paper to Daily Papers about 1 month ago
lewtunย
submitted a
paper to Daily Papers about 2 months ago
cfahlgren1ย
submitted a
paper to Daily Papers about 2 months ago
Post
3958
๐ What happened in AI in 2025? ๐
We prepared the 2025 version of the HF AI Timeline Grid, highlighting open vs API-based model releases, and allowing you to browse and filter by access, modality, and release type!
Play with it here:
2025-ai-timeline/2025-ai-timeline
Here's my personal quarterly TL;DR:
1๏ธโฃ Q1 โ Learning to Reason
Deepseek not only releases a top-notch reasoning model, but shows how to train them and compete with closed frontier models. OpenAI debuts Deep Research.
Significant milestones: DeepSeek R1 & R1-Zero, Qwen 2.5 VL, OpenAI Deep Research, Gemini 2.5 Pro (experimental)
2๏ธโฃ Q2 โ Multimodality and Coding
More LLMs embrace multimodality by default, and there's a surge in coding agents. Strong vision, audio, and generative models emerge.
Significant milestones: Llama 4, Qwen 3, Imagen 4, OpenAI Codex, Google Jules, Claude 4
3๏ธโฃ Q3 โ "Gold" rush, OpenAI opens up, the community goes bananas
Flagship models get gold in Math olympiads and hard benchmarks. OpenAI releases strong open source models and Google releases the much anticipated nano-banana for image generation and editing. Agentic workflows become commonplace.
Significant milestones: Gemini and OpenAI IMO Gold, gpt-oss, Gemini 2.5 Flash Image, Grok 4, Claude Sonnet 4.5
4๏ธโฃ Q4 โ Mistral returns, leaderboard hill-climbing
Mistral is back with updated model families. All labs release impressive models to wrap up the year!
Significant milestones: Claude Opus 4.5, DeepSeek Math V2, FLUX 2, GPT 5.1, Kimi K2 Thinking, Nano Banana Pro, GLM 4.7, Gemini 3, Mistral 3, MiniMax M2.1 ๐คฏ
Credits
๐ NHLOCAL for the source data https://github.com/NHLOCAL/AiTimeline
๐ซก @reach-vb for the original idea, design and recipe
๐ @ariG23498 and yours truly for compiling and verifying the 2025 edition
๐ฅณ Here's to 2026, wishing it becomes the best year ever for open releases and on-device-first use-cases! ๐ฅ
We prepared the 2025 version of the HF AI Timeline Grid, highlighting open vs API-based model releases, and allowing you to browse and filter by access, modality, and release type!
Play with it here:
2025-ai-timeline/2025-ai-timeline
Here's my personal quarterly TL;DR:
1๏ธโฃ Q1 โ Learning to Reason
Deepseek not only releases a top-notch reasoning model, but shows how to train them and compete with closed frontier models. OpenAI debuts Deep Research.
Significant milestones: DeepSeek R1 & R1-Zero, Qwen 2.5 VL, OpenAI Deep Research, Gemini 2.5 Pro (experimental)
2๏ธโฃ Q2 โ Multimodality and Coding
More LLMs embrace multimodality by default, and there's a surge in coding agents. Strong vision, audio, and generative models emerge.
Significant milestones: Llama 4, Qwen 3, Imagen 4, OpenAI Codex, Google Jules, Claude 4
3๏ธโฃ Q3 โ "Gold" rush, OpenAI opens up, the community goes bananas
Flagship models get gold in Math olympiads and hard benchmarks. OpenAI releases strong open source models and Google releases the much anticipated nano-banana for image generation and editing. Agentic workflows become commonplace.
Significant milestones: Gemini and OpenAI IMO Gold, gpt-oss, Gemini 2.5 Flash Image, Grok 4, Claude Sonnet 4.5
4๏ธโฃ Q4 โ Mistral returns, leaderboard hill-climbing
Mistral is back with updated model families. All labs release impressive models to wrap up the year!
Significant milestones: Claude Opus 4.5, DeepSeek Math V2, FLUX 2, GPT 5.1, Kimi K2 Thinking, Nano Banana Pro, GLM 4.7, Gemini 3, Mistral 3, MiniMax M2.1 ๐คฏ
Credits
๐ NHLOCAL for the source data https://github.com/NHLOCAL/AiTimeline
๐ซก @reach-vb for the original idea, design and recipe
๐ @ariG23498 and yours truly for compiling and verifying the 2025 edition
๐ฅณ Here's to 2026, wishing it becomes the best year ever for open releases and on-device-first use-cases! ๐ฅ
craffelย
authored a
paper 3 months ago
qgallouedecย
submitted a
paper to Daily Papers 3 months ago
alozowskiย
authored a
paper 4 months ago
Post
1818
๐๐๐Huge biotech data drop today๐๐๐
The largest drug-target dataset ever created was just released on Hugging Faceโand it's still growing...
EvE Bio is further updating the dataset every 8 weeks. Drug development dream.
Read the blog: https://huggingface.co/blog/hugging-science/eve-bio-mapping-the-pharmone-drug-interaction
Play with the data: eve-bio/drug-target-activity
The largest drug-target dataset ever created was just released on Hugging Faceโand it's still growing...
EvE Bio is further updating the dataset every 8 weeks. Drug development dream.
Read the blog: https://huggingface.co/blog/hugging-science/eve-bio-mapping-the-pharmone-drug-interaction
Play with the data: eve-bio/drug-target-activity
nouamanetaziย
posted an update 5 months ago
Post
4620
After training ๐๐ฆ๐จ๐ฅ๐๐๐ on ๐๐๐ ๐๐๐๐๐ฌ for nearly a month, I've come to realize something most people overlook: ๐ข๐ง๐๐ซ๐๐ฌ๐ญ๐ซ๐ฎ๐๐ญ๐ฎ๐ซ๐ ๐ข๐ฌ ๐ญ๐ก๐ ๐ฆ๐๐ค๐-๐จ๐ซ-๐๐ซ๐๐๐ค ๐๐๐๐ญ๐จ๐ซ ๐ข๐ง ๐๐๐ ๐ญ๐ซ๐๐ข๐ง๐ข๐ง๐ . ๐ฅ
Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious ๐๐๐๐ ๐๐ซ๐ซ๐จ๐ซ๐ฌ, or when your expensive GPU cluster is running at ๐๐% ๐๐๐๐ข๐๐ข๐๐ง๐๐ฒ, the problem isn't your model. It's most probably a ๐ฆ๐ข๐ฌ๐ฎ๐ฌ๐ ๐จ๐ ๐ญ๐ก๐ ๐ก๐๐ซ๐๐ฐ๐๐ซ๐. ๐ ๏ธ
Questions that seemed simple but had no clear answers: Why is ๐๐จ๐ ๐ญ๐ซ๐๐ข๐ง๐ข๐ง๐ ๐ฌ๐ฅ๐จ๐ฐ๐๐ซ ๐ญ๐ก๐๐ง ๐๐๐ง๐ฌ๐ ๐ฆ๐จ๐๐๐ฅ๐ฌ? Which ๐๐๐๐ ๐๐ฅ๐๐ ๐ฌ should we actually set? How often should we checkpoint without killing throughput?
That's why we built ๐๐ก๐ ๐๐ฆ๐จ๐ฅ ๐๐ซ๐๐ข๐ง๐ข๐ง๐ ๐๐ฅ๐๐ฒ๐๐จ๐จ๐ค ๐: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the ๐ข๐ง๐๐ซ๐๐ฌ๐ญ๐ซ๐ฎ๐๐ญ๐ฎ๐ซ๐ ๐ฅ๐๐ฒ๐๐ซ that most teams get wrong.
We validated real vs theoretical bandwidth across the entire stack: ๐๐๐๐ ๐ก๐ข๐ญ๐ญ๐ข๐ง๐ ๐ ๐๐/๐ฌ, ๐๐๐๐ข๐ง๐ค ๐.๐ ๐ซ๐๐๐๐ก๐ข๐ง๐ ๐๐๐ ๐๐/๐ฌ, ๐๐๐๐ ๐๐๐ง๐ ๐๐ญ ๐๐.๐ ๐๐/๐ฌ. Then we ran collective operations across ๐๐๐ ๐๐๐๐ฌ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from ๐๐๐ ๐๐/๐ฌ on a single node to ๐๐๐-๐๐๐ ๐๐/๐ฌ across 16 nodes.
If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.
๐๐ก๐ ๐๐ฆ๐จ๐ฅ ๐๐ซ๐๐ข๐ง๐ข๐ง๐ ๐๐ฅ๐๐ฒ๐๐จ๐จ๐ค: https://lnkd.in/e5MKXUHS
Shared with โค๏ธ by the HuggingFace team
Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious ๐๐๐๐ ๐๐ซ๐ซ๐จ๐ซ๐ฌ, or when your expensive GPU cluster is running at ๐๐% ๐๐๐๐ข๐๐ข๐๐ง๐๐ฒ, the problem isn't your model. It's most probably a ๐ฆ๐ข๐ฌ๐ฎ๐ฌ๐ ๐จ๐ ๐ญ๐ก๐ ๐ก๐๐ซ๐๐ฐ๐๐ซ๐. ๐ ๏ธ
Questions that seemed simple but had no clear answers: Why is ๐๐จ๐ ๐ญ๐ซ๐๐ข๐ง๐ข๐ง๐ ๐ฌ๐ฅ๐จ๐ฐ๐๐ซ ๐ญ๐ก๐๐ง ๐๐๐ง๐ฌ๐ ๐ฆ๐จ๐๐๐ฅ๐ฌ? Which ๐๐๐๐ ๐๐ฅ๐๐ ๐ฌ should we actually set? How often should we checkpoint without killing throughput?
That's why we built ๐๐ก๐ ๐๐ฆ๐จ๐ฅ ๐๐ซ๐๐ข๐ง๐ข๐ง๐ ๐๐ฅ๐๐ฒ๐๐จ๐จ๐ค ๐: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the ๐ข๐ง๐๐ซ๐๐ฌ๐ญ๐ซ๐ฎ๐๐ญ๐ฎ๐ซ๐ ๐ฅ๐๐ฒ๐๐ซ that most teams get wrong.
We validated real vs theoretical bandwidth across the entire stack: ๐๐๐๐ ๐ก๐ข๐ญ๐ญ๐ข๐ง๐ ๐ ๐๐/๐ฌ, ๐๐๐๐ข๐ง๐ค ๐.๐ ๐ซ๐๐๐๐ก๐ข๐ง๐ ๐๐๐ ๐๐/๐ฌ, ๐๐๐๐ ๐๐๐ง๐ ๐๐ญ ๐๐.๐ ๐๐/๐ฌ. Then we ran collective operations across ๐๐๐ ๐๐๐๐ฌ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from ๐๐๐ ๐๐/๐ฌ on a single node to ๐๐๐-๐๐๐ ๐๐/๐ฌ across 16 nodes.
If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.
๐๐ก๐ ๐๐ฆ๐จ๐ฅ ๐๐ซ๐๐ข๐ง๐ข๐ง๐ ๐๐ฅ๐๐ฒ๐๐จ๐จ๐ค: https://lnkd.in/e5MKXUHS
Shared with โค๏ธ by the HuggingFace team
anditoย
authored a
paper 5 months ago
Post
2470
Finally, our new paper is out! "๐๐ถ๐ป๐ฒ๐ฉ๐ถ๐๐ถ๐ผ๐ป: ๐ข๐ฝ๐ฒ๐ป ๐๐ฎ๐๐ฎ ๐๐ ๐๐น๐น ๐ฌ๐ผ๐ ๐ก๐ฒ๐ฒ๐ฑ"! ๐ฅณ
FineVision: Open Data Is All You Need (2510.17269)
If you've ever trained a VLM, you know this problem: nobody shares their data mixtures. It's a black box, making replicating SOTA work impossible.
We wanted to change that.
FineVision unifies 200 sources into 24 million samples. With 17.3 million images and 9.5 billion answer tokens, it's the largest open resource of its kind.
In the paper, we share how we built it:
๐ finding and cleaning data at scale
๐งน removing excessive duplicates across sources
๐ค decontaminating against 66 public benchmarks
My favorite part is Figure 6 (in the video!). It's our visual diversity analysis. It shows that FineVision isn't just bigger; it's more balanced and conceptually richer than other open datasets.
NVIDIA's Eagle 2 paper highlighted just how critical this visual diversity is, and our results confirm it: models trained on FineVision consistently outperform those trained on any other open dataset on 11 benchmarks!
๐ To celebrate the paper, Iโm also releasing a concatenated and shuffled version of the full dataset! ๐
Itโs ready to stream, so you can start training your own models right away:
from datasets import load_dataset
d = load_dataset("HuggingFaceM4/FineVision_full_shuffled", split="train", streaming=True)
print(next(iter(d)))
A big shoutout to the first authors: Luis Wiedmann and Orr Zohar. They are rockstars!
FineVision: Open Data Is All You Need (2510.17269)
If you've ever trained a VLM, you know this problem: nobody shares their data mixtures. It's a black box, making replicating SOTA work impossible.
We wanted to change that.
FineVision unifies 200 sources into 24 million samples. With 17.3 million images and 9.5 billion answer tokens, it's the largest open resource of its kind.
In the paper, we share how we built it:
๐ finding and cleaning data at scale
๐งน removing excessive duplicates across sources
๐ค decontaminating against 66 public benchmarks
My favorite part is Figure 6 (in the video!). It's our visual diversity analysis. It shows that FineVision isn't just bigger; it's more balanced and conceptually richer than other open datasets.
NVIDIA's Eagle 2 paper highlighted just how critical this visual diversity is, and our results confirm it: models trained on FineVision consistently outperform those trained on any other open dataset on 11 benchmarks!
๐ To celebrate the paper, Iโm also releasing a concatenated and shuffled version of the full dataset! ๐
HuggingFaceM4/FineVision_full_shuffled Itโs ready to stream, so you can start training your own models right away:
from datasets import load_dataset
d = load_dataset("HuggingFaceM4/FineVision_full_shuffled", split="train", streaming=True)
print(next(iter(d)))
A big shoutout to the first authors: Luis Wiedmann and Orr Zohar. They are rockstars!
Post
10214
deepseek-ai/DeepSeek-OCR is out! ๐ฅ my take โคต๏ธ
> pretty insane it can parse and re-render charts in HTML
> it uses CLIP and SAM features concatenated, so better grounding
> very efficient per vision tokens/performance ratio
> covers 100 languages
> pretty insane it can parse and re-render charts in HTML
> it uses CLIP and SAM features concatenated, so better grounding
> very efficient per vision tokens/performance ratio
> covers 100 languages
thomwolfย
authored a
paper 6 months ago
lvwerraย
authored a
paper 6 months ago
sashaย
authored a
paper 6 months ago