Model Card for FontDiffuser
Model Details
Model Type
- Architecture: Diffusion-based Font Generation Model
- Framework: PyTorch + Hugging Face Diffusers
- Scheduler: DPM-Solver++ (configurable: dpmsolver++ / dpmsolver)
- Guidance: Classifier-free guidance
- Base Model: FontDiffuser with Content and Style Encoders
Model Components
- UNet: Main diffusion model for image generation
- Content Encoder: Extracts character structure information
- Style Encoder: Extracts font style features
- DDPM/DPM Scheduler: Noise scheduling for diffusion process
Training Configuration
- Resolution: 96Γ96 pixels
- Batch Size: 4-8 (configurable)
- Inference Steps: 15 (default, configurable)
- Guidance Scale: 7.5 (default, configurable)
- Precision: FP32/FP16 (optional)
- Device: CUDA/GPU recommended
Model Usage
Installation
pip install diffusers torch torchvision safetensors
pip install lpips scikit-image pytorch-fid # Optional: for evaluation
Basic Generation
from sample_batch import (
FontManager,
batch_generate_images,
load_fontdiffuser_pipeline
)
from argparse import Namespace
# Initialize font manager
font_manager = FontManager("path/to/font.ttf")
# Load pipeline
args = Namespace(
ckpt_dir="path/to/checkpoints",
device="cuda",
num_inference_steps=15,
guidance_scale=7.5,
batch_size=4,
# ... other args
)
pipe = load_fontdiffuser_pipeline(args)
# Generate images
characters = ['A', 'B', 'C', 'δΈ', 'ε½']
style_paths = ['style1.png', 'style2.png']
results = batch_generate_images(
pipe, characters, style_paths,
output_dir="output",
args=args,
evaluator=evaluator,
font_manager=font_manager
)
Batch Generation with Checkpointing
python sample_batch.py \
--characters "characters.txt" \
--start_line 1 \
--end_line 100 \
--style_images "styles/" \
--ttf_path "fonts/myfont.ttf" \
--ckpt_dir "checkpoints/" \
--output_dir "my_dataset/train_original" \
--batch_size 4 \
--num_inference_steps 15 \
--guidance_scale 7.5 \
--save_interval 10 \
--device cuda
Resume from Checkpoint
python sample_batch.py \
--characters "characters.txt" \
--style_images "styles/" \
--ttf_path "fonts/myfont.ttf" \
--ckpt_dir "checkpoints/" \
--output_dir "my_dataset/train_original" \
--resume_from "my_dataset/train_original/results_checkpoint.json"
Model Performance
Supported Tasks
- β Single-character font generation
- β Multi-character batch generation
- β Multi-font support
- β Multi-style transfer
- β Index-based tracking for large-scale generation
- β Checkpoint and resume support
Output Format
output_dir/
βββ ContentImage/ # Single set of content (character) images
β βββ char0.png
β βββ char1.png
β βββ ...
βββ TargetImage/ # Generated font images organized by style
β βββ style0/
β β βββ style0+char0.png
β β βββ style0+char1.png
β β βββ ...
β βββ style1/
β β βββ ...
β βββ ...
βββ results.json # Comprehensive generation metadata
βββ results_checkpoint.json # Intermediate checkpoint (if save_interval > 0)
βββ results_interrupted.json # Emergency checkpoint (if interrupted)
Results Metadata Structure
{
"generations": [
{
"character": "A",
"char_index": 0,
"style": "style0",
"style_index": 0,
"font": "Arial",
"style_path": "path/to/style0.png",
"output_path": "TargetImage/style0/style0+char0.png"
}
],
"metrics": {
"lpips": {"mean": 0.25, "std": 0.08, "min": 0.1, "max": 0.5},
"ssim": {"mean": 0.82, "std": 0.05, "min": 0.7, "max": 0.95},
"fid": {"mean": 15.3, "std": 2.1},
"inference_times": [
{
"style": "style0",
"style_index": 0,
"font": "Arial",
"total_time": 2.45,
"num_images": 100,
"time_per_image": 0.0245
}
]
},
"fonts": ["Arial", "Times New Roman"],
"characters": ["A", "B", "C"],
"styles": ["style0", "style1"],
"total_chars": 3,
"total_styles": 2,
"total_possible_pairs": 6
}
Evaluation Metrics
Supported Metrics
- LPIPS: Learned perceptual image patch similarity (lower is better)
- SSIM: Structural similarity index (higher is better)
- FID: FrΓ©chet Inception Distance (lower is better)
- Inference Time: Per-image generation time
Generate with Evaluation
python sample_batch.py \
--characters "characters.txt" \
--style_images "styles/" \
--ttf_path "fonts/myfont.ttf" \
--ckpt_dir "checkpoints/" \
--output_dir "my_dataset/train_original" \
--evaluate \
--ground_truth_dir "ground_truth/" \
--compute_fid
Dataset
Dataset Source
- Name: font-diffusion-generated-data
- Link: https://huggingface.co/datasets/dzungpham/font-diffusion-generated-data
- Format: ContentImage + TargetImage per style
- Supports: Multi-font, multi-character, multi-style generation
Dataset Structure
FontDiffusion Dataset/
βββ train_original/
β βββ ContentImage/ # Character structure images
β βββ TargetImage/ # Style-specific font renderings
β βββ results.json
βββ val_original/
βββ test_original/
Training & Fine-tuning
Fine-tuning from Checkpoint
python my_train.py \
--ckpt_dir "checkpoints/" \
--data_dir "my_dataset/train_original" \
--output_dir "finetuned_ckpt/" \
--num_epochs 5 \
--learning_rate 1e-4 \
--batch_size 4
Convert & Upload Fine-tuned Models
python finetune_and_upload.py \
--ckpt_dir "finetuned_ckpt/" \
--hf_token "hf_xxxxx" \
--hf_repo_id "username/font-diffusion-finetuned" \
--num_epochs 5
Technical Features
Optimizations
- β Batch Processing: Process multiple characters per style
- β Memory Efficiency: Attention slicing (optional)
- β FP16 Support: Reduced precision for faster inference
- β Torch Compile: Optional model compilation
- β Channels Last Format: Memory-optimized tensor layout
- β XFormers Support: Fast attention implementation
Robustness
- β Checkpoint & Resume: Resume from interruptions
- β Index-based Tracking: Handle large character sets (100K+)
- β Multi-font Support: Process characters across multiple fonts
- β Error Recovery: Graceful handling of missing fonts
- β Automatic Indexing: Consistent char_index and style_index
Monitoring
- β Weights & Biases Integration: Real-time tracking
- β Progress Bars: Detailed generation progress
- β Checkpoint Saving: Periodic intermediate saves
- β Quality Metrics: LPIPS, SSIM, FID computation
Known Limitations
- Requires CUDA-capable GPU for practical generation speeds
- Characters must exist in at least one loaded font
- Style images should be normalized (96Γ96 or resizable)
- Very large character sets (>100K) may require memory optimization
- FID computation requires representative ground truth dataset
Citation
@article{fontdiffuser2023,
title={FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning},
author={Zhenhua Yang, Dezhi Peng, Yuxin Kong, Yuyi Zhang, Cong Yao, Lianwen Jin},
year={2023}
}
License
This model is licensed under the Apache License 2.0. See LICENSE file for details.
Contact & Support
For issues, questions, or contributions:
- GitHub: [FontDiffusion Repository]
- Hugging Face: [Model Card]
- Dataset: https://huggingface.co/datasets/dzungpham/font-diffusion-generated-data
- Downloads last month
- 137