Model Card for FontDiffuser

Model Details

Model Type

  • Architecture: Diffusion-based Font Generation Model
  • Framework: PyTorch + Hugging Face Diffusers
  • Scheduler: DPM-Solver++ (configurable: dpmsolver++ / dpmsolver)
  • Guidance: Classifier-free guidance
  • Base Model: FontDiffuser with Content and Style Encoders

Model Components

  1. UNet: Main diffusion model for image generation
  2. Content Encoder: Extracts character structure information
  3. Style Encoder: Extracts font style features
  4. DDPM/DPM Scheduler: Noise scheduling for diffusion process

Training Configuration

  • Resolution: 96Γ—96 pixels
  • Batch Size: 4-8 (configurable)
  • Inference Steps: 15 (default, configurable)
  • Guidance Scale: 7.5 (default, configurable)
  • Precision: FP32/FP16 (optional)
  • Device: CUDA/GPU recommended

Model Usage

Installation

pip install diffusers torch torchvision safetensors
pip install lpips scikit-image pytorch-fid  # Optional: for evaluation

Basic Generation

from sample_batch import (
    FontManager, 
    batch_generate_images,
    load_fontdiffuser_pipeline
)
from argparse import Namespace

# Initialize font manager
font_manager = FontManager("path/to/font.ttf")

# Load pipeline
args = Namespace(
    ckpt_dir="path/to/checkpoints",
    device="cuda",
    num_inference_steps=15,
    guidance_scale=7.5,
    batch_size=4,
    # ... other args
)
pipe = load_fontdiffuser_pipeline(args)

# Generate images
characters = ['A', 'B', 'C', 'δΈ­', 'ε›½']
style_paths = ['style1.png', 'style2.png']

results = batch_generate_images(
    pipe, characters, style_paths,
    output_dir="output",
    args=args,
    evaluator=evaluator,
    font_manager=font_manager
)

Batch Generation with Checkpointing

python sample_batch.py \
  --characters "characters.txt" \
  --start_line 1 \
  --end_line 100 \
  --style_images "styles/" \
  --ttf_path "fonts/myfont.ttf" \
  --ckpt_dir "checkpoints/" \
  --output_dir "my_dataset/train_original" \
  --batch_size 4 \
  --num_inference_steps 15 \
  --guidance_scale 7.5 \
  --save_interval 10 \
  --device cuda

Resume from Checkpoint

python sample_batch.py \
  --characters "characters.txt" \
  --style_images "styles/" \
  --ttf_path "fonts/myfont.ttf" \
  --ckpt_dir "checkpoints/" \
  --output_dir "my_dataset/train_original" \
  --resume_from "my_dataset/train_original/results_checkpoint.json"

Model Performance

Supported Tasks

  • βœ… Single-character font generation
  • βœ… Multi-character batch generation
  • βœ… Multi-font support
  • βœ… Multi-style transfer
  • βœ… Index-based tracking for large-scale generation
  • βœ… Checkpoint and resume support

Output Format

output_dir/
β”œβ”€β”€ ContentImage/              # Single set of content (character) images
β”‚   β”œβ”€β”€ char0.png
β”‚   β”œβ”€β”€ char1.png
β”‚   └── ...
β”œβ”€β”€ TargetImage/               # Generated font images organized by style
β”‚   β”œβ”€β”€ style0/
β”‚   β”‚   β”œβ”€β”€ style0+char0.png
β”‚   β”‚   β”œβ”€β”€ style0+char1.png
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ style1/
β”‚   β”‚   └── ...
β”‚   └── ...
β”œβ”€β”€ results.json               # Comprehensive generation metadata
β”œβ”€β”€ results_checkpoint.json    # Intermediate checkpoint (if save_interval > 0)
└── results_interrupted.json   # Emergency checkpoint (if interrupted)

Results Metadata Structure

{
  "generations": [
    {
      "character": "A",
      "char_index": 0,
      "style": "style0",
      "style_index": 0,
      "font": "Arial",
      "style_path": "path/to/style0.png",
      "output_path": "TargetImage/style0/style0+char0.png"
    }
  ],
  "metrics": {
    "lpips": {"mean": 0.25, "std": 0.08, "min": 0.1, "max": 0.5},
    "ssim": {"mean": 0.82, "std": 0.05, "min": 0.7, "max": 0.95},
    "fid": {"mean": 15.3, "std": 2.1},
    "inference_times": [
      {
        "style": "style0",
        "style_index": 0,
        "font": "Arial",
        "total_time": 2.45,
        "num_images": 100,
        "time_per_image": 0.0245
      }
    ]
  },
  "fonts": ["Arial", "Times New Roman"],
  "characters": ["A", "B", "C"],
  "styles": ["style0", "style1"],
  "total_chars": 3,
  "total_styles": 2,
  "total_possible_pairs": 6
}

Evaluation Metrics

Supported Metrics

  • LPIPS: Learned perceptual image patch similarity (lower is better)
  • SSIM: Structural similarity index (higher is better)
  • FID: FrΓ©chet Inception Distance (lower is better)
  • Inference Time: Per-image generation time

Generate with Evaluation

python sample_batch.py \
  --characters "characters.txt" \
  --style_images "styles/" \
  --ttf_path "fonts/myfont.ttf" \
  --ckpt_dir "checkpoints/" \
  --output_dir "my_dataset/train_original" \
  --evaluate \
  --ground_truth_dir "ground_truth/" \
  --compute_fid

Dataset

Dataset Source

Dataset Structure

FontDiffusion Dataset/
β”œβ”€β”€ train_original/
β”‚   β”œβ”€β”€ ContentImage/          # Character structure images
β”‚   β”œβ”€β”€ TargetImage/           # Style-specific font renderings
β”‚   └── results.json
β”œβ”€β”€ val_original/
└── test_original/

Training & Fine-tuning

Fine-tuning from Checkpoint

python my_train.py \
  --ckpt_dir "checkpoints/" \
  --data_dir "my_dataset/train_original" \
  --output_dir "finetuned_ckpt/" \
  --num_epochs 5 \
  --learning_rate 1e-4 \
  --batch_size 4

Convert & Upload Fine-tuned Models

python finetune_and_upload.py \
  --ckpt_dir "finetuned_ckpt/" \
  --hf_token "hf_xxxxx" \
  --hf_repo_id "username/font-diffusion-finetuned" \
  --num_epochs 5

Technical Features

Optimizations

  • βœ… Batch Processing: Process multiple characters per style
  • βœ… Memory Efficiency: Attention slicing (optional)
  • βœ… FP16 Support: Reduced precision for faster inference
  • βœ… Torch Compile: Optional model compilation
  • βœ… Channels Last Format: Memory-optimized tensor layout
  • βœ… XFormers Support: Fast attention implementation

Robustness

  • βœ… Checkpoint & Resume: Resume from interruptions
  • βœ… Index-based Tracking: Handle large character sets (100K+)
  • βœ… Multi-font Support: Process characters across multiple fonts
  • βœ… Error Recovery: Graceful handling of missing fonts
  • βœ… Automatic Indexing: Consistent char_index and style_index

Monitoring

  • βœ… Weights & Biases Integration: Real-time tracking
  • βœ… Progress Bars: Detailed generation progress
  • βœ… Checkpoint Saving: Periodic intermediate saves
  • βœ… Quality Metrics: LPIPS, SSIM, FID computation

Known Limitations

  • Requires CUDA-capable GPU for practical generation speeds
  • Characters must exist in at least one loaded font
  • Style images should be normalized (96Γ—96 or resizable)
  • Very large character sets (>100K) may require memory optimization
  • FID computation requires representative ground truth dataset

Citation

@article{fontdiffuser2023,
  title={FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning},
  author={Zhenhua Yang, Dezhi Peng, Yuxin Kong, Yuyi Zhang, Cong Yao, Lianwen Jin},
  year={2023}
}

License

This model is licensed under the Apache License 2.0. See LICENSE file for details.

Contact & Support

For issues, questions, or contributions:


Downloads last month
137
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train dzungpham/font-diffusion-weights