Fine-Tuning EfficientNet-B0 for Painting Style Classification
Introduction
I fine-tuned EfficientNet-B0 to classify artworks into 9 painting styles using the Hugging Face dataset keremberke/painting-style-classification. The aim was to build a fully custom PyTorch training pipeline — covering dataset preparation, augmentation, transfer learning, and evaluation — to understand both what works and what limits accuracy for this task.
Model card: milliyin/painting-style-classification
Dataset Preparation
The full dataset was downloaded directly from Hugging Face in ZIP format for train, validation, and test splits. Folder structure:
dataset/
images/train
images/validation
images/test
jsonl/train.jsonl
jsonl/validation.jsonl
jsonl/test.jsonlImages were extracted, renamed with zero-padded IDs, and assigned numeric labels based on their original folder names (e.g., baroque → 4, renaissance → 5, surrealism → 8). I generated .jsonl files containing metadata for each split and implemented a custom dataset loader (FolderDataset) that reads these JSONLs and can access splits like dataset['train']. A second wrapper (PaintingDataset) applied transforms and returned (image, label) pairs for PyTorch.
Data Augmentation
For training:
- Resize to 224×224 (EfficientNet-B0 input size)
- Random horizontal flip (50% probability)
- Random rotation up to 15°
- Color jitter (brightness, contrast, saturation, hue)
- Random affine translation
- Normalization to ImageNet stats
For validation and test: only resizing and normalization.
Model Architecture
Started from torchvision.models.efficientnet_b0 with ImageNet pretrained weights. The final classifier layer was replaced with:
- Dropout (0.2)
- Fully-connected layer to 9 output classes
Transfer Learning Strategy
All layers up to ~layer 100 were frozen at the start to speed up convergence and avoid catastrophic forgetting. Gradual unfreezing:
- Epoch 10: freeze_until_layer=50
- Epoch 20: unfreeze all layers for full fine-tuning
This step-wise approach allowed the classifier head to adapt first before updating earlier convolutional blocks.
Training Setup
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
scheduler = ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=5)- Batch size: 32
- Epochs: 50
- Device: CUDA
The training loop tracked train loss/accuracy and validation loss/accuracy each epoch. If validation accuracy improved, the model was saved as best_efficientnet_b0.pth.
Evaluation & Results
Best validation accuracy achieved:
60.15% after 50 epochs
I also generated a classification report and plotted loss/accuracy curves to analyze overfitting patterns. Inference was tested on individual images with top-1 predicted style and confidence score.
Why Did It Plateau Around ~60%?
- High Inter-Class Similarity — Certain styles (e.g., Romanticism vs. Realism) share strong visual overlap.
- Label Noise — Open datasets may have inconsistent labels.
- Data Imbalance — Some styles had fewer samples, causing uneven learning.
- Limited Early Unfreezing — Freezing many layers for too long limited domain adaptation from natural photos to paintings.
- Moderate Augmentation — Could be stronger to handle variations in scan quality, lighting, and framing.
- Model Size — EfficientNet-B0 is compact; larger backbones may better capture fine texture differences.
How to Improve
- Earlier & Gradual Unfreezing — Allow backbone adaptation sooner.
- Stronger Augmentations — Use RandAugment, CutMix, Mixup, or color-space perturbations.
- Class-Balanced Sampling — Reduce bias toward majority classes.
- Bigger Backbone — Try EfficientNet-B2/B3, ConvNeXt-Tiny, or ViT models.
- Curated Splits — Avoid artist overlap between train/validation to measure generalization accurately.
- TTA & Ensembling — Small accuracy gains from combining predictions.
Code Link
Complete training pipeline, dataset processing, and fine-tuning notebook:
painting-style-classification-finetune/finetune.ipynb
Conclusion
This project provided a hands-on look at training image classification models for nuanced visual categories like art styles. With a solid baseline at ~60% validation accuracy, there's plenty of room to iterate — particularly on augmentation, layer unfreezing, and backbone scaling — to push well beyond this mark.