SketchFusion:

Learning Universal Sketch Features through Fusing Foundation Models

1SketchX, CVSSP, University of Surrey, United Kingdom 2iFlyTek-Surrey Joint Research Centre on Artifiial Intelligence

Apart from high-resolution image generation, text-to-image diffusion models (\eg, Stable diffusion (SD) ) with their innate object understanding capability, have shown remarkable performance across a wide range of image-based vision tasks (\eg, segmentation, depth estimation, etc.). However, upon analysing the PCA representation of SD's intermediate UNet features, we observe that it struggles to achieve similar results when working with freehand abstract sketches. Unlike pixel-perfect photos, highly abstract freehand sketches are sparse and lack detailed textures and colours, making it harder for the SD model to extract meaningful features. Furthermore, investigating the SD denoising process in the frequency domain (via Fourier Transform), we observe the predominance of high-frequency (HF) components, rather than their low-frequency (LF) counterpart -- crucial for capturing comprehensive semantic context. To mitigate this inherent bias within SD, we reinforce the diffusion process with another pretrained model (\ie, CLIP ) whose bias is complementary (\ie, focuses on LF) to SD. Consequently, the proposed extractor can extract semantically meaningful and accurate features from both sketches and photos, that encapsulate a broader frequency spectrum (\ie, HF and LF). Right:} Testing the proposed method with different sketch-based discriminative and dense prediction tasks (requiring knowledge of both sketch and image), we find a marked improvement over baseline SD+CLIP hybrid feature extractor.

Abstract

While foundation models have revolutionised computer vision, their effectiveness for sketch understanding remains limited by the unique challenges of abstract, sparse visual inputs. Through systematic analysis, we uncover two fundamental limitations: Stable Diffusion (SD) struggles to extract meaningful features from abstract sketches (unlike its success with photos), and exhibits a pronounced frequency-domain bias that suppresses essential low-frequency components needed for sketch understanding. Rather than costly retraining, we address these limitations by strategically combining SD with CLIP, whose strong semantic understanding naturally compensates for SD's spatial-frequency biases. By dynamically injecting CLIP features into SD's denoising process and adaptively aggregating features across semantic levels, our method achieves state-of-the-art performance in sketch retrieval (+3.35%), recognition (+1.06%), segmentation (+29.42%), and correspondence learning (+21.22%), demonstrating the first truly universal sketch feature representation in the era of foundation models.

Architecture

Given the frozen SD and CLIP models, the proposed method learns the aggregation network, 1D convolutional layers, and branch-weights with sketch-photo pairs, via different losses for different downstream tasks.

Results

qualitative results
Sketch-photo correspondence results on PSC6K. Green circles and squares depict source and GT points respectively, while red squares denote predicted points.

qualitative results
Qualitative results for sketch-based image segmentation. Given a query sketch, our method generates separate segmentation masks for all images of that category.

quantitative results


quantitative results


quantitative results


quantitative results


BibTeX

@Inproceedings{koley2025sketchfusion,
  title={{SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models}},
  author={Subhadeep Koley and Tapas Kumar Dutta and Aneeshan Sain and Pinaki Nath Chowdhury and Ayan Kumar Bhunia and Yi-Zhe Song},
  booktitle={CVPR},
  year={2025}
}

Copyright: CC BY-NC-SA 4.0 © Subhadeep Koley | Last updated: 14 March 2025 | Good artists copy, great artists steal.