S2COPE: Self-Supervised Concept Discovery
via Preference Learning

Can unlabeled data teach foundation models interpretable concepts?

Shilong Xiang, Zirui Zhang, Chengzhi Mao
Rutgers University
Self-Supervised Visual Concept Discovery
Figure 1 Self-Supervised Visual Concept Discovery. (Left) Standard contrastive learning yields discriminative but opaque, high-dimensional feature vectors that act as uninterpretable "black boxes." (Right) In contrast, S²COPE discovers explicitly interpretable concepts (e.g., "bright yellow body") directly from unannotated images. By utilizing Vision-Large-Language Models as a broad semantic prior, our self-supervised preference optimization loop grounds raw visual features into discrete, human-readable attributes, yielding transparent representations that improve classification accuracy on unseen data.

Abstract

Current representation learning paradigms force a fundamental compromise: self-supervised methods scale to massive datasets but yield opaque features, whereas interpretable models remain bottlenecked by the need for dense human annotation.

We introduce Self-Supervised Concept discOvery via Preference lEarning (S²COPE), a label-free framework that resolves this dilemma. Instead of treating Vision-Large-Language Models (VLLMs) as static feature extractors, S²COPE leverages them as active participants in a self-supervised preference optimization loop. By autonomously hypothesizing, validating, and reinforcing candidate visual attributes directly from raw imagery, our framework discovers novel, structured concepts without a single label.

Extensive experiments across natural, medical, and physics domains demonstrate that S²COPE successfully extracts domain-specific concepts where standard VLLMs often fail to generate. By amortizing concept discovery directly into the VLLM backbone through our self-supervised preference objective — rather than relying on static generation and disjoint filtering — we achieve up to a 24-point absolute improvement in downstream top-1 classification accuracy on unseen data.

iNaturalist CUB HAM10000 MedMNIST Galaxy 10 Gravity Spy
+24pt
Max Top-1 Gain
+16
Avg Top-1 Gain
8
Datasets

Method

Overview of the S²COPE Discovery Loop
Figure 2 Overview of the S²COPE Discovery Loop. Our framework operates as an end-to-end, self-supervised discovery process. In iteration k, the VLLM policy πk uses high-temperature sampling to hypothesize diverse candidate concepts C(x) for an unlabeled image x. To evaluate these proposals without human labels, we compute a self-supervised, cross-modal contrastive reward R(c, x) based on visual invariance. A candidate concept receives a high reward only if it is stable across augmented views (the positive set) while maintaining specificity against unrelated batch images. This automatically filters out generic, noisy descriptions (Answer A) in favor of discriminative, structured attributes (Answer B). An Easy-Negative pairing strategy (selecting pairs with the largest reward gap) converts these rewards into preference pairs (cw, cl) to form dataset 𝒟k. Finally, Direct Preference Optimization (DPO) internalizes this invariance by updating the VLLM concept generator's weights, yielding a refined policy πk+1 that iteratively transforms the VLLM into a self-supervised concept miner.

Results

Visualizing Self-Supervised Concept Discovery

Visualizing Self-Supervised Concept Discovery. For each sample, we contrast the top concepts generated by the VLLM baseline (top list) with our S²COPE-optimized model (bottom list). Red text indicates incorrect concepts for recognizing the image's category. S²COPE optimized model suppresses these nuisance concepts, extracting precise, physically grounded attributes.

Ablation Studies on Reward Formulation
Ablation Ablation Studies on Reward Formulation. (a) Reward Components: Impact of isolating the positive and negative signals of the contrastive reward. Eliminating the positive signal causes a performance collapse, while removing the negative signal yields a suboptimal accuracy plateau. (b) Reward Modality: Comparison of cross-modal image-text grounding versus unimodal text-text consensus. Cross-modal grounding against physical image features achieves better performance than relying on unimodal textual consensus.

BibTeX

@article{xiang2026scope,
  title={S$^2$COPE: Self-Supervised Concept Discovery via Preference Learning},
  author={Xiang, Shilong and Zhang, Zirui and Mao, Chengzhi},
  journal={arXiv preprint arXiv:2606.14586},
  year={2026}
}