NOVA

Non-Contrastive Vision-Language Learning
with Predictive Embedding Alignment

1Goethe University Frankfurt, 2German Cancer Research Center (DKFZ), 3German Cancer Consortium (DKTK) *Equal contribution
NOVA architecture overview

NOVA overview. Multiple augmented views of a chest X-ray are encoded by a learnable vision encoder and predictor. A frozen ClinicalBERT text encoder provides the semantic anchor. Predicted visual embeddings are aligned to the text embedding with MSE, while SIGReg regularizes the joint embedding distribution.

Abstract

Vision-language models have transformed multimodal representation learning, yet dominant contrastive approaches such as CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning. We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization.

NOVA predicts frozen, domain-specific text embeddings from augmented image views and enforces an isotropic Gaussian structure through Sketched Isotropic Gaussian Regularization (SIGReg). This removes negative sampling, momentum encoders, and stop-gradients, reducing the training objective to a single trade-off parameter. On zero-shot chest X-ray classification, NOVA outperforms CLIP and MedCLIP-style baselines across MIMIC-CXR, ChestX-ray14, and CheXpert while showing substantially more consistent training runs.

TL;DR: NOVA aligns chest X-ray images to ClinicalBERT embeddings without contrastive negatives. It trains from scratch with a ViT backbone, uses one main hyperparameter, and improves zero-shot AUC across in-distribution and out-of-distribution benchmarks.


Method

NOVA extends the LeJEPA idea to vision-language alignment. Instead of contrasting each image-text pair against other samples in the batch, the model directly predicts the text embedding from multiple image crops. The text branch is a frozen ClinicalBERT encoder with a learnable projection head, so the visual model is trained against a stable clinical semantic target.

Vision Encoder

Randomly initialized ViT-Small or ViT-Base encodes global and local chest X-ray crops.

Predictor

A learnable 3-layer MLP maps visual features into the shared 64-dimensional embedding space.

Text Anchor

Frozen ClinicalBERT encodes report impressions and pathology prompts into domain-specific targets.

All predicted views \(P_{V_i}\) are aligned to the text embedding \(E_T\) with mean squared error:

$$\mathcal{L}_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^{n} \|P_{V_i} - E_T\|_2^2$$

SIGReg regularizes the joint set of image predictions and text embeddings, preventing collapse by encouraging an isotropic Gaussian embedding structure. The final loss is:

$$\mathcal{L}_{\text{NOVA}} = (1 - \lambda)\mathcal{L}_{\text{MSE}} + \lambda\mathcal{L}_{\text{SIGReg}}$$

Zero-Shot Classification

NOVA is evaluated on five CheXpert competition pathologies using text prompts such as "atelectasis" and "no atelectasis". It achieves the strongest average AUC and the best out-of-distribution results across ChestX-ray14 and CheXpert.

Framework Algorithm Model Samples Average MIMIC-CXR ChestX-ray14 CheXpert
CLIP Base ViT-B 1.28M 46.56 48.61 50.63 40.44
CLIP InfoNCE ViT-B 1.41M 66.29 66.78 ± 1.79 65.53 ± 1.46 66.56 ± 3.98
CLIP SigLIP ViT-B 1.41M 68.19 68.49 ± 1.97 66.39 ± 1.47 69.70 ± 3.20
MedCLIP - ViT-S 130K 72.44 72.07 ± 1.10 67.95 ± 0.52 77.30 ± 0.32
MedCLIP - ViT-B 130K 71.07 71.61 ± 1.11 66.70 ± 1.38 74.91 ± 1.38
NOVA - ViT-S 130K 76.23 75.49 ± 0.23 73.04 ± 0.29 80.15 ± 0.08
NOVA - ViT-B 130K 76.25 75.78 ± 0.15 73.17 ± 0.48 79.79 ± 0.32

Zero-shot AUC (×100) across in-distribution and out-of-distribution chest X-ray benchmarks. Highest performance is bolded.

Average zero-shot AUC across datasets

Per-Pathology Results

NOVA improves consistently across the evaluated thoracic findings. The largest gains appear on subtle or diffuse findings such as Atelectasis, Effusion, and Consolidation, where local image evidence benefits from the multi-crop training signal.

ChestX-ray14 per-pathology AUC results

ChestX-ray14. NOVA reaches the highest AUC on all five pathologies.

CheXpert per-pathology AUC results

CheXpert. NOVA shows strong performance, especially on Atelectasis and Consolidation.


Training Stability

NOVA trains for 100 epochs with AdamW, a cosine learning rate schedule from \(1 \times 10^{-4}\) to \(1 \times 10^{-5}\), batch size 256, and \(n = 8\) image views per sample. The paper reports smooth convergence across seeds, while MedCLIP-style training peaks early and degrades after roughly 10 epochs.

This stability follows from the combination of a fixed text anchor and SIGReg. The text embeddings provide a consistent target throughout optimization, and the distributional regularizer keeps the shared embedding space from collapsing without requiring negatives, stop-gradients, or momentum teachers.

BibTeX

@misc{kuhn2026noncontrastivevisionlanguagelearningpredictive,
      title={Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment}, 
      author={Lukas Kuhn and Giuseppe Serra and Florian Buettner},
      year={2026},
      eprint={2602.00653},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.00653}, 
}