NOVA overview. Multiple augmented views of a chest X-ray are encoded by a learnable vision encoder and predictor. A frozen ClinicalBERT text encoder provides the semantic anchor. Predicted visual embeddings are aligned to the text embedding with MSE, while SIGReg regularizes the joint embedding distribution.
Vision-language models have transformed multimodal representation learning, yet dominant contrastive approaches such as CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning. We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization.
NOVA predicts frozen, domain-specific text embeddings from augmented image views and enforces an isotropic Gaussian structure through Sketched Isotropic Gaussian Regularization (SIGReg). This removes negative sampling, momentum encoders, and stop-gradients, reducing the training objective to a single trade-off parameter. On zero-shot chest X-ray classification, NOVA outperforms CLIP and MedCLIP-style baselines across MIMIC-CXR, ChestX-ray14, and CheXpert while showing substantially more consistent training runs.
TL;DR: NOVA aligns chest X-ray images to ClinicalBERT embeddings without contrastive negatives. It trains from scratch with a ViT backbone, uses one main hyperparameter, and improves zero-shot AUC across in-distribution and out-of-distribution benchmarks.
NOVA extends the LeJEPA idea to vision-language alignment. Instead of contrasting each image-text pair against other samples in the batch, the model directly predicts the text embedding from multiple image crops. The text branch is a frozen ClinicalBERT encoder with a learnable projection head, so the visual model is trained against a stable clinical semantic target.
Randomly initialized ViT-Small or ViT-Base encodes global and local chest X-ray crops.
A learnable 3-layer MLP maps visual features into the shared 64-dimensional embedding space.
Frozen ClinicalBERT encodes report impressions and pathology prompts into domain-specific targets.
All predicted views \(P_{V_i}\) are aligned to the text embedding \(E_T\) with mean squared error:
$$\mathcal{L}_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^{n} \|P_{V_i} - E_T\|_2^2$$SIGReg regularizes the joint set of image predictions and text embeddings, preventing collapse by encouraging an isotropic Gaussian embedding structure. The final loss is:
$$\mathcal{L}_{\text{NOVA}} = (1 - \lambda)\mathcal{L}_{\text{MSE}} + \lambda\mathcal{L}_{\text{SIGReg}}$$NOVA is evaluated on five CheXpert competition pathologies using text prompts such as "atelectasis" and "no atelectasis". It achieves the strongest average AUC and the best out-of-distribution results across ChestX-ray14 and CheXpert.
| Framework | Algorithm | Model | Samples | Average | MIMIC-CXR | ChestX-ray14 | CheXpert |
|---|---|---|---|---|---|---|---|
| CLIP | Base | ViT-B | 1.28M | 46.56 | 48.61 | 50.63 | 40.44 |
| CLIP | InfoNCE | ViT-B | 1.41M | 66.29 | 66.78 ± 1.79 | 65.53 ± 1.46 | 66.56 ± 3.98 |
| CLIP | SigLIP | ViT-B | 1.41M | 68.19 | 68.49 ± 1.97 | 66.39 ± 1.47 | 69.70 ± 3.20 |
| MedCLIP | - | ViT-S | 130K | 72.44 | 72.07 ± 1.10 | 67.95 ± 0.52 | 77.30 ± 0.32 |
| MedCLIP | - | ViT-B | 130K | 71.07 | 71.61 ± 1.11 | 66.70 ± 1.38 | 74.91 ± 1.38 |
| NOVA | - | ViT-S | 130K | 76.23 | 75.49 ± 0.23 | 73.04 ± 0.29 | 80.15 ± 0.08 |
| NOVA | - | ViT-B | 130K | 76.25 | 75.78 ± 0.15 | 73.17 ± 0.48 | 79.79 ± 0.32 |
Zero-shot AUC (×100) across in-distribution and out-of-distribution chest X-ray benchmarks. Highest performance is bolded.
NOVA improves consistently across the evaluated thoracic findings. The largest gains appear on subtle or diffuse findings such as Atelectasis, Effusion, and Consolidation, where local image evidence benefits from the multi-crop training signal.
ChestX-ray14. NOVA reaches the highest AUC on all five pathologies.
CheXpert. NOVA shows strong performance, especially on Atelectasis and Consolidation.
NOVA trains for 100 epochs with AdamW, a cosine learning rate schedule from \(1 \times 10^{-4}\) to \(1 \times 10^{-5}\), batch size 256, and \(n = 8\) image views per sample. The paper reports smooth convergence across seeds, while MedCLIP-style training peaks early and degrades after roughly 10 epochs.
This stability follows from the combination of a fixed text anchor and SIGReg. The text embeddings provide a consistent target throughout optimization, and the distributional regularizer keeps the shared embedding space from collapsing without requiring negatives, stop-gradients, or momentum teachers.
@misc{kuhn2026noncontrastivevisionlanguagelearningpredictive,
title={Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment},
author={Lukas Kuhn and Giuseppe Serra and Florian Buettner},
year={2026},
eprint={2602.00653},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.00653},
}