Microsoft Research · Healthcare
BiomedCLIP
A biomedical vision-language model trained on 15 million figure-caption pairs from PubMed Central for medical image-text understanding.
Overview
BiomedCLIP adapts the CLIP framework for the biomedical domain by training on PMC-15M, a dataset of 15 million biomedical figure-caption pairs extracted from PubMed Central. It achieves state-of-the-art performance on a wide range of biomedical vision-language tasks including image classification, retrieval, and visual question answering. The model bridges the gap between medical imaging and natural language understanding in a single unified framework.
Architecture
CLIP (ViT + Text Encoder)
Training Data
PMC-15M (15M figure-caption pairs)
Image Encoder
Vision Transformer (ViT-B/16)
Text Encoder
PubMedBERT
License
MIT
Capabilities
Biomedical image classification
Medical image-text retrieval
Visual question answering for medical images
Zero-shot medical image recognition
Cross-modal biomedical search
Use Cases
Searching medical image databases using natural language queries
Classifying medical images without task-specific fine-tuning
Building multimodal search engines for biomedical literature
Automating figure annotation in medical publications
Pros
- +State-of-the-art biomedical vision-language understanding
- +Zero-shot capability reduces need for labeled medical data
- +Open-source with permissive MIT license
- +Trained on the largest biomedical image-text dataset to date
Cons
- -Performance varies across medical imaging modalities
- -Primarily trained on published figures, not raw clinical imaging
- -Requires paired image-text data for best results
- -Not designed for diagnostic-grade clinical image analysis
Pricing
Free and open-source under MIT license. Available on Hugging Face. Efficient enough for standard GPU inference.