Microsoft Research · Healthcare
PubMedBERT
A BERT model pre-trained from scratch on PubMed abstracts using a biomedical-specific vocabulary for superior biomedical NLP performance.
Overview
PubMedBERT distinguishes itself from other biomedical BERT variants by being pre-trained entirely from scratch on biomedical text rather than initialized from general-domain BERT. This approach with a domain-specific vocabulary yields substantial improvements on biomedical NLP benchmarks. It consistently outperforms BioBERT and other mixed-domain models on the Biomedical Language Understanding Evaluation (BLUE) benchmark.
Parameters
110M
Architecture
BERT-Base (from-scratch pretraining)
Training Data
PubMed abstracts (3.1B words)
Vocabulary
Custom biomedical WordPiece (30K tokens)
License
MIT
Capabilities
Biomedical named entity recognition
Biomedical relation extraction
Biomedical question answering
Sentence similarity in medical context
Document classification for medical literature
Use Cases
Extracting gene-disease associations from research papers
Classifying clinical trial eligibility criteria
Building biomedical knowledge graphs from literature
Semantic search across medical publication databases
Pros
- +Top performance on BLUE benchmark for biomedical NLP
- +Domain-specific vocabulary captures biomedical terminology better
- +Lightweight and efficient for production deployment
- +Well-supported with extensive documentation and benchmarks
Cons
- -Encoder-only model; cannot generate text
- -Limited 512-token context window
- -Focused on abstracts; may underperform on full-text clinical documents
- -Requires task-specific fine-tuning
Pricing
Free and open-source. Available on Hugging Face. Self-hosting costs depend on infrastructure; runs efficiently on a single GPU.