Datasets

We have curated the following datasets to form Harvard Ophthalmology AI Datasets to facilitate various AI research.

Harvard Glaucoma Detection with 500 Samples (Harvard-GD500): This Harvard-GD500 dataset includes 500 samples from 500 patients for glaucoma detection to confirm results in our paper “Artifact-Tolerant Clustering-Guided Contrastive Embedding Learning for Ophthalmic Images in Glaucoma" published in the Journal of Biomedical and Health Informatics. Here is the data download link for Harvard-GD500. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.

Harvard Glaucoma Detection and Progression with 1,000 Samples (Harvard-GDP1000): This Harvard-GDP1000 dataset includes 1,000 samples from 1,000 patients for glaucoma detection and progression forecasting to confirm results in our paper “Harvard Glaucoma Detection and Progression: A Multimodal Multitask Dataset and Generalization-Reinforced Semi-Supervised Learning" published in the 2023 International Conference on Computer Vision. The corresponding code is available on our GitHub repository Harvard-GDP. Here is the data download link for Harvard-GDP1000. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.

Harvard Glaucoma Fairness with 3,300 Samples (Harvard-GF3300): This Harvard-GF3300 dataset includes 3,300 samples from 3,300 patients for fairness learning in glaucoma used in our manuscript “Harvard Glaucoma Fairness: A Retinal Nerve Disease Dataset for Fairness Learning and Fair Identity Normalization“. The corresponding code is available on our GitHub repository Harvard-GF. Here is the data download link for Harvard-GF3300. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.

Harvard FairVision with 30,000 Samples (Harvard-FairVision30k): This Harvard-FairVision30k dataset includes 30,000 samples from 30,000 patients covering three major eye diseases including age-related macular degeneration, diabetic retinopathy, and glaucoma affecting 380 million patients globally. Our dataset includes 10,000 subjects for age-related macular degeneration, diabetic retinopathy, and glaucoma separately, which is used in our manuscript “FairVision: Equitable Deep Learning for Eye Disease Screening via Fair Identity Scaling“. Here is the data download link for Harvard-FairVision30k. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.

Harvard FairSeg with 10,000 Samples (Harvard-FairSeg10k): This Harvard-FairSeg10k dataset includes 10,000 samples from 10,000 patients to study the fairness issue in medical segmentation with the task of cup and rim segmentation on fundus photo for glaucoma. This dataset is used in our paper “FairSeg: A Large-scale Medical Image Segmentation Dataset for Fairness Learning with Fair Error-Bound Scaling" published in the 2024 International Conference on Learning Representation. The corresponding code is available on our GitHub repository Harvard-Fairseg. Here is the data download link for Harvard-FairSeg10k. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.

Harvard FairVLMed with 10,000 Samples (Harvard-FairVLMed10k): This Harvard-FairVLMed10k dataset includes 10,000 samples from 10,000 patients to study the fairness issue in medical vision-language models. This dataset is used in our paper “FairCLIP: Harnessing Fairness in Vision-Language Learning" published in the 2024 Conference on Computer Vision and Pattern Recognition. The corresponding code is available on our GitHub repository FairCLIP. Here is the data download link for Harvard-FairVLMed10k. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.