Datasets

We have curated the following datasets to form Harvard Ophthalmology AI Datasets to facilitate various AI research.

Harvard Glaucoma Detection with 500 Samples (Harvard-GD500): This Harvard-GD500 dataset includes 500 samples from 500 patients for glaucoma detection to confirm results in our paper “Artifact-Tolerant Clustering-Guided Contrastive Embedding Learning for Ophthalmic Images in Glaucoma" published in the Journal of Biomedical and Health Informatics. Here is the data download link for Harvard-GD500. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.

Harvard Glaucoma Detection and Progression with 1,000 Samples (Harvard-GDP1000): This Harvard-GDP1000 dataset includes 1,000 samples from 1,000 patients for glaucoma detection and progression forecasting to confirm results in our paper “Harvard Glaucoma Detection and Progression: A Multimodal Multitask Dataset and Generalization-Reinforced Semi-Supervised Learning" published in the 2023 International Conference on Computer Vision. The corresponding code is available on our GitHub repository Harvard-GDP. Here is the data download link for Harvard-GDP1000. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.

Harvard Glaucoma Fairness with 3,300 Samples (Harvard-GF3300): This Harvard-GF3300 dataset includes 3,300 samples from 3,300 patients for fairness learning in glaucoma used in our manuscript “Harvard Glaucoma Fairness: A Retinal Nerve Disease Dataset for Fairness Learning and Fair Identity Normalization“. The corresponding code is available on our GitHub repository Harvard-GF. Here is the data download link for Harvard-GF3300. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.

Harvard FairVision with 30,000 Samples (Harvard-FairVision30k): This Harvard-FairVision30k dataset includes 30,000 samples from 30,000 patients covering three major eye diseases including age-related macular degeneration, diabetic retinopathy, and glaucoma affecting 380 million patients globally. Our dataset includes 10,000 subjects for age-related macular degeneration, diabetic retinopathy, and glaucoma separately, which is used in our manuscript “FairVision: Equitable Deep Learning for Eye Disease Screening via Fair Identity Scaling“. Here is the data download link for Harvard-FairVision30k. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.

Harvard FairSeg with 10,000 Samples (Harvard-FairSeg10k): This Harvard-FairSeg10k dataset includes 10,000 samples from 10,000 patients to study the fairness issue in medical segmentation with the task of cup and rim segmentation on fundus photo for glaucoma. This dataset is used in our paper “FairSeg: A Large-scale Medical Image Segmentation Dataset for Fairness Learning with Fair Error-Bound Scaling" published in the 2024 International Conference on Learning Representation. The corresponding code is available on our GitHub repository Harvard-Fairseg. Here is the data download link for Harvard-FairSeg10k. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.

Harvard FairVLMed with 10,000 Samples (Harvard-FairVLMed10k): This Harvard-FairVLMed10k dataset includes 10,000 samples from 10,000 patients to study the fairness issue in medical vision-language models. This dataset is used in our paper “FairCLIP: Harnessing Fairness in Vision-Language Learning" published in the 2024 Conference on Computer Vision and Pattern Recognition. The corresponding code is available on our GitHub repository FairCLIP. Here is the data download link for Harvard-FairVLMed10k. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.

Harvard FairDomain with 20,000 Samples (Harvard-FairDomain20k): This Harvard-FairDomain20k dataset includes data for both segmentation and classification tasks for studying fairness in domain shift. For the segmentation task, 10,000 samples from 10,000 patients are included. For the classification task, 10,000 samples from 10,000 patients are included. The samples from Harvard-FairDomain20k dataset are derived from Harvard-FairSeg10k and Fair-FairVLMed10k with an added imaging modality of en-face fundus image in addition to the imaging modality of scanning laser ophthalmoscopy (SLO) fundus image originally in the two datasets. This dataset is used in our paper “FairDomain: Achieving Fairness in Cross-Domain Medical Image Segmentation and Classification" published in the 2024 European Conference on Computer Vision. The corresponding code is available on our GitHub repository FairDomain. Here is the data download link for Harvard-FairDomain20k. This dataset can only be used for non-commercial research purposes. At no time, the dataset shall be used for clinical decisions or patient care. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.