Datasets

Public Datasets

Harvard Glaucoma Detection with 500 Samples (Harvard-GD500): This Harvard-GD500 dataset includes 500 samples from 500 patients for glaucoma detection to confirm results in our paper “Artifact-Tolerant Clustering-Guided Contrastive Embedding Learning for Ophthalmic Images in Glaucoma" published in the Journal of Biomedical and Health Informatics. Here is the data download link for Harvard-GD500. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.

Harvard Glaucoma Detection and Progression with 1000 Samples (Harvard-GDP1000): This Harvard-GDP1000 dataset includes 1,000 samples from 1,000 patients for glaucoma detection and progression forecasting to confirm results in our paper “Harvard Glaucoma Detection and Progression: A Multimodal Multitask Dataset and Generalization-Reinforced Semi-Supervised Learning" published in the 2023 International Conference on Computer Vision. The corresponding code is available on our GitHub repository Harvard-GDP. Here is the data download link for Harvard-GDP1000. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.

Harvard Glaucoma Fairness with 3300 Samples (Harvard-GF3300): This Harvard-GF3300 dataset includes 3,300 samples from 3,300 patients for fairness learning in glaucoma used in our manuscript “Harvard Glaucoma Fairness (Harvard-GF): A Retinal Nerve Disease Dataset for Fairness Learning and Fair Identity Normalization“. The corresponding code is available on our GitHub repository Harvard-GF. Here is the data download link for Harvard-GF3300. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.

Harvard Eye Fairness with 30,000 Samples (Harvard-EF30k): This Harvard-EF30k dataset includes 30,000 samples from 30,000 patients covering three major eye diseases including age-related macular degeneration, diabetic retinopathy, and glaucoma affecting 380 million patients globally. Our dataset includes 10,000 subjects for age-related macular degeneration, diabetic retinopathy, and glaucoma separately, which is used in our manuscript “Harvard Eye Fairness: A Large-Scale 3D Imaging Dataset for Equitable Eye Diseases Screening and Fair Identity Scaling“. Here is the data download link for Harvard-EF30k. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.

Harvard FairSeg with 10,000 Samples (Harvard-FairSeg10k): This Harvard-FairSeg10k dataset includes 10,000 samples from 10,000 patients to study the fairness issue in medical segmentation with the task of cup and rim segmentation on fundus photo for glaucoma. This dataset is used in our paper “FairSeg: A Large-scale Medical Image Segmentation Dataset for Fairness Learning with Fair Error-Bound Scaling" published in the 2024 International Conference on Learning Representation. The corresponding code is available on our GitHub repository Harvard-Fairseg. Here is the data download link for Harvard-FairSeg10k. The data use license is CC BY-NC-ND 4.0. If you have any questions about this dataset, please email harvardophai@gmail.com.

Privileged Datasets

Massachusetts Eye and Ear Dataset: We have demographic and clinical information for 1.69 million patients, 1.91 million fundus photos from 67,000 patients, 900,000 optical coherence tomography scans from 85,000 patients, and 450,000 visual fields from 78,000 patients.

Glaucoma Research Network Dataset: We have 602,000 24-2 visual fields from 129,000 patients and 54,000 10-2 visual fields from 12,000 patients. The Glaucoma Research Network is comprised of Massachusetts Eye and Ear at Harvard Medical School, Wilmer Eye Institute at Johns Hopkins University, New York Eye and Ear Infirmary at Icahn School of Medicine at Mount Sinai, Bascom Palmer Eye Institute at University of Miami, Wills Eye Hospital at Thomas Jefferson University, and Edward S. Harkness Eye Institute at Columbia University, Hamilton Eye Institute at University of Tennessee Health Science Center.

LIFE Dataset: We have data from 10,000 patients at baseline and 2,000 patients at 5-year follow-up from the LIFE-Adult Study randomly selected participants from 550,000 residents in Leipzig, which is a population-based cohort study conducted by the Leipzig Research Centre for Civilization Diseases (LIFE), University of Leipzig in Germany. All participants underwent fundus photography and optical coherence tomography examination, in addition to an extensive core assessment including physical examinations, cognitive function tests, genetic data, biospecimen tests, structured interviews, questionnaires, brain magnetic resonance imaging scans, etc.

American Academy of Ophthalmology IRIS Registry Dataset:  The American Academy of Ophthalmology IRIS Registry (Intelligent Research in Sight) is the nation’s first EHR-based comprehensive eye disease and condition clinical registry and is also the world’s largest clinical data registry. As of May 2021, the IRIS Registry had participation from over 14,000 ophthalmologists and their 3,300 employed optometrists and included approximately 400 million patient encounters from about 70 million unique patients. Its reach continues to grow, providing ophthalmologists with clinical benchmarks and practice patterns. The Academy developed it as part of the profession’s shared goal of continual improvement in the delivery of eye care. Massachusetts Eye and Ear is one of the four institutions having full access to the entire dataset.

UK Biobank Dataset: We have data from 68,000 patients at baseline and 19,000 patients at follow-up with fundus photos, optical coherence tomography scans, physical examinations, cognitive function tests, genetic data, biospecimen tests, self-reported health conditions, ICD codes, brain, cardiac and abdominal magnetic resonance imaging scans, dual-energy X-ray absorptiometry scans, carotid ultrasound scans, etc. We get the UK Biobank data by paying an access fee.