The term ‘artificial intelligence’ was coined by a group of scientists during a workshop known as the “Dartmouth Summer Research Project on Artificial Intelligence” in 1956.1 The concept was based on the idea that “intelligent human behavior consisted in processes that could be formalized and reproduced in machine”.1 The subfield of machine learning (ML) aims to train algorithms to recognize patterns from a large amount of data using extracted features. However, these features require manual extraction, which is labour intensive. Examples of this method include random forests, support vector machines and decision trees.2 Currently, deep learning (DL) is considered the state-of-the-art ML technique, and these DL algorithms are largely preferred in medical image analyses. In contrast to classical ML, DL algorithms do not require manual feature extraction and typically involve multi-layered artificial neural networks (NN).2–4 In ophthalmology, DL models have been successfully applied to various imaging modalities, including colour fundus photography (CFP), optical coherence tomography (OCT) and visual field (VF) testing.
Broadly speaking, DL models have been deployed in ophthalmology for the following purposes: classification, segmentation and prediction. Most of the published studies to date have focused on classification tasks, such as classifying whether a particular colour fundus image contains referral diabetic retinopathy (DR). In recent years, predictive DL models have become an area of particular interest for researchers, as they could be used as clinical decision-support tools. They also could, within certain contexts, make predictions that are beyond the capabilities of human clinicians.
In this review, we aim to highlight predictive DL models by organizing published manuscripts in this area into the following themes: structure-structure prediction, structure-function prediction, disease onset/progression prediction and treatment response prediction. In addition, we focus on three major diseases that can lead to blindness, namely age-related macular degeneration (AMD), DR and glaucoma.
The PubMed database was searched for original investigations published between January 2017 and March 2023 using the following keywords: “deep learning”, “artificial intelligence”, “prediction”, “age-related macular degeneration”, “diabetic retinopathy” and “glaucoma”. Initially, 77 original research articles were identified. Studies that included only classical ML were excluded from our final review.
CFP, an easily accessible imaging tool, is widely used as a screening modality for retinal and optic nerve head pathologies.5–7 The advent of OCT has further revolutionized ophthalmology, as it can image ocular tissue noninvasively, with micro-level resolution. Compared with CFP, OCT images can provide much more valuable medical information due to the higher resolution and 3D nature of OCT volumes. However, OCT imaging is limited by its relatively narrow field of view and the costly, non-portable nature of OCT machines. Therefore, DL models that can predict OCT metrics or characteristics directly from CFP can be invaluable.
In the management of DR, increased central foveal thickness (CFT) due to diabetic macular oedema (DMO) deteceted on OCT is an important indication for anti-vascular endothelial growth factor (anti-VEGF) therapy. Studies have shown that eyes with DMO shown on OCT images may not have obvious features, such as lipid exudates, in CFP.8 To address the limited sensitivity in detecting DMO from CFP, two studies trained DL models to predict CFT and quantitative retinal fluid metrics on OCT directly from CFP.9,10 Both studies showed promising results, with an area under the receiver operating curve (AUC) ranging from 0.89 (95% confidence interval [CI] 0.87–0.91) in predicting centre involving DMO to 0.97 (95% CI 0.89–1.00) in predicting CFT above 250 microns on spectral-domain OCT (SD-OCT).9,10
CFPs were also used for predicting retinal thickness on OCT images in glaucoma. Glaucomatous damage is defined by the thinning of the retinal nerve fibre layer (RNFL) and corresponding VF defects. Medeiros et al. trained a DL model to predict progressive thinning of RNFL on OCT directly from longitudinal CFP.11 The model predicted progressive damage, with an AUC of 0.86 (95% CI 0.83–0.88), and the predicted RNFL thickness values were significantly correlated with the observed RNFL values (r=0.76; 95% CI 0.70–0.80).11 Similarly, a hybrid DL + classical ML model (pre-trained DL model + support vector machine) was developed by Lee et al. to predict macular ganglion cell layer-inner plexiform layer (mGCL-IPL) thickness on OCT using red-free RNFL photography.12 Their model’s predictions were strongly correlated with the measurements taken by human experts (correlation coefficient r=0.739, mean absolute error [MAE] 4.76 µm; p<0.001).12
The vast majority of the published studies to date on this topic centred on using DL and CFPs to predict OCT characteristics and metrics. This has huge implications for decentralized monitoring in a non-ophthalmology setting. In general, OCT machines are much more expensive than devices capturing CFPs, and OCT machines are typically only available at ophthalmology clinics. In contrast, a plethora of options exist for capturing CFPs, including low-cost, portable colour fundus cameras, and these devices could be paired with structure-structure predictive DL models for at-home monitoring. For example, a patient with known DR could undergo regular fundus imaging at home, and the captured CFPs could be analysed by a DL model to monitor for central macular thickening due to DMO. Similarly, a patient with known glaucoma could undergo regular fundus imaging at home, and the captured CFPs could be analysed by a DL model to monitor for progressive RNFL thinning. As the next steps, DL models could also be trained to predict OCT angiography metrics, such as vascular density, from CFPs.
DL models have been developed to predict various visual functions directly from images. Herein, we present four examples of models used in retinal diseases.
Age-related macular degeneration
Using OCT images from the phase III HARBOR (ClinicalTrials.gov identifier: NCT00891735) clinical trial,13,14 which involved monthly visits for patients with neovascular AMD (nAMD) undergoing anti-VEGF injections, Kawczynski et al. developed a DL model15 to predict visual acuity (VA) at every concurrent visit and VA at 12 months from baseline. The model achieved better overall results in predicting VA of the fellow eyes (AUC 0.98 at concurrent visits and AUC 0.96 at 12 months) compared with study eyes (AUC 0.92 at the concurrent visits and AUC 0.84 at 12 months).15
Balaskas et al. aimed to predict VA in patients with geographic atrophy (GA) under standard and low-luminance conditions.16 First, the OCT images were segmented using DL techniques. Then, a random forest regression model was trained using the segmented images to predict VA in Early Treatment of Diabetic Retinopathy Study (ETDRS) letters. 16 The model achieved r2 0.40 (MAE 11.7 ETDRS letters) predicting standard luminance VA and r2 0.25 (MAE 12.1 ETDRS letters) predicting low-luminance VA from OCT images.
Microperimetry (MP) is another important visual function assay that produces retinal sensitivity results comparable to standard automated perimetry but also has better anatomical-functional correspondence.17,18 Additionally, MP can effectively detect residual visual function in various ophthalmic conditions such as glaucoma, DR and AMD.19–21 In their review, Midena et al. also concluded that MP was superior in providing functional changes than VA in patients with AMD.19 Using images from healthy individuals and patients with nAMD and GA for training, Seebock et al. trained a DL model (ReSensNet) to directly predict retinal sensitivity on MP from OCT images.22 The model was then tested on an external dataset, consisting of eyes with DMO, retinal vein occlusion and epiretinal membrane. The MAE for point-wise sensitivity was 2.73 decibels (dB), and the MAE for mean sensitivity was 1.66 dB.22
Lin et al. trained a DL model to predict visual impairment from OCT images of eyes with DME.23 Adequate vision was defined as a decimal VA of ≥0.05, and impaired VA was defined as a decimal VA of <0.05. The model achieved an AUC of 0.80 in predicting adequate versus impaired VA.23
Besides retinal diseases, the majority of the structure-function prediction studies pertained to glaucoma, specifically in predicting VF results from images. VF testing is important for the management of patients with glaucoma, and is the gold standard tool in quantifying functional deficits in patients with glaucoma.
Various studies focused on predicting threshold sensitivity values in 24–2 standard automated perimetry from segmented OCT images.24–28 While the majority of these studies used SD-OCT imaging, Park et al. used swept-source OCT images, and the root mean squared error of the global prediction error for their model was 4.44 dB.25 Christopher et al. used RNFL en face images, laser scanning ophthalmoscopy images and RNFL thickness measurements to predict 24–2 VF, and their model achieved an R2 of 0.70 and MAE of 2.5 dB, outperforming a model that was only trained with RNFL thickness measurements.28 Two other studies developed DL models to predict 24–2 VF results from unsegmented SD-OCT images and achieved similar results. The model developed by Hemelings et al. had an MAE of 4.82 (4.45–5.22) dB.29 Kihara et al. trained their EfficientNet B2 model with both unsegmented OCT images and infrared reflectance images to predict each of the 52 sensitivity points on the 24–2 VF, and their model had an MAE of 0.485 (0.438–0.533).30
Other glaucoma studies focused on the central 10° region and on predicting 10–2 VF sensitivity values.31–35 Xu et al. trained DL models with segmented SD-OCT images (mGCL-IPL, RNFL and outer segment + retina pigment epithelium) to predict VF sensitivity at each point.34 The MAE for the whole VF was 2.72 ± 2.60 dB for convolutional neural networks-tensor regression, one of the DL models.34 Hashimato et al. developed a pattern-based regularization convolutional neural network-pattern-based regularization with segmented SD-OCT images (mGCL-IPL, RNFL and outer segment + retina pigment epithelium), and their proposed model outperformed classical ML models, achieving an MAE of 2.84 (±2.98) dB.31 Moon et al. used swept-source OCT images to develop two DL models to predict 10–2 VF.35 The MAE for global prediction was 3.10 dB for the model that was trained with mGCL-IPL thickness maps and wide-field en face images and 3.17 dB for the model that was trained with mGCL-IPL thickness maps and RNFL thickness.35
Christopher et al. used macular SD-OCT images to estimate 10–2 and 24–2 VF results. The model, which took into account six segmented retinal layers simultaneously, performed the best in predicting mean deviation values and achieved an MAE of 1.9 dB (95% CI 1.6–2.4 dB) for 10–2 VF and 2.1 dB (95% CI 1.8–2.5 dB) for 24–2 VF testing.36
Lastly, Shamsi et al.37 aimed to predict retinal contrast sensitivity in patients with AMD and glaucoma from segmented OCT images. The model was trained on healthy individuals, patients with AMD and patients with glaucoma. The authors reported that mGCL and IPL thicknesses and reflectivity of retinal ganglion cells were significantly correlated with contrast sensitivity, and this correlation was corroborated by class activation maps of the images in the test set. The model achieved an MAE of 0.13 ± 0.011 in predicting Pelli–Robson contrast sensitivity values for all subjects.37
In ophthalmology, the most widely used functional metric is VA. However, this metric has limitations since, in many common ophthalmic conditions such as glaucoma and retinitis pigmentosa, VA is not affected until the end-stage disease has set in. As alternatives to VA, many functional assays, such as VF and contrast sensitivity, are used in clinical trials and in routine clinical practice, but these alternative assays are typically time consuming and labour intensive. They are also limited by patient and operator variability. In contrast, imaging tests are generally more readily available, reliable and repeatable than functional tests. Therefore, DL models that can predict functional status from objective imaging hold great promise in revolutionizing the way that we assess and monitor functional endpoints in various ophthalmic diseases. A major limitation of this approach is the sheer number of possible combinations of imaging and functional tests. Datasets with paired data points will be needed to train such DL models: for example, paired optic nerve OCT and VF for glaucoma and paired fundus autofluorescence (FAF) and contrast sensitivity for GA. Collecting and curating such datasets require significant resources and effort, so for a given condition, it will be more practical and scalable for experts to agree on and perhaps standardize the optimal combination of imaging and functional tests first before proceeding with large-scale data collection in the future.
Disease onset/progression prediction
Age-related macular degeneration
Several studies used OCT images to predict progression from early/intermediate to advanced AMD in the fellow eye of patients with nAMD in one eye.38–40 Russakoff et al. pre-processed the OCT images in the training set using segmentation and improved the performance of their model (AMDnet) to an AUC of 0.89 at the scan level and 0.91 at the volume level.38 On the other hand, Yim et al. used both segmented and raw OCT volumes for training, and their model achieved an AUC of 0.745 in predicting imminent (6 months) conversion to nAMD; in a head-to-head comparison, this system outperformed the three retinal specialists and two optometrists and showed an equivalent performance to the remaining human expert on the panel.39 Finally, Banerjee et al. developed a hybrid model, combining patient demographic information, VA and OCT image features.40 The model achieved an AUC of 0.82 in predicting conversion to nAMD within 3 months and an AUC of 0.68 in predicting conversion to nAMD within 21 months.
A number of studies used CFPs from the Age-Related Eye Disease Study (AREDS) clinical trial to train their neural networks.41–45 Bhujyan et al. applied a two-step approach: (1) classification of images according to disease severity using DL and (2) prediction of progression to advanced AMD using classical ML.43 The hybrid model achieved 84% accuracy in predicting advanced AMD development (GA and nAMD) within 2 years.43 Ganjdanesh et al. developed a generative adversarial network (GAN) that learned from temporal, longitudinal changes in CFPs; this model achieved an accuracy of 0.762 (95% CI 0.733–0.792) in simultaneously grading disease severity at the current time point and in predicting progression to late AMD at a future time point.45 Lastly, Liefers et al. trained their DL model with automatically segmented CFPs to predict GA growth rate.46 The dataset included patients from the Rotterdam study and Blue Mountain Eye Study for training and patients from the AREDS trial for validation.47,48 The model reached an interclass correlation of 0.83 between predicted and ground-truth GA areas.46
Other studies focused on the progression of dry AMD and GA. Gigon et al. developed a DL model to predict en face retinal pigment epithelium and outer retinal atrophy (RORA) progression on OCT, achieving a Dice score that ranged from 0.46 to 0.72 in predicting the RORA growth regions.49 Zhang et al. developed a bi-directional long short-term memory prediction module that had an average Dice ranging between 0.86 and 0.92 in predicting GA growth on SD-OCT images under different scenarios, and their model gained 10% in accuracy after the integration of time-related factors.50 Anegondi et al. developed two DL models for predicting GA growth rate.51 One model was trained with FAF images only. The second model was trained with both FAF and OCT images. Interestingly, the model trained with FAF images only actually performed better, with an AUC of 0.98 (0.97–0.99) in predicting GA growth rate.51 Kalra et al. focused on predicting degeneration of the ellipsoid zone on SD-OCT images, and the at-risk ellipsoid zone areas identified by their DL model showed an interclass correlation of 0.83 with the ground truth.52
Compared with AMD, fewer studies investigated disease onset/progression within the context of DR. A model created by Bora et al. aimed to predict the development of DR within 2 years from baseline fundus photographs that did not contain any DR and achieved an AUC of 0.70 (95% CI 0.67–0.74) during external validation.53 Arcadu et al. used ETDRS 7-field CFPs from patients with DR at baseline to predict two-step worsening on the ETDRS severity scale at month 12, but their model only achieved a modest performance, with a mean AUC of 0.61.54
Several studies trained DL models with CFPs to predict the onset of glaucoma. Both Thakur et al. and Lin et al. used CFPs to predict whether a person will develop glaucoma within a certain time frame in the future.55,56 While Thakur et al. reported moderate performance (AUC 0.88 [0.86–0.91] for onset in 1–3 years and AUC 0.77 [0.75–0.78] for onset in 4–7 years),55 the Multi-scale Multi-structure Siamese Network (MMSNet) developed by Lin et al. achieved an AUC of 0.93 for predicting onset of primary open-angle glaucoma in 2 years and AUC of 0.95 for predicting onset in 5 years.56
Hou et al. employed vision transformers, the cutting-edge DL architecture, to predict VF worsening using longitudinal OCT images and their most robust model achieved an AUC of 0.97 (95% CI 0.88–1.00).57 Similarly, Herbert et al. used vision transformers to predict rapid worsening in VF defects (more than 1 dB decrease per year globally) from OCT images, and their best-performing model had an AUC of 0.87 (95% CI 0.77–0.97).58
Among the four kinds of prediction highlighted in our review, DL models capable of predicting disease onset and progression will likely have the most far-reaching impact on public health by identifying the most at-risk patients on a population level. In most medical conditions, early detection and timely initiation of treatment lead to better outcomes. For example, in nAMD, the presenting VA predicts the long-term VA. In glaucoma, optimal intraocular pressure control at the first sign of glaucomatous optic neuropathy can halt further damage. While being able to predict which patients will develop the earliest stage of diabetic eye disease (i.e. ETDRS level 20 [microaneurysms only]) does not affect DR management currently, as there is no approved therapy, this ability may become more relevant in future, should neuroprotective agents become available.
Treatment response prediction
Most studies published to date in this area pertain to predicting response to intravitreal anti-VEGF injections.
Age-related macular degeneration
Lee et al. trained a GAN with baseline OCT images and fundus fluorescein angiography/indocyanine green angiography images to generate post-therapeutic OCT images in patients undergoing anti-VEGF injections for nAMD.59 The synthetically generated images were compared with their authentic counterparts for the presence or absence of four biomarkers: pigment epithelial detachment, intraretinal fluid, subretinal fluid and subretinal hyper-reflective material. The best-performing model showed an accuracy ranging from 80.7% to 96.3% in generating the appropriate biomarkers in simulated post-therapeutic OCT images.59 Also Liu et al. used a GAN, trained with pairs of pre- and post-therapeutic OCT images of patients with nAMD, to predict treatment response.60 The synthetically generated OCT images were compared with actual post-therapeutic OCT images, and the generative DL model achieved an accuracy of 0.85 (95% CI 0.74–0.95) in predicting the final status of the macula (wet versus dry) and an accuracy of 0.81 (95% CI 0.69–0.93) in predicting whether there will be a complete resolution of sub/intraretinal fluid after one injection.60
Yeh et al. aimed to predict VA improvement at month 12 after the initiation of anti-VEGF injections.61 The model, HDF-Net, was trained with unsegmented baseline OCT images and non-image-based clinical data of treatment-naïve patients with nAMD and achieved an AUC of 0.98 (95% CI 0.97–0.99) in predicting VA improvement of ≥2 lines.61 Fu et al. used a least squares regression model to predict VA at 3 and 12 months after the initiation of anti-VEGF treatment. Their model achieved R2=0.80 (MAE 5.0 ETDRS letters) at month 3 and R2=0.7 (MAE 7.2 ETDRS letters) at month 12. In addition, the model was able to predict incremental VA change after the first and third injections.62 Romo-Bucheli et al. aimed to predict treatment burden over 2 years using baseline OCT images, and their model achieved an AUC of 0.81 in predicting high treatment burden, defined as ≥16 injections over 2 years.63
Studies evaluating treatment response in DMO focused on predicting the reduction of macular thickness after anti-VEGF injections.64–66 Both Rasti et al. and Alryalat et al. used DL models to identify good responders versus poor responders based on OCT criteria, and the models developed by these two groups showed an AUC ranging from 0.81 to 0.86.64,65 Furthermore, Liu et al. used an ensemble model, comprising both DL and classical ML techniques, to predict post-treatment OCT CFT and VA using baseline OCT and clinical data. The MAE for predicting good anatomical outcome was 68.08 µm, and the MAE for predicting good functional outcome was 0.13 logMAR in the external validation dataset.66 Lastly, Xu et al. used a GAN to generate synthetic post-treatment OCT images that were compared with real post-treatment OCT images, and the MAE between synthetic and real images was 24.51 ± 18.56 μm for CFT.67
The burden of VEGF-driven retinal diseases, such as AMD and DR, is expected to increase exponentially as the population ages and the incidence of diabetes continues to rise. Accordingly, the ability to risk stratify and tailor treatment plans for patients requiring anti-VEGF therapy will become increasingly important. On an individual level, more personalized therapy could lead to better final visual outcomes. On a systems level, given there is a large difference in cost between the different anti-VEGF agents, being able to determine which patients would respond equally well to both inexpensive and expensive medications would have implications from a cost-effectiveness point of view.
Conclusion and future directions
DL applications in ophthalmology have gradually shifted from classification to predictive tasks. Such predictive DL models hold the promise of revolutionizing our field by providing insights that may elude even the most astute clinicians. From a technical point of view, we anticipate the more widespread use of vision transformers and generative DL techniques in training these predictive models.
Structure-function predictions may involve the simultaneous correlation of multiple visual functional endpoints with a single imaging modality. Further, future studies involving the prospective validation of models trained with retrospective data will be invaluable. Finally, for predicting treatment response, clinical trials comparing standard-of-care versus DL-based clinical decision support tools will help establish whether DL tools could improve patient outcomes.