1. Introduction
Ear infections, particularly acute infections of the middle ear (i.e., acute otitis media—AOM), are a major health problem in the pediatric population [
1]. Otoscope is used in the clinical examination of the eardrum or tympanic membrane (TM, an organ that separates the ear canal from the middle ear) as the basic diagnostic apparatus for checking the status of the ear canal and TM. Nevertheless, both clinician and computerized system diagnostic accuracies are heavily influenced by the limitations of otoscopy, such as small field of view [
2], poor illumination, or partial occlusions [
3], e.g., by hair or wax. Clinicians have around 75 percent of diagnostic accuracy [
4,
5,
6,
7,
8] with viewing single otoscopic images grabbed from digital otoscopes.
Most of the computer-assisted methods in this field analyze two-dimensional images captured by traditional otoscopes and oto-endoscopes. Unfortunately, these methods can only distinguish among a limited number of TM abnormalities. For instance, Kuruvilla et al. proposed a method to distinguish AOM from other abnormalities like otitis media with effusion (OME) [
5]. However, it is difficult to generalize their approach to more than two categories because they need to design more handcrafted features to identify other TM abnormalities such as TM perforation (a hole in the eardrum), TM retraction (a condition in which a part of the eardrum lies deeper within the ear than its normal position), and tympanosclerosis (a condition including scarring or accumulation of calcium deposits within the TM) [
9].
To explore the computer-assisted detectability of a wide range of eardrum abnormalities, we employed both deep learning techniques [
10] and traditional approaches that require hand-crafted features [
11,
12] in our previous works. While prior studies reported promising results, they rely on a single image rather than raw video for eardrum abnormalities [
13]. Manually selecting a representative frame from even a few seconds of those videos is extremely time consuming and subject to high inter- and intra-reader variability. The complex topology of the eardrum requires multiple images at varying depths of focus to properly capture the eardrum. For this reason, even if a best frame could be selected from the video, it might not contain the entire view of the TM. Therefore, a clear and comprehensive composite image generated automatically from otoscope video clips would be helpful for accurate and automated diagnosis of eardrum abnormalities.
The aim of this study is to develop and validate an automated image segmentation and stitching framework, called SelectStitch, to create enhanced composite images from otoscope videos. For this purpose, a semantic segmentation-based framework is proposed. The segmentation and subsequent stitching enable us to automatically select meaningful frames from otoscope videos and reduce irrelevant ones (e.g., those heavily blurred with excessive amount of cerumen). A meaningful frame is a frame that contains a specific proportion of an eardrum. We propose a modified U-Net [
14] based semantic segmentation approach to identify meaningful frames, which are used by our stitching engine to create a composite image. We then compare the diagnostic decisions of three ear, nose, and throat (ENT) physicians after reviewing these new composite images, relative to the images generated by using the entire frames of the video recorded by traditional handheld otoscopy (i.e., composite images generated without frame selection).
Figure 1 gives an overview of our complete analysis pipeline in this study.
The rest of this paper is organized as follows:
Section 2 describes the data and the proposed framework. Experimental results are presented in
Section 3. The discussion and concluding remarks are respectively included in
Section 4 and
Section 5.
2. The Proposed Methodology
2.1. Materials
A centralized database of high-resolution digital adult and pediatric images was created for this particular project, captured at Ear, Nose, and Throat (ENT) clinics and primary care settings at the Ohio State University (OSU) and Nationwide Children’s Hospital (NCH) in Columbus, Ohio, USA in accordance with the OSU Institutional Review Board (IRB) approved protocol (the project identification codes: 2016H0011 and 2018H0395). A high definition (HD) video otoscope (JEDMED Horus+ HD Video Otoscope, St. Louis, MO, USA) was utilized to capture and record the video data. The video frames are of size 1440 by 1080 pixels and are recorded in MOV format.
Our dataset included 136 otoscope videos. First, we randomly chose 36 otoscope videos from our database, then an image analyst determined 764 images from the extracted frames of all the videos (Dataset 1). These images were selected to contain the eardrum in several different conditions such as blurriness, glare as much as possible. Two otolaryngologists (Aaron C. Moberly and Charles Elmaraghy) annotated the selected frames as a final step (
Figure 2) and these frames were used to develop the semantic segmentation model. The second one (Dataset 2) containing 100 videos for an independent set was used for employing the stitching process along with a reader study to assess the efficiency of the proposed method.
Table 1 shows the distribution of major diagnostic categories as reported by the expert physicians.
An example frame for each of the six major abnormality categories from our dataset is depicted in
Figure 3.
2.2. U-Net Based Semantic Segmentation
U-Net was introduced in biomedical imaging to improve precision and localization of microscopic images of neuronal structures. The architecture builds upon the fully convolutional network [
15] and is similar to the deconvolutional network [
16]. In a deconvolutional network, a stack of convolutional layers—where each layer halves the size of the image but doubles the number of channels—encodes the image into a small and deep representation. That encoding is then decoded to the original size of the image by a stack of up-sampling layers. The U-Net adds additional skip connections between layers at the same hierarchical level in the encoder and decoder. This allows low-level information to flow directly from the high-resolution input to the high-resolution output.
We trained a U-Net architecture with an encoder depth of four to automatically find eardrum regions in each otoscope video frame. In order to reduce the effects of overfitting, we took advantage of data augmentation [
17], which involves random horizontal flips, image sharpener, affine transformations between −45 and 45 degrees, and elastic transformations with three different α intervals of (45, 50), (55, 60), and (65, 80) with σ = 5 as described in [
18]. The augmentation process was implemented using the
imgaug library [
19]. Then, to scale inside the image randomly, images were shrunk and enlarged within a range of (−0.5, +0.5). After data augmentation, we had 15,280 additional images, i.e., generated 20 augmented images for each of the 764 images in Dataset 1, to train the segmentation network.
The performance of the segmentation algorithm was computed in a
k-fold cross-validation [
20] (with
k = 10). It should be emphasized that the segmentation framework was exposed neither to the validation images (Dataset 2) nor the augmented images resulting from the validation set.
2.3. Image Stitching
Image stitching was performed with an OpenCV image-processing library for a stepwise process of correcting optical distortion, cropping the circular TM field of view, and estimating affine transforms between neighboring fields using Speeded Up Robust Features (SURF) [
21] key points matching. These transforms describe the translation, rotation, and skew of each peripheral field relative to the central field. Finally, a full-resolution mosaic is generated using the estimated transforms, and overlapping regions are linearly blended.
To save time on developing a customized image stitching algorithm and to focus on the clinical impact of the proposed framework, we used Image Composite Editor (ICE) 2.0 (v2.0.3, Microsoft, Redmond, WA, USA, 2015) [
22] software package, created by the Microsoft Research Interactive Visual Media Group, which generates seamlessly combined composite images. Microsoft ICE is distributed as a freeware for non-commercial use.
Microsoft ICE does not have the capability of stitching images from totally different scenes. Therefore, reducing TM-irrelevant frames, as proposed in this study, was needed to utilize this tool. Our solution provides a feasible and practical alternative to manual selection. The proposed framework, SelectStitch, consists of two main processes, i.e., semantic segmentation and image stitching, as depicted in
Figure 4.
Figure 5 demonstrates the image stitching process for the three of selected frames. In this example, the image stitching helped to obtain a single image that contains a more comprehensive view of the pathology.
2.4. Post-Processing
Post-processing consisting of two steps, cropping and image enhancement, was applied to the output of the composite image generator. Most composite images produced by the ICE included black background areas, which were redundant and removed after blurring with a Gaussian filter [
23], gray level thresholding [
24] and foreground detection on the thresholded, binary image.
An image captured in an outdoor scene could be highly degraded due to poor- or over-lighting conditions, or if there are different suspension particles, such as water droplets or dust particles. These particles may cause the irradiance coming from the object to be scattered or absorbed, leading to haze, smoke, or fog. The resulting images are degraded, and the color and contrast are shifted from the original irradiance at the time of capture of the image. The image needs to be de-hazed before it can be analyzed. While our ear images were not captured outdoors, the inverted low-light images common in eardrum imaging results in hazy images. Therefore, we borrowed a de-hazing technique [
25] to solve low-light image condition in this study.
Figure 6 gives the before and after images for a sample output of the cropping and de-hazing technique.
2.5. Experimental Setting
Segmentation experiments were conducted on Wake Forest Baptist Medical Center’s high-performance computer cluster. We took advantage of 16 GB Nvidia Tesla P100 PCI-E GPU (Nvidia, Santa Clara, CA, USA). We used the Deep Learning Toolbox of MATLAB R2018b (MATLAB 9.5, MathWorks, Natick, MA, USA, 2018) to implement the U-Net architecture, and Dice coefficient [
26,
27] to evaluate the performance of the segmentation.
Minimization of the pixel-wise cross-entropy loss is achieved via stochastic gradient descent using the Adam optimizer [
28] and learning rate of 0.001. We used mini-batches of size 16 for U-Net. Early stopping was employed to avoid over-fitting [
29,
30].
In this study, we generated 100 composite images from the corresponding videos in Dataset 2. The videos in Dataset 2 were categorized into different eardrum abnormalities by four expert physicians (three ENT specialists and one pediatrician) based on physical examination and evaluation of the patient at the time of the encounter.
Base process: In the base-process, we stitch all the frames of the video data regardless of their content to create composite images. The base process does not incorporate neither a human intervention nor a computerized technique to process or analyze video frames before stitching step, i.e., every frame is used in the resulting composite image. Unlike SelectStitch, the resulting composite images do not benefit from segmentation-based frame selection nor post-processing.
Two rounds of evaluation: The composite images produced by SelectStitch were compared by three ENT specialists against the base processes in terms of their diagnostic capabilities. This reader study provides an evaluation of a new measurement or analysis method [
31]. In order to do this, three ENT specialists graded each composite image using one of the three categories: 1 (Poor) refers to a situation that he/she cannot diagnose well or the image has serious problems; 2 (Moderate) refers to an image which is poor quality but still usable for diagnosis; and finally 3 (Good) refers to a good quality composite image which can be used for diagnosis. Doing these measurements with three ENTs provided us a measure of inter-reader variability, and repeating the study in a second round provided us an estimate of intra-reader variability.
2.6. Statistical Analysis
To evaluate changes in scoring the composite images generated by SelectStitch compared to the base process images in two rounds of assessments of three ENT specialists, we employed an ordinal logistic regression model on image scoring (possible levels of 1, 2, or 3) to test the differences between two methods (i.e., base process and SelectStitch) and two rounds. To evaluate the inter-reader variability within rounds and the intra-reader variability between rounds for each method, we calculated Kendall’s coefficient of concordance, which ranges between 0 (representing no agreement) and 1 (representing perfect agreement).
3. Results
We trained U-Net to segment each frames of otoscope videos. Using 10-fold cross validation, we noted Dice coefficient of 0.84 ± 0.03. While investigating the effect of the segmentation process (as recommended by Zijdenbos et al. [
32]) was beyond the scope of this study, dice coefficient of greater than 0.70 generally indicates a good overlap. At test time for composite image generation, a frame is considered relevant if at least it contains an eardrum region of an experimentally determined threshold of 20%, of its size. TM-relevant frames were utilized to generate composite images (see the SelectStitch process in
Figure 1).
Table 2 gives the changes in scoring the composite images generated by SelectStitch compared to the base images among two rounds of assessments. An examination of
Table 2 shows that when SelectStitch is used, the readers gave fewer responses of a score of 1 (Poor) and more responses of a score of 3 (Good) compared to the base composite images, regardless of round.
Table 2 indicates that using SelectStitch composite images, ENT-I improved 32 images from category 1 (Poor) to 2 (Moderate), 18 images from 2 to 3 (Good), and 8 composite images from category 1 to 3; ENT-II improved 15 composite images from 1 to 2, 18 images from 2 to 3, and 24 composite images from category 1 to 3 and ENT-III improved 24 composite images from 1 to 2, 30 images from 2 to 3, and 17 composite images from category 1 to 3 for Round-1.
To illustrate the effectiveness of our proposed framework for otoscope video stitching, we presented the composite images of three otoscope video clips in
Figure 7.
We used an ordinal logistic regression model to compare image scoring between SelectStitch and the base method and between two rounds. There is a significant difference in scoring between the base and proposed methods at p value of 0.0007, and the base method has 6.5 times higher risk of scoring less than the proposed method.
The probabilities on scoring are listed in
Table 3. At Round-1, the base method has probability of 0.43 for scoring below 2 and has probability of 0.84 for scoring below 3, but the proposed method has probability of 0.10 for scoring below 2 and has probability of 0.44 for scoring below 3. At Round-2, the base method has probability of 0.52 for scoring below 2 and has probability of 0.88 for scoring below 3, but the proposed method has probability of 0.14 for scoring below 2 and has probability of 0.53 for scoring below 3. In summary, the base method has higher probability for scoring less than the proposed method at both rounds.
As with other medical tests, there are intra- and inter-reader variations in assessing the eardrum abnormalities.
Table 4 gives the inter- and intra-reader variability among the rounds for each method (values < 0 as indicating no agreement and 0–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1 as almost perfect agreement). We observed the base method has higher inter- and intra-reader agreement in assessment of evaluation of composite images according to their diagnostic capabilities. This could be explained by the fact that the ENTs agreed more on the poor quality of cases than deciding whether the improvement category belongs to 2 or 3.
4. Discussion
To the best of our knowledge, this is a first attempt at representing an otoscope video with a single composite image while preserving its diagnostic capability as much as possible. Three ENT experts evaluated the images in the scale of poor, moderate, and good and the results (see
Table 2) show that, on average among experts and evaluation rounds, in 60.3% of the cases the diagnostic quality improved and in 15.3% of the cases this improvement was from poor to good and 18.0% from moderate to good. Only 3.0% of the cases deteriorated: 2.7% from good to moderate and only 0.3% from moderate to poor. There is a statistically significant difference in scoring between the base and proposed methods at
p value of 0.0007.
Exemplary composite images for the case of same scoring (score of 2) are shown in
Figure 8. In
Figure 8, each column illustrates the evaluations from different readers. As can be observed in
Figure 8, it is likely that better scores for the proposed images will be obtained in a wider assessment scale (e.g., 1–5).
Exemplary composite images for the cases where the results deteriorated after SelectStitch are shown in
Figure 9, in which the proposed images were consistently scored as poorer than the base images. The comment noted by the ENT-I for the first column was that the base image looks more natural and the proposed image has an unnatural lighting, making difficult to see the disease, which is retraction. The drawback for the composite image generated by the proposed framework (see the second column of
Figure 9) was overexposing, but the base image received the score of 3 although it is also mentioned that it has partial view. For the third column, ENT-III noted that the SelectStitch image has glare. We also noticed a particular SelectStitch composite image (see fourth and fifth column in
Figure 9) that was given a score of 2, respectively, by ENT-I and ENT-III in both rounds while the base score was 3. For the fourth column in
Figure 9, the sample was having multiple abnormalities, i.e., Perforation, Prosthesis, and Tympanosclerosis; and the reader noted that proposed composite image has “a little unnatural lighting.” For the fifth column, the sample was originally labeled as Perforation, Cholesteatoma, and Tympanosclerosis; and the reader evaluated the proposed image as “a little blurry.” For these cases, we can conjecture that the calcification and scarring present in Tympanosclerosis made the generation of enhanced composite image with natural looking colors even more difficult.
We also observed that there are six, three, and three cases which ENTs give less scores compared to the base composite images in Round-1 and four, zero, and two cases in Round-2. We noticed that the composite images resulting from the proposed framework that received a lesser score than the base composite images suffered from over-exposure as the de-hazing algorithm failed to preserve the color information. To improve the diagnostic capability of the composite image, it should be enhanced by preserving its original color as much as possible. We also noted that in no evaluation, SelectStitch image received the score of 1 while the base process image score was 3. There was a significant difference in scoring between the first and second rounds at p value of 0.001, and the second round has 1.44 times higher risk of scoring less than the first round. It is possible that ENTs with Round-1 experience might have better judgement at second round.
Although this work provides a proof-of-concept approach to obtain one representative composite image of the TM, the intra-class variability is a substantial challenge. Thus, to obtain a robust segmentation, more training samples from scarred eardrums, e.g., tympanosclerosis are required. With a larger database, a multi-modal image segmentation could also be developed.
There are some limitations to this study. First, the relevant frames were determined according to the amount of eardrum they capture. If the amount of eardrum in a frame is above a certain threshold, then it was considered a relevant frame. While our study shows promising results, we expect that additional criteria could be developed to select the frames (or even parts of the frames) to be stitched. This would increase the probability that all the relevant frames (and parts of the frames) could be used to properly represent pathologies present in eardrums.
In our study, we did not evaluate the diagnostic accuracy of composite images over either single images selected from the video or the entire of otoscopic video. Although, we expect that the proposed composite images will be superior (or at least equal) to the manually selected single images in terms of diagnostic capability, a comparison study is needed but is beyond the scope of the current study and will be the subject of our future studies.
5. Conclusions
In this study, we developed a framework to obtain enhanced composite images from otoscope video clips and evaluated its effectiveness with a reader study. The data used in our study were acquired from a hand-held HD video imaging system. The results of this study have shown that an appropriate frame selection applied on otoscopy videos can significantly improve the diagnostic quality of composite images generated by these selected frames. We envision that the proposed approach has the potential to provide valuable diagnostic information in the form of normative TM information.
Our system could be very valuable to provide a comprehensive view of the eardrum to clinicians, supporting more appropriate treatment. These composite images may increase the accuracy and efficiency of automated image analysis systems. Such a system can analyze the composite image, on a cloud-based platform, produced from a video-otoscope taken by a trained healthcare individual. These systems could be particularly helpful in case of shortage of specialists who can accurately diagnose eardrum abnormalities, especially in low- and middle-income countries and/or in rural areas [
9]. Moreover, this single image could be stored and transferred between health institutions or computation environments instead of a whole video sequence.
More data is needed to explore the generalization of the proposed framework to different kinds of eardrum abnormalities. For example, some abnormalities (TM with retraction, perforation with discharge, and cholesteatoma) are more difficult to diagnose than some particular ones such as clear perforation [
33]. A further research study will be needed to build models that could analyze the whole spectrum of otoscopic videos faced in clinical settings.