Laryngeal Image Processing of Vocal Folds Motion

: This review provides a comprehensive compilation, from a digital image processing point of view of the most important techniques currently developed to characterize and quantify the vibration behaviour of the vocal folds, along with a detailed description of the laryngeal image modalities currently used in the clinic. The review presents an overview of the most signiﬁcant glottal-gap segmentation and facilitative playbacks techniques used in the literature for the mentioned purpose, and shows the drawbacks and challenges that still remain unsolved to develop robust vocal folds vibration function analysis tools based on digital image processing.


Introduction
Voice is the main support of communication in mankind, through which feelings, emotions and culture are transmitted. Therefore, voice is crucial to the quality of anyone's daily life, especially of professional voice users. However, because of the voice misuse or overuse, changes in laryngeal structures may lead to voice disorders. In a cure-and-care approach, clinical voice assessment is required to diagnose the dysfunction and to plan appropriate treatment strategies [1]. According to the American Speech-Language-Hearing Association [2], a recommended protocol for instrumental assessment of voice has to include laryngeal endoscopic imaging, acoustics, and aerodynamic assessments ( Figure 1). However, only the laryngoscopy allows the etiology determination of an organic or functional voice disorder. Hence, visual inspection of the vocal folds vibratory characteristics is essential for diagnosis. Advanced assessment makes use of high-speed cameras. On such data, image processing is the most promising approach to investigate vocal folds vibration and laryngeal dynamics in speech and singing [3][4][5]. Among the image-processing tasks, the most recurrent tasks are glottal segmentation and synthesizing laryngeal videos in a family of representations named Facilitative Playbacks (FP). They are essential operations for accurate characterization of vocal folds vibrations. This process leads to identify different phonation features, i.e., periodicity and amplitude of vocal folds vibration, Mucosal Wave (MW), glottal closure, closed-state, symmetry of vibration, presence of non-vibrating portions of vocal folds, etc. Clinicians and voice scientists use the obtained information to support their clinical diagnostics about the presence or absence of a disease and also to follow the dynamics of anatomical features of interest in a more intuitive way, revealing contents, which are often hidden to human eyes [6][7][8]. Reviews on laryngeal imaging research and clinical implications already exist in the literature of the past twenty years [4,8,[11][12][13][14][15][16][17][18][19][20]. However, to our knowledge, there are no reviews devoted to analyzing the different glottal-gap segmentation and FP techniques. This paper is intended as a comprehensive state-of-the-art review on the particular issues of glottal-gap segmentation approaches and FP restricted to two widely available types of image modalities: Laryngeal High-Speed Videoendoscopy (HSV) and Laryngeal Videostroboscopy (VS) [21]. This paper is organized into six sections. Section 2 introduces the different image modalities in laryngeal imaging. Section 3 addresses the glottal gap segmentation problem and presents the literature devoted to synthesizing vocal folds dynamics (FP). Section 4 highlights the challenges in glottal segmentation, and FP techniques. Section 5 introduces the several clinical and voice research works that make use of glottal-gap segmentation and FP approaches. We conclude with a discussion, pointing to future research directions and open problems related to glottal-gap segmentation and FP representation.

Laryngeal Image Modalities
Current laryngeal image modalities include Laryngeal VS, Videokymography (VKG), High-Speed Digital Kymograghy (DKG) and Laryngeal HSV. However, there are new modalities that emerge for analyzing the vocal folds dynamics [22] and also devices for 3D visualization of the vibration of vocal folds superior margins [23][24][25][26]. This paper is focused on two laryngeal image modalities: VS and HSV.

Laryngeal Videostroboscopy
During phonation, the vocal folds vibrate at a rate faster than can be perceived by the human eye. Therefore, the use of techniques to create an apparent slow-motion view of the periodic vibratory cycles has been necessary. The process to record an VS involves the use of a video camera attached to a rigid (transoral) or flexible (transnasal) endoscope where the illumination is provided by a strobe light that flashes at a rate that is synchronized with the patient's fundamental frequency during sustained vowel production. Therefore, VS is an estimated version of the vibration of the vocal folds that is acquired by sampling its motion. Figure 2 illustrates the complete procedure to record a VS. (1) and (3) represent the endoscope used to capture the vocal folds motion: rigid (90 • ) and flexible, respectively; (2) represents the vocal folds; (4) is the real vibratory pattern of the vocal folds; (5) is the stroboscopic light; and (6) is the estimated slow motion version of the vocal folds vibration. Some of the advantages and limitations of Laryngeal Videostroboscopy are mentioned [18] below: Figure 2. Illustration of the stroboscopic sampling. Adapted from [27]. (1) and (3) are the rigid and flexible endoscope, respectively; (2) are the vocal folds; (4) represents the real vibratory pattern; (5) illustrates the strobe light; (6) is the estimated version of the vibratory pattern.

Advantages
• It can be used with Distal Chip Laryngoscopy (DCL) and Flexible Fiberoptic Laryngoscopy (FOL) [28] during articulated speech and singing. DCL is identical to FOL regarding diagnostic accuracy, but DCL is superior to FOL in image quality and interrater reliability. Despite flexible endoscope optics have lower quality than a rigid endoscope, it is possible to obtain an image quality comparable to the rigid laryngoscope using DCL [29,30].
• It can be coupled with high-definition cameras, providing a higher spatial resolution of the vocal fold structures involved in phonatory vibration (e.g., mucosa, superficial vasculature, etc.) [18]. The High-Definition Digital Stroboscopy System by KayPENTAX Model 9310HD, for example, records interlaced video frames with a spatial resolution of 1920 × 1080 full-HD letting real-time viewing of exam video in uncompressed HD. 4K video format with a spatial resolution of 3840 × 2160 pixels has been tested in VS. Despite the fact that its clinical implementation is feasible, its usefulness must be proven [31].

Limitations
• It does not provide a real view of the vocal folds' vibratory pattern, so it is restricted to stable and periodic vocal folds vibrations. Therefore, VS is incapable of revealing vocal fold vibratory patterns in patients with severe dysphonia [32]. In addition, VS limits scientific and diagnostic knowledge of the vocal function during voice onset and voice offset. • It is more sensitive to camera rotation, side movement of the laryngoscope and patient movements which produce the vocal folds delocation [33,34].

Laryngeal High-Speed Videoendoscopy
HSV has revolutionized laryngeal imaging, increasing the understanding of glottal dynamics during the phonation process [17]. HSV is the only technique able to acquire the true intra-cycle vibratory behavior, allowing the study of cycle-to-cycle glottal variations. In HSV, images are sampled constantly due to the use of a continuous light source. No information is lost between frames. Lastly, image sequences can be slown down to frame rates that can be perceived by the human eye.
Nowadays, due to the fast-growth of high-speed technology, it is possible to find cameras that can reach frame rates over 20,000 Frames per Second (fps), recording in color with high spatial resolution and excellent image quality for long-duration recording times. With respect to the minimum frame rate requirements of HSV for clinical voice assessment, frame rates of 8000 fps are recommended with a minimum requirement of 4000 fps [35,36]. For HSV recordings at rates below 4000 fps for women and 3200 fps for men, the videos have to be interpreted with caution. Figure 3 illustrates the principle of sampling in HSV for two different frame rates. It can be observed that every single cycle is sampled, in contrast to VS where the samples are taken from different cycles (see Figure 2). Some of the advantages and limitations of HSV are detailed below [4,14,27]:

Advantages
• It captures the true intra-cycle vibratory behavior of the vocal folds, providing a more reliable and accurate objective quantification of vocal-fold vibrations. Therefore, it is possible to study aperiodic movements of vocal folds observed in some voice disorders [4,37]. • The combination of HSV with acoustics and other physiological signals may provide complementary, high-precision measures that can improve the clinical practice [38,39]. • It is used to examine the basic physiology of different singing styles, such as extremely high-pitched singing, throat singing, or different pop and rock styles [40], and to study the oscillatory characteristics of the vocal folds across classical soprano and tenors singers [41,42]. • It is useful to get insights into tissue vibratory characteristics, influence of aerodynamical forces and muscular tension, vocal length, phonatory breaks, laryngeal spams, vocal folds contact and evaluation of normal laryngeal functioning in situation of rapid pitch change such as onset and offset of voicing or glides [5,43].

Limitations
• Due to the huge amount of data acquired, storing and visualization are great problems. For instance, 10 s recording data at speed of 10,000 fps would require 2 h and 46 min to view the whole recording at a speed of 10 fps [4,44].
• It is not possible to provide real-time audiovisual feedback due to the high temporal resolution of HSV. However, it is possible to align HSV with acoustics and EGG signals to provide more precise measures that can improve clinical diagnosis.

Time
Con)nuous light Con)nuous light  Figure 4 illustrates the procedure used to study the vocal folds motion based on laryngeal image processing techniques. Glottal gap segmentation and the FP (third and fourth row in Figure 4), have been extensively used in HSV and VS to simplify the visualization of the vocal folds dynamics and to provide objective parameters to accurately identify the presence of organic voice disorders [45,46], classify functional voice disorders [47,48], vibratory patterns [49,50], discriminate early stage of malignant and precancerous vocal folds lesions [51], among others [7,39]. Despite the progress achieved to describe the vocal folds dynamics using laryngeal image-processing, the task is still challenging and poses a difficult computer vision problem. . Graphical illustration of the procedure followed to extract glottal features from vocal folds motion during phonation. First row shows the coronal view of the vocal folds; second row represents the top view of the vocal folds obtained from the Laryngeal Videostroboscopy (VS)or HSV recording; third row depicts the glottal segmentation; and fourth row illustrates the glottal area waveform (GAW) playback.

Glottal Gap Segmentation and Facilitative Playbacks
Laryngeal imaging techniques have been used to study the complex interaction of vocal folds structures [52,53], tissue mechanics [6], muscle contraction [54], voice treatment effects in Parkinson's diseases [55] or to objectively characterize voice disorders [7,39]. Therefore, Section 3.1 introduces the two most widespread image-processing methods used for glottal gap segmentation and FP representation. Numerous works are described in Section 3.2, using a single or a combination of image-processing methods for image enhancement, Region of Interest (ROI) detection, and glottal gap delimitation, respectively. Lastly, Section 3.3 presents the literature related to the FP representations.

Image-Processing Methods
The image-processing methods refer to a set of operations on an image, in order to get an enhanced image or to extract some useful information from it. It is a type of signal processing in which input is an image and output may be image or characteristics/features associated with that image. Let us denote a laryngeal image sequence as I(x, t), where x = (x, y) ∈ R 2 represents the position of the pixels and t represents the time instants of the sequence. Hence, a single frame at instant t k , k = {1, 2, . . . , N}, is denoted as I(x, t k ). Therefore, the intensity of a given pixel x ij = (x i , y j ), i = {1, 2, . . . , n} and j = {1, 2, . . . , m}, at time t k can be defined as I(x ij , t k ) ∈ R, which represents a pixel in gray-scale. For a color image sequence, the components can be represented by superscripts. For instance, in the RGB space, the image sequence is denoted as I R,G,B (x, t), where R, G, and B are the three components of the space. For single images that do not belong to an image sequence, the notation is simplified to I(x) or I(x, y) for gray-scale images and I R,G,B (x) or I R,G,B (x, y) for images in the RGB space. Figure 5 summarizes graphically the notation for a gray-scale laryngeal image sequence. Two of the most common tasks in image-processing are: image segmentation and motion estimation. The first one subdivides an image into its constituent regions or objects and the level of detail of each subdivision depends on the problem being solved [56]. The image segmentation methods can be classified, roughly speaking, into the following categories: thresholding [57], edge-based [58], region-based [59], classification-based [60], graph-based [61,62] and deformable models [63]. Meanwhile, the motion estimation algorithms precisely and faithfully model the motion in the scene which is typically represented using motion vectors, also known as vector displacements. The motion estimation techniques can be grouped into pixel based methods (direct) and feature based methods (indirect). The direct methods derive motion vectors for each pixel in the scene and can be categorized in Phase Correlation [64,65], Block Matching [66,67], Pel-Recursive [68,69], and Optical Flow (OF) [70]. On the other hand, the indirect methods use features matching between frames to compute the motion vectors [71].

Glottal Gap Segmentation Techniques
The literature reports different techniques for the glottis segmentation task depending on the user intervention. They can be grouped in semi-automatic and automatic approaches. The semi-automatic techniques let the user interact as many times as needed in order to solve any inconvenience that might appear during the segmentation process. On the contrary, the automatic techniques process all the data without any previous setting or any user intervention. From a clinical point of view, both methods present advantages and disadvantages but it is worth mentioning that semi-automatic methods are more time consuming for the clinicians, although their accuracy is expected to be better.
In semi-automatic segmentation, the users select an arbitrary set of images within the video sequence. These frames are representative cases of the glottal cycle, for instance, maximal glottal opening [72] or minimal glottal opening [73]. Later, the user selects one or multiple seed-points belonging to the glottal area [74,75]; the posterior and anterior commissures of vocal folds [72,[76][77][78]; or a region of interest (ROI) that includes the glottal gap area [11,79]. Lastly, thresholding, region growing or active contours techniques are used to compute the glottis segmentation. A summary of the main semi-automatic algorithms contributions is presented in Table 1. With regard to the automatic segmentation, only a few of the existing approaches are designed to be fully automatic [88][89][90][91]. However, in the last few years, the fully automatic glottal segmentation has become an active research field with growing interest [92][93][94][95][96][97]. Up to now, there is no standardized procedure to automatically segment the glottal gap from endoscopic high-speed sequences, in spite of the extensive literature devoted to solve such a problem. Roughly speaking the authors divide the problem into three main stages: image enhancement, identification of the ROI, and glottal gap delimitation (see Figure 6). A summary of the main automatic algorithms contributions is presented in Table 2.

Image Enhancement
Image enhancement refers to the manipulation or transformation of an image I R,G,B (x, t), with the aim of increasing its usefulness or visual appearance. There are not general criteria behind the enhancement, and often the techniques used depend on the application [56]. In laryngeal images, the glottis has darker intensity levels than its surrounding tissues. However, they often have low contrast and heterogeneous profiles due to the illumination conditions. Modelling the histogram of HSV with a statistic distribution, such as Rayleigh as in [82], or finding the darkest region, produces errors due to the non-uniform contrast of the image, lighting conditions and artifacts due to the recording equipment. For this reason, it is required to simultaneously reduce the effect of the low contrast and to highlight the object of interest (i.e., the glottis). Thus, the use of image enhancing techniques is expected to improve the characteristics of the image for a further processing. The literature reports the use of different enhancing techniques as a prior step to glottis segmentation.

Image Enhancement ROI Detec/on
. Graphic Representation of the three common steps followed to segment the glottal gap.
In [98], the authors combine an anisotropic diffusion with an FFT-based band pass filter in order to obtain a smoother image without losing edge information (second row of Figure 7). In [84] a Lagrange interpolation is combined with a Gaussian filter in order to smooth the images, reduce noise and eliminate unwanted details. In [86], the authors use a global thresholding to obtain a binary image to eliminate the worthless information. However, this strategy can not be generalized for noisy and poor quality HSV recordings.
Another alternative is to manipulate the histogram of the image. The most common histogram based processing techniques are the Histogram Equalization (HE), Adaptive Histogram Equalization (AHE), Contrast Limited Histogram Equalization (CLHE) and the Contrast Limited Adaptive Histogram Equalization (CLAHE). CLAHE is used in [91] providing more details in the glottal area while avoiding significant noise introduction (third row of Figure 7). CLAHE highlights the details over a small neighborhood preventing the over amplification of noise that can arise from adaptive histogram equalization AHE.
In [92] the intensity of HSV is model as the product of an illumination function by a reflectance function. The illumination is estimated by temporally averaging the intensity for each pixel location in the glottis and the reflectance is obtained by dividing the image intensity into the illumination function. The reflectance distribution has a bimodal structure which is used by the authors to threshold HSV.
One of the most widespread methods is based on point-wise intensity transformations. The point-wise transformation operates directly over the intensity values of an image, processing each pixel separately and independently. This transformation can be linear, piecewise linear, or nonlinear. Reference [100] establishes a methodology for pre-processing VS as a preliminary step for edge detection. The authors mention the drawbacks that exist in the acquisition due to the flashing effect at the recording instants, reducing the accuracy of the segmentation algorithm. The same procedure is used in [83] to highlight glottal area and to reduce influence of flashes in HSV.
Considering the increasing use of deep learning techniques in medical image processing, an implementation of a Convolutional Neuronal Network (CNN) to enhance the low-light HSV is proposed in [107]. The training data include dark frames created by adding Perlin noise to the HSV and data augmentation to increase the number of videos to train.
The quality of the laryngeal videos has a critical impact on an accurate diagnosis, examination, and identification of lesions or tissue changes [4]. Sequences with low quality are inadequate for quantitative analysis, reducing the amount of data available for research. Furthermore, current HSV technology requires special hardware and is particularly susceptible to image degrading factors.

Region of Interest
A ROI is a part of the image that encapsulates important features that can be used for further analysis. ROI detection has been studied for many years. Most algorithms use either feature-based or object-based approaches. Feature-based methods find pixels that share similar features to form the ROI. Meanwhile, object-based methods detect the ROI at a higher level than the pixel-by-pixel approach of feature-based systems using information such as target shape or structure. Figure 8 depicts some examples of ROI detection in six different HSV sequences using the algorithm presented originally in [108] and extended in [95]. The literature devoted to solving the glottal gap segmentation reports attempts to detect a ROI. However, most of the studies require user intervention [76,78,79,81,82,84,86] and, even more important, few works take into account the temporal information of the sequence. Some approaches assume that a previous frame segmentation is available. Then, the pixels values are set to 1 when the difference between the current frame and the previous one is larger than a fixed threshold [83]. A similar strategy has been implemented in [73], which apply an algorithm based on differences between consecutive frames. Other works [11,98,99] use motion estimation techniques to compute the ROI based on the fact that the region with the most salient motion features corresponds to the vocal folds.
In [91], an edge-based morphological approach is used to process some frames extracted from HSV, called keyframes. They search the object with the larger and nearly vertically oriented area by applying a Sobel filter. Lastly, a morphological closing operation is carried out over the gradient map to connect small related regions and the object with the largest area and vertical orientation is chosen. In [92,95] a criterion based on the time change of the spatial intensity profile is used to detect the ROI for a fixed set of frames. The squared area to be tracked is selected adaptively based on the variations of the image intensity and the inter-frame disparity for an appropriate set of frames. By taking advantage of the continuous light source used in HSV, the area with the largest variability in time is identified as the glottis region. However, this approach presents drawbacks in cases as phonation onset, offset and total or partial paralysis of the vocal folds.
In laryngeal images, the vocal folds, and so detected ROI, usually covers less than 25% of the entire image size. Therefore, the ROI detection permits to eliminate the non-relevant information and reduces the number of false detections, so it is an important step to be considered prior to the segmentation process.

Glottal Gap Delimitation
The image segmentation methods can be classified, roughly speaking, into the following categories: thresholding, edge-based, region-based, classification-based, graph-based and deformable models. With respect to the glottal gap delimitation, the most common approaches are based on thresholding, region growing and deformable models (parametric and geometric).
The studies based on thresholding assume that the glottis has darker intensity levels than the vocal fold tissues [72,82]. However, the laryngeal images often have low contrast and heterogeneous profiles. Hence, selecting a global threshold results in an erroneous delimitation of the glottal gap, since the intensity distribution is not bimodal. The studies based on region growing require a solid criterion for the seed selection and relatively well-delimited edges in order to converge towards the glottal gap. Furthermore, the algorithms segment objects with inhomogeneous regions into multiple sub-regions, resulting in over-segmentation [74,75]. With respect to the deformable models, they have the advantage to couple appropriately to non-rigid and amorphous contours by an iterative minimization of an energy function. However, the initialization process is not a trivial task. Therefore, many authors use manual procedures to initialize the active contours [78,79,81].
There are other approaches that combine different image-processing techniques [76,87,90,92,95,98], use feature extraction and training [96], or make use of more recent paradigms such as Deep Neural Networks [97]. For instance, [90] present an automatic glottis segmentation approach using deformable shape models and region growing. The approach starts with an initial coarse segmentation by means of the Region Growing technique. The seed points are determined based on a simple linear relationship between the average gray level of the image and the optimal seed points obtained from the training examples. Lastly, the non-glottal regions are eliminated using the reliability score factor from the trained shape models.
In [87] the textures of the background (laryngeal structures) and foreground (glottal gap) are smoothing. Then, the glottis is detected based on the temporal intensity variation of laryngeal videos and segmented using a background subtraction method. In [96], a fully automatic method is used to segment the glottis using local color and shape information. The approach is divided into three modules: training, recognition and segmentation. In the training, 60 different glottis shapes are manually segmented, and a set of descriptors are computed. The recognition module is designed to recognize, delineate and determine the optimal starting glottis regions. The last module segments the glottis based on properties of the previous frame. Hence, the glottis is continuously tracked within vibration cycles of the video by a frame-by-frame-wise segmentation technique.
In [97] a Deep Neural Network is implemented to segment the glottal gap in VS. The glottis color structure and its neighboring color pattern are employed to create the features vectors. Then, these vectors are used to train the network and classify whether each pixel belongs to the glottal region or not.
One of the most extended methods to perform glottal gap segmentation is presented in [74]. The advantage of the method lies in its semi-automatic characteristic since the wrong detection of the glottal gap can be fixed during the segmentation process. An open-source version of this algorithm called "GlottalImageExplorer" is available in [109]. With a friendly interface, the tool lets the user enter three threshold points, which define the threshold progression for the seeded region growing, and the threshold progression is defined by the linear interpolation between the threshold points.

Assessment of Vocal Folds Segmentation
Assessing the glottal segmentation is not trivial due to the huge amount of frames to evaluate and the need to take into account the spatial-temporal information of the video sequences. The evaluation is even more complicated having in mind that there are neither standard metrics to evaluate the distinct algorithms nor public databases that could be used for benchmark and comparison purposes [110].
A simple way to evaluate the glottal segmentation is by visual inspection. However, it requires a frame by frame intensive evaluation over a large set of images, including the contribution of several experts to minimize the inter evaluation bias. The literature proposes subjective ways to evaluate the segmentation by grading the quality of the glottal gap delineation from 0 to 5 points [74,91,100]. A more complete evaluation combines segmentation quality, readability of the playbacks (GVG, PVG, GAW) and shape similarity between VFT and VKG [87].
The literature also refers to different objective evaluations to compare the segmented image against a Ground-Truth (GT). The degree of similarity between the human and machine-generated images determines the quality of the segmentation. Among the most extended metrics used to  [75,87,89,91,[94][95][96][97]101]. The main drawback of objective evaluation is to manually generating a ground-truth since it is a subjective and time-consuming task. Table 3 summarize the main studies carried out to objectively and subjectively assess the vocal fold segmentation.

Facilitative Playbacks Representation
FP improve the quantification accuracy, facilitate the visual perception, and increase the reliability of visual rating while preserving the most relevant characteristics of glottal vibratory patterns [4]. Depending on the way they assess the glottal dynamics they can be grouped in local-or global-dynamics FP. Figure 9 illustrates eight FP from the same high-speed sequence.

Local-Dynamics Facilitative Playbacks
Local-dynamics FP analyzes the vocal folds behavior along one single line that is computed on a line perpendicular to the main glottal axis. In this category, the most used by clinicians and researchers are DKG, VKG and Vocal Folds Trajectories (VFT). They have been successfully applied to demonstrate the change of glottal dynamics in case of damaged tissues, such as lesions, scars, discoloration of the vocal folds and voice disorders [4,7,48,[111][112][113][114].
VKG and DKG provide a clear visualization of the glottal cycle opening and closing phases, of the MW traveling across the vocal folds upper surface, and of the displacement of the upper and lower margins of the vocal folds [111,115,116]. Given a video sequence I(x, t), let us denote a horizontal line at time t k and position y j as KG(t k ), where KG(t k ) is the row vector [I(x nj , t k ) I(x n−1j , t k ) I(x n−2j , t k ) · · · I(x 1j , t k )]. Then, the kymographic matrix I DKG (x, y) (or I R,G,B DKG (x, y) for color) is constructed with a set of N vectors {KG(t k ) ∈ R n , k = 1, · · · , N}, where each KG(t k ) is a column vector of I DKG (x, y) (see Equation (1)).
In order to interpret the kymographic vibratory pattern, Figure 10 depicts the schematic view of DKG compared with the traditional displays of the vocal folds oscillations. The first row shows eight phases of a glottal cycle in the frontal section, starting with vocal folds opening and ending with a complete vocal folds closure. The second row presents the same eight phases as viewed from above of the vocal folds using HSV. The third shows the DKG playback at a position y j . The kymographic image depicts two cycles of the vocal folds oscillation. The important features observed from the eight phases are: (1)   VFT (δ l,r seg (pc, t)) synthesizes the HSV in a single image that describes the deflections of the vocal folds edges perpendicular to the glottal main axis [74]. Hence, the vocal folds edges (C l,r (t)) have to be computed in advance. Later on, a trajectory line, L(t k )), at time t k that intersects perpendicularly with the glottal main axis (G(t k )) in a predefined point g pc (t k ) is defined. The current position of g pc (t k ) is updated at every frame to compensate for the relative movement of the endoscope, of the larynx, or of the vocal folds length changes via Equation (2).
Later, the intersection between the vocal folds edges C l,r (t k ) and the trajectory line L(t k ) is computed by Equation (3): The vocal folds trajectory is obtained by Equation (4) as: where δ l,r seg (pc, t k ) are the deflections of the vocal folds edges at the point c l,r pc (t k ), and pc indicates the position of g pc (t k ) in the glottal main axis. The VFT is illustrated in Figure 11 and expressed in vector notation at Equation (5). δ l,r seg (pc, t) = [δ l,r seg (pc, t 1 ) δ l,r seg (pc, t 2 ) · · · δ l,r seg (pc, t k ) · · · δ l,r seg (pc, t N )] (5) . .

Cycle length Cycle length
Cycle length Cycle length Figure 11. Illustration of the VFT playback. First row: the image sequence of one glottal cycle. Second row: vocal folds trajectories, DKG and DKG+VFT playbacks of five glottal cycles.
There are other local FP that have been less explored in literature such as Mucosal Wave Kymography (MKG), Optical Flow Based Waveform (OFW) and Optical Flow Kymogram (OFKG). These FP compute motion vectors to assess the fine detail of MW including the propagation of the mucosal edges during the opening and closing glottal phases. They are computed based on motion estimation techniques where the motion field W(x, t k ) at time t k is obtained using the intensity variation of consecutive frames (t k and t k + 1). The matrix representation of a total motion field at time t k is depicted in Equation (6).
where w(x ij , t k ) = (u(x ij , t k ), v(x ij , t k )) is the vector displacement of x ij at time t k . The vector displacement w(x ij , t k ) has two components: one in the x-axis direction (u(x ij , t k )) and another in the y-axis direction (v(x ij , t k )). For instance, in OFKG, a row vector OFKG(t k ) = [u(x nj , t k ) u(x n−1j , t k ) u(x n−2j , t k ) · · · u(x 1j , t k )] is selected from a horizontal line at time t k and position y j . Then, the OFKG matrix I R,G,B OFKG (x, y) is constructed with a set of N vectors {OFKG(t k ) ∈ R n , k = 1, · · · , N}, where each OFKG(t k ) is a column vector of I R,G,B OFKG (x, y): The u(x ij , t k ) vector is encoded using red and blue tonalities. For rightwise movements, the direction angle of displacement ranges from [−π/2, π/2] and is coded with red intensities. On the other hand, the direction angle for leftwise displacements ranges from [π/2, 3π/2] and is coded with blue tonalities.

Global-Dynamics Facilitative Playbacks
The global-dynamics FP analyse the vocal folds behaviour along the whole glottal length. Most of them are focused on vocal folds edge motion by means of glottal segmentation algorithms. The most widespread are: Glottal Area Waveform (GAW), Phonovibrogram (PVG) and Glottovibrogram (GVG).
GAW uses the glottal segmentation to compute a glottal gap area function along time from which several parameters can be estimated [8,39]. Let us consider I seg (x, t) as a segmented HSV, having the same size of the original video I(x, t). The segmented HSV is a set of binary images, where 1 is assigned to pixels belonging to the glottal gap area (foreground) and 0 is assigned to pixels belonging to the other laryngeal structures (background). Equation (8) computes I(x, t k ) in time t k : Therefore, the glottal gap area at t k is computed via Equation (9) as follows: then GAW can be expressed in vector notation by Equation (10), where each of its elements represents the area of the glottal gap at a particular instant of time.
The GAW playback measures the glottal area function throughout the glottal cycle, being possible to compute some features as: opening and close phase of the vocal folds oscillations, maximum and minimum glottal areas, Open Quotient (OQ), closed quotient and speed quotient, among others. Figure 12 illustrates a GAW normalized within the interval [0,1]. The peaks represent the open-states of the vibratory cycles, meanwhile the valleys represent the closed-states of the vibratory cycles. The maximum and minimum amplitudes of the whole vibratory cycles can be computed by finding the maximum and minimum glottal area respectively. The period of the GAW playback is equivalent to the duration of the glottal cycles, and also to the sum of the opening and closing phase duration.  [118] and of the glottal shape representation proposed by [119]. PVG is a 2-D diagram introduced in [120] where a set of segmented contours of the moving vocal folds are unambiguously transformed into a set of geometric objects that represents the entire HSV sequence. Let us consider that the video sequence I seg (x, t) was computed in advance by any segmentation algorithm. Then, the set of frames with the maximal glottal gap area are identified and named as keyframes I key (x, t) (Equation (11)).
For each keyframe, I key (x, t), a linear regression line is computed to identify the main orientation of the glottal gap area. The regression line intersects with C l,r (t k ) at the points p(t k ) and a(t k ). Such points are used to split the vocal folds edges into the left C l (t k ) and right folds C r (t k ).
Since the vocal folds contours were computed independently, it is necessary to derive a continuous representation of the vocal folds vibrations that links the posterior and anterior point of all images within the HSV sequence. For doing this, it is assumed that the positions of the posterior and anterior glottal points do not change dramatically for all the intermediate images between the occurrences of two consecutive keyframes within a single oscillatory cycle. Therefore, such points are computed approximately by linear interpolation via Equation (12) where t O and t O+1 indicate two consecutive open-states.
By connecting the vocal folds edges C l,r (t k ) to the approximated position of p(t k ) and a(t k ), a continuous representation of the vocal folds edges is obtained also in the parts that are undetected from the segmentation methods. Later, the glottal main axis G(t k ) and the vocal folds edges C l,r (t k ) are equidistantly sampled with pc ∈ [0, M], where M corresponds to total percentage length of G(t k ) (100%). Then for each image the deflections of the vocal folds edges δ l,r seg (pc, t k ) are obtained ∀pc ∈ [0, M] and ∀t ∈ [1, N]. δ l,r seg (pc, t k ) is positive, if the left/right fold contour is correctly located on the ipsilateral side of the glottal main axis. Contrariwise, if the vocal fold edges cross laterally, the glottal main axis, δ l,r seg (pc, t k ) becomes negative. Furthermore, the vocal folds are splitted longitudinally and the left vocal fold is turned 180 • around the posterior commissure p(t k ). Lastly, all the computed δ l,r seg (pc, t k ) are stored in a matrix I PVG (x, y) ∈ R (2M+1)×N (Equation (13)).
In order to visualize I PVG (x, y), each element is represented by color tonalities: red represents positive deflections and blue represents negative deflections. PVG allows to distinguish left-and right-fold movements, and is thus more sensitive to the accuracy of glottal main-axis [91]. The complete procedure to compute PVG is illustrated in Figure 13 where five glottal cycles are depicted.
The GVG playback was proposed in order to solve the difficulties to interpret the PVG and its strong dependence on the detection of the glottal main axis [91,121]. The GVG synthesize the HSV in one single image and its computation uses a similar approach than the PVG formulation. However, unlike it, GVG computes the distance between the vocal folds edges themselves. Firstly, the vocal folds edges C l,r (t) are equidistantly sampled with pc ∈ [0, M]. Then, the deflections among the vocal folds edges are computed by Equation (14), where δ GVG (pc, t k ) represents the distance between the left c l pc (tk) and right c r pc (t k ) fold at position pc and time t k : δ GVG (pc, t k ) = c l pc (t k ) − c r pc (t k ) 2 ; ∀k and ∀pc (14) Lastly, all the distances are stored in a matrix I GVG (Equation (15)) and normalized within the interval [0, 1], with 0 corresponding to zero distance and 1 corresponding to maximal distance. For visualization purposes the matrix is coded with a grayscale map. The GVG depicts a well-shaped form of the vocal folds vibration even when detection errors occur during the segmentation or glottal main axis detection, providing a more intuitive representation of vocal folds dynamics (see Figure 14). However, GVG is not suitable to analyze symmetry between the vocal folds.
There exist other global-dynamics FP that depend on vocal folds contour segmentation. The authors in [119] concatenate the vocal folds deflections δ l,r seg (pc, t k ) in order to obtain a 3D representation of the entire vocal folds dynamics. The representation is simple to implement but difficult to interpret. In [81] a playback named Vibration Profiles (VP) is introduced, where not only the edges but also the whole vocal folds were segmented. They computed the deflections between the medial axis in each vocal fold and the edges of each fold. The representation obtained keep relation with PVG playback, but attempts to preserve information of MW propagation. Empirical Orthogonal Eigenfunctions Analysis (EOF) is another playback that is based on the glottal edges computation. The authors extract independent spatio-temporal structures from the vocal-fold displacements analysis using empirical orthogonal functions. The technique appears to be an appropriate tool to extract principal glottal vibration modes from high-speed recordings. In [122], a Singular Value Descomposition (SVD) analysis of vocal folds edges was implemented to capture the spatial and temporal behavior of the vibrations over time and space respectively. The first spatial eigenfold captures the average shape of the vocal folds, the second spatial eigenfold captures the closing pattern of the folds, and the third spatial eigenfold captures the motion in the longitudinal direction of the folds. On the other hand, the first temporal eigenfold captures the amplitude differences between the left and the right folds, and the second temporal eigenfold captures the sample difference in phase between the left and right folds. The separation of spatial and temporal eigenfolds offers a compact visualization but does not allow localization of vibratory features in space and time. There are also representations that use information from other FP to compute new ones. For instance, the authors in [123] computed the Hilbert transformation of GAW to create an analytic phase plot that depicts the effects of jitter and shimmer as well as harmonic distortion in terms of perturbation and periodicity. Nevertheless, analyzing GAW signals does not differentiate between left and right vocal fold vibration and consequently, asymmetries remain unconsidered. Alternatively, reference [124] use phase space reconstruction and correlation dimension over the glottal area time series (GAW) to investigate the dynamic characteristics of the vocal folds. In [125], PVG and GAW were combined to construct a 3D representation named Phonovibrographic Wavegram (PVG-wavegram). This playback let visualize the intra-and inter-cycle characteristics of vocal fold vibrations within a single three-dimensional scalar volume field for long phonation sequences. One of the latest FP proposed in the literature is Glottocorrelogram (GCG). This representation allows to visualize the correlation coefficients obtained from the glottovibrogram presenting phase and frequency variations among the vocal folds edges.
Only a few FP do not depend on a segmentation approach. Reference [126] proposed a method for extracting spatial features based on pixel-wise Fourier analysis of a time-varying brightness curve for each pixel across images. They named this FP Laryngotopography (LGT).
LGT let to display spatial information related to natural frequencies of a portion of a larynx, amplitude of vibration, and propagation speed of MW. In [127] Principal Component Analysis (PCA) was applied to the pixel intensity of the high-speed sequence, and the first two PCA coefficients were visualized. The Glottaltopogram (GTG) method reveals the overall spatial synchronization pattern of the vocal folds vibrations for the entire laryngeal area, rather than focusing on a specific location or frequency. Despite full spatial resolution is maintained, the time information is not preserved.
Since the purpose of HSV analysis is to characterize the motion of the vocal folds by identifying their movements from one frame to the followings, the authors in [128] used Optical Flow (OF) computation to track the vocal folds solely based on its motion, with no need of additional segmentation techniques. OF aims at understanding and interpreting the dynamical behavior of moving objects a low-level by the estimation of a displacement vector for each pixel in the image, creating a dense motion field W(x, t k ) (Equation (6)) [70]. The direction of the motion field is expected to be inwards in the closing phase and outwards during the glottal opening. The authors implemented two global FP from W(x, t k ) named as Optical Flow Glottovibrogram (OFGVG) and Optical Flow Kymogram (OFKG). OFGVG is derived from GVG and represents the velocity of glottal movement per cycle. It is obtained by averaging each row of the u(x) component of the flow, and represents it as a column vector. The OFGVG complements the spatio-temporal information provided by the common techniques (GVG, PVG) by adding velocity information for each displacement of the vocal folds. On the other hand, GOFW is a 1D representation that is based on GAW, but here the total magnitude of the velocity is computed over a ROI for each instant of time. GOFW represents the change of velocity along time at the same instant in which it is quantified the total velocity variation. Both FP have the same limitation existing in GVG and GAW, since they are not suitable to analyze symmetry between the vocal folds.
A summary of the main studies carried out to synthesize the vocal folds vibratory patterns is presented in Table 4. Table 4. Summary of the main studies that synthesise the vocal folds dynamics in FP.

Author
Year

Challenges in Glottal Gap Segmentation and Facilitative Playbacks
Automatic methods for glottal gap segmentation and for the representation of FP are still being investigated. From a pure image processing perspective, the task of tracking vocal folds edges during an entire video sequence appears to be a standard tracking task. Thus, naively, it may seem that once the glottal area in a frame is delineated, it can easily be identified in successive frames, or that the glottal gap can be obtained intuitively by subtracting successive frames. However, glottal-gap segmentation is not easy due to many different factors: limited spatial resolution, contrast inadequacies during acquisition (see Figure 15a), camera rotation (see Figure 15b), side movement of the laryngoscope, movements of the patient, brightness changes (see Figure 15c), depth differences between videos and demanding cases as hourglass closure, irregular closure, vocal hemorrhage (see Figure 15d), nodules (see Figure 15e), polyps, cysts, scars, mucus, specular reflection,and vocal folds occlusion (see Figure 15f) among others. In addition, there are no official guidelines for laryngeal imaging footage analysis; the literature is limited by the relatively low or nonexistent correlations among measures of irregularity in vocal folds vibration and acoustic parameters; and the lack of normative values of the parameters and intervals, which are needed to determine the severity of pathological voice production. Since the pattern of vocal folds vibration is difficult to evaluate by simply observing the successive video frames, the researchers have introduced the concept of FP to easily and better visualize the features of vocal folds dynamics. The purpose of FP is to synthesize the time-varying data in a few static images, or in an unidimensional temporal which permits to visualize hidden features.
Despite the great advances that have been reached condensing the data coming from laryngeal imaging, many of the FP in the literature present some drawbacks that restrict their applicability. For instance, the motion analysis is focused only on those points belonging to the glottal contours. Additionally, some FP rely on the computation of the glottal main axis, which strongly depends on the geometry of the detected glottal gap and can be difficult to identify accurately in the presence of a posterior glottal chink [74,81,91,120,125]. Other FP do not preserve spatial information about vocal folds vibration, limiting their applicability for interpreting vibratory features such as asymmetry [3,124,129,130]. Furthermore, there are FP that restrict the information about the dynamics of the vocal folds along one single line [4,115,128,131]. Lastly, there are FP less intuitive to interpret [126,127], reason that probably explains why they have not been widely used.
The current challenge is to provide new methods for data visualization to overcome the drawbacks of existing ones, providing simultaneously features that would integrate time dynamics, such as: velocity, acceleration, instants of maximum and minimum velocity, vocal folds displacements during phonation and motion analysis of MW propagation. These methods should include not only those points belonging to the glottal edges but also those regions that originated such movements. All these issues make the glottal segmentation and FP challenging tasks. A successful image processing approach will have to overcome these issues in a robust way in order to maintain a high level in the quality and accuracy in clinical diagnosis.

Voice Research and Clinical Applications
In voice research, several works make use of glottal segmentation and FP to investigate the correlation between vocal folds vibratory patterns, speaker voice quality, voice onset and offset. For instance, the authors in [3,132,133] make use of glottal segmentation to obtain the afforementioned GAW, which identifies new quantitative measures to evaluate the regularity of vocal folds vibration during phonation. In [134], GAW has been used to compute the Voice Onset Time (VOT) which is a clinical indicator for correct laryngeal functionality. The authors in [135] computed 20 parameters from the acoustic signal and GAW. The aim was to identify mathematical dependencies between parameters and suggest which parameters may best describe the properties of the GAW and corresponding acoustical signal. Subsequent works [7,39,49,72,136,137] pointed out the advantages of studying the vocal folds vibration function to detect asymmetries, transients, breaks, opening phase events, closing phase events and irregularities.
The literature also reports a combined use of laryngeal image processing of the vocal folds motion with biomechanical models (so-called Lumped Mass Models (LMMs)) to investigate the parameter effects such as applied subglottal air pressure, vibrating masses, tissue stiffness and elongation characteristics with respect to the dynamic vocal folds behavior. In [6,138,139], the Two Spring-Coupled Masses Models (2MMs) proposed in [140] was combined with glottal segmentation to describe the vocal folds oscillations as a function of time. The authors found physiological parameters such as vocal folds tensions and vocal folds masses. In [141], an automatic optimization procedure was developed to fit a multi-mass model [142] to the observed vocal folds oscillations obtained via segmentation, with the aim of inferring an approximation of the stiffness and mass distribution along the entire vocal folds. In [143], a numerical 2MMs was adapted to the VFT playback by varying model masses, stiffness and subglottal pressure. The obtained model was used to quantify gender and age-related differences. Most recent work makes use of a recurrent neural network to train a 2MMs using VFT to estimate laryngeal pressure [144]. The authors claim that the methodology proposed is transferable to estimate other laryngeal parameters that will aid in diagnosis and treatment selection in voice disorders.
The image-processing of vocal folds motion has been used to highlight the importance of visualizing MW propagation for an accurate diagnosis and optimal treatment of voice disorders. In [145,146] the MW was detected and quantified by combining glottal segmentation with physiological knowledge of mucosal lateral movements. An important conclusion was the finding that 2000 fps is insufficient to assess features of MW when the frequency of vocal fold vibration is high. The authors in [147] discussed the benefits, the disadvantages, and the clinical applicability of different MW measurement techniques. They found the necessity of additional research to broaden the use of laryngeal image-processing to accurately and objectively diagnose voice disorders. The authors in [148] analyzed the OQ from glottal-gap segmentation and Electroglottography (EGG). They observe that the measure of the OQ from three parts of the glottis helps to diagnose and localize glottal vocal-fold lesions. They also found that OQ varies depending on the type of organic dysphonia. In [149,150] MW analysis is used for the diagnosis of different vocal fold lesions and for evaluating the success of treatment procedures, hydration effects,or mucosal healing after phonosurgery. Besides human vocal folds, mucosal waves were documented on the vocal folds of many mammals and also in birds [151][152][153][154]. They observed that the fundamental frequencies of many mammals are similar to those seen in humans, and consists of flow-induced self-sustaining oscillations of laryngeal tissues.
Further, the appearance of MW was shown to be a crucial component of the myoelastic-aerodynamic theory of phonation, explaining the mechanisms of the self-sustained vocal fold oscillations.
Different singing styles have been analyzed using laryngeal image processing. For instance: the mechanism of the bass type of Mongolian throat singing (called Kargyraa) was studied using FP [155]. They found that both true vocal folds and false vocal folds vibrate during singing; they also observed that the vibration of the false folds adds subharmonics to the acoustic content. In [156] the characteristics of rock singing, also known as distorted singing, were investigated. The authors found some modulations of the vocal folds vibrations by means of periodic or aperiodic motion in the supraglottic mucosa which presumably adds special expressivity to loud and high tones in rock singing. The authors in [157] proposed FP to visualize glottal dynamics during singing sequences consisting in vowels sung at different pitches, different loudness, and exploring the four laryngeal mechanisms. They found additional information on the temporal dynamics of glottal vibratory movements during glottal closing and opening phases. The Laodan and Qingyi role in Peking Opera was studied in [158]. They used image-processing to evaluate glottal changes, relative length of the glottis fissure and relative maximum mucosal amplitude. To understand the vibration of the vocal folds of human voice production at very the high soprano range, several acoustic and glottal parameters are extracted from laryngeal videos [159,160]. The studies reveal three regions of pitch jumps or instabilities in the high soprano range, above the M1-M2 transition or primo passaggio. They also found that the whole length of the membranous part of the vocal folds took part in the oscillatory process and even at these highest pitches a total closure of the vocal folds was observed.
There are also studies devoted to describe properties of laryngeal oscillation patterns with regard to vocal registers and their transitions [41,42,161,162]. In [161] the vocal fold oscillatory patterns during transitions from modal to falsetto register and from modal register to stage voice above the passaggio in professional tenor opera soloists were analyzed using PVG and DKG. They found that the transitions are associated with large modifications of the supraglottic cavities and with retraction of the epiglottis. Therefore, they suggest the use of flexible transnasal high-speed laryngoscopy in order to improve the visibility of vocal folds. In [42] variations of vocal fold vibration patterns were found in the first and the second passaggio. They observed four distinct patterns emerged: smooth transitions with either increasing or decreasing durations of glottal closure, abrupt register transitions, and intermediate loss of vocal fold contact. In addition, laryngeal configuration and vocal fold behavior during different voice quality have been explored in different studies [163,164]. The results suggests that the voice qualities were produced by independently manipulating the adduction of the arytenoid cartilages and the thickening of the vocal folds.
The glottal segmentation and FP have been used to diagnose functional voice disorders [45,48,[165][166][167]. For instance, the authors in [47] presented a computer-aided method for automatically and objectively classifying individuals with normal and abnormal vocal folds vibration patterns. First, a set of image processing techniques were employed to visualize vocal folds dynamics. Later, numerical features were derived, capturing the dynamic behavior and the symmetry of the vocal folds oscillation. Lastly, a support vector machine was applied to classify between normal and pathological vibrations. The results indicate that an objective analysis of abnormal vocal folds vibration can be achieved with considerably high accuracy. The laryngeal imaging techniques have been used to detect early-stage vocal folds cancer. The authors in [51] report a procedure to discriminate between malignant and precancerous lesions by measuring the characteristics of vocal folds dynamics by means of a computerized analysis of laryngeal videos. They found that vocal folds dynamics are significantly affected by the presence of precancerous lesions.
Recently, laryngeal image processing of vocal folds motion have been used to visualize the 3D superior vocal fold vibrations from laryngeal recordings [23,24,[168][169][170]. The authors proved that healthy phonation does not depend on symmetric oscillation patterns since great asymmetric vertical dynamics were observed. They concluded that a good evaluation of vocal folds vibrations has to include a 3D visualization of the vibratory function.

Discussions and Future directions
The study of vocal folds vibratory behavior reveals pathophysiological evidence and explains voice dysfunction. Since the last decade, a great number of papers have been published in the field of laryngeal imaging, focusing on glottal gap segmentation and FP representation in different image modalities.
There are some open challenges which should be covered in future research, such as, the lack of unified benchmarks and guidelines to evaluate glottal segmentation and FP. The studies in the literature have been performed using their own private datasets. It is not straightforward to evaluate and numerically compare different studies without shared public databases used for benchmark and comparison purposes. The studies to date are solely based on their reported results since they use different datasets, various evaluation methods, and multiple performance metrics. For numerical comparison of the studies, it is necessary to build benchmark datasets and to establish a set of guidelines. Additionally, the datasets should be clinically validated. They should comprise samples coming from a large number of patients, and annotated by different specialists to remove subjective variations in the annotations. Such an effort would make possible the numerical comparison of the results obtained by different studies and the identification of distinguishing features. In view of these limitations, and in other to partially circumvent these problems, [87] proposes a set of guidelines to measure the accuracy and efficiency of the segmentation algorithms. The guidelines are divided in three groups according to their nature: analytical, subjective, and objective.
Another problem is related to the accurate and precise detection of the glottal gap along the video sequence. Most of the methods in the literature only take local intensity information without any prior knowledge about the object to be segmented and the temporal information of the laryngeal sequence. Furthermore, the task of identifying the glottal gap is carried out by semi-automatic methods. In this context, and with the exponential growth of computer power and the constant improvement of image processing algorithms, automatic segmentation has achieved a great progress. However, many of the techniques found in the literature still have weaknesses that make them impractical in a clinical environment, in which automatization and reliability are fundamental. Up to now, there is no standardized procedure to automatically segment glottal gap from HSV and VS, in spite of the extensive literature devoted to solving such as problem. The common approach to solve the glottal segmentation, roughly speaking, divides the problem into three main stages: image enhancement, identification of ROI, and glottal gap delimitation.
The laryngeal imaging techniques in combination with image processing are the most promising approach to investigate vocal folds vibration and laryngeal dynamics. Therefore, researchers have introduced objective parameters that can be used by clinicians to accurately identify the presence of organic voice disorders [45], classify functional voice disorders [47], vibratory patterns [49], discriminate early stage of malignant and precancerous vocal folds lesions [51], among others [7,39]. However, despite the progress achieved to describe the vocal folds dynamics, its use in the clinical routine is limited due to several restrictive factors: there are no official guidelines for footage analysis; the literature is limited by the relatively low or nonexistent correlations among measures of irregularity in vocal folds vibration and acoustic parameters; and there is a lack of normative values of parameters and intervals, which are needed to determine the severity of pathological voice production.
In order to improve the laryngeal diagnosis, the concept of FP has been introduced by voice researchers. These representations synthesize the time-varying data to visualize hidden features that are not easily observed from the laryngeal sequences. However, many of the FP presented in the literature have some drawbacks that restrict their applicability since they rely on glottal-area segmentation (GAW, GVG, PVG, PVG-wavegram, VFT, VP, EFA and HTA). Additionally, most of the FP consider only those points belonging to the glottal contours and neglect the MW contribution. Others, such as GVG, PVG, PVG-wavegram, VFT and VP rely on the computation of glottal main axis, which strongly depends on the geometry of the detected glottal area. FP such as GAW, HTA, NDA and PGAW are based on glottal area waveform computation, so they do not preserve spatial information about vocal folds vibration. On the other hand, FP such as OFW, OFKG, DKG and MKG restrict the information along one single line. The challenge is to provide representations to overcome the drawbacks of existing ones, providing simultaneously features that would integrate the time dynamics, such as velocity, acceleration, instants of maximum and minimum velocity, vocal folds displacements during phonation and motion analysis of the MW.
To the best of our knowledge, only few works study the MW propagation [47,145,147,171]. Future studies have to analyze in more detail the objective detection and quantification of MW propagation since its existence and magnitude provides valuable information about the coupling between the mucosa and the subjacent vocal folds muscle. Additionally, it is widely demonstrated that MW activity is a useful indicator of the quality of voice production and the presence of voice disorders. One possible solution to address these shortcomings, it could be created kinematics models that simulate the MW based on physical models and FP, similar that the one presented in [171]. In this way, it will be possible to explore the variations of the MW under different constraints. Besides, it can be exploited the potential of 3D visualization of vocal folds to understand the propagation of mucosal wave using motion estimation techniques as 3D feature tracking or dense 3D motion field [172,173].
Further development should explore other techniques such as registration [65], Haar wavelets [174], Haar-like features [175], local binary patterns (LBP) [176], Histogram of Oriented Gradient (HOG) [177] or supervised machine-learning to characterize the kinematics of the vocal folds. The image registration is one of the fundamental tasks within image processing, which let to find an optimal geometric transformation between two corresponding images. Therefore, it will be possible to model the deformation of the vocal folds for each particular laryngeal mechanism using geometric transformations. The HOG well captures local shape characteristics of the targeted object based on directions and magnitudes of image derivatives. Then it could be used to model the wavelike motion of the vocal folds during different vocal tasks. Besides, HOG could be combined with Haar wavelets, Haar-like features and local binary patterns (LBP) to create a new feature descriptor to classify different vocal folds vibratory patterns [178]. On the other hand, supervised machine-learning techniques, especially deep learning [179], is part of state-of-the-art systems in various disciplines, particularly in computer vision and image processing. In addition, it has been used to make critical decisions in medical diagnosis. It would be possible to train a neural network to extract 3D structure of vocal folds; to segment the glottal gap; to compensate camera rotations and translations; to enhance the quality of laryngeal imaging; to synthesize video information; to classify different vibratory patterns; to generate synthetic datasets; to track different glottal structures (e.g., arytenoids, epiglottis); image data augmentation to increase video frame rate; among others.
Lastly, it might be helpful to show how the FP and the objective parameters provide complementary information, as well as combining them to put additional information to the context. In addition, a systematic comparison with other glottal-activity signals such as Electroglottography (EGG) would be very interesting to provide a more complete assessment of vocal folds vibration.

Conflicts of Interest:
The authors declare no conflict of interest.