VID2META: Complementing Android Programming Screencasts with Code Elements and GUIs

Alahmadi, Mohammad D.

doi:10.3390/math10173175

Open AccessArticle

VID2META: Complementing Android Programming Screencasts with Code Elements and GUIs

by

Mohammad D. Alahmadi

Department of Software Engineering, College of Computer Science and Engineering, University of Jeddah, Jeddah 23890, Saudi Arabia

Mathematics 2022, 10(17), 3175; https://doi.org/10.3390/math10173175

Submission received: 2 August 2022 / Revised: 22 August 2022 / Accepted: 27 August 2022 / Published: 3 September 2022

(This article belongs to the Special Issue Computational Intelligence in Computer Vision and Software Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

The complexity of software projects and the rapid technological evolution make it such that developers often need additional help and knowledge to tackle their daily tasks. For this purpose, they often refer to online resources, which are easy to access and contain a wealth of information in various formats. Programming screencasts hosted on platforms such as YouTube are one such online resource that has seen a growth in popularity and adoption over the past decade. These screencasts usually have some metadata such as a title, a short description, and a set of tags that should describe what the main concepts captured in the video are. Unfortunately, metadata are often generic and do not contain detailed information about the code showcased in the tutorial, such as the API calls or graphical user interface (GUI) elements employed, which could lead to developers missing useful tutorials. Having a quick overview of the main code elements and GUIs used in a video tutorial can be very helpful for developers looking for code examples involving specific API calls, or looking to design applications with a specific GUI in mind. The aim is to make this information easily available to developers, and propose VID2META, a technique that automatically extracts Java import statements, class names, method information, GUI elements, and GUI screens from videos and makes them available to developers as metadata. VID2META is currently designed to work with Android screencasts. It analyzes video frames using a combination of computer vision, deep learning, optical character recognition, and heuristic-based approaches to identify the needed information in a frame, extract it, and present it to the developer. VID2META has been evaluated in an empirical study on 70 Android programming videos collected from YouTube. The results revealed that VID2META can accurately detect and extract Java and GUI elements from Android programming videos with an average accuracy of 90%.

Keywords:

video metadata; programming screencasts; mobile development; software documentation; deep learning; video mining

MSC:

68N30

1. Introduction

Developing modern software systems is challenging due to the increasing complexity of software and the vast technological knowledge that developers need to obtain in order to effectively implement it. To obtain this knowledge, developers often need to refer to external resources and frequently consult informal online documentation in the form of questions and answer (Q&A) websites, tutorials, and API documentation due to the wide availability and variety of these resources. Recently, programming screencasts have become a popular resource among developers [1,2] due to their engaging and interactive nature [3,4,5]. In addition, the number of programming video tutorials is rapidly growing, since it is generally much easier for programmers to prepare them than to create a text-based tutorial [2,6].

Although the amount of Android programming screencasts at the disposal of developers is vast, the code information presented in these videos cannot be easily accessed, searched, and navigated. This is due to the nature of the data, as videos contain a sequence of frames or images, which make it hard to fully explore the content of the videos without manually skimming through them. To find a video of interest, a developer typically writes a query, and a list of videos is retrieved based on the video metadata which includes the title, description, and tags. Then, developers decide which video is the closest match to their information needs based on (i) reading the video metadata and/or (ii) manually skimming over the content of the video. Unfortunately, currently available video metadata are not always descriptive to developers, typically containing general information about a programming task rather than code-related details [5,7]. At the same time, manually skimming over videos is not efficient and can lead to missing useful parts of videos [8,9], given that people spend less than a second analyzing an online resource to determine its relevance [10]. Therefore, there is a need to complement Android programming screencasts with additional metadata that specifically describe the code and GUI screens appearing in the videos, making them easier to search, inspect, discover, and navigate.

To augment videos with code and GUI-related metadata, this paper introduces VID2META, which is an approach that operates on video frames and analyzes their visual and textual content to automatically extract and list Java code and GUI elements (represented in XML code), as well as GUI screens. To begin with, VID2META employs an object detector to locate the regions that contain a Java code and GUI elements (I refer to these regions as code-editing windows) and GUI screens (GUIs). Then, it uses optical character recognition (OCR) to extract the text present in the code-editing windows and proceeds to classify this text as either a Java or XML code using deep learning (DL) methods. Last, VID2META uses heuristic-based techniques to fix the syntax of the OCRed Java and XML code, locate and extract Java and GUI elements, and detect the incorrect ones using cross-frame information. Note that Java elements include import statements, class names, method names, and method calls, whereas GUI elements include their names such as ListView and ProgressBar.

I conducted four empirical evaluations to assess (i) the accuracy of localizing the code-editing window and GUIs using the proposed approach as opposed to a previous work [5]; (ii) the ability of the proposed approach to correctly classify the content of the code-editing window as either a Java or XML code; (iii) the effectiveness of the proposed techniques to firstly find and extract Java and GUI elements from the code-editing window and secondly detect the incorrectly extracted elements; and (iv) the accuracy of identifying and eliminating duplicate GUIs in order to remove noise and cut down on processing time. To carry out these empirical evaluations, I collected a total of 70 videos from YouTube, with all their individual video frames. I used the frames in 20 of the videos to train and evaluate the approaches for localizing the code-editing window and classifying its contents, and the remaining 50 videos to evaluate the proposed techniques that extract Java and GUI elements. The results of these evaluations illustrate that VID2META is able to successfully extract and correct Java and GUI elements from Android videos, achieving an accuracy above 94% in localizing code-editing windows, project windows, and GUIs. In addition, VID2META is able to successfully detect the content of code-editing windows and reduce the errors in the OCRed Java and GUI elements by an average of 39.5%. Moreover, the proposed approach for detecting and eliminating duplicate GUIs achieved an F-Score of 84%.

Tha major contribtions of this paper are as follows:

VID2META is proposed which, to the best of my knowledge, is the first approach that complements the metadata of Android programming screencasts with Java and GUI elements extracted from the videos;
Through extensive experiments, I show that VID2META can accurately (i) locate the code-editing window and GUIs in the frames of Android screencasts, (ii) classify the content of a code-editing window, and (iii) extract Java and GUI elements;
I provide a replication package (https://zenodo.org/record/5014890, accessed on 20 July 2022) containing the complete dataset, source code, VID2META’s output, and the detailed results of the experiments.

2. Approach

In this section, VID2META is introduced, which aims to complement the metadata of Android screencasts by analyzing their video frames to extract meaningful Java and GUI elements, as well as entire GUIs.

2.1. VID2META Overview

As shown in Figure 1, VID2META takes a video as input, analyzes the visual and textual content presented in the video frames, and produces Java and GUI elements (I collectively refer to these elements as code elements). To extract code elements, VID2META needs to locate the code-editing window that contains the Java or XML code and then extract the appropriate code elements. Since the two programming languages require different approaches to extract code elements, we need to first determine what programming language we are dealing with and then apply the proper approach accordingly. I initially envisioned two approaches to classify the content of the editing window, as follows: (i) train an object detector with two classes (Java and XML) to directly determine the programming language of the code in the editing window, or (ii) train an object detector to localize the code-editing window and then use another approach to classify the text inside the editing window as Java or XML. Through a preliminary experiment, I found that the first approach failed to distinguish between Java or XML due to the similarity between the visual appearance of code written in the two languages (e.g., similar font type and color, background color, and integrated development environment window structure). Therefore, I proceeded with the second approach as follows. First, VID2META locates the bounding box of the selected file in the project window using computer vision techniques and uses OCR to extract the file extension of the selected file, and thereby detect the content type of the editing window. Second, VID2META uses a deep learning binary text classifier to predict the programming language of the code in the editing window (e.g., Java or XML). In case the content of the editing window is Java, VID2META extracts import statements, class names, and method names and calls, whereas if it is XML, VID2META extracts GUI elements from the file. Besides extracting code elements, VID2META extracts GUIs and removes duplicates using computer vision techniques. Next, more details about the pipeline of VID2META are presented.

2.2. Localizing Code-Editing Windows, Project Windows, and GUIs

A mobile programming screencast consists of n frames where each frame is typically divided into windows/regions with different sizes (e.g., a frame shows an IDE could have several regions such as an editing window, a project window, a GUI layout, etc.). To extract Java and GUI elements from those frames, we need to locate the bounding box of the code-editing windows precisely. Thereby, we are assured that when we apply OCR on only those windows, we are only extracting the code information (i.e., to avoid extracting other irrelevant information from other windows in an IDE such as a project window, an output window, etc.). The windows that are located are defined as follows.

Code-Editing Window (CEW): This window contains Java or XML texts.
Project window (PW): A project window is typically displayed on the left side of an IDE and includes resource files of a project in the form of a tree view.
Graphical user interface (GUI): Visual GUI elements are placed on GUIs where users can interact with the app.

Formally, given a video V =

{f_{1}, f_{1 + 1}, \dots, f_{n}}

where

f_{i}

is a frame captured at the ith second and n is the total number of seconds in video

v_{i}

. I aim to detect the location of each window

w_{i}

in

f_{i}

, if any. Note that this problem is considered as a multi-window detection and localization problem in video frames, in which each frame

f_{i}

could have all possible combinations of the pre-identified windows. For example, in a mobile programming screencast, the screen of a narrator could show an IDE where the project window, an XML, and one or more GUI previews are displayed at the same time. Regardless of the number of windows displayed at the same time, all windows must be located accurately.

To recognize the windows in a frame, we need first to obtain each window’s features and feed them into a learning model. For this task, a convolutional neural network (CNN), called Inception-Resnet V2 [11], with a region-based object detector, called Faster RCNN [12], are utilized. Note that a pre-trained CNN is successfully adapted in analyzing video programming frames [13,14,15], and Faster R-CNN has shown promising results in detecting mobile GUI components [16,17]. The choice of the Inception-Resnet V2 network with Faster R-CNN is based on the empirical evaluation performed by [18] who found that this combination outperformed several other object detectors that used various CNN architectures. Several parameters impact the network’s performance during the training process. The choice of the optimizer plays an essential role in updating the weights after each epoch by back-propagation (one forward and backward pass of every training sample through the neural network). Based on the recommended optimizer by Faster R-CNN, momentum is used, which is a version of the stochastic gradient descent (SGD) [19]. The default values for the network layers’ hyperparameters are used, such as a stride value of 16.

To train the object detector, we need to manually collect videos and annotate code-editing windows with their coordinates. Next, the data collection and annotation processes are explained.

2.2.1. Datasets: Mobile Programming Screencasts

To prepare the data for the object detector’s training, I used 20 videos manually chosen from YouTube. To select the videos, I searched for various topics using the keyword Android, combining it with words such as Google Map, game development, or ListView. The final selection was limited to a maximum of three videos from each topic. A total of 11 different categories were collected. Most importantly, I ensured diversity by selecting a maximum of one video from each playlist. A total of 18 unique authors produced the videos. Concerning the visual content of the video datasets, I ensured that there was a variety of (i) regions presented in the video (two GUI previews shown at the same time, XML with UI, and others) and (ii) IDE background colors (15 white and five dark greys). After the videos were chosen, they were downloaded to the server with the highest available quality using the youtube-dl (https://github.com/rg3/youtube-dl, accessed on 10 June 2021) tool. The mean and median length of the collected videos were 608 s and 592 s, respectively.

2.2.2. Annotating Code-Editing Windows, Project Windows, and GUIs

To train a model to locate a specific window/region, we have to collect several images that contain those regions and feed them into a neural network with their coordinates. Because the regions are spatially located within an image, the neural network learns to extract the relevant spatial features (through convolutional layers) for each region, which maps to its coordinates.

The mobile programming screencast dataset was used to annotate three windows. Note that a video frame containing an IDE could have none to several windows (e.g., a project window, an editing window, and a GUI preview displayed simultaneously).

Two annotators annotated each window belonging to one of the categories with bounding box information (e.g., coordinates) and a label (e.g., CEW, PW, or GUI). Previous works that analyzed programming screencasts extracted a frame per second and classified each frame individually [13,15,20]. They also found that the manual labeling task requires time-intensive human efforts (i.e., required 100 students for the annotation process in [13]). In our case, we classified a frame, defined the bounding box of each window, and labelled its class, which adds more time efforts to the manual annotation process. Based on the study that found that programming screencasts are more static than other types of videos [21], we found that annotating the entire video while playing instead of extracting and annotating each of its individual frames is much more effective. On average, annotating a single region on a frame takes about seven seconds.

To facilitate the process of video annotation, we used a cloud-based web tool called DataTurks (https://dataturks.com/, accessed on 15 June 2021) and uploaded the 20 videos. Next, we initiated the class names with the labels. Note that since we need to evaluate the proposed approach (explained in Section 2.3) in distinguishing between Java and XML, we labeled the CEW with either Java or XML. Yet, when the network was trained, Java and XML were considered as one class that belongs to CEW. We played each video and paused it once a region of interest (RoI) appeared. We labeled each region with a class name and drew a corresponding bounding box around each region (i.e., there could be more than one region in each snapshot). The video was played while our regions and class names were still drawn. Thus, we did not need to annotate each snapshot/frame individually. We paused the video when our ROIs were no longer shown to remove them accordingly.

After annotating all videos, we downloaded the annotated datasets in the JSON format. This file included all information we needed to create a PASCAL VOC [22] format file for each image, including the ground-truth annotation information. This file is required when training an object detector. We parsed the JSON file and extracted the annotation information for each window: StartTime, EndTime, className, and BoundingBox. Then, for each second between the start and end time, we (i) extracted a frame from the corresponding video such that the frameNumber = fps * currentSecond, (ii) saved the frameNumber, className, the bounding box, and the currentSecond information. Once this process was finished, we grouped different regions based on the currentSecond and saved them in the PASCAL VOC format file. Using this step, a single PASCAL VOC format for each image could have multiple regions/windows. To confirm that our annotation was accurate, we drew a bounding box with a class label on each region based on the ground-truth information. A second annotator checked each frame, and we revised the ones that did not have an accurate annotation. The Cohen’s Kappa coefficient was 0.95, which indicated a high agreement between the annotators.

Table 1 shows the number of collected images and the total number of windows/regions in each image. Because not all images contained only one region (RoI), the table was divided into multi-region (ROIs) and single-region systems. The majority of the annotated images contain a project window (PW) and Java regions. There are few cases in which more than one GUI was displayed simultaneously, which makes the total number of annotated GUI reached to 7390 regions. The GUI regions could be displayed with other regions, such as (PW and XML), together with 1138 images. There are few cases where a single region was displayed on the screen, such as Java. In total, the number of annotated images and regions was 7798 and 16,085, respectively.

2.3. Detecting Programming Language

The approach proposed above locates the bounding box of code-editing window regardless of its content that could be a Java or XML file. Two different approaches were followed to the extract code or UI elements; therefore, we needed first to predict the editing window’s content. Two different approaches were proposed to predict the content of the editing window as follows. First, the selected file was located from the project window with its file extension (e.g., .java or .xml). This approach cannot be applied in several cases, such as when the project window is hidden, or if there is no selected file in the project window. Thus, a second approach was proposed based on the textual content of the editing window. More specifically, a deep learning model was employed to predict the text’s programming language written in the editing window. Let us first explain the approach that locates the selected file in the project window using a pipeline of three main steps.

First, the background color of the selected file was typically different from other files. Based on this, we needed to accurately find the background color (BG) of each file listed on the project window (PW). To detect the BG, we needed to first detect the bounding box of each text in PW. The problem of text detection is different from object detection such that for the first one, the region of interest (ROI) is a text composed of different sizes of characters, numbers, or symbols, whereas for the latter, the (ROI) has a well-defined closed boundary [23], such as the entire PW region in our case. Thus, in this task, a fine-grained text-based detection approach [24,25] was utilized. In particular, I leveraged the architecture of a connectionist text proposal network (CTPN) [26] to detect the position of each text entry (file or directory) presented in the PW as follows. First, the input images (PWs) were fed into a CNN-based feature extractor, called VGG-16 [27], to extract the feature maps. Second, to produce a fine-scale text proposal, the CTPN detector densely slid a

3 \times 3

window in the map to detect the text line and then produced text proposals. Last, to further increase the accuracy of the text proposal and find meaningful context information, the CTPN connected sequential text proposals in the feature map using a recurrence mechanism. It has been shown that using sequential context information on words improves the recognition performance [28]. Ultimately, a

P W_{i}

predicted by

M_{i}

(explained in Section 2.2) was fed into a (CTPN) that outputs a set of closed bounding boxes for each textual entry (e.g., the names of files, images, or directory) presented in the

P W_{i}

.

Second, this step’s input is the output of the previous step, which is a set of predicted bounding boxes for each textual entry presented in

P W_{i}

. Each predicted bounding box is considered as a candidate that might or might not have the region of interest (selected file). An abjectness score was defined to either zero (entry was not selected) or one (entry was selected). To detect the objectness score, I based the approach on assuming that the selected entry’s BG color is different from others. Note that the detected bounding box covers only the presented text, whereas the background color spans over the PW’s width.

For this reason, each predicted bounding box was expanded to the full width of the PW (i.e., refined bounding box (RBB)). Note that the bounding box of each

P W_{i}

was automatically detected using the previous approach (explained in the previous Section 2.2). For each

R B B_{i}

, the dominant color was obtained by computing the (RGB) color of each pixel and count the frequency of each color.

Last, the dominant color of each

R B B_{i}

was compared to that of

R B B_{j}

where j ≥ i + 1 and j ≤ n (total number =

R B B_{n}

). One trivial solution is to compute the similarity between two RGB colors based on a distance metric (e.g., euclidean distance). However, two RGB colors might look very similar to the human, but they might have a large euclidean distance, and this is due to the fact that they are represented in RGB color space which led to the flaw. Thus, CIE Lab color space was designed to approximate human vision based on the color’s lightness [29]. For this reason, the color differences between two

R B B s

were quantified by computing the ΔE and following the guidelines of the International Commission on Illumination (CIE), as shown in Equation (1). Formally, the dominant background color for each

R B B

was converted into (L*a*b*) where (i) L represents the lightness value and ranges from zero (dark black) to 100 (bright white), (ii) a represents the green (negative direction) and red (positive direction) components, and (iii) b represents the blue (negative direction) and yellow (positive direction) components. To detect whether the background color of

R B B_{i}

was similar to that of

R B B_{j}

, ΔE has to be ≤ a threshold. By the end of this comparison, there should only be one unique

R B B

which embodies the selected file or directory.

Δ E_{ab} = \sqrt{(L_{2} - L_{1})^{2} + (a_{2} - a_{1})^{2} + (b_{2} - b_{1})^{2}}

(1)

The second approach, to detect the code-editing window’s content, is based on DL text classifier. I leveraged GuessLang (https://github.com/yoeo/guesslang, accessed on 20 June 2021), which is a deep neural network with three classification layers and a customized linear classifier. I opted to employ GuessLang as it has been successfully used in the field of SE [30,31]. GuessLang has already been trained on 30 programming languages. The list of programming languages that the model was trained on did not include Android development (e.g., Java and XML). For this reason, I fine-tuned the architecture of GuessLang and trained it on a new dataset. The new dataset contains a total of 15,000 Java and 15,000 XML files extracted from 675 open-source Android projects found in the play store and collected by [32]. During the training, the source code files were transformed into features vectors in which bigrams and trigrams were computed before they were fit into the network.

2.4. Extracting Java and UI Elements

To extract Java and UI elements from videos, I applied OCR on the predicted editing window and predicted its content using the approaches proposed above. Next, I present the approaches that extract Java and UI elements.

2.4.1. Java Elements

SrcML [33] is utilized to map the extracted Java code to an XML document. This document contains syntactic information about the source code as it wraps the text with meaningful tag names such as imports, classes, and methods. Although srcML has been successfully used in the field of SE where the entire code is available in the text format [34,35], it has never been used to the source code embedded in video frames. This poses two main limitations: (i) the OCRed code typically contains several errors since most OCR engines are sensitive to image quality and does not work well with images that contain source code [5,36,37] and (ii) each video frame contains only part of the Java element (i.e., incomplete element) since the code in programming screencasts is typically written on the fly. To detect OCR errors, I devised several steps to fix issues as follows.

First, Java elements are not recognized by srcML due to errors in the syntax of the OCRed code. There are several cases where this could occur such as if (i) the class name is not detected if the semicolon is missing before class definition, (ii) there is illegal space in the definition of a Java element (e.g., import statement), (iii) open and/or close curly brackets are missing. To solve this, I wrote a script to pre-process all OCRed codes and fix syntax errors as follows. First, I removed illegal extra spaces in import statements, class names, method names, and method calls. Second, I added a semicolon at the end of each statement if they do not exist or changed a colon to a semicolon where appropriate. Third, I fixed missing curly brackets or parentheses (i.e., some curly brackets are extracted as squares).

Second, most modern IDEs feature code suggestion where a popup window appears when developers write code to help them save time, and as such srcML extracts method names and calls from a popup window which are not part of the transcribed code. To detect if an extracted method name or call was part of the transcribed code, I checked if the full definition matches the correct format (i.e., syntactically correct). For example, a method name must have a modifier, return value, name, and parenthesis. Otherwise, it is considered as a method extracted from a popup window. It is worth mentioning that this step is essential as the extracted Java elements must be part of the transcribed code; otherwise, irrelevant Java elements are displayed and indexed.

Last, Java elements are incorrectly extracted due to OCR errors, which is very common when OCR is applied to an IDE to extract source code [5,36,37,38]. This is due to the fact that OCR is very sensitive to (i) the colorful background of the IDE, (ii) the low resolution of video frames, and (iii) the low quality of video frames. Let us denote the initial set of the extracted Java elements from a video as V =

{e_{1, 1}, e_{1, 2}, . ., e_{2, 1}, e_{2, 2}, . ., e_{i, j}, . ., e_{m, n}}

, in which

e_{i, j}

is the jth extracted element from the ith frame (to be precise, the cropped editing window). As import statements, class names, method names, and method calls are extracted, I create one set for each element type (i.e., four different sets for each video). Each set contains unique elements and their counts. For example, if the same method name (e.g., onOptionsItemSelected) was extracted from a total of k frames, this means it has a count of k. The k value determines the confidence score of whether an element was correctly extracted or not. The similarity between each pair of elements was computed using a character-level similarity metric. Specifically, The normalized Levenshtein distance NLD (as described in Equation (2)) was used, and if the similarity between two elements is greater than a threshold, I discarded the element that was extracted fewer times (based on the percentage change). The rationale is that an incorrect (or incomplete) element is typically extracted fewer times than a similar correct element because OCR is less likely to confuse some characters. For example, if an element onOptionsLtemSelected was incorrectly extracted, appeared only k times, and was very similar to onOptionsItemSelected which appeared more times than k, the first term was assumed to be incorrectly extracted.

N L D (e_{1}, e_{2}) = 1 - \frac{L D (e_{1}, e_{2})}{m a x (l e n (e_{1}), l e n (e_{2}))}

(2)

2.4.2. UI Elements

An android app contains a layout that has an invisible container such as RelativeLayout that defines visible UI elements to the users in which they can view or interact with the app. The UI elements can be declared based on (i) a list of pre-built UI elements provided by Android such as TextView, DatePicker, ProgressBar, and several others; (ii) third-party API libraries in which a user must import the required package and/or create a dependency entry in the build.gradle file; and (iii) a customized class in which a developer must create a custom Java code and define the behavior of the element. Fortunately, we can create a list of valid UI elements and check if an OCRed UI element is in that list to detect incorrectly extracted UI elements. I created a list of valid UI elements as follows. First, I utilized the official online Android documentation (https://developer.android.com/, accessed on 30 June 2021) to crawl all UI elements. Second, I used all import statements defined in Java files using the approach that extracts import statements, as well as a regular expression to find all third-party dependencies defined in the build.gradle file. Note that this file’s content was written on a code-editing window, and thereby we could easily extract its content and find UI elements defined as a dependency. Last, I applied OCR on the predicted project window to extract a list of file names, which contains names of potential user-defined UI elements. Ultimately, if an OCRed UI element was incomplete (e.g., Imagevie) or incorrect (e.g., viewelipper), it would not exist in a list of valid UI elements, and thereby it would be marked as an incorrect element.

2.5. Eliminating Duplicate GUI Screens

Extracting one frame per second from each video would result in duplicate frames, especially when it comes to video programming tutorials as the displayed contents in these videos are more static than other types [21]. The trained model locates and crops GUI screens from each frame, which results in redundant GUI screens. To detect redundant GUI screens, I adapted the mature image descriptor, the scale-invariant feature transform (SIFT) [39], to extract the robust features that are invariant to rotation, transformation, and scaling features. The key points are detected using a Hessian matrix approximation, in which SIFT assigns a distinctive descriptor (feature vector) to each keypoint.

To demonstrate the proposed approach for selecting unique UI frames, consider an input video as V =

{f_{1}, f_{2}, \dots, f_{n}}

, in which each

f_{i}

is processed through the first phase (see Section 2.2) to output the cropped

U I_{i}

, if any. Note that for each

f_{i}

there could be many GUI screens or none at all. For each video, there is a new set that contains a list of the extracted GUIs denoted as UI =

{U I_{1}, U I_{2}, \dots, U I_{m}}

, where m indicates the total number of extracted UIs from a video. Duplicate frames were detected and removed as follows. First, the feature vector was extracted from each

U I_{i}

and saved in memory (to extract the features only once). Second, the euclidean distance between each extracted feature from

U I_{i}

to that of

U I_{j}

was computed, in which j ≥ i + 1 and j ≤ m. The two UIs were considered similar if the percentage of the matched descriptor was above a pre-defined threshold.

To assist the claim that SIFT performs the best in terms of removing duplicate UI frames from video programming tutorials, an empirical evaluation was performed on four different methods. In particular, I compared SIFT to another image descriptor, speeded-up robust features (SURFs) [39], and two other pixel-wise algorithms. I utilized two of the most popular pixel-wise comparisons (i) to detect mobile GUI changes using perceptual image differences (PIDs) [40,41], and (ii) to detect changes between programming video frames using the structural similarity index (SSIM) [14]. PID compares two images based on models of the human visual system [42], whereas SSIM compares the pixel intensities of two images [43]. The four methods removed duplicates based on a similarity threshold, and to ensure a fair comparison, I experimented with each method with different thresholds that range from 0.5 to 0.95 with a step size of 0.05.

3. Empirical Evaluation

This section introduces the proposed techniques in collecting the dataset, classifying and annotating the frames, and performing the empirical evaluation.

In this section, the empirical evaluation for VID2META is introduced in terms of (i) its accuracy in locating the ROIs within video frames and predicting the programming language, (ii) its effectiveness in extracting and fixing Java and UI elements, and (iii) its accuracy in removing only GUI screens. As such, the research questions (RQs) were formulated as follows:

RQ $_{1}$: How accurately can VID2META classify and localize code-editing windows, project windows, and GUIs in Android programming screencasts?
RQ $_{2}$: Does VID2META outperform previous work that localized code-editing window?
RQ $_{3}$: How accurately can VID2META detect the programming language of the text inside the code-editing window?
RQ $_{4}$: How accurately can VID2META extract and fix Java and UI elements?
RQ $_{5}$: Which computer vision method performs the best in eliminating duplicate GUIs?

All RQs, except the fourth one, use the dataset collected to train the object detector (Section 2.2.1). The methodology to answer the first two RQs is explained in Section 3.1. The data collection processes to train the deep learning classifier to answer the RQ₃ are introduced in Section 3.2. Section 3.3 presents details about the process of collecting brand new videos (with source code) and creating the ground-truth Java and UI elements to answer RQ₄. Last, Section 3.4 shows the methodology used to evaluate different CV techniques to remove GUIs and answer RQ₅.

3.1. Methodology: Localizing Code-Editing Windows, Project Windows, and GUIs

To answer RQ₁, I used the annotated frames from the 20 videos and performed 10-fold cross-validation to cross video frames. To ensure that the averaged results of the 10-fold cross-validation converge to a stable value and determine the precision level accordingly, previous works [44,45] have identified the precision problem and suggested the repetition of 10-fold cross-validation until the results converge. As such, I determined a precision level of 0.01 in the experiments.. During each experiment, I split the videos into 80% training, 10% validation, and 10% testing. Thus, we can be certain that we are testing a new set of frames from untrained videos. I trained Faster R-CNN with the re-sized input image (600 × 1024) for a total of 10,000 iterations (i.e., validation loss was stable). Note that a new network weight was initialized for each fold to ensure that any information (features) learned from other folds was not used. Through back-propagation, the network weight was adjusted after each iteration using a momentum optimizer. A batch size of 16 with a stride of (8 × 8) was used as hyperparameters of the network layers. As part of data augmentation to improve the network’s learning performance, I randomly scaled some of the training regions. The implementation was based on Tensorflow API (https://github.com/tensorflow/tensorflow, accessed on 30 June 2021) on a machine with an Intel Xeon 3.40 GHz processor, 128 GB RAM, and a GeForce GTX 1080 GPU with 8 GB of memory for ∼150 h for all folds (10 different experiments).

To answer RQ₂, I compared VID2META to the open-source implementation of codemotion [46]. Unlike previous works [9,36], which only applied the Canny edge detector [47] to detect the code-editing window, codemotion proposed seven heuristics to filter out noisy edges and enhanced the detection process. The edges in the input images were detected using Canny and I followed the heuristic steps of codemotion to detect the code-editing window, which includes (i) detecting the endpoint of horizontal and vertical line segments using Hough transform [48], (ii) discarding the endpoints that are not close to the center of the image, (iii) connecting each point to another in such a way that closely forms a horizontal and vertical line, and (iv) detecting the surrounding rectangle in addition to the inner rectangle that was detected using the previous step. The output of these steps is a set of candidate rectangles, and we are interested in the ones that contain code. As the previous phase outputs a set of candidate rectangles, OCR was applied to each rectangle to predict the one that is most likely to contain the code.

The classification performance to answer RQ₁ was computed as follows. I used the standard metrics of precision, recall, and

F_{1}

score. Precision is defined as

P = \frac{T_{p}}{T_{p} + F_{p}}

, Recall is computed as

R = \frac{T_{p}}{T_{p} + F_{n}}

, and the

F_{1}

score is the harmonic mean of precision and recall, defined as

F = 2 \cdot \frac{P \cdot R}{(P + R)}

. I additionally evaluated the localization accuracy based on a standard intersection over union (IoU) metric [22,49,50]. Formally

I o U_{g t}^{p r e d}

is defined as

I o U_{g t}^{p r e d} = \frac{A r e a o f (p r e d \cap g t)}{A r e a o f (p r e d \cup g t)}

, where

p r e d

is the predicted location and

g t

is the location of the ground-truth annotation. An IoU threshold determines the overall prediction performance (i.e., if an IoU is above a predefined threshold, the prediction is considered correct). Since a lower threshold would typically result in higher accuracy and a higher threshold would generally decrease the overall accuracy, I computed the results at a high IoU threshold (0.90). I computed average precision (AP) to show the overall accuracy of the model. AP is used as a standard metric in several object detection competitions [51,52]. The proposed model predicts the location of each key window with a confidence score that is sorted in descending order. Then, a prediction is considered correct if the IoU is greater than the threshold. I also computed the overall accuracy of the prediction by dividing the total correct predictions by the total number of predictions based on IoU@.90.

3.2. Methodology: Detecting Programming Language

VID2META is designed to detect the programming language written in the code-editing window of the IDE using two different approaches as follows.

First, the selected file from the project window was located and OCR was applied to extract the file’s name with its extension. Using the file extension, a prediction of the opened code-editing window’s content was made (e.g., Java or XML). As a prior step, we needed to evaluate the approach’s accuracy in finding the selected file in the first place. Therefore, I used all annotated PWs shown in Table 1 and fed them into the CTPN network. The CTPN outputs a prediction file that includes the coordinates of each text entry presented in a PW. I parsed each file and obtained each coordinate (xmin, xmax, ymin, and ymax). The vertical axes were only required as we needed to re-draw the bounding box to the full width of the PW (i.e., xmin = 0 and xmax = width). To this end, we ensured that each bounding box covers the entire background. Then, each pixel’s RGB values presented in the bounding box were obtained to find the most dominant one. Finally, the Δ*

E_{a b}

distance was computed (see Equation (1)) between all pairs of any two dominant colors. Two dominant colors were considered similar if the distance between them was less than a threshold. The threshold was defined empirically by normalizing Delta (i.e., (Δ*

E_{a b}

/100) < 9).

Second, I employed a GuessLang network that uses a deep learning model with natural language processing techniques to guess a given text’s programming language. Although GuessLang has been trained to detect 30 programming languages, none of these languages were related to Android programming (e.g., Java and XML). Thereby, I fine-tuned GuessLang and trained it with two classes: Java and XML. I initially used the Android dataset collected by Businge et al. [32]. I opted to choose this dataset for several reasons: (i) the apps have already been published in the play store and (ii) it contains a diversity of apps that belong to more than 15 categories. I filtered out the files that contain code other than Java or XML, and ensured that Java and XML files are not empty. I selected a total of 15,000 Java and 15,000 XML files from 675 Android repositories. I split the dataset into 20,000 training and 10,000 testing sets. Then, GuessLang was trained on this dataset for a total of 50 epochs.

I evaluated the performance of detecting the programming language on a total of 4656 code-editing windows using the above two approaches. Note that I already annotated the content of each code-editing window with Java or XML (Section 2.2.2). For the first approach, I simply predicted the code-editing window’s programming language based on the file extension. To ensure the second approach’s practicality, I did not use the ground-truth coordinates of each code-editing window; instead, I used the cross-video code-editing predictions. I OCRed the predicted code-editing window and the entire frame. Then, I used the model to predict the OCRed editing window’s programming language and the OCRed entire frame. Note that I applied OCR on the editing window and the entire frame to compare their results. For the two approaches, the precision, recall, and F-Score were computed.

3.3. Methodology: Extracting Java and UI Elements

I used the 20 videos collected in Section 2.2.1 to determine the best thresholds for detecting incorrect Java and UI elements. The choice of these thresholds impacts the accuracy of detecting which element is correct or incorrect. For example, the NLD threshold determines the similarity between two elements, and one of them is deleted based on the NLD threshold and the count threshold (Section 2.4.1). The ground-truth Java and UI elements for each video were manually created. The ground-truth Java elements contain all unique import statements, class names, method names, and calls, whereas the ground-truth UI elements contain the names of all unique UI elements. I then applied OCR (https://www.google.com/drive/, accessed on 30 June 2021) on the predicted code-editing window and experimented with different NLD and count thresholds. In each experiment, the Jaccard similarity index (see Equation (3)) was computed, which quantified the similarity between a list of ground-truth elements and a list of predicted elements by dividing their intersection over their union. Eventually, the best NLD and count thresholds that maximized the Jaccard index for the 20 videos was chosen.

J a c c a r d (e_{1}, e_{2}) = \frac{| e_{1} \cap e_{2} |}{| e_{1} \cup e_{2} |} = \frac{| e_{1} \cap e_{2} |}{| e_{1} | + | e_{2} | - | e_{1} \cap e_{2} |}

(3)

To ensure robust evaluation and avoid data snooping biases on the thresholds, I manually collected a brand new 50 Android programming screencasts from YouTube. To facilitate the process of creating the ground-truth Java and UI elements, I chose Android videos that already have an attached source code. I manually validated the correctness of the attached source code for each video and made the changes accordingly. There were two cases: (i) videos do not show one or more files that were attached (I simply deleted those files) and (ii) some files were opened in the videos and were not attached (those files were manually transcribed). After I corrected the ground-truth source code files, I created the ground-truth information for Java and UI elements as follows. First, I employed srcML to map the Java file to an XML document. This XML document contains information about each element presented in the Java file, such as the import statement, class name, method name, and method call. As such, I used XPath Query to extract a ground-truth list for each element type (e.g., imports). Second, I created the ground-truth UI elements by applying a simple regex to the ground-truth XML files. To validate if an UI element is correct, the UI element has to be found in the Android API (https://developer.android.com, accessed on 30 June 2021) (i.e., I crawled all valid UI elements from the official Android documentation). I manually added third-party and custom-defined UI elements to each video’s ground-truth UI elements (I found only three third-party UI elements and one custom-defined in the 50 videos).

After preparing the ground-truth information, I processed each video as follows: (i) I extracted one frame per second, (ii) I used the trained model to predict and crop the code-editing window, and (iii) I applied OCR on those windows and categorize the content of the code-editing window using the approach discussed in Section 3.2.

To answer RQ₄, several experiments were performed as follows. First, I did not apply the approaches that fixed the extracted elements from the (i) OCRed predicted the code-editing window and (ii) OCRed the entire frame. Second, I applied the approaches (explained in Section 2.4) to pre-process the OCRed text from the (i) OCRed predicted code-editing window and (ii) the OCRed entire frame and fixed/detected errors in the extracted Java and XML elements. It should be noted that I performed some experiments on the entire frame to quantify the importance of detecting the code-editing window prior to the extracting process. Eventually, I computed the number of correct, incorrect, and missing elements for each experiment. In addition, I determined whether the results of the approaches in extracting Java elements from the predicted code-editing window were statistically significant compared to the results of blindly extracting the Java elements from the entire frame and from the predicted code-editing window. To perform the statistical analysis, I conducted the non-parametric Wilcoxon rank-sum test with a confidence interval of 95%. I used the Wilcoxon test as it only makes the assumptions of independence and equal variance, but it does not assume the data have a known distribution. I also measured the magnitude of the effect using Cliff’s delta (

δ

) effect size [53]. Cliff’s

δ

ranges in the interval [

- 1

, 1] and defined in Equation (4).

effect size = \{\begin{matrix} negligible & if Cliff ’ s | δ | \leq 0.147 \\ small & if 0.147 < Cliff ’ s | δ | \leq 0.33 \\ medium & if 0.33 < Cliff ’ s | δ | \leq 0.474 \\ large & if Cliff ’ s | δ | > 0.474 \end{matrix}

(4)

3.4. Methodology: Eliminating Duplicate GUI Screens

To answer RQ $_{5}$ , I used ten randomly selected videos from the dataset that included GUI screens. I extracted one frame per second for each video and used the model to predict and crop the UI region. This results in a total of 3594 GUI frames, most of which are duplicates. To evaluate CV methods’ performance in detecting duplicate GUIs, we must have a set of ground-truth frames (i.e., a set of unique and sufficient GUI screens for each video). To create the ground-truth set of each video, another annotator (who was not aware of the CV methods) clustered similar UI frames together. An effective CV method must keep exactly one frame from each cluster. Keeping more than one frame from each cluster means that the algorithm failed in identifying duplicate frames, whereas not selecting a frame from a cluster means that the algorithm removed an important UI frame. I performed the empirical evaluation using SURF, SIFT, PID, and SSIM with different similarity thresholds between 0.50 and 0.95 with a step size of 0.05. Two images were considered similar if the similarity between them was greater than a pre-defined threshold.

The accuracy was measured in terms of precision, recall, and F-Score. The true positive (TP) indicates the total number of selected frames, in which each frame belongs to only one cluster. The false positive (FP) indicates the total number of duplicate frames (e.g., more than one frame was selected from a cluster). The false negative (FN) indicates the total number of unselected frames which should have been selected (e.g., none of the frames that belong to a cluster was selected). Formally speaking, precision is defined as

P = \frac{T_{p}}{T_{p} + F_{p}}

, in which (

T_{p}

+

f_{p}

= total number of selected frames). Recall is defined as

R = \frac{T_{p}}{T_{p} + F_{n}}

, in which (

T_{p}

+

F_{n}

= total number of clusters). The results are reported in terms of F-Score because it is a synthetic measure that considers both precision and recall. The average F-Score is computed over the ten videos.

4. VID2META’s Results

This section presents the results of the empirical evaluations to answer the five research questions.

4.1. RQ_1,2: Localizing Code-Editing Windows, Project Windows, and GUIs

To answer

R Q_{1}

, I computed the classification and localization accuracy and reported the results in Table 2. As it is very important that the predicted bounding box overlaps very well with the actual bounding box, I evaluated the localization accuracy at an IoU threshold of 0.90. Table 2 demonstrates the results and clearly shows that the model can accurately classify and locate the three windows. Even when I evaluated the model on unseen videos (i.e., cross-videos), it still performed very well, which indicates that the features of the code-editing windows and their shapes and outlines are predictable.

To answer

R Q_{2}

, I computed the IoU between the predicted bounding box of codemotion to that of the proposed approach for each input image that contained a code-editing window (i.e., 2590 Java and 2066 XML code-editing windows). The results of the evaluation are presented in boxplots (Figure 2). The proposed approach significantly outperformed code motion with median and mean IoU values of 97% and 96%, as compared to 62% and 64%, respectively. I inspected the results to see why codemotion failed to find the code-editing window of some frames accurately and found that even with some heuristics, Canny edge detection was still very sensitive to the frame’s noise and failed to detect some edges of the code-editing windows.

4.2. RQ₃: Detecting Programming Language

In total, the CTPN network predicted 104,852 bounding boxes for the text entry presented in 6731 PWs. The results in Table 3 show the precision, recall, and F-Score values for each category as follows. First, the precision of the selected category shows that the approach was successful in locating a high number of selected entries in the PW (i.e., a total of 6700 out of 6731). Second, the proposed approach that found unselected files with a 99% F-Score. Note that the background color of the unselected files ranges from white to light colors such as yellow, but since I defined a similarity threshold, they were very similar to the background of the other unselected file and different than the selected one (typically the selected one has a darker color). Thus, the result yields a high accuracy for both categories, which answered

R Q_{3}

. I also computed the results in terms of precision, recall, and F-Score for the two methods I used to predict the code-editing window’s content as follows. First, using only the selected file’s extension from the project window to determine the content (e.g., file.xml refers to the XML content), I obtained 89% for all precision, recall, and F-score for Java, whereas they were 90% for XML. I noticed that the results were highly impacted by the fact that (i) a PW sometimes is minimized, (ii) there was no selected file in the PW, and (iii) a programmer switch to another file through the tap that is typically located above the code-editing window. With these limitations, I introduced the results of the second method. Table 4 answers

R Q_{4}

and shows that the trained model is very accurate in distinguishing between Java and XML. I evaluated the model using the extracted text from the predicted bounding box as well as from the entire frame. The results indicate that when the model guesses the entire frame’s programming language, a lot of noise is produced and yielded to a low F-Score. On the other hand, guessing the predicted bounding box’s programming language improved the accuracy by at least 26%.

4.3. RQ₄: Extracting Java and UI Elements

Table 5 reveals the results of extracting Java elements from the predicted code-editing window before and after using the approaches to pre-process and detect incorrect elements. I computed the Jaccard similarity index (see Equation (3)) between the ground-truth Java elements and the extracted Java elements before and after error detection. Although the number of correctly extracted Java elements before error detection was relatively high, the number of the incorrect ones was also high, which yielded a low Jaccard similarity. This is due to the issues I previously discussed in Section 2.4.1, which were mainly caused by OCR errors. I applied the approaches to pre-process the OCRed code and detect any potential incorrect elements. As shown in Table 5, the number of incorrect Java elements was reduced significantly which indicated that the approaches could reliably detect and discard incorrect elements. The Jaccard similarity improved by 20%, 23%, 18%, and 52% for the import statements, class names, method names, and method calls, respectively. Trivially, I could extract the Java elements and detect OCR errors without even locating the code-editing window. Unfortunately, applying OCR on the entire frame and fixing OCR errors is a non-trivial challenge [5,36]. Table 6 confirms this as I extracted Java elements from the entire OCRed code frame and from the OCRed predicted code-editing window and applied the proposed approaches to detect OCR errors. Notably, the lowest Jaccard index was used for those Java elements extracted from the entire frame without error detection, followed by those extracted from the code-editing window without error detection, followed by those extracted from the entire window with error detection. The highest Jaccard index was achieved when Java elements were extracted from the code-editing window and the proposed approach was applied to detect OCR errors (see underlined results in Table 6) which improved the overall accuracy by an average of 40%. The results with statistically significantly lower Jaccard scores (at the 0.05 significance level) compared to the results of the proposed approach are marked with a superscript symbol, where a cross (‘†’) indicates a medium or large effect size and an asterisk (‘*’) indicates a small effect size. The results illustrate that the proposed approach can produce statistically significant results on extracting all Java elements.

The results of extracting and detecting OCR errors in UI elements are as follows. First, a total of 72 XML files were presented in 41 videos (out of 50 videos). There were a total of 257 UI elements found in these XML files. Out of these elements, there were 253 names used from the Android API (e.g., button, fragment, and others), three widget names were used from third-party API (e.g., youtubeplayerview), and one was a custom-defined UI element. I computed the Jaccard similarity index to compare the extracted list of UI elements from the OCRed XML editing window with that of the ground-truth annotation. As the pre-defined list contained all possible valid UI elements from the three resources, an accuracy of 100% was obtained. If the approach was not applied to detect incorrect UI elements, the Jaccard index would be only 52%.

4.4. RQ₅: Eliminating Duplicate GUI Screens

Figure 3 answers the RQ $_{5}$ and shows that SIFT outperformed every other method and achieved the best average F-Score that reached 84% with a similarity threshold of 60%. The second best-performing method was also based on image’s distinctive descriptor (SURF) and achieved a 78% score with a 55% similarity threshold. The highest achieved scores of PID and SSIM were 71% and 75%, respectively. The precision values of SIFT and SSIM were 91% and 78% at their best-performing similarity threshold. This means that SSIM kept more duplicate frames than SIFT as it compared the value of the same pixel position of two images rather than their features (i.e., inaccurate where there is a slight translation and very sensitive to noise). On the other hand, the SIFT descriptor is invariant to translation, rotation, and scaling. I also computed each algorithm’s running time applied on 100 GUI frames (i.e., the average time complexity is

θ (n^{2})

). SSIM performed the fastest with only 68 s, whereas PID was the slowest with 553 s. The running time SURF and SIFT were similar to 126 s and 129 s, respectively.

5. Related Work

In this section, the related work is presented (which is also summarized in Table 7) on (i) code extraction from programming screencast, (ii) mining programming screencasts, and (iii) mining image-based software artifacts using CNNs.

5.1. Code Extraction from Programming Screencasts

Several works were devoted to extracting the source code from the programming screencast by applying OCR on video frames [5,36,37,38,54]. Yadid et al. [36] and Bao et al. [38] trained a statistical language model [78] on source code and used the consolidating code across frames to detect and correct OCR errors. Ponzanelli et al. [9,55] used an island parser to cope with the noise of the OCRed code. None of these works extracted any information that is related to Android layout such as UI elements. Furthermore, I extracted fine-grained information such as API calls which would be more applicable and efficient to index and/or complement videos with.

5.2. Mining Programming Screencasts

A sizable body of work has been dedicated to facilitating the process of extracting the source code from a video programming screencast [6,9,13,36]. Ott et al. [13] proposed the use of a CNN-based classifier to predict the existence of source codes in a video frame. This work was dedicated to only solving this classification problem of the entire frame without focusing on locating the source code’s bounding box or extracting the source code. Several works locate the source code’s bounding box using the Canny edge detection algorithm [79], where it was applied to the entire IDE frame, and several heuristics were proposed to guess which bounding box contains the source code [5,36,54]. By replicating codemotion [5] that uses the Canny edge algorithm, the results show that the proposed approach is more accurate. Furthermore, none of these works extracted UI elements from Android videos.

5.3. Mining Image-Based Software Artifacts Using CNNs

It has been increasingly common within the software engineering community to fine-tune CNN models to solve image classification problems. Zhao et al. [14] proposed ActionNet that analyzes the changes that occur when a developer writes a code on-the-fly throughout a programming screencast using Inception Resnet V2 [11]. Ott et al. [13] proposed another approach that classifies video frames obtained from programming screencasts into four different categories using VGG-16 [27]. On the same line, Bao et al. [38] trained the top layers of the VGG network to classify frames into code and non-code categories. Yang et al. [80] proposed an approach to help developers find live-streamed programming videos by automatically classifying video frames to live-streamed video or pre-recoded video. The work most closely related to VID2META is proposed by Alahmadi et al. [15,81], who leveraged an object detector to locate the code bounding box. While the approach of locating code bounding box is similar to VID2META, VID2META made another contribution towards finding, extracting, and fixing Java and UI elements that would better complement Android programming screencasts.

6. Conclusions and Future Work

In this paper, VID2META is proposed, an automated approach that takes an Android video tutorial as input in which it analyzes the visual and textual content of the video to extract meaningful code elements and GUIs. Extracting those elements enables developers to (i) search code elements, (ii) decide if a video is relevant, and (iii) allow navigation throughout a video or to outside API resources. I performed an extensive evaluation of VID2META on a total of 70 videos (20 videos to evaluate the localization accuracy and 50 videos to assess the accuracy of extracting correct code elements). The results illustrate that VID2META can accurately locate the code-editing window and extract code elements with an average accuracy above 90%.

Some potential extensions can be derived from this work to advance the state-of-the-art in mining programming screencasts further. As such, I plan to create a concise and meaningful tags of video programming based on the code information presented in the frames, audio, and metadata of videos. In addition, I will study whether the current search engine (e.g., YouTube) can be improved by indexing the extracted metadata for mobile programming screencasts.

Funding

This work was funded by the University of Jeddah, Jeddah, Saudi Arabia, under grant No. (UJ-22-DR-22). The author, therefore, acknowledge University of Jeddah for its technical and financial support.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Storey, M.A.; Singer, L.; Cleary, B.; Figueira Filho, F.; Zagalsky, A. The (R) evolution of social media in software engineering. In Proceedings of the on Future of Software Engineering; ACM: New York, NY, USA, 2014; pp. 100–116. [Google Scholar] [CrossRef]
MacLeod, L.; Bergen, A.; Storey, M.A. Documenting and sharing software knowledge using screencasts. Empir. Softw. Eng. 2017, 22, 1478–1507. [Google Scholar] [CrossRef]
Lin, Y.T.; Yeh, M.K.C.; Tan, S.R. Teaching Programming by Revealing Thinking Process: Watching Experts’ Live Coding Videos with Reflection Annotations. IEEE Trans. Educ. 2022, 1–11. [Google Scholar] [CrossRef]
Pongnumkul, S.; Dontcheva, M.; Li, W.; Wang, J.; Bourdev, L.; Avidan, S.; Cohen, M.F. Pause-and-play: Automatically linking screencast video tutorials with applications. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, Santa Barbara, CA, USA, 16–19 October 2011; pp. 135–144. [Google Scholar]
Khandwala, K.; Guo, P.J. Codemotion: Expanding the design space of learner interactions with computer programming tutorial videos. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale, London, UK, 26–28 June 2008; pp. 1–10. [Google Scholar] [CrossRef]
MacLeod, L.; Storey, M.A.; Bergen, A. Code, camera, action: How software developers document and share program knowledge using YouTube. In Proceedings of the 23rd IEEE International Conference on Program Comprehension (ICPC’15), Washington, DC, USA, 18–19 May 2015; pp. 104–114. [Google Scholar]
Parra, E.; Escobar-Avila, J.; Haiduc, S. Automatic tag recommendation for software development video tutorials. In Proceedings of the 26th Conference on Program Comprehension, Gothenburg, Sweden, 28–29 May 2018; pp. 222–232. [Google Scholar]
Pavel, A.; Reed, C.; Hartmann, B.; Agrawala, M. Video digests: A browsable, skimmable format for informational lecture videos. In Proceedings of the UIST 2014, Honolulu, HI, USA, 5–8 October 2014; ACM: New York, NY, USA, 2014; Volume 10, pp. 2642918–2647400. [Google Scholar]
Ponzanelli, L.; Bavota, G.; Mocci, A.; Di Penta, M.; Oliveto, R.; Hasan, M.; Russo, B.; Haiduc, S.; Lanza, M. Too Long; Didn’t Watch!: Extracting Relevant Fragments from Software Development Video Tutorials. In Proceedings of the 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), Austin, TX, USA, 14–22 May 2016; pp. 261–272. [Google Scholar] [CrossRef]
Granka, L.A.; Joachims, T.; Gay, G. Eye-tracking analysis of user behavior in WWW search. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, 25–29 July 2004; pp. 478–479. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, Inception-ResNet and the impact of residual connections on learning. arXiv 2016, arXiv:1602.07261. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef]
Ott, J.; Atchison, A.; Harnack, P.; Bergh, A.; Linstead, E. A deep learning approach to identifying source code in images and video. In Proceedings of the 15th IEEE/ACM Working Conference on Mining Software Repositories, Gothenburg, Sweden, 28–29 May 2018; pp. 376–386. [Google Scholar]
Zhao, D.; Xing, Z.; Chen, C.; Xia, X.; Li, G.; Tong, S.J. ActionNet: Vision-based workflow action recognition from programming screencasts. In Proceedings of the 41st IEEE/ACM International Conference on Software Engineering (ICSE’19), Montreal, QC, Canada, 25–31 May 2019. [Google Scholar]
Alahmadi, M.; Hassel, J.; Parajuli, B.; Haiduc, S.; Kumar, P. Accurately predicting the location of code fragments in programming video tutorials using deep learning. In Proceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering—PROMISE’18; ACM Press: Oulu, Finland, 2018; pp. 2–11. [Google Scholar]
Bernal-Cárdenas, C.; Cooper, N.; Moran, K.; Chaparro, O.; Marcus, A.; Poshyvanyk, D. Translating Video Recordings of Mobile App Usages into Replayable Scenarios. arXiv 2020, arXiv:2005.09057. [Google Scholar]
Chen, C.; Feng, S.; Xing, Z.; Liu, L.; Zhao, S.; Wang, J. Gallery DC: Design Search and Knowledge Discovery through Auto-created GUI Component Gallery. Proc. ACM Hum. Comput. Interact. 2019, 3, 1–22. [Google Scholar] [CrossRef]
Huang, J.; Rathod, V.; Sun, C.; Zhu, M.; Korattikara, A.; Fathi, A.; Fischer, I.; Wojna, Z.; Song, Y.; Guadarrama, S.; et al. Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE CVPR, Honolulu, HI, USA, 21–26 July 2017; Volume 4. [Google Scholar]
Qian, N. On the momentum term in gradient descent learning algorithms. Neural Netw. 1999, 12, 145–151. [Google Scholar] [CrossRef]
Ott, J.; Atchison, A.; Harnack, P.; Best, N.; Anderson, H.; Firmani, C.; Linstead, E. Learning lexical features of programming languages from imagery using convolutional neural networks. In Proceedings of the 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC), Gothenburg, Sweden, 27 May–3 June 2018. [Google Scholar]
Ellmann, M.; Oeser, A.; Fucci, D.; Maalej, W. Find, understand, and extend development screencasts on YouTube. In Proceedings of the 3rd ACM SIGSOFT International Workshop on Software Analytics; ACM: New York, NY, USA, 2017; pp. 1–7. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef] [Green Version]
Cheng, M.M.; Zhang, Z.; Lin, W.Y.; Torr, P. Binarized normed gradients for objectness estimation at 300 fps. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Tian, K.; Revelle, M.; Poshyvanyk, D. Using Latent Dirichlet Allocation for Automatic Categorization of Software. In Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories, Vancouver, BC, Canada, 16–17 May 2009; pp. 163–166. [Google Scholar]
Huang, W.; Qiao, Y.; Tang, X. Robust scene text detection with convolutional neural networks induced mser trees. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; p. 3. [Google Scholar]
Tian, Z.; Huang, W.; He, T.; He, P.; Qiao, Y. Detecting text in natural image with connectionist text proposal network. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 56–72. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, P.; Huang, W.; Qiao, Y.; Loy, C.C.; Tang, X. Reading scene text in deep convolutional sequences. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
Robertson, A.R. The CIE 1976 color-difference formulae. Color Res. Appl. 1977, 2, 7–11. [Google Scholar] [CrossRef]
Buchgeher, G.; Cuénez, M.; Czech, G.; Dorninger, B.; Exler, M.; Moser, M.; Pfeiffer, M.; Pichler, J. Software Analytics and Evolution Team Report 2017. 2018. Available online: https://www.researchgate.net/publication/312297650_Software_Analytics_and_Evolution_-_Team_Report_2016 (accessed on 10 June 2022).
Di Sipio, C.; Rubei, R.; Di Ruscio, D.; Nguyen, P.T. A Multinomial Naïve Bayesian (MNB) Network to Automatically Recommend Topics for GitHub Repositories. In Proceedings of the Evaluation and Assessment in Software Engineering, Trondheim, Norway, 15–17 April 2020; pp. 71–80. [Google Scholar]
Businge, J.; Openja, M.; Kavaler, D.; Bainomugisha, E.; Khomh, F.; Filkov, V. Studying Android App Popularity by Cross-Linking GitHub and Google Play Store. In Proceedings of the 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), Hangzhou, China, 24–27 February 2019; pp. 287–297. [Google Scholar]
Collard, M.L.; Decker, M.J.; Maletic, J.I. Lightweight transformation and fact extraction with the srcML toolkit. In Proceedings of the 2011 IEEE 11th International Working Conference on Source Code Analysis and Manipulation, Williamsburg, VA, USA, 25–26 September 2011; pp. 173–184. [Google Scholar]
Medeiros, F.; Lima, G.; Amaral, G.; Apel, S.; Kästner, C.; Ribeiro, M.; Gheyi, R. An investigation of misunderstanding code patterns in C open-source software projects. Empir. Softw. Eng. 2019, 24, 1693–1726. [Google Scholar] [CrossRef]
Abid, N.J.; Sharif, B.; Dragan, N.; Alrasheed, H.; Maletic, J.I. Developer reading behavior while summarizing java methods: Size and context matters. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), Montreal, QC, Canada, 25–31 May 2019; pp. 384–395. [Google Scholar]
Yadid, S.; Yahav, E. Extracting code from programming tutorial videos. In Proceedings of the 6th ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!’16); ACM: Amsterdam, The Netherlands, 2016; pp. 98–111. [Google Scholar]
Khormi, A.; Alahmadi, M.; Haiduc, S. A Study on the Accuracy of OCR Engines for Source Code Transcription from Programming Screencasts. In Proceedings of the 17th IEEE/ACM Working Conference on Mining Software Repositories, Seoul, Korea, 29–30 June 2020; pp. 376–386. [Google Scholar]
Bao, L.; Xing, Z.; Xia, X.; Lo, D.; Wu, M.; Yang, X. psc2code: Denoising Code Extraction from Programming Screencasts. ACM Trans. Softw. Eng. Methodol. (TOSEM) 2020, 29, 1–38. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. Surf: Speeded up robust features. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar]
Moran, K.; Li, B.; Bernal-Cárdenas, C.; Jelf, D.; Poshyvanyk, D. Automated reporting of GUI design violations for mobile apps. arXiv 2018, arXiv:1802.04732. [Google Scholar]
Moran, K.; Watson, C.; Hoskins, J.; Purnell, G.; Poshyvanyk, D. Detecting and summarizing GUI changes in evolving mobile apps. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, Montpellier, France, 3–7 September 2018; pp. 543–553. [Google Scholar]
Yee, H.; Pattanaik, S.; Greenberg, D.P. Spatiotemporal sensitivity and visual attention for efficient rendering of dynamic environments. ACM Trans. Graph. (TOG) 2001, 20, 39–65. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Du, X.; Wang, T.; Wang, L.; Pan, W.; Chai, C.; Xu, X.; Jiang, B.; Wang, J. CoreBug: Improving effort-aware bug prediction in software systems using generalized k-core decomposition in class dependency networks. Axioms 2022, 11, 205. [Google Scholar] [CrossRef]
Qu, Y.; Zheng, Q.; Chi, J.; Jin, Y.; He, A.; Cui, D.; Zhang, H.; Liu, T. Using K-core Decomposition on Class Dependency Networks to Improve Bug Prediction Model’s Practical Performance. IEEE Trans. Softw. Eng. 2019, 47, 348–366. [Google Scholar] [CrossRef]
Karlson, A.K.; Meyers, B.R.; Jacobs, A.; Johns, P.; Kane, S.K. Working overtime: Patterns of smartphone and PC usage in the day of an information worker. In Proceedings of the International Conference on Pervasive Computing; Springer: Berlin/Heidelberg, Germany, 2009; pp. 398–405. [Google Scholar]
Canny, J. A computational approach to edge detection. In Readings in Computer Vision; Elsevier: Amsterdam, The Netherlands, 1987; pp. 184–203. [Google Scholar]
Matas, J.; Galambos, C.; Kittler, J. Robust detection of lines using the progressive probabilistic hough transform. Comput. Vis. Image Underst. 2000, 78, 119–137. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common objects in context. arXiv 2014, arXiv:1405.0312. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object detection via region-based fully convolutional networks. arXiv 2016, arXiv:1605.06409. [Google Scholar]
Shrivastava, A.; Gupta, A. Contextual priming and feedback for faster r-cnn. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 330–348. [Google Scholar]
Romano, J.; Kromrey, J.D.; Coraggio, J.; Skowronek, J. Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys. In Proceedings of the Annual Meeting of the Florida Association of Institutional Research, Cocoa Beach, FL, USA, 1–3 February 2006; pp. 1–33. [Google Scholar]
Ponzanelli, L.; Bavota, G.; Mocci, A.; Di Penta, M.; Oliveto, R.; Russo, B.; Haiduc, S.; Lanza, M. CodeTube: Extracting relevant fragments from software development video tutorials. In Proceedings of the 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C), Austin, TX, USA, 14–22 May 2016; pp. 645–648. [Google Scholar]
Ponzanelli, L.; Bavota, G.; Mocci, A.; Oliveto, R.; Di Penta, M.; Haiduc, S.C.; Russo, B.; Lanza, M. Automatic identification and classification of software development video tutorial fragments. IEEE Trans. Softw. Eng. 2017, 45, 464–488. [Google Scholar] [CrossRef]
Moslehi, P.; Adams, B.; Rilling, J. Feature location using crowd-based screencasts. In Proceedings of the 15th International Conference on Mining Software Repositories—MSR ’18, Gothenburg, Sweden, 28–19 May 2018; ACM Press: Gothenburg, Sweden, 2018; pp. 192–202. [Google Scholar] [CrossRef]
Bao, L.; Pan, P.; Xing, X.; Xia, X.; Lo, D.; Yang, X. Enhancing Developer Interactions with Programming Screencasts through Accurate Code Extraction. In Proceedings of the 28th ACM/SIGSOFT International Symposium on Foundations of Software Engineering (FSE’20), Virtual Event, 8–13 November 2020; ACM: Sacramento, CA, USA, 2020. [Google Scholar]
Bao, L.; Li, J.; Xing, Z.; Wang, X.; Xia, X.; Zhou, B. Extracting and analyzing time-series HCI data from screen-captured task videos. Empir. Softw. Eng. 2017, 22, 134–174. [Google Scholar] [CrossRef]
Bao, L.; Li, J.; Xing, Z.; Wang, X.; Zhou, B. Reverse engineering time-series interaction data from screen-captured videos. In Proceedings of the 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Montreal, QC, Canada, 2–6 March 2015; pp. 399–408. [Google Scholar]
Bao, L.; Xing, Z.; Xia, X.; Lo, D. VT-Revolution: Interactive programming video tutorial authoring and watching system. IEEE Trans. Softw. Eng. 2018, 45, 823–838. [Google Scholar] [CrossRef]
Bao, L.; Xing, Z.; Xia, X.; Lo, D.; Li, S. VT-revolution: Interactive programming tutorials made possible. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Lake Buena Vista, FL, USA, 4–9 November 2018; pp. 924–927. [Google Scholar]
Poché, E.; Jha, N.; Williams, G.; Staten, J.; Vesper, M.; Mahmoud, A. Analyzing user comments on YouTube coding tutorial videos. In Proceedings of the 25th International Conference on Program Comprehension, Buenos Aires, Argentina, 22–23 May 2017; pp. 196–206. [Google Scholar]
McGowan, A.; Hanna, P.; Anderson, N. Teaching programming: Understanding lecture capture YouTube analytics. In Proceedings of the 2016 ACM Conference on Innovation and Technology in Computer Science Education, Arequipa, Peru, 9–13 July 2016; pp. 35–40. [Google Scholar]
Chen, C.H.; Guo, P.J. Improv: Teaching programming at scale via live coding. In Proceedings of the Sixth (2019) ACM Conference on Learning@ Scale, Chicago, IL, USA, 24–25 June 2019; pp. 1–10. [Google Scholar]
Eghan, E.E.; Moslehi, P.; Rilling, J.; Adams, B. The missing link—A semantic web based approach for integrating screencasts with security advisories. Inf. Softw. Technol. 2020, 117, 106197. [Google Scholar] [CrossRef]
Best, N.; Ott, J.; Linstead, E. Exploring the Efficacy of Transfer Learning in Mining Image-Based Software Artifacts. arXiv 2020, arXiv:2003.01627. [Google Scholar] [CrossRef]
Ott, J.; Atchison, A.; Linstead, E.J. Exploring the applicability of low-shot learning in mining software repositories. J. Big Data 2019, 6, 35. [Google Scholar] [CrossRef]
Moran, K.; Bernal-Cárdenas, C.; Curcio, M.; Bonett, R.; Poshyvanyk, D. Machine Learning-Based Prototyping of Graphical User Interfaces for Mobile Apps. IEEE Trans. Softw. Eng. 2018, 46, 196–221. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
Chen, C.; Feng, S.; Liu, Z.; Xing, Z.; Zhao, S. From Lost to Found: Discover Missing UI Design Semantics through Recovering Missing Tags. arXiv 2020, arXiv:2008.06895. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
Zhao, D.; Xing, Z.; Chen, C.; Xu, X.; Zhu, L.; Li, G.; Wang, J. Seenomaly: Vision-Based Linting of GUI Animation Effects Against Design-Don’t Guidelines. In Proceedings of the 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), Seoul, Korea, 5–11 October 2020. [Google Scholar]
White, T.D.; Fraser, G.; Brown, G.J. Improving random GUI testing with image-based widget detection. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, Beijing, China, 15–19 July 2019; pp. 307–317. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Chen, J.; Xie, M.; Xing, Z.; Chen, C.; Xu, X.; Zhu, L.; Li, G. Object Detection for Graphical User Interface: Old Fashioned or Deep Learning or a Combination? In Proceedings of the 28th ACM/SIGSOFT International Symposium on Foundations of Software Engineering (FSE’20), Virtual Event, 8–13 November 2020; ACM: Sacramento, CA, USA, 2020. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 6569–6578. [Google Scholar]
Rosenfeld, R. Two decades of statistical language modeling: Where do we go from here? Proc. IEEE 2000, 88, 1270–1278. [Google Scholar] [CrossRef]
Canny, J. A Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intel. 1986, PAMI-8, 679–698. [Google Scholar] [CrossRef]
Yang, C.; Thung, F.; Lo, D. Efficient Search of Live-Coding Screencasts from Online Videos. arXiv 2022, arXiv:2203.04519. [Google Scholar]
Alahmadi, M.; Khormi, A.; Parajuli, B.; Hassel, J.; Haiduc, S.; Kumar, P. Code Localization in Programming Screencasts. Empir. Softw. Eng. 2020, 25, 1536–1572. [Google Scholar] [CrossRef]

Figure 1. An overview of VID2META for extracting code elements and GUIs from Android screencasts.

Figure 2. Boxplots of the IoU obtained by applying the Codemotion [5] approach and the approach to detect the code region.

Figure 3. The accuracy of removing duplicate GUI screens using four different methods (feature-wise and pixel-wise) with different similarity thresholds.

Table 1. Statistics of the dataset with respect to the annotated regions in terms of a single-region system (exactly one region in an image) a multi-region system (at least two regions in an image).

	Region	# of Images	# of Regions
	PW & Java	2077	4154
	PW & GUI	1683	3366
Multi-Region	PW & XML & GUI	1138	3414
	PW & XML	928	1856
	PW & GUI & GUI	611	1833
	GUI & GUI	101	202
	Java	513	513
Single-Region	GUI	453	453
	PW	294	294
Total		7798	16,085
	GUI	3986	7390
	XML	2066	2066
Overall	Java	2590	2590
	PW	6731	6731

Table 2. The classification and localization results using 10-fold cross-validation. The average precision (AP) and the accuracy is computed with IoU@90.

Category	Classification			Localization
Category	Precision	Recall	F-Score	AP@.90	Accuracy@.90
Code-editing window	0.99	0.99	0.99	0.96	0.97
Project window	0.99	0.99	0.99	0.95	0.96
GUI	0.94	0.99	0.96	0.94	0.94

Table 3. The classification results of finding the selected entry in the project window.

Category	Precision	Recall	F-Score
Selected	0.99	0.98	0.99
Unselected	0.99	1.00	0.99

Table 4. The classification results of predicting the written language of the entire frame as compared to that of the code-editing window (CEW).

Categories	Entire Frame			Predicted CEW
Categories	Precision	Recall	F-Score	Precision	Recall	F-Score
Java	1.00	0.58	0.73	0.99	1.00	0.99
XML	0.53	1.00	0.69	0.96	1.00	0.98

Table 5. The results of extracting Java elements from programming screencasts before and after applying the approaches to detect incorrect elements from the predicted code-editing window.

Java Element (GT Total)	Without Error Detection				With Error Detection
Java Element (GT Total)	#Correct	#Missing	#Incorrect	Jacc.	#Correct	#Missing	#Incorrect	Jacc.
Import Statements (185)	176	9	109	0.68	173	12	17	0.88
Class Names (60)	50	10	29	0.72	58	2	4	0.95
Method Names (176)	168	8	110	0.70	166	10	18	0.88
Method Calls (524)	529	13	2338	0.28	528	14	163	0.80

Table 6. The Jaccard index similarly between the set of the extracted Java elements from the entire frame and predicted code-editing window to that of the ground-truth annotation computed with and without error detection.

Java Element (GT Total)	Without Error Detection		With Error Detection
Java Element (GT Total)	Entire Frame	Code-Editing Window	Entire Frame	Code-Editing Window
Import Statements (185)	0.65 ^†	0.68 ^†	0.72 ^†	0.88
Class Names (60)	0.45 ^†	0.72 ^*	0.76 ^†	0.95
Method Names (176)	0.58 ^†	0.70 ^†	0.78 ^*	0.88
Method Calls (524)	0.25 ^†	0.28 ^†	0.64 ^†	0.80

The results were statistically significantly worse than the underlined results (0.05 significance level), ^† = with a medium or large effect size; ‘*’ = with a small effect size.

Table 7. This table surveys the related work on code extraction, mining programming screencasts, and the works that applied transfer learning in classifying/localizing image artifacts.

Code Extraction from Programming Screencasts
Paper	# of Videos	Programming Language	Frame Comparison	Detect Frame with Code	OCR Engine
Ponzanelli et al. [9,54,55]	150	Java	Pixel-Wise	Heuristics	Tesseract
Yadid et al. [36]	40	Java (Android)	N/A	Canny-Edge	Tesseract
Moslehi et al. [56]	5	PHP (WordPress)	OCR	N/A	Google Vision
Khandwala et al. [5]	20	Java, Javascript, Python, PHP, Ruby, C, C#	N/A	Canny-Edge	Tesseract
Bao et al. [38,57]	50	Java	Pixel-Wise	CNN Classifier	Tesseract
Khormi et al. [37]	300	Java, Python, C#	N/A	Manual	GDrive, ABBYY, GOCR, Tesseract, OCRAD
Mining Programming Screencasts
Paper	# of Videos	Type of Information		Aim (Method)
Zhao et al. [14]	50	Video Frames		Action Detection (CNN)
Bao et al. [58,59]	N/A (29-h Videos)	Video Frames		Extract HCI data (CV)
Bao et al. [60,61]	3	Operating System		Tutorial Authoring System (OS)
Ott et al. [13]	40	Video Frames		Frame Classification (CNN)
Ott et al. [20]	100	Video Frames		Prog. Lang. Detection (CNN)
Parra et al. [7]	75	Video Metadata		Automatic Tagging (IR and NLP)
Poche et al. [62]	12	Video Comments		Comments Classification (ML)
McGowan et al. [63]	N/A (Two Java Courses)	Student’s Engagement		Analysis of Viewing Behaviours (User Study)
Chen et al. [64]	30	IDE		Combine Slides With IDE (Web Technologies)
Ellis et al. [65]	48	Video Frames, Audio, and Metadata		Link Video to Other Artifacts (Knowledge Modelling)
Mining Image-Based Software Artifacts Using CNNs
Paper	Image Artifact	Classification or Localization?		Network
Zhao et al. [14]	Programming Video Frames	Classification		Inception Resnet V2 [11]
Ott et al. [13,20] and Bao et al. [38,57]	Programming Video Frames	Classification		VGG-16 [27]
Best et al. [66] and Ott et al. [67]	UML Diagrams	Classification		VGG-16
Moran et al. [68]	Mobile UI Components	Classification		AlexNet [69]
Chen et al. [70]	Mobile UI Components	Classification		Resnet [71]
Zhao et al. [72]	Mobile Image-Based Buttons	Classification		Resnet-101 [71]
Bernal-Cárdenas et al. [16]	Touch Indicator in Mobile UI	Both		Faster R-CNN [12] and AlexNet
Chen et al. [17]	Mobile UI Components	Both		Faster R-CNN
White et al. [73]	Mobile UI Components	Both		YOLOv2 [74]
Chen et al. [75]	Mobile UI Components	Both		Faster R-CNN, YOLOv3 [76], CenterNet [77]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alahmadi, M.D. VID2META: Complementing Android Programming Screencasts with Code Elements and GUIs. Mathematics 2022, 10, 3175. https://doi.org/10.3390/math10173175

AMA Style

Alahmadi MD. VID2META: Complementing Android Programming Screencasts with Code Elements and GUIs. Mathematics. 2022; 10(17):3175. https://doi.org/10.3390/math10173175

Chicago/Turabian Style

Alahmadi, Mohammad D. 2022. "VID2META: Complementing Android Programming Screencasts with Code Elements and GUIs" Mathematics 10, no. 17: 3175. https://doi.org/10.3390/math10173175

APA Style

Alahmadi, M. D. (2022). VID2META: Complementing Android Programming Screencasts with Code Elements and GUIs. Mathematics, 10(17), 3175. https://doi.org/10.3390/math10173175

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

VID2META: Complementing Android Programming Screencasts with Code Elements and GUIs

Abstract

1. Introduction

2. Approach

2.1. VID2META Overview

2.2. Localizing Code-Editing Windows, Project Windows, and GUIs

2.2.1. Datasets: Mobile Programming Screencasts

2.2.2. Annotating Code-Editing Windows, Project Windows, and GUIs

2.3. Detecting Programming Language

2.4. Extracting Java and UI Elements

2.4.1. Java Elements

2.4.2. UI Elements

2.5. Eliminating Duplicate GUI Screens

3. Empirical Evaluation

3.1. Methodology: Localizing Code-Editing Windows, Project Windows, and GUIs

3.2. Methodology: Detecting Programming Language

3.3. Methodology: Extracting Java and UI Elements

3.4. Methodology: Eliminating Duplicate GUI Screens

4. VID2META’s Results

4.1. RQ1,2: Localizing Code-Editing Windows, Project Windows, and GUIs

4.2. RQ3: Detecting Programming Language

4.3. RQ4: Extracting Java and UI Elements

4.4. RQ5: Eliminating Duplicate GUI Screens

5. Related Work

5.1. Code Extraction from Programming Screencasts

5.2. Mining Programming Screencasts

5.3. Mining Image-Based Software Artifacts Using CNNs

6. Conclusions and Future Work

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.1. RQ_1,2: Localizing Code-Editing Windows, Project Windows, and GUIs

4.2. RQ₃: Detecting Programming Language

4.3. RQ₄: Extracting Java and UI Elements

4.4. RQ₅: Eliminating Duplicate GUI Screens