Category Maps Describe Driving Episodes Recorded with Event Data Recorders †

This study was conducted to create driving episodes using machine-learning-based algorithms that address long-term memory (LTM) and topological mapping. This paper presents a novel episodic memory model for driving safety according to traffic scenes. The model incorporates three important features: adaptive resonance theory (ART), which learns time-series features incrementally while maintaining stability and plasticity; self-organizing maps (SOMs), which represent input data as a map with topological relations using self-mapping characteristics; and counter propagation networks (CPNs), which label category maps using input features and counter signals. Category maps represent driving episode information that includes driving contexts and facial expressions. The bursting states of respective maps produce LTM created on ART as episodic memory. For a preliminary experiment using a driving simulator (DS), we measure gazes and face orientations of drivers as their internal information to create driving episodes. Moreover, we measure cognitive distraction according to effects on facial features shown in reaction to simulated near-misses. Evaluation of the experimentally obtained results show the possibility of using recorded driving episodes with image datasets obtained using an event data recorder (EDR) with two cameras. Using category maps, we visualize driving features according to driving scenes on a public road and an expressway.


Introduction
Drivers adjust their focus and their behavior according to traffic conditions to maintain safety.For example, drivers carefully devote attention to pedestrians or bicycles when they drive near a school or a park.On an expressway, drivers devote attention to surrounding cars running at high speed.
Further, drivers will take extra care to avoid sleepiness when driving scenes do not often change.Therefore, prediction models for ensuring safety must adjust flexibly according to traffic changes, road conditions, environments, and situations.Advanced safety knowledge, danger prediction, and situational judgment are obtained not only from personal knowledge based on experiences and memory, but also on collective intelligence in terms of experience-based stories from their family and friends, news from TV, radios, and newspapers, and lessons learned at driving schools [1].However, existing prediction models are hindered by limitations of event-based prediction using statistical information and probability models from sensor data and its histories.
Recently, automobile manufacturers, universities, research institutes, and Internet-related service companies have been investigating automatic driving cars that have autopilot-type assistance [2,3].
For wide-range and high-precision outside sensing, such cars use stereo cameras, millimeter-wave radar, and laser range finders for autopilot systems-in limited use on expressways.Moreover, real-time sensing and processing are actualized using originally customized processing devices.The performance of outside sensing is making steady progress.It is constantly advancing.For inside sensing, existing studies have targeted drowsy-driving detection [4], inattentive driving detection [5], cognitive distraction detection [6,7], and internal state estimation from facial expressions [8].However, compared with outdoor sensing, numerous problems remain for human sensing in terms of effective measurements for visual differences among individuals, reproduction, and time-series changes of sensing targets [9].Moreover, outside sensing and inside sensing are handled independently.No study results have been reported for both types of sensing together.
This study was conducted to create an episodic memory model in driving scenes using an event data recorder (EDR) with two cameras used simultaneously for inside and outside sensing.Several reports in the relevant literature [10,11] have described studies of driver gaze tracking and outside sensing for the detection of traffic lane deviation and driver carelessness.Nevertheless, no reports have described studies using an EDR: a simple device for sensing outside and inside environments together.Moreover, context recognition from scene images and facial expression recognition from face images have been studied individually in computer vision, human communication, and human-machine interfaces.Our report is the first to describe a trial of both images for creating episodic memory.
Studies of behavior predictions and intention understanding are extremely active in brain sciences.According to knowledge of brain sciences, one important role for human memory with accumulation and editing is feature prediction of various events [12], especially for episodic memory [13], which combines the context for target scenes and emotions, and which has high contributions [14].Maeno proposed an episodic memory model used for a robotic mind and an emotional system based on his originally proposed passive consciousness hypothesis [15].He presented a basic conceptual model based on a thought experiment.
The aim of this study is the creation of a safety prediction model based on episodic memory using artificial neural networks of long-term memory (LTM) and topological mapping.This paper presents a novel episodic memory model for driving safety according to traffic scenes using machine-learning-based algorithms of three types: adaptive resonance theory (ART) networks [16], which learn time-series features incrementally with the maintenance of stability and plasticity for time-series data, self-organizing maps (SOMs) [17], which represent input data as a map with topological relations using self-mapping characteristics, and counter propagation networks (CPNs) [18], which label category maps using input features and counter signals.Category maps represent driving episodes created from driving scenes and facial expressions using synchronized images obtained from outside and inside cameras on an EDR.The bursting states of respective maps produce LTM created on ART as episodic memory.
The remainder of this paper is organized as follows.Sections 2 and 3 present our proposed feature extraction and machine-learning-based methods.Section 4 addresses facial measurements in near-misses with cognitive distractions using a driving simulator (DS).As an evaluation experiment using an actual car, Section 5 addresses a prototype model of episodic memory using category maps.We analyze the relation between visualized category maps and actual near-misses using image datasets obtained using EDRs from three cars in summer and in winter.Finally, Section 6 concludes and highlights future work.We had proposed this basic method in a preceding publication [19].For this paper, we have improved our method to review detailed procedures, especially in Section 4.

The Procedure
Figure 1 depicts the procedures used for our proposed method.Using an EDR with two cameras, we obtained outside and inside time-series images simultaneously.Local features on high visual saliency regions are extracted from images obtained from an outside camera.For images obtained from an inside camera, facial expression features are extracted using Gabor wavelet descriptors after facial region detection.Extracted features are combined with our original machine-learning-based methods for representation as category maps.The following are outlines in respective algorithms.

Saliency Maps
Visual information of various types is used for human recognition, understanding, and decisions.However, we do not use all information that we see momentarily.Humans have a mechanism to notice attentional objects or salient regions unconsciously.Itti et al. proposed saliency maps (SMs) [20] as a computational model for this attentional mechanism.
The brief procedure of SMs comprises four steps based on elemental computer vison algorithms [21][22][23][24][25][26][27].The first step is low-level feature extraction using Gaussian filters after creating pyramid images in changed scales.The second step is to create component images of hue, brightness, and orientation.Subsequently, the third step is to create feature maps (FMs) that represent visual features of respective components after processing center-surround operations.The final step is to integrate SMs with a linear combination of FMs.We used high-saliency regions from SMs, although winner-take-all (WTA) competition is conducted for extracting high saliency coordinates.

AKAZE Descriptors
Gist [28] is a descriptor used to extract features from outdoor images, especially for global scenes in terms of mountains, lakes, and clouds in nature.Because of its rough granularity, it is unsuitable for describing features of driving scenes as roads and traffic signals for our target objects.As a part-based feature descriptor, scale-invariant feature transform (SIFT) [29] is used widely in computer vision studies, especially for generic object recognition.Using nonlinear scale spaces, KAZE, which is a Japanese word that means wind, descriptors [30] recorded higher performance than SIFT descriptors did.Recently, accelerated KAZE (AKAZE) descriptors [31] are specially examined not only for excellent descriptive capability but also for low real-time processing costs.The rapid processing is actualized with several fundamental technologies [32][33][34][35] used for actual applications.

Face Detection
We used the face detection algorithm proposed by Viola et al. [36] to detect a driver's face from images obtained using an inside camera.As a method to detect objects from images, numerous face detection methods have been proposed [37].The method proposed by Viola et al. [36] was an epoch-making method for real-time video image processing using a generally available computer.Now the method is the de facto standard for face detection.Numerous improvements have been applied for the method in terms of robustness for occluded images and non-frontal faces.The application range has been expanded to tablet computers and smartphones for real-time processing with low electric power consumption [38].The mechanism of rapid processing is based on a simple pattern features of Haar-like features [39] and cascade-connected weak classifiers [40] created by AdaBoost [41].

Gabor Wavelets
Visual information obtained by retinas are propagated to Visual Area 1 (V1) via the lateral geniculate nucleus (LGN) [42].V1 comprises visual cells of two types: simple cells and complex cells.Simple cells in LNG and V1 have receptive fields that compose a visual range in response to specific stimulation.Receptive fields respond to a specific figure size, length, orientation, color, and frequency.This feature is called selective response.Hubel and Wiesel found selective response for lines from their electric physiological experiment using an anesthetized cat [43].Selective orientation, which is one selective response, is examined specifically because similar features were achieved by Gabor wavelet filters as an engineering model [44].Gabor wavelet filters have been used for various studies and applications in image processing and computer vision fields for enhancing specific features controlled by internal parameters.

Adaptive Category Mapping Networks
Figure 2 portrays the network architecture of adaptive category mapping networks (ACMNs) [45] as a learning method for visualizing time-series features on a category map.ACMNs comprise three modules: a codebook module for vector quantization of input data, a labeling module for creating labels as candidates of categories, and a mapping module for visualizing spatial relations of categories on a category map.These modules comprise self-organizing maps (SOMs) [17], adaptive resonance theory (ART) networks [16], and counter propagation networks (CPNs) [18].Herein, SOMs and ART are unsupervised neural networks and CPNs are supervised neural networks.ACMNs actualize both learning modes for an original mechanism to create labels as candidates of categories.The following presents detailed explanations of the respective algorithms after an overview in each module is presented.Input data are presented directly to the codebook module.This module is used if dimensions of input features differ among datasets.For example, dimensions are various according to the number of feature points on scale invariant feature transform (SIFT), which is used widely in generic visual object recognition as a part-based local feature.Using this module, input features are quantified to a specific dimension to represent distributions of histograms.Moreover, this module conducts vector quantization if the dimension of input features is high.For this process, data topology is preserved while changing to a low-dimensional space.Herein, this module is not mandatory for use.This module can be passed if the dimensions of the input features are fixed for all datasets.We use this mechanism to reduce the load caused by the learning that ensues when codebooks are created and updated incrementally.
The labeling module creates candidates of categories from input features adaptively and incrementally.Based on the learning of ART, this module actualizes incremental learning while maintaining plasticity and stability.Input data are assigned to available categories if similar features are included.A new unit is assigned on F2 as a new category candidate if no similar feature is included.The labeling module actualizes incremental learning for this mechanism.For supervised or semi-supervised learning modes, teaching signals are assigned as labels for units created using this module.Unit indexes are used for candidate labels in the unsupervised learning mode.
The mapping module produces category maps with learning and mapping functions of CPNs using candidate labels of categories created from the labeling module.For this module, spatial relations among categories are visualized on category maps.Moreover, redundant labels including noise signals that occurred partially are removed using competitive learning in neighboring regions.The decision process is conducted using this module to bypass the labeling module when test datasets are presented.Herein, the module cannot learn incrementally, resembling the second layer of self-organizing incremental neural networks (SOINNs) [46].The learning process occurs when a new dataset is presented for this module.However, this process uses training data obtained using not only candidate labels created from the labeling module but also labels in this module.This is a point of difference for standard relearning.For this mechanism, ACMNs store no training datasets for relearning.Rapid relearning is actualized using the minimum number of datasets.

Codebook Modules
For creating codebooks [47], k-means [48] is widely used.However, Vesanto et al. demonstrated that the clustering performance of SOMs is higher than that of k-means as a classic clustering method [49].Moreover, Terashima et al. showed quantitatively that false recognition accuracy is lower when using SOMs for clustering than when using k-means [50].Therefore, SOMs are used for creating the codebooks that are utilized for this module.
As a mechanism of neighborhood and competitive learning for self-mapping characteristics based on unsupervised learning, SOMs create clusters with similar input features.The SOMs network architecture comprises two layers: the input layer and the mapping layer.For the input layer, a similar number of units is assigned to the number of dimensions of input features.The mapping layer comprises units that are assigned in a low dimension.For creating codebooks, we assigned units on the mapping layer to one dimension because vector quantization is used for clustering.Learning is conducted to burst a unit on the mapping layer for input data.
The learning algorithm of SOMs is as follows.x i (t) and w i,j (t) respectively denote input data and weights from an input layer unit i to a mapping layer unit j at time t.Herein, I and J respectively denote the total numbers of the input layer and the mapping layer.w i,j (t) is initialized randomly before learning.The unit for which the Euclidean distance between x i (t) and w i,j (t) is the smallest is sought as the winner unit of its index c as (1) As a local region for updating weights, the neighborhood region N c (t) is defined as the center of the winner unit c as Therein, µ(0 < µ < 1.0) is the initial size of N c (t); O is the maximum iteration for training.Coefficient 0.5 is appended as a floor function for rounding.Subsequently, w i,j (t) of N c (t) is updated to close input feature patterns.
Therein, α(t) is a learning coefficient that decreases along with the progress of learning.
In the initial stage, the learning speed is higher when this rate is high.In the final stage, the learning converges while the range decreases.
For this module, the input features of I dimension are quantized into the J dimension, which is a similar dimension to the number of units on the mapping layer.The module output y j (t) is calculated as This module is connected to the labeling module at the training phase.For the testing phase, this module is switched to the mapping module.Moreover, this module is passed when input features are used without creating codebooks directly.

Labeling Module
The role of this module is to create labels used for category candidates.For this study, we created this module using ART, which is a theoretical model of unsupervised neural networks to create labels adaptively and incrementally with preservation of plasticity and stability together for time-series data.
In ART of various types [51], we use ART-2 [16], into which it enables input continuous values.The network of ART-2 comprises two fields: Field 1 (F1) for feature representation and Field 2 (F2) for category representation.Here, F1 comprises six sub-layers: p i , q i , u i , v i , w i , and x i .The sub-layers actualize short-term memory (STM), which enhances features of input data and removes noise for a filter.Here, F2 actualizes long-term memory (LTM) based on finer or coarser recognition categories.LTM is created in each unit assigned to independent labels.The j-th unit of F2 and the sub-layer p i are connected.Top-down weights Z ji and bottom-up weights Z ij are included.The weights are initialized as Therein, J is the number of units of F2.Subsequently, input data x i are presented to F1; the sublayers are propagated as Therein, a and b respectively denote coefficients of feedback loops from u i to w i and from q i to v i .
θ is a parameter to control a noise detection level in v i .e is a coefficient to prevent zero from occurring in the denominator.Subsequently, the most active unit of its index c is searched as For c, weights are updated as The vigilance threshold ρ is used to ascertain whether input data belong correctly to a category, as where s is a coefficient for propagation from p i tor i , and d is a learning rate coefficient.Furthermore, s • α/(1 − α) ≤ 1 is the constraint between them.When ( 18) is false, the active unit is reset and is searched to a next active unit.Repeat until the range of change of F1 is sufficiently small if (18) is true.
Herein, teaching signals are used for labels if ACMNs are used for supervised learning.The index c is stored as a label if ACMNs are used for unsupervised learning.

Mapping Module
For this module, category maps are created as a learning result.We built this module using CPNs, which are supervised neural networks, to classify patterns into particular categories with the functions of competitive and neighborhood learning.
The network architecture of CPNs comprises three layers: an input layer, a mapping layer, and a Grossberg layer.The input layer and mapping layer resemble those of SOMs in this module.Teaching signals are presented to the Grossberg layer.For our method, labels that are assigned for F2 on ART-2 of the labeling module are used for teaching signals.Our method actualizes automatic labeling to combine CPNs with ART.
The order of units on F2 is assigned as labels used for teaching signals in the supervised learning mode.For the semi-supervised learning mode, mixed labels that include teaching signals and those without teaching signals created from ART are mapped on the category map.For the unsupervised learning mode, labels obtained using ART are used for learning CPNs.The usage of labels differs in each learning mode.Using the intermediate representation as labels, this module performs similar learning behaviors in respective modes.
Learning results are represented as a category map on the mapping layer.Spatial relations among datasets based on similarity are visualized on a category map.ACMNs create it automatically without setting the number of categories.Moreover, redundant labels are removed through the process of competitive and neighborhood learning.
The learning algorithm of CPNs is as follows.Herein, for visualization characteristics of category maps, we set the mapping layer to a two-dimensional structure X × Y unit.We set one dimension of the input and Grossberg layers, although they can take any structures.The numbers of units are I and K, respectively.u i,j(x,y) (t) is the weight from an input layer unit i to a mapping layer unit j(x, y) at time t.v j(x,y),k (t) is the weight from a Grossberg layer unit k to a mapping layer unit j(x, y) at time t.These weights are initialized randomly before learning.x i (t) represents training data to present to the input layer unit i at time t.The unit for which the Euclidean distance between x i (t) and u i,j(x,y) (t) is the smallest is sought as the winner unit.c(x, y) is the index of the unit.c(x, y) = argmin (1,1)≤j(x,y)≤(X,Y) The neighborhood region N (c x ,c y ) (t) around c(x, y) is defined as where µ(0 < µ < 1.0) is the initial size of the neighborhood region, and O is the maximum iteration for training.u i n,m (t) of N c(x,y) (t) are updated to close input feature patterns using Kohonen's learning algorithm as u i,j(x,y) (t + 1) = u i,j(x,y) (t) + α(t)(x i (t) − u i,j(x,y) (t)).
Subsequently, v j(x,y),k (t) of N c(x,y) (t) is updated to close teaching signal patterns using Grossberg's learning algorithm.
Therein, T k are training signals obtained using ART-2.α(t) and β(t) are learning coefficients that have decreasing values with the progress of learning.α(0) and β(0) respectively denote the initial values of α(t) and β(t).The learning coefficients are given as In the initial stage, the learning is done rapidly when the efficiencies are high.In the final stage, the learning converges, although the efficiencies decrease.At the maximum number of v j(x,y),k (t) for the k-th Grossberg unit, category L k (t) is searched as A category map is created after determining categories for all units.Test datasets are presented to the network that is created through learning.The mapping layer unit, which is the minimum Euclidean distance, since the test data and feature patterns are similar, is burst.Categories for these units are recognition results for CPNs.

Measurement Setup
The purpose of this preliminary experiment is to evaluate facial measurements for the creation of driving episodes with simulated near-misses.Figure 3 depicts our DS, which has three displays and six actuators used to move the driver's seat.The realistic feeling of this DS is higher than that of DSs with a single display and a fixed seat.We used an RGB-D camera (Xtion pro Live; ASUSTeK Computer Inc., Taipei, Taiwan) to sense the driver's face.Using this camera, depth information is obtained from infrared dot patterns.Moreover, we used an eye tracking system (faceLAB; Seeing Machines, Canberra, Australia) for gaze motion measurements.Compared with results of an experiment using an actual car, we were able to take advanced measurements of facial features using a DS and the sensors above.Furthermore, we quantitatively evaluated drivers' biological information, especially in facial feature changes for a near-miss situation under a distracted state.For this evaluation, we created near-miss scenarios of two cases.Both scenarios include a traffic scene at an intersection as a narrow perspective.For different patterns of a bicycle, we divide near-misses into two cases.Figure 4a depicts Case I: a bicycle runs in front of the car from the right to the left without brakes.Figure 4b depicts Case II: a bicycle runs suddenly from the left of the intersection to the right along with the car after turning in the direction in front of the car.This experimental setup, regarding details pertaining to driving scenarios, simulation environments, and near-misses, was based on our former study [52].For creating simulated cognitive distractions, we used questions that comprise the multiplication of single-digit numbers based on studies by Suenaga et al. [53] and Abe et al. [54].We provided it to all subjects in 3 s intervals as information delivered vocally from a speaker.All subjects answered them verbally.We recorded answers using a microphone to calculate the correct rate.As simulation parameters, we set the weather to fine in the daytime, which provides high visibility.The subjects consisted of two women, Subjects A and B, and 10 men, Subjects C-L.All subjects were university students who had been licensed drivers for up to four years.We selected a subject for evaluation using questionnaires of driving characteristics because this study was conducted to create an individual episodic model.

Driving Characteristics
For measuring driving characteristics in advance, we used two questionnaires: a driving style questionnaire (DSQ) [55] and a workload sensitivity questionnaire (WSQ) [56] by the Research Institute of Human Engineering for Quality Life.Quantitatively, the DSQ and WSQ respectively measure driving styles comprising driving attitudes, desires, and cognitive and driving burdens.

Measurement Results of Gaze and Face Orientation
Figure 6 depicts distribution results of gaze and face orientations for Subject C when the subject encountered a near-miss.The measurement range is from entry into an intersection to the termination of a right turn after stopping in front of a traffic sign.Results of Cases I and II reveal that the distribution complexity of face orientations is higher than that of gazes.
Face orientations were varied because the driver moved his neck to check the non-visible intersection and the bicycle that ran from the right to the left in Case I.In contrast, face orientation changes to the right were slight in Case II because the bicycle that appeared suddenly from the left of the intersection passed through the same road as that used by cars.Face movements expanded upward and downward in Case II because the bicycle did not pass through the intersection, similar to Case I.
Using a DS, advanced and diverse information from drivers' faces can be measured without risk of a crash.However, it is still a challenging task to measure steady information from a driver on an actual car using a device such as FaceLAB.For experimentally obtained results, we obtained a tendency of diverse information of face orientations similar to that of gazes.Positional values on the horizontal axes maintain values of less than zero because the subject watched the lower half of the display.The result in Figure 7a shows that the face movements were slight for the near-miss.In contrast, Figure 7b, which is a result under the calculation task, depicts wide and rapid movements of face orientations after a slight delay of responses.The subject moved their head lower unnaturally while turning left.However, it is difficult to observed responses repeatedly because most behaviors are mere single instances.For this study, we aimed to record these features as driving episodes because driving behaviors are actual measurement results.

Experimental Setup
EDRs are used widely not only for crash recording but also for the recording of driving scenes [58].For this study, we used an EDR (GDR45DJ; Garmin Ltd., Schaffhausen, Switzerland) that comprises a front camera and a back camera connected by a USB cable.Synchronized time-series images are readily obtainable using EDR as a cost-effective system.Table 1 presents major specifications of the EDR.The 132 • diagonal and 120 • horizontal viewing range allows for an expansive capture of the driving scene.Figure 8 depicts some obtained sample images.We installed a back camera on a dashboard in a car to take images of the driver's face, although its normal usage is rear-view monitoring.This installation provides captured face images from the lower side, as depicted in Figure 8b.The EDR has a function of saving a route using a global positioning system (GPS).One can check a route on an online map using an original tool provided by the manufacturer.This tool is used for a photographic navigation system using geotagging information that is included with each image.
We obtained image datasets from three cars.The obtaining period comprises two seasons: summer (July-August in 2015) and winter (January-February in 2016).The obtained area was in the Akita prefecture, which is an area of heavy snowfall in Japan.Therefore, our datasets include images obtained on snowy roads.Moreover, for extracting daily episodes from their daily car life, we obtained repeated data from similar routes for a commuter trip.
The video files of this EDR were saved automatically as short video clips with maximum sizes of 255 MB, which corresponds approximately to a five-minute video clip for the two camera mode.
For this experiment, category maps were created in each video clip.We changed the sampling rate from 30 to 3 fps because high rate images include numerous mutually resemblant parts.

Feature Extraction Results
Figure 9 depicts extracted scene features for an image obtained from the front camera.Figure 9b depicts an extraction result of AKAZE features from the original image depicted in Figure 9a.Feature points with orientations and scales are distributed on the road and over the traffic sign that is viewed by drivers.Figure 9c depicts high saliency regions extracted using SMs.Combined with Figure 9b,c, Figure 9d depicts an extraction result of AKAZE features in high saliency regions for a binary mask image.Green lines are boundaries of highly salient regions that include traffic signs and white lines.Figure 10 depicts face extraction results for images obtained from the back camera.Wide variation of face orientations or brightness leads to detection failure.However, drivers are fixed in their seat by a seat belt.Therefore, the variation of a face size and its position is less than that of normal face detection applications.Before calculating a gap of size s and the center (c x , c y ) between the current region of interest (RoI) at time t and the former RoI at t − 1, our method corrects the RoI using the former RoI if the gap exceeds thresholds that were set in advance.Figure 10a,b respectively depict a successful example and a corrected example.The red RoI signifies that the former RoI was used for correction.Figure 11 depicts four orientation GW images used for input features.We created codebooks for reducing fixed input vectors from features of all images.Figure 12 depicts codebooks for images in Figure 8.The numbers of AKAZE features differ among images, although the descriptor comprises 61 dimensions.Using codebooks, features are integrated to a fixed dimension.In Figure 12a,b, we set the mapping layer on SOMs to 128 units that engender the codebook dimension.We set the mapping layer on SOMs to 128 units that engender the codebook dimension.The GW features after downsampling comprise 900 dimensions of 30 × 30 patches.We reduced the dimensions using codebooks equivalent to the AKAZE dimension.Herein, Figure 12c depicts codebooks created using k-means [48] for inside images as a visual comparison.

Classification Granularity
Figure 13 depicts the numbers of ART labels and CPN categories according to changes in ρ from the 0.9950 to 0.9980 step by 0.001.Although ART labels are increased monotonically, CPN categories are increased with variation.The expanded difference between both numbers shows that the compression of CPNs works suitably for the redundance of labels created from ART. ρ most effectively determines classification granularity.For this study, we set ρ to 0.9970, which is the average of values of 0.9965 and 0.9975, as a steady range before expanding the gap separating ART labels and CPN categories.

Created Category Maps
First, we created individual category maps using images of driving scenes and facial expressions separately.We set the size of category maps to 50 × 50 = 2500 units to ensure a sufficient mapping space related to the number of input images.Figure 14 depicts category maps running on a public road.The category maps are represented using color temperatures from blue to red that correspond to low and high temperatures, respectively.Using color temperature, one can confirm the order of categories according to the distribution.The vertical bar beside the category maps shows the indexes of color temperature that are divided by the number of categories.Based on neighborhood and competitive learning, ART labels that are removed by CPNs are not included in the indexes of color temperature.Figure 14 depicts category maps that comprise 14 categories for outside images and nine categories for inside images.According to the color temperature distribution, Figure 14a,b respectively depict a superior distribution for the first half and the last half of categories.
Subsequently, Figure 15 depicts category maps for a dataset driving on an expressway.Outside images and inside images were divided, respectively, into 13 categories and 8 categories.Similar to the results for a public road described above, the number of categories for outside images is greater than that of inside images.The categories of facial expressions are distributed throughout the map, although initial categories are distributed on the left and right regions of the map.
Figure 16 depicts category maps created using time-series images obtained from both outside and inside cameras.The codebook dimension of the combined input is double that of respective inputs.The numbers of categories on a public road and an expressway respectively represent eight and seven categories.The color temperature shows that the first half labels occupied the majority distribution.Compared with category maps in Figures 14 and 15, the effect of driving scenes is greater than that of facial expressions.We consider that this gap is caused by the difference of diverse codebooks depicted in Figure 12.
Figure 17 depicts time-series category changes mapped as a color bar.The color map corresponds to the color on the category map in Figure 16a.The color bar presents results of categories that correspond to the color temperature.The upper images respectively correspond to the 100th, 300th, 500th, and 700th frames.The brief route is depicted in an online map on the right side, which comprises the following.First, the car ran through a residential area to a main street after starting from a parking lot.Then, the car stopped at a traffic light before an intersection.After a few minutes, the car moved forward again.The zone corresponding to the high-temperature color shows the state stopped in front of a traffic light.The high-temperature color categories are present between the 400th and 500th frames, which correspond to driving scenes in heavy traffic after going to a main road.
Figure 18 depicts time-series category mapping results obtained while driving on an expressway.The category changes are compared to those of a case of a public road with monotonous driving scenes.The new category was created around the 620th frame because, at that point, the car ran into a tunnel.Subsequently, category changes were slight with momentary returns to existing categories.

Driving Episodes with Near-Misses
During the data acquisition from three cars, we obtained two actual near-misses as shown in Table 2. Case I occurred during a summer evening.A compact car driven by an elderly man rushed suddenly out from a parking lot at a pachinko parlor.A crash was avoided by strong braking with fully locked tires because the car had no anti-lock braking system (ABS).The compact car passed from east to west.We infer that the elderly driver did not notice the car because of the bright sun in the evening.However, the actual reason was unclear; we did not hear any explanation from the driver.The dataset was classified into nine categories.In the first stage, categories were created throughout the map.In the middle stage, categories were created at the bottom of the map.The ninth category, created as the final category, is distributed on the upper right corner independently.Figure 20 depicts time-series classification results.The near-miss occurred between the 550th and 560th frames.The white circle depicted in Figure 21a corresponds to burst units related to the near-miss.Categories were changed around category boundaries.Existing categories were used for a transition without the creation of new categories for this near-miss.
Case II was a slip accident on an icy road.The car reduced its speed immediately before it reached a corner.However, the brakes did not work sufficiently, despite the fact that the driver applied full pressure to their brake pedals.The car went into a space covered with snow, and the ABS functioned automatically.There was no salient value of the accelerator.Figure 21 depicts a category map of Case II.All six categories, a smaller number of categories compared to Case I, were distributed to several regions in small clusters.This distribution tendency indicates that the driving scenes differed partially but were similar globally.Figure 22 depicts the time-series classification results.Feature changes were salient because categories were mixed in a complex manner, with wide color temperature gaps.The near-miss occurred between the 390th and 400th frames.The category transition was active among existing categories.No new category was generated for the near-miss.

Conclusions
This study was undertaken to present driving episodes using machine-learning-based algorithms that address LTM and topological mapping.To this end, for preliminary experimentation using a DS, we measured the gazes and face orientations of drivers.Moreover, we measured the effects of facial features for cognitive distraction using simulated near-misses.Results show the possibility of recording driving episodes using image datasets obtained using an EDR with two cameras.Using category maps, we visualized driving features according to the driving scenes on a public road and an expressway.Moreover, we created original datasets that include near-misses and here describe the position of these near-misses on category maps.
In future studies, we will apply our methods to unknown driving environments and integrate both the collective intelligence and the experiential knowledge of numerous drivers.Moreover, we will develop a new interface that can sense actual responses and use that information to support recognition, judgment, and safety evaluations for elderly drivers.

Figure 1 .
Figure 1.Procedures of our proposed method.

Figure 2 .
Figure 2. Overall architecture of adaptive category mapping networks (ACMNs), which comprise a codebook module, a labeling module, and a mapping module.

Figure 3 .
Figure 3. Driving simulator (DS) and measurement devices: (a) outside and (b) inside of the DS.

Figure 4 .
Figure 4. Simulated near-miss scenarios: (a) Case I and (b) Case II.
) false discovery.The respective questions are scored in four steps.The second, the WSQ, comprises 38 questions classified into 10 categories: (1) traffic condition posture, (2) road environmental posture, (3) hindrance of driving concentration, (4) decrease in body activity, (5) driving pace inhibition, (6) affliction, (7) driving path recognition and searching, (8) interior, (9) controls and operations, and (10) driving position.The respective questions are scored in five steps.Higher points scores are interpreted as showing high driving sensitivity for each measurement category.

Figure 5
Figure 5 depicts measurement results obtained for the DSQ and the WSQ.Among all subjects, Subject C shows a salient tendency of worry and false discovery in the DSQ.Therefore, we analyzed Subject C further in detail.

Figure 6 .
Figure 6.Distribution of gaze and face orientations of Subject C for near-misses of Subject C: (a) Case I and (b) Case II.

Figure 7
Figure7depicts time-series changes of face directions of Subject C for the near-miss scenario in Case II under the status of normal driving with a cognitive distraction[57] using a calculation task.Vertical and horizontal axes respectively portray the normalized DS display sizes and time t.Positional values on the horizontal axes maintain values of less than zero because the subject watched the lower half of the display.The result in Figure7ashows that the face movements were slight for the near-miss.In contrast, Figure7b, which is a result under the calculation task, depicts wide and rapid movements of face orientations after a slight delay of responses.The subject moved their head lower unnaturally while turning left.However, it is difficult to observed responses repeatedly because most behaviors are mere single instances.For this study, we aimed to record these features as driving episodes because driving behaviors are actual measurement results.

Figure 7 .
Figure 7. Time-series changes of face orientations of Subject C for a near-miss scenario in Case II: (a) normal driving and (b) cognitive distraction.

Figure 8 .
Figure 8. Sample images obtained using EDR: (a) front camera and (b) back camera.

Figure 12 .
Figure 12.Three 128-dimensional codebooks created using (a) self-organizing maps (SOMs) for outside images, (b) SOMs for inside images, and (c) k-means for inside images.

Figure 13 .
Figure 13.The relation between ρ and the numbers of adaptive resonance theory (ART) labels and counter propagation network (CPN) categories.

Figure 14 .
Figure 14.Category maps on a public road: (a) outside and (b) inside images.

Figure 15 .
Figure 15.Category maps on expressway: (a) outside and (b) inside images.

Figure 16 .
Figure 16.Category maps of driving scenes and facial expressions inputted together: (a) a public road and (b) an expressway.

Figure 17 .
Figure 17.Time-series classification results obtained on a public road.

Figure 18 .
Figure 18.Time-series classification results on an expressway.
Figure 19 depicts salient signals of the accelerator embedded in the EDR.The red arrow indicates the position of harsh braking related to the near-miss.

Figure 19 .
Figure 19.Output signals from the acceleration sensor.The red arrow indicates the position of the brakes during the near-miss of Case I.

Figure 20 .
Figure 20.Time-series category transition results in Case I.

Figure 21 .
Figure 21.Category maps: (a) Case I and (b) Case II.White circles show burst units for the near-misses.

Figure 22 .
Figure 22.Time-series category transition results in Case II.

Table 1 .
Major specifications of the event data recorder (EDR).