Point-Graph Neural Network Based Novel Visual Positioning System for Indoor Navigation

: Indoor localization is a basic element in location-based services (LBSs), including seamless indoor and outdoor navigation, location-based precision marketing, spatial recognition in robotics, augmented reality, and mixed reality. The popularity of LBSs in the augmented reality and mixed reality fields has increased the demand for a stable and efficient indoor positioning method. How-ever, the problem of indoor visual localization has not been appropriately addressed, owing to the strict trade-off between accuracy and cost. Therefore, we use point cloud and RGB characteristic information for the accurate acquisition of three-dimensional indoor space. The proposed method is a novel visual positioning system (VPS) capable of determining the user’s position by matching the pose information of the object estimated by the improved point-graph neural network (GNN) with the pose information label of a voxel database object addressed in predefined voxel units. We evaluated the performance of the proposed system considering a stationary object in indoor space. The results verify that high positioning accuracy and direction estimation can be efficiently achieved. Thus, spatial information of indoor space estimated using the proposed novel VPS can aid in indoor navigation.


Introduction
Information obtained from visual positioning systems (VPSs) is more innovative than that obtained from navigation technology, such as global positioning systems (GPSs). Therefore, VPSs are increasingly being used to improve the lifestyles of people worldwide. For instance, in places where GPS service is not available, such as indoor spaces, the VPS can use the user's mobile camera to identify the surroundings and visually determine directions [1].
In this study, we propose a novel VPS method that determines the position of the camera by matching the pose information of the object estimated using the improved point-graph neural network (GNN) with the pose information of the object in a predefined voxel database. The proposed method uses a sparsely down-sampled point cloud that can be obtained from mobile devices and uses RGB information as feature information. However, unlike images, point clouds are generally sparse and are not evenly distributed on a regular grid. This implies that placing the point cloud on a regular grid generates a non-uniform number of points in the grid cells, which may lead to potential information loss in congested cells or wasted operations in empty cells when the convolution operations of same kernel size are applied to such a grid [2]. Therefore, a GNN was used to structure the point cloud, as the graph can be directly applied to the structure of the point cloud without setting it up regularly. Additionally, GNN can reuse graph edges in all layers and need not extract the grouped and sampled points repeatedly [3]. The proposed method is a novel VPS based on point-GNN [4] that structures a three-dimensional (3D) point cloud to precisely acquire spatial information and detect objects using a GNN. The user's position can be determined by matching the pose information of the object estimated by the improved point-GNN with the pose information label of the object in the voxel database, which addresses the point cloud in the indoor space in voxel units of a certain size.
The contributions of this study can be summarized as follows: • The proposed method estimates the pose of an object in terms of nine degrees of freedom (DOF) by improving the point-GNN. • This is a novel method that determines the user's position as a voxel address in indoor space by matching the object pose label in the defined voxel database with the object pose estimated using the user's camera position.
Our method is a novel method of determining a position in a 3D voxel using the point cloud. The existing positioning method is an image learning-based neural network, and the general method calculates two 3D bounding boxes IoU (Intersection over Union) in seven degrees of freedom with a single rotation (around the z-axis) using a bird's eye view. Our method calculates the rotated IoU around all axes (x, y, z) with nine degrees of freedom (three transformations, three dimensions, three rotations) for pose estimation in 3D space. After calculating the 3D IoU, it is calculated with our formulas and algorithms to determine the position. We improved a new object detection approach using graph neural network on the point cloud and is a novel method of structuring point clouds by addressing 3D space with voxel labeling.
The remainder of this paper is organized as follows. Section 2 provides a brief overview of related work. Section 3 describes the system architecture and the proposed method in detail. Experiments and performance evaluations are presented in Section 4. Section 5 discusses the inference of the obtained results, and Section 6 summarizes the study conclusions and suggests potential future applications.

Point Cloud-Based 3D Object Detection
Point cloud: 3D object detection methods that use point clouds can be divided into 3D object detection based on multi-view images, voxels, and points [5]. Both multi-view image and voxel-based methods rely on the empirical quantization resolution to represent the point cloud transformation.
3D object detection based on points: The introduction of PointNet [6] and Point-Net++ [7] established a method of handling irregular and unaligned point sets. This involves learning the point-by-point function directly from the original point cloud without expression transformation. A study [8] narrowed the search range in the point cloud using the frustum proposal of two-dimensional (2D) detection for the corresponding camera image and directly regressed the prediction based on the internal point. In the reported cascading frameworks [8,9], the performance of 3D object detection was severely limited by the results of 2D detection. Unlike generating proposals from images [8,9], PointRCNN [10] generated 3D proposals with high recall performance in the entire point cloud. Furthermore, PointNet++ was directly applied to perform 3D semantic segmentation of the point cloud, and 3D proposals were generated simultaneously based on each foreground point. Subsequently, these 3D proposals were used to incorporate point-by-point semantics for further improvement. To improve the directional range of the cubic anchor, a study proposed a new spherical anchor for the point cloud space [11]. Additionally, several other methods [12,13] have applied PointNet++ to point cloud processing and improved the 3D object detection performance. As point-based learning can better use each point in the original point cloud, it performs 3D object detection using the points. Point cloud primarily refers to the data collected by a light detection and ranging (LiDAR) sensor, time of flight (ToF) sensor, and RGB-D sensor. These sensors transmit light or signal to an object, record the return time, calculate the distance information for each light or signal, and generate points to create a point cloud, which is a set of points spread in 3D space. The depth (z) information in point clouds differs from that of 2D images. Although point clouds can represent only three pieces of information, there may be additional information provided by the sensor. For instance, the LiDAR sensor can add reflectance information, whereas RGB-D can provide color information. Therefore, we can learn the 9-DoF of an object using this point cloud [14].
Point-GNN based on points: This is a GNN that detects objects in a LiDAR point cloud. Point-GNN for object detection maintains the irregularity of the point cloud based on a graphical representation rather than transforming the point cloud into a regular grid, such as an image or voxel. Unlike the technique of repeatedly sampling and grouping points, point-GNN constructs the graph only once. It uses a multilayer perceptron (MLP) to model the functions Subsequently, point-GNN extracts the features of the point cloud by repeatedly updating the vertex features in the graph. As point-GNN is a single-step detection method, it efficiently encodes the point cloud in an adjacent fixed-radius graph. Point-GNN is a network that predicts the category and shape of the object to which each vertex of the graph belongs. The edges of the graph connect adjacent points within a fixed radius to ensure that feature information flows between adjacent points. GNNs reuse graph edges in all layers and avoid repetitive grouping and sampling of points. Table 1 is in terms of the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) benchmark, point-GNN achieved high accuracy using only point clouds and exhibited accuracy similar to that of fusion-based algorithms [4,[15][16][17][18][19][20][21][22][23][24][25].

D Intersection over Union (IoU)
IoU is widely adopted as an evaluation indicator in several visual tasks, such as object detection, segmentation, and visual tracking. Additionally, it is generally computed independently of the neural network training loop without participating in the backpropagation process. To regress the parameters of the rotated bounding box, the conventional 3D object detection approach regresses the translation, scale, and yaw angle separately, using Smooth-L1 Loss [26] based on the L1-norm of the parameter distance. As IoU does not change with the scale of the problem, a study [27] attempted to learn the IoU directly, such as predicting the IoU as a metric of non-maximum suppression (NMS) rather than calculating the IoU between the detected bounding box and ground truth bounding box. Another study [28] proposed a fitting quality network (FQNet) to directly predict 3D IoU between the sample and object in monocular 3D object data. Additionally, an algorithm for estimating the exact 3D IoU value of the bounding box was proposed [29]. The accurate 3D IoU values of 3D boxes in 3D object detection can be computed using a three-part algorithm. First, the Sutherland-Hodgman polygon clipping algorithm calculates the intersection point between the faces of two boxes. Second, the volume of intersection uses the convex hull algorithm of all truncated polygons. Third, IoU is computed based on the intersection and union volumes of the two boxes. Figure 1a illustrates the computation of the intersection points of each side by clipping the polygon to the box, and Figure 1b depicts the computation of the intersection volume using the convex hull algorithm of all intersections (green).

VPS
Camera pose estimation can be approximately divided into three categories in the existing VPSs [14,29]. The structure-based positioning method uses a local function to match the features of a query image and a point on a 3D model from 2D to 3D or an RGB-D image with a 3D model, and the camera pose is estimated. We concluded that the 3Dbased method provides more accurate pose estimation owing to the precise model construction and maintenance [30]. Moreover, a model is trained from a specific image, wherein pose information serves as a learning-based localization method, and a scene can be expressed using the learned model. This learning-based positioning method predicts the match for pose estimation [31][32][33][34] or estimates camera poses, such as PoseNet [35], PoseNet2 [36], VlocNet [37], and VlocNet++ [38]. Table 2 presents the localization error of sub-meter-level accuracy 0.25 m using the image learning-based existing CNN models. In VlocNet [37], a high accuracy of 0.048 m was achieved using the Microsoft 7-Scenes dataset. The 7-Scenes dataset uses a dataset consisting of RGB-D images collected from seven different scenes in an indoor office environment.

Proposed Method
This section outlines the proposed method and describes the main modules and important algorithms in detail. This includes the object pose estimation and matching with a voxel database comprising voxel labeling and object pose estimation of an improved point-GNN in the point cloud. Figure 2 illustrates the overall architecture of the method, which comprises two components.
Pose estimation of GNN illustrated in Figure 2a: To structure the input point cloud in a graph, the key-point selected by the fixed-radius algorithm is used as the vertex of the graph to express the graph coordinates; RGB data are learned from a GNN. The improved point-GNN of the one-stage method estimates the classification and localization of the object box. The estimated localization based on the key module of 3D IoU can be represented as box = (x, y, z, l, h, w, ϕ, θ, ψ) in the 9-DoF format.
Voxel database illustrated in Figure 2b: The indoor point cloud is addressed to voxels based on the proposed algorithm to construct the voxel database. The trained improved point-GNN can estimate the 3D pose of the object in the new point cloud and determine the user's position by matching the estimated pose information with that of the object in the voxel database. As shown in Figure 2a, in the improved Point-GNN, the overall architecture of the improved Point-GNN includes three components: graph construction, GNN of T iterations, bounding box IoU and prediction. We use the point cloud created by the ToF camera as an input to the network. A point in the point cloud has all 3D coordinates representing properties, a state value, and a length vector. The state value is RGB. A graph is constructed by using a point as a vertex and connecting the neighbor points within the predefined and fixed radius. After the graph is constructed, the graph is re-processed using three iterations (T = 3) with GNN. To reduce the point cloud, we set the voxel size to 0.8 m for network training and 0.4 m for inference. For initial states of vertex, we set a fixed radius to r0 = 0.2 m. Th = 0.01 is set in NMS. We train the proposed GNN end-to-end with batch size 1. We use stochastic gradient descent with a stair-case learning-rate decay. For each object, we use an initial learning rate of 0.01 and a decay rate of 0.1 per training step with 400 K steps. After T = 3 iterations of the graph neural network, the vertex state value is used to predict both the category and bounding box of the object to which the vertex belongs. The classification branch calculates multi-class probabilities. We use the mean cross-entropy loss as the classification loss. Finally, the network calculates the bounding box for each class. Add L1 regularization to each MLP to prevent overfitting. Since multiple vertices can be in the same entity, the neural network can output multiple bounding boxes of the same entity. We use 3D IoU and NMS to merge these bounding boxes into one. The size of MLPcls is (64, #(class)). The MLPloc size of each class is (64,64,9). MLPcls predicts the class of an object along with the Background class and the DoNotCare class.

Point-GNN for 3D Object Pose Estimation
This section explains the pose estimation of the object box. The input of the proposed system is a point cloud created using a 3D sensor, such as ToF or LiDAR, of a mobile phone camera or other mobile platforms. We improved the existing point-GNN, which is used for detecting objects using point clouds. Moreover, it is a single-step detection method that maintains the irregularity of the point cloud using a graphical representation. A general GNN reuses graph edges in all layers, prevents repetitive grouping and sampling of points, and improves vertex features by aggregating them along the edges. In the (t + 1)th iteration, each vertex function is updated in the same format as indicated in Equation (1) of Section 2.1, where e t and v t denote the edge and vertex features of the t th iteration, respectively. The function f t (.)computes the edge feature between two vertices. Furthermore, ρ(.) is a set function that aggregates edge features for each vertex, and g t (.) updates the vertex function using the aggregated edge function. Subsequently, the GNN outputs the vertex features or repeats the process in the next iteration. = ( , ) ∈ , In the case of object detection, the GNN includes the information on the object to which the vertex belongs and rearranges the state of the graph vertex. The state of the vertex is refined using the state of the neighboring vertex, and the relative coordinates of the neighboring network are used as input to f t (.) for edge feature extraction. As indicated in Equation (3), relative coordinates are calculated using the key-point vertex and the neighboring vertex selected by the fixed-radius algorithm. The relative coordinates induce transformation invariance for the global movement of the point cloud. The point cloud of N points can be defined as a set P = { , ..., }, where = ( , ) is a point with both 3D coordinates ∈ R 3 and the state value ∈ R k the color RGB that represents the point property.
Object detection of point-GNN estimates the center point of the object box and the 7-DoF format of the length, height, width, and yaw values, where box = (x, y, z, l, h, w, θ). However, to estimate the pose of an object using the unfixed camera origin in a 3D indoor space, 9-DoF of the object must be estimated. Therefore, we improved two modules in the point-GNN.
First, to estimate the pose of the object, the functions of estimating the pitch and roll values were improved in addition to that estimating the yaw value. The bounding box of the object was estimated as box = (x, y, z, l, h, w, ϕ, θ, ψ) in 9-DoF using the improved point-GNN. Herein, the center coordinates (x, y, z) of the object represent the center position of the bounding box; the size of the object (l, h, w) represents the box length, height, and width in the X-, Y-, and Z-axes rotations, respectively; ϕ, θ, and ψ denote the pitch, yaw, and roll angles, respectively. We encoded the bounding box with vertex coordinates (xv, yv, zv) as indicated in Equation (4), where lm, hm, wm, θ0, and θm represent the constant scale factors. The localization branch predicts the encoded bounding box as δbox = (δx, δy, δz, δl, δh, δw, δϕ, δθ, δψ) for each class.
Second, we improved the 3D IoU for calculating the NMS value of an object in the 3D space. To estimate a single box of objects, NMS was calculated using the 3D IoU, which is a key module. Algorithm 1 uses polygon clipping to calculate the 3D IoU for the box to determine the intersection on each side. Thus, the improved 3D IoU computes the convex hulls of all intersections and calculates the intersection volume to determine the NMS value.

Voxel Labeling in the Point Cloud Database
We divided the point cloud of indoor space into voxel units of a certain size and constructed the pose labeling of the object in the divided voxel database space. The voxel database scans the indoor space using a high-performance 3D sensor and calculates the total size of the X, Y, and Z axes of the point cloud using the maximum and minimum values of the scanned point cloud coordinates. After dividing the calculated X, Y, and Z axes of the indoor space based on the set voxel size, we performed numbering to generate the voxel address that determines the user's spatial location.
The object pose label was generated to estimate the object pose of the improved point-GNN in the built voxel database. Object pose labels comprise the voxel database object box (class, x, y, z, length, height, width, pitch, yaw, roll) and voxel number address (No.n) of the object. The center point and eight corner vertices (corner[0]-corner [7]) of the 3D box were generated based on the position of the object box in 9-DoF. Figure 3a illustrates the interior space divided by voxel size and the pose label of the object in the divided space, and Figure 3b depicts the pose label of the object box.

Matching the Object Pose Estimated by Deep Learning with that in the Voxel Database
The object estimated using the method presented in Section 3.1 and the object obtained in Section 3.2.1 are identical. This section explains the generation of each 3D box using the object labels obtained in Sections 3.1 and 3.2.1, match the center points of the two 3D boxes, and then match the eight vertices of the 3D box. Figure 4 illustrates the three steps, and the detailed method can be summarized as follows.  Figure 4a: The X, Y, and Z axes of the voxel database coordinate system comprise the XY plane in the ground and the Zaxis in the upward direction. The axis of most camera coordinate systems is the Z-axis in the camera optical axis direction, X-axis in the right direction, and Y-axis in the downward direction. To match the directions of the X, Y, and Z axes of the two coordinate systems, the camera coordinate system axis is rotated 90 ° counterclockwise based on the X-axis to unify the X, Y, and Z axes of the voxel database coordinate system.

Unification of the coordinate system illustrated in
Transformation of the center movement of the object box illustrated in Figure 4b: The Euler matrix uses Equation (5) to calculate the estimated value of the improved point-GNN object box (class, x, y, z, length, width, height, pitch, yaw, roll). Eight corner vertices, corner[0] to corner [7], of the 3D box rotated as an estimate based on the object's center coordinates are generated, and the camera origin is added to the estimated labels. The 3D box of the object in the indoor space voxel database and that of the object estimated by the improved point-GNN are identical. The center point of the 3D_Box object (0, 0, 0) estimated based on the origin (0, 0, 0) of the center point of the 3D box in the voxel database is moved and transformed, and the center points of the 3D boxes of the two objects are matched. In both 3D boxes, the corner vertices are relative coordinates with respect to the center point.
Center rotation transformation of the object box and the camera origin rotation transformation illustrated in Figure 4c: After matching the center points of the boxes of two objects, the corner vertices of the object 3D box in the voxel database are determined to calculate the rotation angle that matches the vertices of each corner. The rotation transformation value of the camera origin in the coordinate system of the database object box is calculated based on the estimated object 3D box center point. Algorithm 2 is a pseudo code for calculating the rotation value of the camera origin based on the object's center point (0, 0, 0). One corner vertex of each of the two boxes (corners[0]) and a point on the diagonal line (corners [6]) are used to calculate the cross product and dot product of each vector. Therefore, if the rotation value is calculated using the rotation transformation matrix, the transformed coordinates of the camera origin rotated in the voxel database can be obtained. Equation (6) represents the rotation transformation matrix with the unit vector u = (ux, uy, uz) as the axis. When the camera origin is transformed into a rotation transformation matrix Ru(θ), the transformed camera coordinates in the voxel database are calculated. Consequently, the voxel, including the addressed coordinates configured in Section 3.2.1, can be determined by applying the proposed method.

Voxel Labeling in the Point Cloud Database
We used deep learning datasets to select 10 types of objects for the evaluation. We considered fixed objects with a low probability of position movement in indoor space. Azure Kinect DK was used with 500 training datasets and 100 test datasets to generate point clouds for training and evaluation. Figure 5 depicts the objects selected for the selfperformance evaluation.  , flower pot,  refrigerator, round table, and podium table. Azure Kinect DK was used to generate a point cloud for evaluating the improved point-GNN. The implementation was written in Python 3.6 using TensorFlow for graphics processing unit (GPU) computation. Evaluation details measured the inference times on a desktop using Linux Ubuntu 16.04 LTS, Ryzen 9 3900X CPU and RTX 2080 SUPER 8G GPU (two sets). To increase the speed of estimating the object pose by reducing the number of computations in the dense point cloud generated using Azure, the improved point-GNN was learned by down-sampling 20 000 points sparse for each object in indoor space. Figure 6a depicts the appearance of the label of the object box during learning, and Figure 6b illustrates the graph of the key-point generated by estimating the 9-DoF based on the improved point-GNN during object box estimation. Figure 7 illustrates the learning loss observed in the improved point-GNN object box estimated using the proposed method. As indicated in the figure, the proposed method verified the learning graph with a class-loss value of 88.2% of the mean average precision (mAP) of our dataset. Table 3   The improved part of the Point-GNN used in our experiment is the code for estimating the 9DoF of the object and the 3D IoU module. Our test environment confirmed the VPS result of 732 ms in the middle-scale indoor space. To improve the speed, the IoU module we improved can be applied to a neural network with a faster operating speed.

Experimental Setup
In the case of point cloud data generation, the interior space of the voxel database was captured using four Leica BLK360 imaging laser scanners, which are LiDAR 3D sensors with 3D panoramic features. The point cloud comprises approximately 4.5 million pieces of RGB information with indoor space of length 35 m, height 5 m, and width 25 m. The point cloud of x, y, and z sizes were calculated using the minimum and maximum coordinates of the indoor space point cloud scanned by the LiDAR 3D sensor, and the real values of the calculated x, y, and z sizes converted into integer values increased. Although the minimum size of a voxel that can be set is 0.1 m, the smaller the size, the lower the computing speed. During the evaluation, the voxel size was set to 0.2 m. If the x, y, and z sizes are divided into voxels, the number of voxels along the X, Y, and Z axes can be calculated. The voxels can be addressed through numbering. Figure 8 depicts an interior view of the indoor space, a LiDAR 3D sensor camera, point cloud divided into voxels, and 3D doll-house view of the indoor space.

Results
Voxel information was obtained from the system to calculate the number of voxels in the X, Y, and Z axes by setting the size of the voxel based on the calculated size of the space. The point cloud of indoor information calculated the size of the indoor space in the point cloud. Furthermore, the log comprises VPS information of indoor database module. Table 4 presents the VPS log. The pose estimation error of the improved point-GNN is proportional to the object's center coordinates (x, y, z) and rotation values, namely the pitch, yaw, and roll. Furthermore, it is proportional to the distance of the camera's origin coordinates (0, 0, 0). However, as VPS is performed considering the voxel address, the coordinates of the camera in the database space are identical to those of the position in the cube of the voxel labeling. Figure 9 depicts the VPS error rate when the position of the camera coordinates and the actual camera coordinates are not the same voxels in the database space estimated by the VPS. On average, the error rate is as much as the distance between the centers of the two voxels. The actual camera position of the selected object in the indoor space can be represented as (x2, y2, z2) ∈ Voxel No. x2. The camera position predicted using VPS based on the pose estimation of the improved point-GNN can be represented as (x1, y1, z1) ∈ Voxel No.n When x1 is the range of Equation (7) and is included in the same voxel, the proposed method provides the same position by correcting the error of the voxel space of size: length, width, height = α cm. The final actual error distance in the same voxel space does not exceed Equation (7). However, when the voxel numbers do not match, the error distance can be obtained using Equation (8).
(VPS Error Distance = ( − ) + ( − ) + ( − ) ) ∈ {Voxel No.x2 ≠ Voxel No.x1} (8) Table 5 summarizes the VPS error distance calculated using Equation (7) in the experiment described in Section 4.1. The proposed method can localize locations within a sub-meter-level accuracy in our dataset and to test on indoor scenes we use the publicly available 7-Scenes dataset with scenes shown in Figure 10. This dataset contains significant variation in camera height and was designed for RGB-D relocalization. We compared our method with the state-of-the-art method by creating a point cloud from RGB-D images. As indicated in Tables 5 and 6, more than 80% of the test data were successfully located within 0.25 m of the ground truth location.    Figure 11 depicts the object when the point-GNN object detection learning is not performed based on the size of the object. Although the object was fixed in indoor space, with the size of the object being relatively small and the learning being performed using the improved point-GNN with a mAP of approximately 0.3, the learning was not performed with precision and recall values of 0. This is because when the graph of the improved point-GNN is generated, only a small number of key-points are included in the object. Therefore, the graph learned from the network is not generated with sufficient accuracy. This can be achieved by increasing the number of computations in the applied point cloud. However, as the down-sample and learning parameters of the point cloud were identical to the other objects used, learning of objects with dimensions 30 × 30 × 30 cm or less was not performed. In addition, our method synchronized the classification learning rate and localization learning rate by setting the class loss weight and localization loss weight parameters for learning of multi-object classes for dynamic positioning system.

Conclusions
In this study, we proposed a novel VPS method for determining the position and pose of an object, estimated using deep learning, by matching it with the position and pose of an object in a predefined space. Initially, we estimated the position and pose of an object by learning the 3D point cloud obtained from the LiDAR camera through deep learning. Subsequently, the position and pose of the object in a predefined space were matched to determine the user's position. This confirms that the proposed system can ensure high location accuracy and direction estimation in a well-established space with structured voxels. The proposed system uses a down-sampled point cloud similar to a mobile device-based environment that can precisely acquire 3D data when GPS information is insufficient. Moreover, it enables deep learning-based visual positioning using fixed objects in an indoor environment. In summary, it is a decision system that can provide customized augmented reality content to users and also aid in indoor navigation.
Owing to the characteristics of augmented reality, which predominantly operates using mobile devices, it is necessary to investigate the work schedule of mobile GPUs to minimize the camera power consumption rate of mobile devices or the use of CPU and GPU. Thus, we intend to develop a system that complements deep learning object recognition and simultaneous localization and mapping (SLAM) for indoor navigation and augmented reality in the future.  Data Availability Statement: Code and dataset will be made available on request to the first author's email with appropriate justification. The public site for each dataset is as follows. RGB-D Dataset 7-Scenes: https://www.microsoft.com/en-us/research/project/rgb-d-dataset-7-scenes/.