OLIMP: A Heterogeneous Multimodal Dataset for Advanced Environment Perception

: A reliable environment perception is a crucial task for autonomous driving, especially in dense traffic areas. Recent improvements and breakthroughs in scene understanding for intelligent transportation systems are mainly based on deep learning and the fusion of different modalities. In this context, we introduce OLIMP: A heterOgeneous Multimodal Dataset for Advanced EnvIronMent Perception. This is the first public, multimodal and synchronized dataset that includes UWB radar data, acoustic data, narrow-band radar data and images. OLIMP comprises 407 scenes and 47,354 synchronized frames, presenting four categories: pedestrian, cyclist, car and tram. The dataset includes various challenges related to dense urban traffic such as cluttered environment and different weather conditions. To demonstrate the usefulness of the introduced dataset, we propose a fusion framework that combines the four modalities for multi object detection. The obtained results are promising and spur for future research.


Introduction
In the last few decades, the evolution of autonomous cars has been driving disruptive innovations with life-changing impacts. In fact, it not only changes the way we drive, but its deployment will also have a direct social impact in terms of mobility, security, safety, environment, etc. The car is able to perceive, interpret and make decisions thanks to the employment of several sensors [1][2][3]. Thus, the development of such systems requires an accurate representation of the vehicle's environment and surroundings [4,5].
Driving in dense urban traffic is a challenging task. It includes complex road conditions and various traffic-agents as cars, pedestrians, motorcyclists, trains, etc. Moreover, these agents have diverse types and behaviors, which further increases the environment perception complexity. To provide a safe autonomous driving service, the system should be able to detect the different road agents and predict their future paths in order to have a complete environment perception [6] and make the right navigation decision [7][8][9]. Therefore, obstacle detection is considered as a crucial task for autonomous driving in complex urban traffic. To achieve this, diverse machine-learning-based algorithms have been developed [10]. Due to its impressive effectiveness in representing hierarchical features, deep learning is one of the most widely used tools in this domain.
In fact, training such algorithms requires absolutely huge datasets. In this context, there exist various multimodal datasets dedicated to intelligent transportation systems (ITS) such as Kitti et al. [11], Cityscapes [12], Kaist Multi-Spectral dataset [13], nuScene [14], etc. These datasets gained a particular attention over mono-modal datasets, as an individual sensor is insufficient to provide a complete perception of the environment. Furthermore, the most exploited sensors in this field such as camera, lidar, radar, etc., offer complementary data and their collaborationn can guarantee a better scene understanding [15].
The camera is the most widely used sensor for obstacle detection. It provides detailed information about the vehicle's surroundings such as edges, colors, etc. Lidar is more accurate in terms of depth information. With regard to radar, it is robust to weather and lighting conditions. The latter cited sensors are the most exploited in the field of autonomous vehicle, and based on the literature, the combination of the captured data from each sensor separately can improve performances [16].
Over the last decade, various autonomous driving datasets have been published in order to enhance reasarch for environment perception. Most of these datasets are multimodal, combining different heterogeneous modalities.
While some of the existing datasets use narrow-band radar, the UWB radar carries richer information. The UWB radar provides a signal that results from the reflection of a transmitted UWB pulse on the object. The deformation of the initial wave represents the signature of the object. This signature contains information about the distance, the material and the shape of the object.
Moreover, different objects have distinguishable acoustic signatures that may help recognize each of them. In spite of the usefulness of the acoustic data, we notice that none of the state of the art ITS benchmarks uses acoustic modality.
The main contributions of this paper are: • We introduce OLIMP (https://sites.google.com/view/ihsen-alouani), A HeterOgeneous MuLtimodal Dataset for Advanced EnvIronMent Perception a new heterogeneous dataset collected using a camera, a UWB radar, a narrowband radar and a microphone.

•
We present an exhaustive overview of the available public environment perception databases.

•
We propose a new fusion framework that combines data acquired from the different sensors used in our dataset to achieve better performances for obstcle detection task. This fusion framework highlights the potential improvement that could be acquired by the community using our dataset.
The paper is organized as follows: In Section 2 we review existing multimodal environment perception datasets. In Section 3, we focus on detailing some related works especially on data fusion methods. We introduce and detail our proposed dataset in Section 4. In Section 5, we present the fusion framework and show experimental results to generate baselines for our new dataset. Section 6 provides a discussion. Finally, we conclude the paper in Section 7.

Existing Public Multimodal Environment Perception Databases
Public multimodal datasets are important for autonomous driving's advancement. In the last decade, several datasets have been released for this purpose. Kitti et al. [11] and Cityscapes [12] datasets are considered the first benchmarks that have addressed real-world challenges. Until a few years ago, datasets that contain only sparsely annotated data were sufficient to treat several problems. But nowadays, with the evolution of deep learning techniques, the exploitation of such datasets is insufficient [17]. In fact, the training of deep models requires datasets with a huge number of labeled data though collecting such amount of data is not an obvious task. Hence, this requirement has led to the development of several new sophisticated autonomous driving datasets [18]. In this section, we review various existing public monomodal and multimodal environment perception databases by detailing and observing the characteristics of each one. Table 1 shows an overview of the reviewed datasets. Table 1. Overview of some autonomous driving datasets ("-": no information is provided).

Dataset
Year Kitti is a vision benchmark dataset that was released in 2012 and comprises stereo camera, Velodyne lidar and inertial sensors [11]. Within the introduction of this database, various vision tasks were launched as pedestrian detection, road detection, optical and stereo flow, etc. Kitti was recorded in six different emplacement with cluttered scenes and it provides over 200k boxes that was manually labeled. Nevertheless, only 3D objects that exist in frontal view are annotated and it covers only daytime conditions. Moreover, the preeminent limitation of Kitti database is the small amount of data that is not suitable for deep learning algorithms.
In the meantime, the University of Cambridge has introduced a new driving dataset named CamVid [19]. It was the first that contains videos with semantic segmentation labels related to 32 classes. However, the size of this dataset is small; it contains only four scenes.
In 2016 Cityscapes dataset was published [12]. It covers urban traffic scenarios in 50 cities, in only spring and summer, including 30 categories. Cityscapes consists of a pixel-level and instance-level segmentation labeling. Indeed, it contains mainly images and few videos with 5000 images which have fine-annotation over 20,000 images along coarse annotations.
In [20], BDD100k was recorded in 2016 in four different regions of the US. It is considered as the largest driving video dataset, and offers diversity in terms of data and driving conditions. BDD100 comprises 100k videos containing almost 1000 hours recorded under different weather conditions. Indeed, only one image is selected from each video sequence for labelling likewise Cityscapes dataset. Ten thousand images are labeled in pixel level and bounding box labels are provided for 100k images.
Kaist Multi-Spectral dataset [13] is a multimodal database that was repeatedly collected in urban, residential and campus environments. Several sensors were fixed on the vehicle, namely: stereo camera, thermal camera, GNSS, 3D lidar and inertial sensors. Moreover, it covers a diverse time slots (day, night, morning, sunset, etc.) and the annotation is provided in 2D. But compared to the newest released datasets, the size of the Kaist dataset is limited.
Subsequently, another popular dataset is released named ApolloScape [21]. Compared with Kitti and Cityscape, it contains an extensive amount of data and has many properties. In fact, it includes stereo driving sequences that reach over one hundred hours of recording under diverse day times and about 144k images. It covers also 2D and 3D pixel-level segmentation, instance segmentation, lane marking and depth. Further, in the intention to label such a database, the authors developed several tools customized mainly for the annotation process. However, data acquired from lidar is used to provide only static depth maps.
The H3D [22] was introduced in 2019. It considers over 160 complex and congested scenes. Three cameras, a lidar, a GPS and inertial sensors were used to collect this dataset. The main challenge addressed in this dataset is 3D multi-object detection and tracking. In fact, it consists of 1.1M 3D boxes annotated data, which includes over 27 k frames. Eight classes were considered in this dataset: car, pedestrian, cyclist, truck, misc, animals, motorcyclist and bus. It is true that the dataset comprises rich scenes and annotation with a particular size. Nevertheless, the data was registered only during daytime conditions.
The dataset introduced in [23] and entitled BLVD does not focus on static obstacle detection only, but also on dynamic object detection. Indeed, this dataset proposes a platform that involves 4D tracking (3D+temporal), 5D interactive recognition events and 5D intention prediction. It includes 3 classes: vehicle, pedestrian and rider. The data was recorded in daytime and night time conditions. It provides 120k frames with a 5D semantic annotation and beyond 249 3D annotation.
The nuScene (nuTonomy scenes) [14] is the first dataset that involves the three preeminent sensors exploited to ensure an autonomous driving which are a lidar, 5 radars and 6 cameras. This database consists of 1000 scenes where the duration of each scene is 20 s. The annotation provided is a 3D bounding boxes specified for 23 classes. The data were gathered under several lighting and weather conditions: daytime, night time and rain [14]. This dataset is rich in terms of utilized sensors, size, acquisition conditions diversity, and amount of data with 1.4M frames. Yet, the main issue of this dataset is the number samples imbalance between the different classes.
Other than the autonomous driving databases previously mentioned, additional dataset are released for the same purpose, such as the Oxford Robotcar [24], Udacity [25] and DBNet Dataset [26]. These datasets are mainly dedicated to enhance the scenes understanding and environment perception. We provide in Table 2 a categorization of the most important autonomous driving datasets according to particular tasks. CamVid [19] × × × kitti [11] × × × × × × Cityscapes [12] × × BDD100k [20] × × × × Kaist Multispectral [13] × × × × ApolloScape [21] × × H3D [22] × × BLVD [23] × nuScenes [14] × × OLIMP × × We notice that since 2016, the number of published datasets has increased because of its importance for the development of self-driving cars. In fact, the majority of the collected data is specialized in urban driving, and was recorded in different locations: Europe, the United States, Asian cities, etc. In terms of sensing modalities, all the examined datasets contain RGB images acquired from one or more cameras or video in HEVC (high efficiency video coding) standard or in other recent coding [27]. For narrow-band radar data, it is only presented in nuScenes dataset [14] and the newly released Oxford Radar RobotCar Dataset [28]. These benchmarks contain a limited number of radar samples, despite of this sensor provides rich information and helps in the environment perception and taking the right decisions. For that reason, nowadays, it becomes essential to exploit radar sensor in developing autonomous driving datasets.
Depending on the principal aim of the published dataset, objects are labeled into various categories. Comparing the object classes existing in each dataset, we notice that the number of examples attributed to each class is imbalanced. For example, we compare the samples related to two classes: car and pedestrian for nuScenes, Kitti and Kaist Multispectral databases. We can observe that there are much more car samples than pedestrian labels, as shown in Figure 1. One of the important criteria to have a complete dataset is different lighting and weather conditions in order to cover various scenarios [29]. Nonetheless, Kitti dataset is broadly used in this field of research, the variety of its recording environmental conditions is reduced: it is gathered only under daytime and sunlit days, similar to CamVid, CityScapes and H3D datasets. In order to enrich light recording conditions [13,14,20,21,23] collected data considering both daytime and night time all day long. Concerning the diversity of weather conditions, only BDD100k and nuScenes covers rain and snow situations. Actually, seasonal changes are not well covered as the majority of the databases were recorded in short periods. From Table 2, we notice that most of the reviewed datasets were dedicated to multi-object detection as it is an inevitable process in ITS. Likewise, several datasets were dedicated to object tracking, lane recognition, semantic recognition, SLAM and 3D vision.

Multi-modal Environment Perception Related Work
Complex driving situations often present various obstacles. Some works focus on 2D detection, while some others deal with 3D object detection which includes more challenges thanks to the development of complex datasets.
To address this challenge, the combination of various modalities is of a great interest. From this perspective, the majority of existing work fuse RGB images with lidar point clouds. A limitied number of benchmarks couple RGB images with thermal images. However, we highlight that there is a lack of research on combining radar data with images. Furthermore, fusion of sensing modalities can be achieved at 3 possible stages: early, intermediate or late level. Moreover, based on the literature, five fusion operations are mainly used to fuse multiple modalities based on a deep architecture [15]: (1) Addition, (2) Average mean, (3) Concatenation, (4) Ensemble: used to combine the regions of interest (ROIs) for object detection, (5) Mixture of Experts: this operation tends to model explicitly the weights of the feature maps.
In this section, we summarize various existing techniques for multi-modal environment perception, particularly focusing on multi-object detection which will be highlighted in Table 3.
• RGB images and lidar point clouds fusion: Many works proved that fusing images with lidar data improves the accuracy of the object detection process particularly for far range and small obstacles [30]. There are three techniques to combine lidar point clouds with camera images. First, images and lidar points are merged. Then, targets are detected using camera images and these results are subsequently provided using lidar point clouds. The third method consists of defining regions of interest using lidar data and the camera is used to detect the objects.
González et al. [31] used transformed depth maps and RGB images as inputs to detect pedestrians. The objects' poses in multi view were taken into account, intermediate and late level fusion are implemented. For the intermediate stage, they fused features extracted from Histogram of Oriented Gradients (HOG) and Local Binary Patterns (LBP) descriptors, with a support vector machines (SVM) classifier. With regard to fusion at high level, they coupled decisions obtained from the training of a detector on each modality. In this case, the feature level fusion guaranteed better performance.
In [32], a point fusion method is proposed where lidar points are mapped onto the image plane and features are extracted from the image using a pre-trained 2D detector. Afterwards, features are concatenated via VoxelNet architecture. In [33] an architecture based on two single stage detector is proposed. The information provided by lidar data (height, distance, intensity) are transformed into images. This data is processed by a VGG16 [34] very to provide feautres. Afterwards, an SSD [35] network is adapted to generate bounding boxes of 2D cars in foggy weather based on a deep feature exchange that rely principally on features concatenation. In the work of Xu et al. [36], the raw data acquired from lidar are proceeded by PointNet architecture and images features are extracted via CNN (Convolutional Neural Network). The obtained results are then pooled in order to locate the 3D bounding boxes coordinates. Qi et al. [37] adapted a simlar approach in their work. In [38], object proposals are generated using a segmentation method applied on lidar point clouds data and RGB images. Afterwards, the candidates generated from lidar data and images train two separate CNNs to classify the proposals. The outupts decisions are combined using a basic belief assignment (BBA) to associate bounding boxes. Finally, a CNN modal is implemented to determine the final decision.

•
Visible and thermal images fusion: While visual cameras are affected by weather and lighting conditions, thermal cameras are robust to night time and daytime circumstances since they detect the object's heat reflected by the infrared radiation. For this reason, the combination of the provided data can ensure a detailed scene understanding as they are correlated in terms of illuminations conditions.
Hwang et al. [39] introduced an extension of the aggregated channel features (ACF) dedicated to pedestrian detection. The extended model consists of a multispectral ACF obtained from the augmentation of the thermal intensity with HOG features. In [40], visible and thermal images are fused according to two approaches to detect persons. The first is called DenseFuse and consists of encoding the two types of images using independent encoders. The encoded features are merged and then decoded back to generate a single fused image that represents be the input of a Residential network (ResNet) architecture. The second method is an intermediate level fusion technique. In fact, ResNet-152 is employed separately for infrared and visual images, thereafter the extracted features are concatenated into a single array that will serve as the input of the fully connected layer.
An early and late fusion based on CNN architecture in the intention to couple infrared and visible images are investigated in [41]. The early fusion method consists of combining the pixels captured from the two modalities. In opposition to late fusion, where two sub-networks provide feature representation for the two modalities. These representations are fused on a supplementary fully connected layer. Besides, the proposals are generated using the ACF+THOG detector. According to the obtained results, a pre-trained late fusion method evaluated on KAIST multispectral dataset guarantees better performance. In [42], an illumination-aware architecture is proposed based on Faster R-CNN [43]. Infrared and visible images are respectively the inputs of two separate sub-networks. Meanwhile, an illumination aware network is developed for estimating an illumination value from color images, thereafter an illumination weight layer is integrated in order to determine the fusion weights for the two modalities. Consequently, the final decision is achieved from weighting the final results obtained from the two sub-networks due to the estimated fusion weights.
• Narrow-band radar data and RGB images fusion: For obstacle detection, the radar and the camera are two complementary sensors, however only a few studies addressed this challenge. Similar to the other kinds of sensing combinations, the three types of fusion can be applied to couple these modalities.
In [44], narrow-band radar tracks generate the ROIs in the images. Following this, for the vision module, a symmetry algorithm and a contour detection technique are applied to the ROIs to identify vehicles. The goal of the work presented in [45] is to detect pedestrians. Narrow-band radar sensor provides a list of tracks and the ACF object detector is adopted to generate a list of identified pedestrians in the images. Subsequently, the fusion of the obtained descions is ensured using the Dempster Shafter method. Wang et al. [46] proposed a decision approach to fuse radar data and images. The You Only Look Once (YOLO) [47] network is employed in this work to detect vehicles from images. The radar sensor detects the centroid of the obstacles, afterwards, these detections are projected in the image plane. Finally, the results obtained from the two modalities are combined. A real-time Radar Region Proposals network (RPNP) is developed in [48]. The network consists of generating ROIs based only on radar detections. In fact, the tracks are mapped into images so that anchor boxes are proposed which are inspired by Fast R-CNN architecture. Then, these boxes are scaled according to the distance of the objects to have accurate detection. Radar data is transformed into images in [49] in order to be combined with RGB images. Actually, these data will be proceeded via ResNet network separately. Accordingly, features are concatenated after the second block of ResNet.  To fuse different modalities for understanding the vehicle surroundings, many approaches employ deep neural network architectures while others are based on hand crafted features. From the aforementioned reviewed works, we observe that the fusion performance depends mainly on the sensing modalities, the quality of data and the selected architecture. For fusion operations, feature concatenation is widely exploited method specifically in early and intermediate levels. Likewise, addition and mixture of experts are mainly employed for intermediate and high stages.

Proposed Dataset
The importance of multimodal perception techniques for ITS and the extent of research efforts in this direction emphasize the need for multimodal datasets that explore complementary sensors. In this section, we introduce OLIMP, a new dataset for road environment perception. To ensure a complete environment perception, our benchmark contains four complementary modalities, namely: UWB radar data, narrow-band streams, images and acoustic data. In fact, camera is affected by degraded condition such as foggy weather, while UWB radar is not influenced by either luminosity or weather conditions. The acoustic data is orthogonal to the vision field. Concerning the narrow-band radar, it provides position and velocity. This section presents the proposed dataset and details its implementation, challenges and opportunities.

Context
The main contribution of our work is to introduce such a multimodal dataset for this aim. To the best of our knowledge, OLIMP is the first benchmark that contains UWB radar data and acoustic data.
The introduced OLIMP dataset is a multimodal synchronized dataset that was collected using four heterogeneous sensors to better understand the vehicle's environment. The data was collected in the campus of the Polytechnic University Hauts-de-France in Valenciennes, France (Valenciennes is known for its foggy weather). Data was captured during 3 months and consists of 47,354 synchronized frames.

Hardware and Data Acquisition
On the one hand, we used four heterogeneous sensors: a monocular camera, an UWB radar which is a short range radar, a narrow band radar (ARS 404-21) that is a long range radar and a microphone. On the other hand, we exploited the EFFIBOX platform to acquire data from the different sensors simultaneously [52]. In Table 4, we highlight the sensors' characteristics and technical details.
• UMAIN radar: it is an UWB radar. The exploited kit is called HST-D3 developed by the UMAIN corporation [53]. The kit comprises a UWB short radar with a Rasperby Pi 3 for the acquisition. Following this, the received radar raw data are transmitted to the computer through the Raspberry Pi that is connected via TCP/IP protocol. • Narrow band radar (ARS 404-21): This Premium sensor from Continental is a long range radar that is able to detect multiple obstacles up to 250 meters. It genertaes raw data that include: distance, velocity and radar cross section RCS [54]. Data are transmitted to the EFFIBOX platform via CAN bus.

•
The EFFIBOX platform: it is a software developed in (C/C++) dedicated to the design of multi-sensor embedded applications. In addition, diverse adequate development functionalities are available such as: acquiring and saving sensor streams, processing/post-processing, visualization, etc.

Sensors Embedding
With regard to the sensor configuration, we designed a structure where all the sensors are placed in the front view. To simplify the data fusion, the narrow-band and the UWB radars and the camera were mounted on the same vertical axis. Figure 2 shows the proposed data acquisition architecture, and Figure 3 highlights the structure setup.

Sensor Synchronisation
To develop an efficient autonomous driving dataset, sensor synchronisation is a challenging and inevitable task. We developed our method to achieve an accurate alignment between the modalities' data streams. In the simultaneous data recording process, we register timestamps relative to each sensor separately. We first start with synchronizing the radars and the camera. Since these sensors have different frequencies and time responses, we choose the narrow-band radar as a primary sensor. This is explained by the fact that the narrow-band radar is the slowest among these sensors; it has the highest latency of a complete measure compared to the other modalities as shown in Table 4. In fact, the narrow-band radar raw data is represented in the form of a stream of discrete measures. Each one of these measures comprises a main data frame including the obstacle's number followed by successive information about each obstacle (distance, velocity, etc.). Once a narrow-band measure is taken, we record its timestamp and look for the camera frame as well as the UWB signature that have the closest timestamp to the synchronization narrow-band timestamp.
Regarding the acoustic modality, the frame corresponds to an analog signal (sound). The challenge is to find the most suitable time window size that: (i) corresponds to the exact scene recorded at given timestamp, and (ii) is long enough to hold meaningful information about the scene. After thorough explorations, we empirically choose an optimal window size of 5 s for acoustic signal frame. This frame is recorded according to the narrow-band synchronization timestamp mentioned above.
Overall, the proposed algorithm consists of selecting the timestamp acquisition of every narrow-band measure and find the corresponding frames of the other sensors which have the closest timestamp. The frames synchronization step is illustrated in Figure 4.

Labeling Process
In addition to the background, we consider four classes: pedestrian, cyclist, vehicle and tram, since these are the most probably encountered possibilities in an urban transport environment. The vehicle class contains cars, tracks, etc. For the labeling process, we manually annotate data consecutively one image per three as this task is time consuming and the changes between two successive images are practically negligible. We avoided automatic annotation to have a high quality labeled ground truth. Thus, we used the Matlab Image Labeler toolbox whom we have the license as semi-automatic labeler tool.
Data annotation includes 2D bounding boxes that present respectively x, y, the width and the height of the object in pixels.

Scenarios Selection and Data Formats
In order to collect raw sensor data, we carefully choose diverse driving situations. The scenes duration differ and depend mainly on the situation's complexity. While recording our dataset we consider diverse challenges that will be detailed in the following subsection. We emphasize the data variety through employing different locations (8 emplacements) that vary in terms of structures, environment, road markings, traffic signs, etc. Driving situations are carefully selected and collected under different lighting conditions, we covered also sunny, cloudy and snowy weather. For data format, the dataset provides synchronized frames of each situation, the data are stored as: RGB images, .txt files presenting UWB radar signals, .txt files of narrow band radar data stream and .wav microphone files.

Challenges of the Dataset
With the intention of developing a complete dataset, we cover realistic conditions for environment perception such as: cluttered environment, occlusions, lighting conditions, etc. To overcome the aforementioned challenges, exploiting several sensors is highly required to obtain redundant information or complementary data that may compensate the challenges presented by each sensor. Figure 5 highlights the most introduced challenges in our dataset. The object's types exhibit an immense variability since they vary in terms of appearance, movement and differ from the point of view of the class: pedestrian, vehicle, etc. When recording our data, we take into consideration this camera-radar challenge as we consider 4 categories of obstacles. Furthermore, our dataset was performed by several pedestrians and cyclists of different ages, looks, body sizes, etc. Moreover vehicles are varied: multiple cars, vans and trucks. We can see this differentiation through UWB radar signatures shown in Figure 6 that correspond to each of the considered categories. Moreover, the considered objects can be static or dynamic. Distance is one of the fundamental challenges presented for autonomous driving either for camera, the two exploited radars or even the microphone. According to this, we consider two representations when capturing our dataset, depending on the range: near and far obstacles.
A further challenge is presented: the cluttered environment since generally dense urban driving involves many traffic agents with a complex background. For UWB radar, multiple reflections can influence the quality of the signal in the presence of many objects. Concerning narrow-band radar, it generates many detections when various obstacles exist, thus a selection process is required to identify the relevant ones. So, we attempt to introduce several complex scenes during recording.
Furthermore, we consider diverse lighting conditions as we record data throughout the day (morning, afternoon and sunset). We collect our dataset under sunny, foggy and snowy weather to increase the diversity and cover the possible real driving situations. In fact, the camera is highly sensitive to the last mentioned challenges whereas the radar is robust against them.
Besides, the object detection task is extremely delicate to occlusions that occur between several classes which is frequently presented in diverse cluttered scenes. OLIMP includes severe occlusions situations combining the four classes as pedestrians that are often occluded by each other or by a cyclist, a vehicle or a tram, or the opposite. Figure 7 illustrate the inter and intra class challenges by presenting the synchronized data acquired from the camera, the UWB radar and the microphone.

Statistics and Dataset Organisation
OLIMP is organized in 6 subsets from C0 to C5. C0 contains background only, C1 includes either one, two or a group of pedestrians. C2 comprises cyclists, C3 and C4 include respectively vehicles and trams. The final subset C5 contains the different possible combinations of the aforementioned classes introduced in OLIMP dataset considering various scenarios. In fact, we only focus on the main moving road objects that can be presented in an urban traffic scene.
Our dataset was performed by 93 pedestrians, 14 cyclist and using 90 vehicles and 2 trams. Precisely, the dataset presents 47,354 data for each sensor. For the evaluation protocol, 2 3 of the dataset is used for training, and 1 3 for test.

Fusion Framework
In our work, we focus especially on obstacle detection and recognition. Thus, in this section, we aim to evaluate each modality individually and propose, afterwards, a fusion-based system that takes advantage of each modality's contribution.

Image-based System
The multiple obstacle detection task can be divided into two steps: the localization defines the bounding boxes and the recognition that is ensured via a probability estimation. Thus, deep learning techniques have been widely adopted in image-based object detection.
Among the known deep architectures used in the litterature, we used a pretrained MobileNet-v2 [56] model on a subset of the ImageNet dataset for detecting objects on RGB images. The MobileNetV2 is based mainly on depthwise separable convolutions and it contains two blocks. The first block is a residual block with a stride equal to 1 and the second block is a downsizing block with a stride equal to 2. Its architecture contains three convolution layers for both mentioned blocks: 1 × 1 convolution layer with RelU6, a depthwise convolution and a 1 × 1 convolution. The overall MobileNetV2 architecture contains 17 of these blocks. These blocks are followed by a regular convolution layer, an average pooling layer and a fully connected classification layer. The network consists of 54 layers deep and uses 3.5 million parameters [57]. Actually, the presiding model was chosen due to its compromise between performance and execution time.
The results relating to the training of this architecture are presented in Table 5, and the metrics that are chosen to evaluate the performance are precision (P), recall (R) and the average precision (AP). As shown in Table 5, MobileNet achieves a significant results on the four categories in terms of precision. However, the image-based system provides low rates of recall for all the classes which explains that the system generates too many false negative samples.

UWB Radar-based System
To demonstrate the importance of using the UWB radar, we proposed a radar-based system to discriminate the four classes for short distances. First of all, we classified the whole signals using SVM in the intention of distinguish the classes, yet the results were not promising as the signals present rich information with a significant leakage in the beginning. For this reason, we use narrow band radar data to achieve better performance. Though, the proposed approach consists of selecting ROIs in the signals acquired from the UWB radar in order to localize radar signatures that characterise the obstacle. Afterwards, these ROIs will be classified using SVM. In fact, narrow band radar generates a list of targets with their position and velocity. Thus, we injected the distances taken from narrow band radar data to define the ROIs in UWB signals. In this state, we focus our attention to obstacles which are located less than 6 meters, while after various experiments the UWB radar is less efficient for a range that exceeds this margin. We can observe that we obtain multiple ROIs when matching the narrow band points with the signatures. Accordingly, we proposed to exploit the velocity of each obstacle with the distance to reduce ROIs and to better localise the signature. For that, two objects that are side by side and have the same velocity are considered as one target. In addition to this, we set an amplitude threshold to validate the ROIs. Figure 8 illustrates this process. The selected ROIs are classified using an SVM classifier with an Radial basis function (RBF) kernel. The results of the UWB radar-based system are shown in Table 6. According to our experiments and obtained results, we assume that the proposed radar-based system can better distinguish pedestrians and cyclists. Aside from the fact that the UWB radar provides a unique signature for each class, it is not able to classify tram and vehicle. Since the results in Table 6 include the overall dataset testing, the accuracy results for those two classes are remarkably low. For experiments safety,the tram and the vehicle are generally located a far from the radar, in a range greater than 6 m. Thus, reflections' magnitude from these two classes are low compared to reflections acquired from a cyclist or a pedestrian that are usually closer to the field of view of the radar. This explains the difference of accuracy between the two latter classes and the first classes.

Acoustic-Based System
According to the state of art, the MFCC (Mel-Frequency Cepstral Coefficients) are widly used in sound processing and analysis as it provides a better representation of the sound [58]. Hence, for acoustic data, we extracted temporal features and spectral features using MFCC (Mel-Frequency Cepstral Coefficients) based on several experiments. These features are concatenated and classified using SVM with RBF kernel.
As shown in the results presented in Table 7, using acoustic data leads to better performance for the two categories tram and vehicle. This is due to the relevant sound generated by these two classes. In other words, a walking pedestrian sound is narrow compared to the tram sound that presents more information. For this reason, precision and recall rates related to the tram and the vehicle classes are higher than the two others.

Multi-Modalities Fusion System
To prove the significance of our dataset, we take advantage of the different sensors by proposing a fusion framework system. This framework is built in the lights of the results obtained from the aforementioned systems. In fact, we identify the effectiveness of each sensor individually and its ability to differentiate one class of another according to the results presented in Sections 5.1-5.3. The architecture of the proposed fusion framework is represented in Figure 9. The first step of the framework consists of extracting the labels from MobileNet CNN. If the extracted label is a car or a tram, we use the acoustic-based system to verify the attributed label, and, all the labels are updated accordingly. The CNN-extracted label is neither a tram neither a car, the distance of the object will be calculated. Thus, if it is a far obstacle we will keep the same labels of the CNN model. Nonetheless, if it is a near obstacle then it will be either a pedestrian or a cyclist. In addition, we will adopt the radar-based system to confirm the attributed label, since it can particularly discriminate the aforementioned categories in a range less than 6 meters. Thus, the results related to the fusion framework are illustrated in Table 8.

Discussion
We conducted various experiments using mono-modality and multi-modalities to validate our dataset and to open perspectives the way for future research. The fusion levels exploited in our work are the following: low, intermediate and late levels. We can recognize the low level fusion when projecting narrow band data into UWB signals to define ROIs. The intermediate fusion consists of concatenating temporal and spectral features for acoustic data. Lastly, the late level is exploited in decisions fusion to obtain the final decisions of the total framework. From analyzing the fusion results presented in Table 7, we notice that the performance has been clearly improved in terms of precision. The enhancement brought along with the acoustic system has a higher importance compared with the contribution of the radar-based system. This is mainly because of the range and power limitations of the UWB radar. Despite this fact, it provides a unique signature for each type of object with a low price compared to the new sophisticated radars. For the acoustic system, the distance between the obstacle and the sensor presents an important challenge. Moreover, obstacles like pedestrians and cyclists have low magnitude acoustic signals and could not easily detected through acoustic based systems.
The considered environments in OLIMP are challenging and present various confusing categories such as metal infrastructure, trafic signs, glass-surface buildings, etc. The obtained results for object detection are promising and show the importance of using multimodality for vehicle environment perception. To the best of our knowledge this is the first dataset that has exploited ultra UWB technology and acoustic data this shows the originality of our work. For this reason, we encourage research on proposing new fusion networks that use either two modality or more to enhance the vehicle environment perception. The proposed fusion framework is limited because of its simple and serial aspect. We believe that this shortage could be overcome using advanced parallel fusion systems. This will be investigated in future work.

Conclusions
In this paper we propose OLIMP, a multimodal dataset for environment perception. It includes four modalities: images, ultra wide band radar signatures, narrow band data streams and acoustic data. Further, the acquired data is synchronized and the annotation process is provided for RGB images. This dataset unprecedentedly introduces ultra wide band radar and acoustic sensor. The proposed dataset was captured in various environments and dedicated mainly to dense urban traffic situtations. To demonstrate the effectiveness of our dataset, we presented a fusion framework which takes advantage of the results obtained using each modality separately. In spite of its simplicity, the proposed framework yields promising improvement in terms of precision. These experiments highlight the relevance of the proposed modalities.