On-Device Object Detection for More Efﬁcient and Privacy-Compliant Visual Perception in Context-Aware Systems

: Ambient Intelligence (AmI) encompasses technological infrastructures capable of sensing data from environments and extracting high-level knowledge to detect or recognize users’ features and actions, as well as entities or events in their surroundings. Visual perception, particularly object detection, has become one of the most relevant enabling factors for this context-aware user-centered intelligence, being the cornerstone of relevant but complex tasks, such as object tracking or human action recognition. In this context, convolutional neural networks have proven to achieve state-of-the-art accuracy levels. However, they typically result in large and highly complex models that typically demand computation ofﬂoading onto remote cloud platforms. Such an approach has security- and latency-related limitations and may not be appropriate for some AmI use cases where the system response time must be as short as possible, and data privacy must be guaranteed. In the last few years, the on-device paradigm has emerged in response to those limitations, yielding more compact and efﬁcient neural networks able to address inference directly on client machines, thus providing users with a smoother and better-tailored experience, with no need of sharing their data with an outsourced service. Framed in that novel paradigm, this work presents a review of the recent advances made along those lines in object detection, providing a comprehensive study of the most relevant lightweight CNN-based detection frameworks, discussing the most paradigmatic AmI domains where such an approach has been successfully applied, the different challenges arisen, the key strategies and techniques adopted to create visual solutions for image-based object classiﬁcation and localization, as well as the most relevant factors to bear in mind when assessing or comparing those techniques, such as the evaluation metrics or the hardware setups used.


Introduction
Today ambient intelligence (AmI) systems are experiencing an unprecedented growth momentum essentially due to the thrust and fast development of two of their main enabling technological forces: the so-called Internet of Things (IoT), focused on the exploitation of networked sensor infrastructures to remotely gather data and enable data exchange between several distributed ends and Artificial Intelligence (AI), more oriented to the use of gathered data and the subsequent distillation of knowledge necessary to make AmI systems adaptable and "aware" of their surroundings. Both fields, IoT and AI have experienced outstanding progress almost side by side in the last two decades, a fact in no case fortunate, but rather the result of a continuous interaction that is still going on nowadays.
Methods and technologies developed within AmI under the IoT paradigm facilitate communication and data exchange among the different devices that comprise the so-called intelligent environments, enabling the creation and proper exploitation of new network architectures based on connected sensing instruments and resource-constrained energyefficient end devices (i.e., embedded devices and mobile devices). Efforts made in that regard have been primarily aimed both at the design and deployment of faster and more efficient network infrastructures, as well as at the development of more accurate sensing platforms and more powerful computing hardware, enabling the collection and storage of large volumes of data and consequently, the support of increasingly sophisticated AI techniques, from traditional machine learning (ML) methods, all the way to more recent deep learning (DL) approaches.
Modern advanced algorithmic solutions can process sensed data and derive high-level knowledge to allow intelligent computational systems to recognize and understand both user activity and behavior, as well as the different phenomena around them to extract context-tailored knowledge and consequently, provide more effective support. Artificial visual perception techniques, particularly object detection methods, have emerged as some of the most relevant enabling factors of that user-centered intelligence, providing a better understanding of the environment and the entities in it, thus constituting a fundamental pillar for vision tasks in the AmI context, such as object tracking [1] and human action recognition [2,3], among others.
Despite the qualitative leap forward that the recent wide adoption of convolutional neural networks (CNNs) brought accuracy-wise and the countless research efforts mobilized in the last decade in that direction, object detection remains a highly complex problem. CNNs, as a specialized type of DL networks, are computationally expensive [4,5] and require a large memory footprint. Although cloud platforms have shown to be scalable and powerful enough to meet those needs, leveraging such infrastructures typically involves deploying neural networks as remote services, demanding a virtually persistent connection with the server-side. That usually leads to latency (network speed + processing time) and security limitations that make those solutions unsuitable for conventional AmI scenarios where the system response time must be as short as possible, or data privacy needs to be ensured.
As an alternative to the cloud, many authors have focused their research on local processing strategies in response to the limitations mentioned, resulting in a novel DL trend called on-device machine learning. Such an approach pursues more compact and efficient models directly deployable into IoT devices, reducing the server-side computational load, the data traffic between endpoints, and the associated latency. Thus, it makes it possible to incorporate a layer of intelligence into the different end devices in AmI systems and as a result of the latter, it enables a more fluid experience better tailored to end users' needs, without compromising the privacy of their data.
While AmI is a mature discipline that has attracted in the last fifteen years a great deal of attention of scientists and practitioners from various fields, such as human-computer interaction (HCI), AI, and communication networks, the search for on-device lightweight detection solutions is a still-emerging area of study that has barely been around for five years. Both of them, however, have yielded a vast scientific production, as evidenced by the important number of surveys currently present in the related literature. Specifically concerning AmI it is possible to find a wide range of studies: (i) papers of a general nature that overview the field, introducing key concepts, common use cases, and state-of-the-art techniques [6][7][8][9][10][11]; (ii) works circumscribed only to a subset of relevant related matters, such as security and privacy [12][13][14][15], sensing, computing, and networking technologies [16,17], as well as ethical implications and social issues [18], and (iii) research focused on particular application domains (e.g., architecture [19], workplaces [20], educational environments [21,22], and smart cities [18]), with healthcare being the one that has shown greater treatment [12,[23][24][25][26][27][28][29]. For its part, the on-device object detection corpus is limited to literature published over the last five years (2017)(2018)(2019)(2020)(2021). It encompasses (i) surveys that cover the on-device ML field from a generic perspective following a schema similar to the one indicated for AmI surveys [30][31][32][33][34], (ii) works exclusively circumscribed to the object detection problem [35][36][37][38][39][40][41], and last but not least, (iii) research on problem-specific matters, such as good-performant network architectures [37,39], appropriate training strategies [36,39], and techniques to enhance the representational power of neural networks [40].
Beyond the works referenced, the intersection of AmI and AI has already resulted in a substantial number of review articles [17,[42][43][44][45]. They summarize the progress both fields have experienced jointly, detailing the several approaches explored for creating intelligent and context-aware environments (e.g., human activity and behavior recognition techniques [44], and classification methods for more accurate disease prediction and diagnosis [45]). Within such body of works, although some papers can be considered part of the on-device literature since they analyze different IoT communication architectures, as well as hardware technologies for local data processing, there is not any equivalent yet dealing either with the algorithmic part or, in any case, providing an overview of the advanced sensing techniques required for the implementation of the intelligence, context awareness, user adaptability and user privacy preservation principles that, according to Augusto et al. [46], should drive the building process of any AmI system.
To fill that gap, this paper reviews some of the most relevant research works recently produced in different AmI application areas of interest, aimed at the design and development of efficient, compact, locally executable, and consequently, privacy-and data security-friendly object detection frameworks. In particular, this work is exclusively focused on CNNs, a deep learning algorithm family that has shown to be the current state-of-the-art in relation to the target problem [36,38,39]. More specifically, the paper analyzes and provides the reader with a structured presentation of the most salient related approaches and techniques, giving details on both the different challenges or issues that the authors in the field have faced and the various CNN architectures exploited for that purpose, also discussing the specific strategies or approaches adopted and the different configurations or frameworks used to evaluate their performance.
The scope of the study is restricted to CNN networks conceived according to the on-device paradigm. The paradigm itself and its relatively short lifetime represent a highly constraining filtering criterion on the source search process carried out. MobileNets' [47] year of publication (2017) was taken as the primary point in time reference, given that, to the best of the authors' knowledge, it is widely considered the very first compact CNN architecture explicitly designed to be deployed on low-powered devices, in particular, on mobile devices. Thus, the survey covers the major progress made in that direction in the last five years, only considering still image processing techniques and being restricted to application domains and use cases traditionally associated with the AmI field drawn from existing literature, such as [42,43]. Domain names together with terms, such as "on-device", "object detection", "convolutional neural networks" or its acronym "CNN" were used as keywords in the search for pertinent resources, being such search performed using the Google Scholar engine. The list of results originally obtained was refined later on through an iterative process, discarding research papers outside beyond the scope previously outlined, initially after merely reviewing the abstract and in a subsequent step, after an in-depth reading of the work.
The rest of the document is organized as follows. Section 2 provides context by characterizing the most relevant studies in the field. Section 3 analyzes the different lightweight detection frameworks conceived, detailing the different strategies or approaches adopted, as well as the specific challenges or typical problems addressed in such context. Section 4 discusses the factors and metrics considered when evaluating the performance of the applied solutions or comparing them with existing alternatives. Finally, Section 5 summarizes the observations drawn from the state of the art and identifies research challenges to be addressed in future work.

On-Device Object Detection for Context Awareness in Ambient Intelligence Systems
Following a top-down approach, we start the review by first creating a taxonomy of the main AmI application domains where lightweight sensor-based solutions have been recently explored for building context-aware intelligent systems. Such a type of system can sense the environment and as a result, both acquire static images and record videos. Thanks to an "augmented" visual perception layer (that assembles at the same level what is traditionally known as the perception layer in IoT, together with the application layer), they can successfully process such input data and extract the knowledge required to better understand the surroundings, as well as to interact with the entities (including users) that exist there in a more effective manner. The ability to detect such entities without offloading computation in external third-party systems, endows AmI solutions with great flexibility and versatility, as evidenced by the significant number of domains and applications where on-device object detection has been explored, shown in Table 1.
It is possible to create a first classification in which we can distinguish, on the one hand a number of general-scope works [48][49][50][51][52][53][54][55][56][57][58][59][60][61][62][63][64][65][66] and on the other hand a larger group of publications that present domain-specific research . The first group includes studies that, far from focusing on a particular application area, propose more open research involved in the search for efficient and lightweight detection solutions, oriented to AmI scenarios. Overall, authors pursue advances in object detection within the ondevice paradigm, trying, in general terms, to satisfy the requirements that the implementation or deployment of state-of-the-art detection techniques in resource-limited devices inevitably entail, but putting particular interest in, if not achieving real-time performance [49,50,54,64,66], obtaining at least an adequate trade-off between speed and accuracy [52,56,57,60,65]. The experimentation proposed along those lines is typically contextualized in a given AmI scenario or application. Such contextualization, however, has nothing to do with addressing the specificities of an AmI area or sub-field. It is used instead for practical purposes, just as a use case for quantitatively analyzing the performance of the algorithms and methods devised, as well as for assessing the feasibility of their deployment given the constraints of the hardware platforms used; merely having implications on the datasets (on the classes or object types considered) exploited for training, fine-tuning, and testing the CNNs designed. There exists a prevalence in that regard, of widespread and cross-domain AmI applications on mobile or embedded devices, such as face detection [48,58,[60][61][62] (biometric security, surveillance), and vehicle and pedestrian detection (security, surveillance, autonomous vehicles, smart cities) both in cars [56,64,65] and unmanned aerial vehicles (UAV) [49,[52][53][54][55]57,59]. Furthermore, as far as domain-specific studies are concerned, it is possible to categorize them into five groups, intelligent transportation [68,69,[77][78][79]82,85,92,96,102,105,106], surveillance [67,70,75,89,90,95,98,104,109], smart farming [76,80,87,91,93,103,107], healthcare [73,74,88,97,110] and smart cities [72,100,101,108], all of them, scenarios where constant and real-time object detection is necessary for enabling context-awareness on end devices. While further information on each of those domains will be incorporated into the discussion in successive paragraphs to draw a clearer picture, it should be noted first that additional application areas, albeit almost residually with only one or two related works identified, have emerged in the analysis: (i) robotics [81,94], a domain where vision represents one of the most important communication channels with the environment, and where object detection has traditionally shown to be critical for the perception, modeling, planning, and understanding of unknown terrains [94]; (ii) defense, where object detection constitutes a major factor for controlling UAVs [84] and detecting ships in radar images [86]; (iii) smart logistics, with two distinct but equally representative examples of the use of sensing technologies, one on embedded platforms (in situ detection and recognition of ships for more efficient port management) [83], and the second one on mobile devices (barcode detection) [99] and finally, (iv) human emotion recognition based on facial expression detection, as reported in [71]. Once the less representative options are covered, the remainder of this section will deal with the most common domains that have emerged in the analysis.
Intelligent transportation systems (ITS) have lately attracted considerable research interest driven by the modernization of transportation infrastructures and the development of autonomous driving and its supporting technologies. The reviewed AmI literature is good evidence of that. Although the works specifically identified along those lines [64,68,69,[77][78][79]82,85,96,102,105,106] cover only a limited subset of the broad spectrum of applications resulting from the technological progress in the field, they provide good intuition on the relevance of object detection as a key factor for making both vehicles and infrastructures safer, more efficient, comfortable and reliable. In particular, in-vehicle ITS systems [64,[77][78][79]85,96,102,105,106] (commonly known as Advanced Driving Assistance Systems or ADAS) stand out in the analysis as the central focus of interest. Such systems embed detection solutions conceived as safety mechanisms for monitoring both driver operations [77,78,85], preventing distractions, and ultimately, the loss of control of the vehicle, as well as on-road events [64,79,96,102,105,106], being the latter mainly implemented nowadays in the form of a warning instrument triggered in situations of potential collisions [79,96] or infractions [106], but also designed towards decision-making support in future autonomous vehicles [64,102,105]. For its part, regarding conventional ITS alternatives [68,69,82], the different approaches identified are aimed at road traffic control and early response to emergencies, applications in which the use of UAVs has been intensively explored [68,69] due to their capabilities to record and communicate information in a non-intrusive way, as well as to recognize areas inaccessible by other means. A key challenge in this context lies in UAVs' built-in computing components since they have to both process the acquired images with almost no latency to make critical navigation decisions [68] while consuming as little power as possible to minimize its impact on the battery life and hence on the system's flight time [69].
As with ITS, surveillance systems have similarly undergone a dramatic development, pushed by the advancement of information and communication technologies. Like the former, they require a real-time capacity to perform key tasks, such as monitoring and target tracking. In this regard, the studies analyzed within the domain [67,70,75,89,90,95,98,104,109] reveal a large body of works focused on the exploitation of efficient CNNs for human detection in a wide range of scenarios: street monitoring in urban areas [67], home surveillance [75], subway [95] and railway security [90], people search and rescue in natural disasters [104], and recognition of specific human behaviors [89]. Besides low latency, there is a couple of additional factors to be considered when designing such systems that have emerged in the review: privacy [70] and cost [75]. Individuals' privacy and anonymity should be a central tenet of surveillance systems. However, RGB cameras commonly used for such purposes are too invasive, and for that reason, we see how authors, such as Mithun et al. [70] explore the exploitation of alternatives, such as raw depth data. Furthermore, surveillance has traditionally been extremely important for business and home security, and while there is a decent number of technological solutions in the market, most of them rely on expensive proprietary cameras. Low-cost IoT hardware platforms, as suggested by [75], might be a perfectly valid approach to building solutions tailored to more austere budgets. Surveillance and development of low-cost systems can be also observed in smart farming literature, the third of the five main AmI domains identified. In particular, both of those elements are present in Seo et al.'s work [93]. Detection frameworks are devised to automate pig monitoring, a use case where implementing solutions with low cost is tremendously important given the potential need for large-scale deployment of such systems and the more than likely high turnover rate due to their rapid physical deterioration. Nevertheless, except for Seo et al.'s work, the different observed approaches are all related to activities of the agricultural sector, a context where detection techniques stand out as an effective method to recognize plant diseases [80,107], as well as an essential service integrated into the control software of agricultural robots [76,87,91,103]. Specifically regarding the last point, leveraging mobile agricultural robots makes it possible to automate a considerable part of traditional farmers' tasks, eminently repetitive, such as fruit counting [87,103], harvesting, and picking [76]. Fruit detection, as well as the detection of any other typical element in such contexts like trunks or branches, is exploited not only for the indicated purpose but also to provide the robot with information on the environment necessary to successfully navigate through the crop fields [91], often irregular or located in areas where the GPS signal is not reliable enough.
Finally, to conclude this first analysis focused on the current landscape of objectdetection-based AmI applications for low-power devices, we add to the discussion the two application domains not covered yet from the group of five with greater representation in the study: healthcare [73,74,88,97,110] and the so-called smart cities [72,100,101,108]. In regard to healthcare, on-device detection techniques are shown to be effective in extending healthcare spaces beyond the traditional scenario of closed clinical environments, bringing the capabilities of (i) disease diagnosis [73,74], (ii) wound or injury zone delimitation [97] and (iii) patient monitoring and support [88], (available only in typically complex and expensive configurations until recently) to low-cost portable devices. Object detection, however, is not the final intended application. Detection solutions are conceived instead as enabling services for more sophisticated vision tasks, as shown in Table 1. In addition, as far as smart cities are concerned, detection mechanisms are exploited to provide support to more complex tasks as well, but, in this particular case, seeking more resource-efficient management in everyday services in urban areas, such as public transportation [101] and garbage collection [72,100,108].

Compact and Efficient Vision-Based Detection Frameworks
Nowadays, whether we tackle the cross-domain on-device space involving compact and efficient solutions or more specific research focused just on the current AmI landscape, dealing with the design of vision-based object detection frameworks involves the discussion of CNN architectures. While statistical classifiers, such as Support Vector Machines (SVM) [111], Random Forest [112], Adaboost [113] and classical neural networks were for many years the computer vision (CV) standard and played a dominant role in object detection tasks, the advent of DL techniques has unquestionably represented a step forward compared to traditional detectors. The exploration of new approaches and the design of new CNN architectures, capable of automating the extraction of representative features, boosted by challenges, such as Imagenet Large Scale Visual Recognition Challenge (ILSVRC) [114] and Pascal Visual Object Classes (VOC) [115], has resulted in a plethora of novel methods with increasingly higher performance in visual recognition tasks over the last decade, producing more robust frameworks capable of addressing object localization and classification tasks in highly complex scenarios.
Such challenges and the tremendous interest generated in the scientific community have pushed CNNs to an unprecedented evolution. Such progress, however, has led to increasingly complex architectures, as indicated in the introduction of the paper. Models, such as VGG-16 (138 M parameters) [5] or RetinaNet (built on ResNet-152 [116], with 60 M parameters) [117], while achieving high levels of accuracy, are generally based on bulky structures and multidimensional parameter spaces, resulting in a high volume of both computed intermediate products and output values. Towards a response to the reality described, a fair number of investigations have been focused on producing more efficient and effective detection frameworks during the last five years. The reduction of the computational cost of traditional detectors and the preservation of accuracy have driven the solution search process carried out by CV experts (both scientists and practitioners), resulting in a fair number of techniques and methods that, built upon existing restrained-complexity CNNs (e.g., SSD [118], the YOLO family [119][120][121][122], and Faster R-CNN [123]), have shown promising performance in detection tasks and have also proven to be applicable and relevant in different disciplines, including AmI. With the support of Tables in 3, we provide an overview of the various detection frameworks devised within AmI along the described lines, covering both architectural details and traditionally related challenges together with the most relevant solutions explored.

Architectures
Intuitively, upon a first look at the data reported in Table 2, it is possible to observe the predominance of one-stage detectors (39 out of a total of 42) over the two-stage alternative [83,95,99]. Two-stage and single-stage detection frameworks (the latter also called unified detectors) are the two main categories typically considered for the classification of modern object detection pipelines. Two-stage detectors have typically featured high accuracy in both target localization and classification tasks thanks to the exploitation of a Region Proposal Network (RPN) [123] dedicated to producing Region of Interest (RoI) proposals, i.e., candidate object bounding boxes. However, region proposal generation has shown to be an insurmountable bottleneck when trying to produce less costly models. Therefore, many authors have focused their research on exploring unified detection strategies, seeking more efficient solutions. Such an approach models object detection as a single regression problem, addressing in a single step the prediction in the form of bounding boxes of both the areas where potentially interesting objects might be found (localization) and the various classes they belong to (classification). The result is a simplified and smaller architecture, conceived as a single feedforward neural network, suitable to derive high-inference-speed detection models.   The unified approach, crystallized in compact architectures considered a standard nowadays, such as SSD [118] and YOLO in its different versions [119][120][121][122], constitute the foundation on which modern lightweight object detection architectures have been built in AmI [48,49,52,54,55,63,65,66,68,70,72,73,75,[77][78][79][80][81][82][86][87][88]90,91,94,100,[105][106][107]. Such compact architectures, however, resulted from an eminently moderate structural optimization in a first attempt to make CNNs more manageable, prioritizing accuracy preservation over structural reduction and inference speed increase. Thus, although it is possible to state that those detectors represented an effective evolution towards more efficient techniques, especially if we compare them with two-stage frameworks, they were still insufficient for running such a type of detection system on limited-resource devices mainly due to their considerable computational complexity and size. It has not been until recently when works within the on-device paradigm have actually delved into the compression and simplification of detection models and their underlying architectures, achieving a more pronounced decrease of both the number of parameters (and accordingly, the storage required) and the inference time of the detectors produced. Such research has resulted in significantly more efficient detection frameworks, making them directly deployable on low-power and low-memory hardware platforms but, on the flip side, causing a dramatic accuracy degradation. Despite the latter, these novel models, e.g., Tiny YOLO [119,120] and SSDLite [118], are already considered standard benchmarking options when designing new lightweight detectors in AmI [50,58,64,69,74,84,89,92,93,96,97,[101][102][103][108][109][110].
The framework type constitutes the master blueprint adopted to build the detector. The building process is driven by the different design decisions aimed at modeling and tailoring the underlying CNN to the specific problem addressed. Such tasks can be tackled by directly reusing an existing standard network or designing a new detection pipeline from scratch. The utilization of off-the-shelf CNNs, whose implementation and weights are available in ML development frameworks, such as TensorFlow [124] and PyTorch [125], simplifies the development of new detection solutions considerably.
In this regard, pre-trained detectors can be either exploited right away or, on the contrary, be tuned for a different second target task (typically, more specific) through transfer learning. This transfer learning approach enables reusing already existing models [48,55,63,70,75,76,[78][79][80][81][82]87,91,100,103,107], previously trained on public standard object detection benchmarks, such as Pascal VOC [115] (used in [48]), and Microsoft COCO [126] (used in [63,75,78,80,91,107]). Still, as shown in Table 2, it is a common practice as well to exploit not the entire detection model but merely the backbone [49,50,59,68,71,95,106,109], being this a CNN embedded in the detection framework, responsible for extracting from some given input images the different feature maps subsequently exploited by the deeper layers of the detector for predicting the several classes and bounding boxes produced as output. In any case, either globally or just circumscribed to the backbone, weights are initialized with values taken directly from pre-trained models, and then, through a fine-tuning process, the detector is re-trained on an application-specific dataset in order to adjust it to the specific use case to be addressed.
As reflected in the "Baseline" column of Table 2, leaving aside some few works [56,61,83,85] that have alternatively explored the creation of new architectures from scratch, the most common approach along those lines involves, however, using an existing architecture as a baseline and then either replacing the backbone [49,52,54,86] or introducing changes on the detector's architecture or the training process parameterization [55,58,62,69,84,89,90,93,96,98,99,101,104,108,110]. Such tweaks and their focus, as seen if the "Baseline" and "Enhancement Emphasis" columns are analyzed jointly, are closely related to the very nature of the detection framework adopted as a reference. Both the "classical" moderate single-stage detection frameworks and the modern on-device models constitute a valuable starting point that simplifies and speeds up the creation of the new detectors. However, overall, they do not represent proper solutions for developing real-world AmI applications in austere technological environments. In particular, the more accuracy-preserving approaches, as just pointed out, fall short when it comes to optimizing the underlying network architecture and therefore, require efforts mainly aimed at minimizing the memory and computational resources demanded [48,55,70,79,82,85,86,90,107]. For its part, lightweight detectors are the result of more aggressive compression and optimization strategies. A two-fold effort is required to move towards a better speed-accuracy trade-off, on the one hand exploiting techniques capable of mitigating the accuracy degradation caused by the use of compact architectures, and on the other hand with approaches producing more expressive networks with greater detection capacity [51,58,59,62,64,76,84,92,93,[96][97][98]102,103,108,109].
Furthermore, in addition to the dual path of improvement indicated, there is a third line of further refinement that has pursued the creation of more robust detectors [49,52,54,58,62,63,65,72,75,78,[80][81][82][83][84]88,91,93,[96][97][98]102,103,105,108,109] no matter what their nature is. Overall, CNNs have typically shown limitations in generalizing their predictions beyond the training domain or the distribution of the used dataset. Consequently, tasks considered reasonably straightforward for humans, such as detecting faces in the wild (with different head poses and face expressions), might be very challenging for vision-based detectors. Generalization and robustness against variations have been traditionally relevant in object detection, but they have shown to be even more sensitive for on-device detection solutions conceived for AmI. On the one hand as made explicit in Section 2, AmI is a domain that covers a fair number of applications where a badperformant detector can have actual implications or impact on the physical integrity, safety, and wellbeing of the users (e.g., in autonomous driving [92,102,105]), and even on their personal finances (e.g., in smart farming [76,80,87,91,93,103,107]); for that reason, detection systems must ideally ensure consistency in their behavior. On the other hand, on-device detection solutions are, by definition, lightweight DL models directly deployable on hardware platforms with limited resources, a category where we can find portable devices, such as smartphones and machinery with mobility, such as robots and UAVs. Those devices have been reportedly used and proved to be helpful in different real-world scenarios where there are good chances of having dynamic and challenging conditions that might alter how targets look, contributing in that way to higher intra-class variability.
Below we list, in decreasing order of occurrence, the various factors pinpointed in the AmI papers analyzed as usual obstacles against detection robustness: (i) illumination [58,65,75,78,80,81,83,103,105,109], that has an impact on images at the pixel level, altering object colors and even producing sense of occlusion in shadowed areas; (ii) total or partial target occlusion [58,80,81,83,103], including also object overlapping and truncation [102]; (iii) viewpoint [63,65,75,80,84], which makes a single object look completely different depending on the angle of view; (iv) weather conditions [83,84,105,109], that lead to illumination variations in the scene, also encompassing phenomena (e.g., fog and rain) that cause distortion or blurring effects on the acquired image; (v) object pose and orientation [58,78,109], usually upward in the real world, but not necessarily so in implementations, such as mobile augmented reality applications [63], and (vi) background complexity [65,83,98], which usually arises in crowded or cluttered scenes and makes it difficult to separate targets located in the foreground from non-relevant objects located in the background.

Challenges and Explored Approaches
Environment variation is still one of the common challenges that need to be addressed when developing detectors, and so they are when addressing the object detection problem within AmI while pursuing more efficient and compact solutions, as we have just seen. Although a fair number of such conditions have been successfully modeled by gathering representative samples as part of the datasets used for training the detection models, it is impossible to anticipate all the conditions that can potentially occur in the wild. Those context variations represent, however, only one of the various challenges observed in the AmI literature reviewed. Table 3 presents all the challenges resulting from the analysis, together with the different strategies or methods also noted. As far as the challenges are concerned, it is possible to categorize them into three distinct groups: AmI-related challenges, a group that comprises the environment-variation-specific issues we mentioned early on, but also both intra-class variation and scale variation challenges; performance objectives inherent to the on-device paradigm, such as model size and computational load decrease, as well as accuracy boost and finally, data-specific issues, including scarcity and low quality, both of them classical DL challenges that arise when dealing with specific applications or domains. In addition, concerning the various solutions reported, we can group them into four categories or approaches: (i) design of novel CNN-based architectures, (ii) model compression and acceleration, (iii) improved ad hoc datasets creation, and (iv) hyperparameter tuning. Below we discuss those categories in greater detail, providing information on the specific techniques and methods explored, putting the focus on the potential benefits that have contributed to addressing the challenges previously mentioned effectively.

Design of Novel Lightweight CNN-Based Architectures
The design of novel lightweight CNN architectures stands out in the review as the dominant line of work, represented by the first 20 rows out of the 29 in Table 3. As its name suggests, this approach aims primarily at creating more compact and efficient network architectures (as reflected by the number of cells resulting from the intersection of the corresponding row in Table 3 with the first two columns), but it also seeks to improve accuracy in order to achieve an appropriate trade-off between speed and accuracy. Size and latency reduction go hand in hand. The more model weights, the higher the number of operations involved and consequently, the computational complexity and inference time required. Leaving aside a few exceptions in [49,82,86], in the AmI research analyzed, several strategies can be observed that pursue in the first place the reduction of the number of parameters, but that also affects the computational cost of the detector. In particular, it is possible to distinguish among them techniques eminently focused on the internal configuration of the layers in the network, as well as methods that, on the contrary, address global-scope factors focused on the organization of the layers or building blocks within the architecture. More specifically, within the first group, two different types of approaches have been observed: (i) a set of practices dealing with properties at the convolution kernel level (kernel number decrease, whether this is networkwide [69,95] or more selective [50,89]; size reduction in the spatial dimensions [95], and channel number decrease, applied, again, both indiscriminately to all the layers in the network [53], and particularly to specific stages [86] or layers [90]) and (ii) a second subset of works exploring the use of less expensive operations [49,60,67,82,95,101], focused primarily on finding lighter alternatives to standard convolution operations, such as point-wise convolutions [95], depth-wise separable convolutions [60,67,101], and grouped convolutions [49].
For its part, with regard to the group dealing with higher-level matters, we find a greater heterogeneity of approaches that includes: (i) the design and integration into an existing architecture of building blocks, such as two-way dense connection modules [86], based on residual shortcut connections, or the so-called Fire modules [95,105], consisting of a compression layer and a expansion layer, jointly capable of reducing the number of parameters of the models produced while preserving accuracy; (ii) the replacement of heavyweight backbones, such as VGG [5] by lighter networks, such as Shufflenet [49]; (iii) the replacement of costly layers, such as fully connected layers, in which each neuron of a given layer is connected to each of the neurons in its preceding, by alternatives with fewer connections and weights, such as convolution layers [95]; (iv) shallow layer subsampling (reduction of spatial dimensions) [60], by using convolution layers with a stride value greater than 1 instead of utilizing pooling layers which result in a higher loss of detailed information. Finally, it also identified in this group more simplistic techniques, such as reducing the input dimensions of the network [50,90] or eliminating some of its layers [76,79,82].
Regarding detection accuracy, an additional fair number of approaches can be observed in the literature analyzed. They are methods and techniques that pursue better object classification performance but, eminently, go after a detection accuracy boost, especially when dealing with multiscale or tiny targets. Thus, two distinct strategies can be distinguished along those lines: broader-scoped CNN-related methods addressing classification-related concerns and object-detection-specific techniques primarily aimed at achieving higher localization accuracy. In particular, the first group encompasses solutions devised to increase the representational capacity of the networks, improving both the network's learning capacity and the accuracy yielded. Although most of the actions reported in that direction, such as increasing networks' depth [84,96,107] and input size [69,102], late subsampling (i.e., reducing deep layers' size to obtain or maintain large receptive fields, thus enabling the coding of a higher volume of information [60,79]), or increasing the number of channels in convolution kernels [102], may seem misplaced within the on-device landscape, their underlying philosophy remains entirely pertinent for the search of lightweight solutions aimed at mitigating accuracy degradation. Furthermore, those methods are presented in the AmI body of works complemented by the exploitation of more versatile techniques, such as the utilization of residual shortcut connections, a building block that has proven to be effective at enabling a better information flow throughout the network, thus mitigating the vanishing gradient problem [49,105], and the exploitation of both convolution kernels [95] and receptive fields [55,59] with multiple sizes, aimed explicitly at achieving higher detection accuracy when targeting multiscale objects.
Target scale variation poses a challenge that commonly requires the fusion of low-level high-resolution features and high-level semantic features to obtain greater semantic richness. Data reported in this respect through Table 3, draw a scenario where the fusion of multiscale feature maps constitutes the dominant approach [53,56,57,62,64,68,82,98,[102][103][104][105], with this substantiated in a considerable variety of strategies: channel concatenation [53], which provides a "trainable" way of fusing feature maps from different stages in the network; the exploitation of the pyramidal hierarchy naturally shaped by the characteristic size-descending feature maps in CNNs [102][103][104]; the use of 1 × 1 convolution kernels right after larger convolutions [56], which is equivalent to the linear combination of the corresponding pixels along the channels in the kernel, and the exploitation of Feature Pyramid Networks (FPN) [68,105], or any other variant, such as Deep Feature Pyramid Network (DFPN) [57], FPNLite [31], or Concatenated Feature Pyramid Network (CFPN) [82], as part of the detection framework, fusing semantic information from multiscale feature maps through an architecture comprised of an upstream and a downstream path, and the downstream path being responsible for building higher resolution feature maps using a semantically rich map as a starting point. Lastly, in addition to the several techniques just mentioned, a secondary approach has been noted in this respect, focused on investigating novel blocks or architectural elements to facilitate information transfer throughout the network underlying the detection framework. More specifically, the analyzed studies report using the Squeeze-and-Excitation blocks [62], which explicitly model channel relationships and interdependencies and include a form of self-attention on channels, and spatial attention blocks [98], which can emphasize smaller targets' features, thus capturing more important information and filtering out noise.

Model Compression and Acceleration
While architectural-tweak-based techniques and methods for designing more compact object detection frameworks have been widely studied in the on-device literature, they do not always lead to solutions suitable for austere environments. Depending on how those techniques perform, but also on the target hardware platform's capabilities, they might produce models too costly to be deployed in such environments. In this context, CNN compression has proven to be an effective approach orthogonal to lightweight CNN architecture design, accelerating and reducing the size of already existing networks. This approach, shown in Table 3 as the second one with more occurrences, encompasses three distinct observed techniques: pruning [50,54,55,70,90,93,99,101], quantization [48,87,99], and knowledge distillation [48,87,99].
Pruning consists of a typically incremental process that compresses a trained model, eliminating redundant or non-informative elements of the network, being these either individual neural connections or larger structures within the network itself. It commonly entails three distinct stages: evaluating how important the parameters considered are, trimming the less relevant ones, and finally, finetuning the resulting model to recover part of the accuracy lost. Although this pipeline remains unchanged in the reviewed investigations, it is possible to observe in those studies two different pruning modalities: unstructured [99] and structured [50,54,55,70,90,93,101]. Unstructured pruning eliminates connections with lower-valued weights. Although conceptually intuitive, it results in network architectures that present an irregular structure, potentially unsuitable for exploitation in practical applications. Structured pruning, on the other hand is capable of producing more formal standard network structures as output, resulting in a structured sparsity, particularly beneficial for saving computational resources. In this modality, in turn, two types of methods are observed in the collection of papers reviewed: kernel pruning [50,54,70,93] and channel pruning [55,90,101]. Particularly concerning kernel pruning, three different techniques have been identified. Two of them, motivated by the intuition that kernels with lower-valued weights tend to produce feature maps with weaker activations, assess the relevance of every kernel present in each layer by computing the absolute sum of its weights [50,54], while the third one leverages a kernel clustering algorithm to get rid of the kernels in the network that extract similar features [93]. For its part, as far as channel pruning is concerned [55,90,101], as its name suggests, it is an approach that reduces the number of channels in convolutional kernels, thus accelerating convolutions. Each channel is assigned a scale factor representing its relevance in the first place, being the network trained later on in a sparse manner to delete channels with weight values below a given threshold.
The second main compression technique identified, data quantization, reduces the number of bits used to represent network parameters, reducing the size and computational complexity of the model, but resulting in a significant accuracy loss. In particular, 8-bit quantization, a standard practice in on-device ML, is exploited in three of the AmI research works analyzed [48,87,99]. Parameter conversion, from 32-bit floating-point precision to 8-bit precision, is shown to be a mechanism effective for compressing DL models, as well as necessary to obtain an implementation suitable for modern hardware accelerators, such as DSPs [48]. Specifically, in [87], quantization is applied to both kernel weights and data in the activation functions, [48] the authors report the application of quantization directly on the pre-trained model.
Finally, while both parameter pruning and quantization address the inherent parameter redundancy in DL models by deleting the less critical or relevant parameters, knowledge distillation is a training strategy able to generate compact neural networks but capable of producing an output similar to the one provided by more complex networks. This latter approach, represented in Table 3 by a single paper [64], goes beyond a mere compression technique in the strict sense of the word, producing lightweight models with better performance. Instead of training a compact network from scratch, it relies on a refinement strategy in which a lightweight model, namely the "student", is trained guided by a more complex and sophisticated "teacher" model, resulting in a transfer of knowledge between the two through the matching of certain statistics.

Improved Ad Hoc Datasets
Datasets represent a key factor when conceiving DL-based detection frameworks. The exploitation of datasets is an unavoidable consideration for model training and evaluation, as well as for comparing different algorithmic alternatives. Indeed, the access to massive collections of annotated images and videos through the Internet has certainly played a significant role in the development of those models. The widespread availability of the so-called big data and the ubiquitous nature of modern information access technologies have enabled the compilation of extensive datasets, such as the well-known Pascal VOC [115] or Microsoft COCO [126], representative of a fair number and diversity of objects, boosting object detection in the direction of increasingly complex problems and more sophisticated solutions.
Such a scenario responds, however, only to general-purpose object detection. In domains, such as AmI, where object detection is applied to specific problems, the reality differs significantly from the one described in the previous paragraph. Overall, mainstream datasets are unable to capture the singularities of the distinct class instances considered for particular use cases (especially when they are very numerous or similar), and they are not able either to do the same with the context or environment where they can commonly be found. Particularly in AmI, as seen in Section 2, there is a fair number of paradigmatic use cases that have been extensively studied, where, consequently, well-known and widely proved public databases can be found, e.g., KITTI [127] for autonomous detection systems, VisDrone [128] for object detection on images taken from UAVs, or WIDER FACE [129] for face detection. However, they do not cover in any case the broad spectrum of existing related applications in the field, and overall, they lack proper size and quality, especially when compared to general-purpose detection benchmarks.
The study revealed two different approaches devised in response to the noted datarelated shortcomings: the creation of ad hoc datasets for the problem addressed [50,64,68,[77][78][79][80]84,86,91,95,98,100,103,109], and data augmentation [48,50,52,53,57,60,61,64,70,71,73,74,76,81,84,85,87,97,98,100,106]. As shown in Table 3, both solutions have been widely adopted by the authors, a fact quite surprising considering the substantial effort typically required to address the various tasks involved in creating new datasets (image and video acquisition and homogenization, but fundamentally, the creation of annotations). In particular, the creation of customized datasets is featured in 15 of the 62 works reviewed, proving to be highly effective in dealing with the variations among the different objects within the same class and the changing environmental conditions that might affect their visual appearance. For its part, data augmentation is reported to be used in 22 studies, supplementing datasets by directly increasing their number of samples, thus avoiding classical problems in DL, such as model overfitting and class imbalance.

Hyperparameter Tuning
This fourth and last approach encompasses a set of methods that, although barely observed in the body of works studied, has traditionally had a significant impact on ML and DL-based solutions, pursuing the proper parametrization of certain attributes either of the underlying network architectures or the specific algorithms exploited for training in order to yield a more comprehensive performance improvement of the models produced. Particularly in AmI, it has been reported in this regard, on the one hand two different approaches that involve the configuration of the anchor boxes used for the prediction tasks performed by object detection frameworks [60,61,79,95] and on the other hand a better-represented approach that explores more appropriate or representative loss functions [53,60,64,92,95,105,107].
Anchor boxes or "priors" represent a concept intrinsic to modern CNN-based object detection frameworks. Thus, for every image provided as input, detectors typically produce a broad sample of anchor boxes, adopted as candidate regions examined by the algorithm to determine whether or not they contain any of the objects of interest considered, with this decision being made according to the overlapping of the anchor boxes themselves with the ground-truth boxes (a metric known as Intersection over Union, IoU). That said, a particular case is assumed to be positive when the computed overlap value is greater than a predefined threshold, triggering the refinement of the candidate region boundaries to fit the enclosed object better and thus, provide a more accurate prediction of the output bounding box. In that sense, although each feature map in the underlying CNN is likely to have several candidate regions, potentially only a small percentage of them actually contains objects, leading to a clear imbalance between positive and negative samples and consequently, to a waste of resources. Anchor box filtering [79] addresses this problem by using an object-priority-based labeling mechanism to filter out a significant chunk of the negative samples. Hence, it reduces the object search space considerably, also mitigating the detector's computational overhead. Furthermore, anchor boxes' dimensions and their aspect ratio largely determine the size and shape of the objects a detector is able to detect, to the point that if the priors do not match the different objects in the exploited dataset (for instance, because they are tiny of irregularly shaped), the detector might omit them. Motivated by this issue, two research works exploring different anchor box scaling mechanisms were identified in the analysis of on-device detectors within AmI: [59] which proposes a multiscale-box-based strategy for predicting targets with different scales and [95] that explores the utilization of small anchors to detect small and partially occluded targets.
Lastly, with regards to the loss function, which drives the learning process of deep neural networks (DNN) and so it does with CNN-based object detection frameworks, it is a well-known element that has shown a significant impact on the accuracy level of the predictions produced by such algorithmic solutions. In particular, when it comes to the object detection frameworks studied, several enhancement strategies have been observed on two different types of loss: the classification loss, necessary to train detectors on object class prediction tasks, and the localization loss, required as well, to do the same, in this case, with bounding box regression. It is precisely concerning the latter, the location loss, that we find various specific optimization techniques within the AmI space. Three of them pursue a better performance in terms of accuracy, exploring possible loss factors more representative of the particular problems addressed [92,105,107]. Both in [92] and [105], unlike the deterministic-regression-based approach usually employed in mainstream object detection frameworks, authors opt to introduce uncertainty in localization prediction by modeling bounding box vertices using Gaussian parameters, while in [107] they explore a metric alternative to the standard IoU with better capability for representing two-object overlapping. Furthermore, accuracy boost aside, a more suitable definition of the loss function has proved to effectively mitigate the class imbalance problem. In particular, representing this line of action, we can find in Table 3, three works [53,60,64] that incorporate a focal loss factor into the loss function, filtering in that way simple samples automatically and thus, increasing the weight of complex samples.

Study and Evaluation
To conclude the present revision of research on the design of on-design object detection solutions in the context of AmI, it is necessary to provide a brief overview of the most significant aspects involved in the evaluation and comparison of such techniques. The study and evaluation of the models, in the same way as, for example, the design, coding, tuning, or training of the networks, constitute an unavoidable step when implementing a DL or ML solution. Creating a robust and reliable test setup is a fundamental task, not only to ensure the adequate performance of the detection framework conceived but also to guide the iterative refinement or improvement typically involved in the design of such complex models. It is, therefore, necessary to present, even if only briefly, the main factors or elements to consider both when creating a test setup and when analyzing the results obtained (besides the detector or the several detectors of interest and the various datasets selected for benchmarking). Accordingly, Table 4 collects the several hardware platforms commonly used in the considered context of the study, a fundamental decision when assessing the performance of the solutions in real-world scenarios with memory and computational constraints, together with the main aspects or values representing detection frameworks' performance, as well as the most widely used associated metrics. Table 4. Summary of evaluation frameworks used for assessing on-device HAR models' performance in the reviewed studies.

Work
Test Device Hardware Acceleration Accuracy Speed Model Size Computational Complexity Real Time

Hardware Platforms
In the last few years, a plethora of new cutting-edge mobile and embedded devices have appeared in the market as potential AI-supporting hardware platforms suited for the implementation and continuous improvement of on-device DL models due to their everincreasing computational power and modest energy consumption compared to the more traditional server or desktop alternatives. In particular, mobile SoCs (System on a Chip), adopted in today's mass-produced embedded devices, such as mainstream single-board computers, have been and still are, definitely, one of the main drivers of the recent on-device trend. Their physical design based on the tight integration of computing, memory, and communication components in a single integrated circuit not only optimizes the internal communication among components and maximizes energy efficiency but also minimizes chips' waste heat, as well as their die area, enabling the miniaturization of the devices built on its basis. Table 4, in particular, the field tagged as "Test device", lists the several devices used in the AmI research analyzed for evaluation purposes. It is possible to classify those devices into three different groups: high-performance single-board computers [49,[52][53][54]56,57,65,66,70,71,[75][76][77]82,83,89,[91][92][93][94]103,105,106,108], low-power single-board computers [60,67,69,72,75,78,95,100,103], and mobile devices [63,68,73,74,80,87,99]. This categorization, however, does not contemplate papers where the hardware setup used for model deployment is not properly detailed or detailed at all [48,51,55,64,81,84,88,90,[96][97][98]101,102,104,107], nor the ones that report using desktop systems for that purpose [58,59,61,62,79,85,86,109]. Moreover, in addition to those omissions, there are two punctual investigations also excluded from the previous classification where authors propose detectors specially designed for specific-purpose hardware: humanoid robots programmed to play soccer [50], and a mixed reality headset used to overlay virtual information [110] on the real world.
The group of high-performance single-board computers includes low-energy embedded devices designed specifically for accelerating ML applications through a dedicated built-in processor. As the data presented in Table 4 reflect, this category is monopolized by Nvidia's Jetson family. Up to 23 of the 47 papers detailing the evaluation hardware setup exploited report using at least one device belonging to the Jetson hardware platform for AI processing on edge, being the latter a distributed computation paradigm based on an intermediate layer of resource-constrained devices located physically closer to data source where computation is partially offloaded, thus mitigating latency and reducing response times comparing to the cloud alternative. In particular, the specific devices observed in this respect, listed in ascending order according to their computational power, are: Nvidia Jetson Nano [65,91,93,103], Nvidia Jetson TX1 [52,70], Nvidia Jetson TX2 [49,53,54,56,57,66,71,76,77,82,83,89,93,94] and Nvidia Jetson AGX Xavier [92,103,105,106]. All of them are based on CPUs with ARM architecture, ranging from four cores in the Jetson Nano and Jetson TX series to eight cores in the Nvidia Jetson AGX Xavier. However, their actual AI processing capability lies in the powerful GPUs they embed, delivering performance equivalent to desktop graphics chipsets of recent past years. Their physical design, based on a fair number of cores (128 in the Jetson Nano, 256 in the Jetson TX1 and Jetson TX2, and 512 in the Jetson AGX Xavier), makes it possible to parallelize repetitive operations and thus perform matrix computation more efficiently than general-purpose CPUs. They present, however, more limited memory than their desktop counterparts, insufficient for unrolling and converting convolution operations into matrix operations. For that reason, as a rule of thumb, models trained using desktop GPUs must be converted into models suitable for mobile GPUs, transforming matrix multiplications into dot product operations.
Still, within the high-performance single-board computers group, two alternatives to the Nvidia devices have been observed: the RK3399 pro [108] and Up Squared [75] boards. Although more modest than the Jetson series as far as CPU and GPU are concerned, the RK3399 pro integrates, unlike them, a Neural Processor Unit (NPU) as the primary AI inference acceleration component. Those NPUs, as we will see for other devices, depending on the manufacturer, can present different commercial names, such as Tensor Processing Unit (TPU), Neural Network Processor (NNP), Intelligence Processing Unit (IPU), or Vision Processing Unit (VPU). They all refer to specialized circuits that implement all the control logic and arithmetic necessary to execute ML algorithms. In the particular case of RK3399 pro's SoC, the CPU embeds an NPU with power enough to support both 8-bit and 16-bit data, yielding a computing performance of up to 3.0 TOPS (Tera Operations per Second), at the cost of power consumption, according to the manufacturer, of only 1% of that of the GPU. For its part, the Up Squared board is the only device with an x64 architecture among all the ones identified in the review, standing out because it integrates an FPGA, an acceleration solution that, although less costly than NPUs or GPUs, has shown more potential than the latter in DL tasks, with higher flexibility to operate with different types of data (binary, ternary and even customized), and strong capacity to deal with the usual irregular parallelism that characterizes sparse DNN-based algorithms.
In the second category of hardware platforms exploited for evaluation, we find singleboard embedded devices again, but, in this case, governed by more modest SoCs, embedding general-purpose mobile CPUs and GPUs with significantly lower cost and power consumption. Raspberry Pi (in its different versions) [60,67,69,72,75,78,100,103], Odroid XU4 [69], and the standard version of the RK3399 board [95], the three representative devices observed in the AmI literature analyzed, show, in general terms, limited hardware capabilities, especially if we compare them with the typically demanding requirements of DL tasks, yielding a performance far from the pursued real-time. Hence, with the end goal of compensating to some extent such shortcomings, some plug-and-play ML-accelerating devices have recently emerged on the market, which, typically connected to the host device through a USB port, act as a dedicated coprocessor, enabling a high-speed ML inference on a wide range of conventional systems. The Intel Neural Compute Stick, in its two versions, is shown in Table 4 as the most popular alternative [75,100,103]. Built on Intel's Movidius Myriad X VPU, it incorporates the inference capability of 16 programmable shaving cores and a dedicated neural computation engine into the end device. Still, if we take as a reference the data collected in Table 4, it can be stated that its use remains merely testimonial, and its impact on the resulting final performance obtained is not sufficient yet.
Finally, the third group includes only mobile devices, in the strict sense of the term. Android systems practically monopolize this last category (there is only one iOS device [99] in the eight papers where smartphones are used for evaluation), with Qualcomm and its Snapdragon SoC [68,74,78,87] being the preferred hardware configuration. Essentially through its 8 series, Snapdragon has not only led the "flagship" smartphone market in recent years but has also been one of the main spearheads of the technological development in the mobile world, progressively incorporating in the last few years new components specialized in AI: from the Qualcomm Hexagon DSP (Digital Signal Processor) introduced in the Snapdragon 835 [68], all the way to the recent and more sophisticated Qualcomm AI Engine embedded in the Snapdragon 865 [87], where the Qualcomm Hexagon Tensor Accelerator DSP, capable of performing 15 trillion operations per second, is jointly exploited with the Adreno GPU and the Kryo CPU cores as a comprehensive acceleration solution. DSPs can perform part of the computations involved in ML and DL processes with high efficiency, alleviating the workload of the other cores and thus reducing power consumption.
Accuracy and speed are closely related to the complexity of the trained model: a superior model representation capacity makes it possible to produce more accurate predictions, but, on the flip side, it generally results in a longer inference time. Such complexity, as done by a fair number of authors, might be intuitively evaluated in terms of model size, reporting the results produced either in the form of the number of parameters [53][54][55][60][61][62]70,71,74,85,90,95,99,100,109], their weight [48,51,[54][55][56]60,64,67,70,86,87,98,[100][101][102]105,107,109], or their memory footprint [60]. The number of parameters responds to the number of model weights learned during the training process, essentially considering in that respect the parameters in convolution and fully connected layers. Weight, in turn, stands for the size of the parameters counted, while the memory footprint, although sometimes erroneously used as a synonym of the former, corresponds to the estimation of the space in main memory required by the model at runtime, thus constituting the measure that best characterizes the requirements of such nature.
Model size, particularly the number of parameters and their weight, are straightforward metrics capable of providing a first intuition on its resource requirements at a glance. However, when discussing an algorithmic solution's computational complexity in a more rigorous fashion, it is necessary, especially within the on-device paradigm, to know the way computations are carried out and their cost in terms of demanded resources. In this regard, despite the scarcity shown in the AmI works reviewed, it is possible to observe additional representative metrics, commonly reported in DL studies to supplement the accuracy and inference information provided, denoting a model's computational complexity by the number of operations involved, either directly as MAdds (number of multiplicationaddition operations) [61,74], or indirectly as FLOPS (number of floating-point operations per second) [48][49][50][51]53,55,60,64,71,85,86,90,92,105,109].

Conclusions and Future Work
This paper reviews the more relevant ambient intelligent recent research focused on the study and exploitation of lightweight object detection frameworks capable of address the inference process locally, thus ensuring more robust data security and better user privacy within intelligent environments. Specifically, the study carried out provides a comprehensive analysis of such frameworks, discussing in an organized and schematic way (i) the application domains where those techniques have proven to be particularly useful, (ii) the various CNN architectures designed explicitly for devising more efficient and compact detection systems, (iii) the challenges associated with the design process of such systems, together with (iv) the different approaches explored in response to them, and lastly, (v) the hardware setups and metrics adopted by researchers and AI practitioners for assessing the solutions proposed.
Ultra-compact detectors, such as Tiny YOLO in its different versions or SSDLite, emerge in the reviewed literature as the most salient on-device options among the several lightweight detectors exploited in AmI. Adopting a unified-detection-pipeline-based architecture or model as a starting point, primarily oriented towards higher inference speed, alongside the integration of a more refined and simplified network model as the backbone, makes it possible to build fast detection systems. Such an approach, however, delivers processing times that fail to achieve real-time performance in many scenarios and what is even worse, it derives in a dramatic accuracy loss in comparison with existing state-of-the-art alternatives. In that sense, most of the research efforts observed have been aimed, overall, at finding mechanisms and strategies for a better accuracy-speed trade-off, focusing on accuracy or speed depending on the nature of the architecture or model adopted as a baseline, but also depending on the particular application pursued.
Networks resulting from more aggressive compression and optimization approaches have proven to fall short in use cases where accuracy is crucial (e.g., autonomous driving, and robotic systems for task automation), thus demanding methods capable of producing more expressive CNNs, either through richer feature hierarchies, such as the fusion of features extracted at different levels of the network, deeper network architecture or the exploitation of building blocks, such as the attention mechanism. On the other hand, detection frameworks based on compact networks but more conservative in terms of the accuracy provided typically feature an excessive size and resource consumption, which, while not hindering their deployment in austere systems or execution environments, do weigh down their performance significantly by degrading response speed. Therefore, they require optimization efforts to reduce the size of the models produced, as well as the computational power and memory space required for their execution. In this regard, architecture-tweaking approaches, such as exploiting more efficient convolution operations or decreasing the number of convolution kernels per layer constitute today's mainstream practices. However, techniques such as pruning and quantization have shown to be complementary widespread improvement options.
In addition to speed-and accuracy-specific challenges and solutions, both closely related to object detection on low-power end devices, a collection of domain-specific issues arise when developing AmI applications. The reuse of models previously trained on large public data sets leads in the first instance to general-purpose solutions that, overall, do not fit the specificities of the AmI scenarios observed, resulting in less accurate and robust detectors. In this regard, transfer learning has proved to be effective in dealing with that gap, successfully tailoring a given model to a particular use case by further training it on a new dataset more representative of the new target task. Creating such datasets involves, however, a tedious and costly process, typically based on manual annotation and thus prone to human error. That usually results in datasets that, on the one hand are not big enough for DL training, and on the other hand present quality deficiencies, such as class imbalance. Furthermore, a fair number of authors have focused part of their research efforts on acquiring and annotating images representative of the actual context where the resulting detection systems would be used. Many produced datasets have been reportedly insufficient to cover the typical highly-changing environmental conditions in the wild (e.g., brightness, viewpoint and weather) or deal with the intra-class variance problem, no matter the latter is due to the very nature of the targets (e.g., different facial expressions in face detection, and different clothing and poses in human detection [109]) or to alterations caused either by the environment itself or by the rest of entities present in it (e.g., total or partial occlusion).
The selection of a proper lightweight detection framework, jointly with the implementation of improvement tweaks eminently oriented to both the reduction of size and complexity of the underlying CNN network and the creation of a more robust and adequately sized dataset, have yielded promising results as can be seen in the data presented in the column tagged as "Real-time" in Table 4. Nevertheless, although some works confirm the feasibility of real-time detection solutions on mobile and embedded devices in AmI contexts, an overwhelming majority of them either fail to achieve such efficiency or present important gaps in the evaluation report that degrade the robustness of the results. In particular, with respect to the latter point, the table reveals (i) research that did not consider execution speed as an evaluation metric and that only report accuracy values; (ii) studies that omit information about the materials used for testing the models' performance on inference tasks, reporting in many of those cases only the hardware setups used for training and (iii) works that report achieving real-time performance, but on high-end GPU-powered desktop systems far from the strong memory and computational limitations that characterize the on-device paradigm.
Future research should address the deficiencies mentioned above, dealing with the computational complexity of the solutions devised, and also addressing hardware-specific concerns that may affect their final performance, generalizing the study of energy consumption (approached punctually in [53,57,83]), as well as other interesting related matters, such as the proper use of parallelism strategies or how to exploit jointly modern multi-core architectures and AI acceleration hardware in a proper way. Likewise, in order to overcome data scarcity, it will be necessary to either explore techniques to alleviate or streamline the dataset creation process (e.g., synthetic data generation based on Generative Adversarial Networks [130,131], or image and video acquisition in simulated environments [132]), or devise DL alternatives that demand a smaller volume of data (e.g., the so-called few-shot learning techniques [133,134]). Finally, although the body of works considered in the study represents a broad spectrum of applications within ambient intelligence, it does not cover paradigmatic scenarios in the field, such as workplaces, educational centers, or smart homes. Vision-based methods and techniques, such as those presented, featuring offline solving capabilities, will undoubtedly push forward the implementation of intelligent systems in such contexts, where data security and individuals' privacy are hard requirements. It will, however, be necessary to approach the specificities of each of them in order to maximize the performance of the proposed solutions.

Conflicts of Interest:
The authors declare no conflict of interest.