For years, smartphones have been increasingly become one of the most indispensable personal devices, allowing people to easily take photos recording every desirable moment with just a simple click. According to the Global Digital 2019 report, the number of people around the world who use a mobile phone accounts for 67%—more than two-thirds of the total global population [
1]. In Japan, the statistics obtained from the Ministry of Internal Affairs and Communications show that the ownership rate of the smartphone is about 60.9%, and especially the rate of owners who are under 40 was over 90% in 2018 [
2,
3]. This facilitates the explorations of various Social Networking Services (SNS) (e.g., Facebook, Twitter, etc.). In fact, about 3.5 billion people accounting for 45% of the global population are using SNS [
1]. As a result, a social problem potentially occurs when people are unintentionally captured in others’ photos and which are then published on social networks. More seriously, the photos along with the photographed person’s identity can be used by photographers for their own purposes. As a consequence, the facial privacy of the photographed person is severely violated. Recent advances in computer vision and machine learning techniques make this problem become more serious. Indeed, these techniques can automatically recognize people with extremely high accuracy, facilitating the possibility of searching for a specific person in vast image collections [
4]. To combat this privacy problem, numerous approaches have been introduced. One straightforward approach is to manually specify the regions containing subjects and apply appearance obscuration. However, this approach is time-consuming and not suitable for real-time privacy protection. For automatic privacy-protection purposes, the existing methods might be done at either photographer’s site or the site of the photographed person. The methods in the former category leverage the power of computer vision and machine learning techniques to hide the identity of photographed persons to avoid their identification [
5]. For example, Google developed cutting-edge face-blurring technology which can blur identifiable faces of bystanders on the images [
6]. Other solutions aim to automatically recognize photographed persons in images and obscure their identities [
7,
8,
9]. Unfortunately, these approaches give no choice to the photographed person to control over his privacy protection since this process is totally done at the photographer site. The methods in the latter category attempt to proactively prevent the photos of the photographed person from being taken. For example, some techniques force for the privacy of photographed person to be respected based on his privacy preferences represented by visual markers [
10] or hand gestures [
11] or offline tags [
12] which are visible to everyone. However, the privacy preferences might vary widely among individuals and change from time to time, following patterns which cannot be conveyed by static visual markers [
13]. More sophisticated techniques rely on cooperation between photographers and photographed persons, which enables photographed people’s privacy preferences to be detected by nearby photo-taking devices (via peer-to-peer short-range wireless communications) [
14,
15,
16]. However, this approach requires the photographed persons to broadcast their preferences, leading to other aspects of privacy. More importantly, this approach might not be effective in a situation where the photographers proactively switch off the communication function on their devices or ignore the advertised privacy choices of the nearby photographed persons because they are intentionally and secretly taking photos of pre-targeted persons. Indeed, this is also a common problem of most of existing studies, which mainly focus on the privacy-protection of the “bystander” which is defined as either “a person who is present and observing an event without taking part in” [
9] or “a person who is unexpectedly framed in“ [
7] or “a person who is not a subject of the photo and is thus not important for the meaning of the photo, e.g., the person was captured in a photo only because they were in the field of view and was not intentionally captured by the photographer” [
9]. In other words, the situation where the photographer intentionally takes photos of targeted persons has not been taken into account.
In this study, we propose an approach that allows the photographed person to proactively detect a situation where someone is intentionally/unintentionally trying to take photos of him using a mobile phone, without broadcasting his privacy preferences as well as identifying information. Afterward, the photographed person will have an appropriate reaction such as leaving the shared space or asking the photographer to stop taking the photos in order to protect his privacy. Importantly, in order to sufficiently cover as many of cases of photo-taking as possible, we use the notion of “photographed person” instead of “bystander”. We assume that the photographed person has strong motivation to protect his privacy and is willing to use a wearable camera to monitor the surrounding environment. The behavior of the likely photographers is recognized via the analysis of their skeleton information obtained from the monitored video. Note that, in this study, we only use a normal camera to evaluate the potential of the proposed approach. In practice, a thermographic camera should be used to hide the facial identities of people who are captured in the monitored video. However, there is no technical difference between the normal camera and the thermographic camera in detecting photo-taking behaviors of the photographer since the proposed method uses only the photographer’s skeleton information which can be precisely obtained by both types of cameras. On another front, we argue that misdetection possibly occurs when there is behavior—i.e., net-surfing—which is similar to photo-taking behavior. Basically, the human arm parts are believed to significantly contribute to the precise recognition of photo-taking behavior. Thus, only skeleton information of the arm parts including length and angle transition is focused on in the analysis process. In our study, such information is extracted by OpenPose [
17] in real-time. Afterwards, dynamic programing (DP) matching between monitored data and reference data is performed to generate monitored DP scores which are then compared with a pre-determined DP threshold. The comparison results decide whether the input data represents photo-taking behavior. The experimental results demonstrate that the proposed approach achieve an accuracy of 92.5% in recognizing photo-taking behavior.