2.1. Datasets and Definitions
The datasets provided to our group consist of timestamped PR records from several public outdoor WiFi networks in France. The data are not public, and site names have been anonymized. Please see the data availability statement at the end of the article for details.
For the present article, segments of data from three cities and a campground were explored. Data were recorded in 2020 and 2021. Some of the city data were split into two periods according to whether COVID-19 confinement was in effect or not. Overall, six files were created from the raw database, as detailed in
Table 1. The cities studied cover a wide range of sizes and populations, which, as will be seen later, allows us to test our approach on PR counts varying over nearly an order of magnitude. The inclusion of the campground dataset, to be discussed in
Section 3, which has logistics quite different from those of a city, further helps to ensure the general applicability of the proposed method.
A separate file stores connection detail records of clients who have at some time formally signed in to the WiFi service. In accordance with the GDPR, client MAC addresses were not retained during data acquisition and, in the files provided to us, are replaced by anonymized hash-coded strings. It is important to note that a fixed MAC address will always hash to the same string, while for randomized MACs, the hash code will be different for each PR a client emits.
Although the timestamps on each file record allow us to choose any desired time step, for the purposes of this article, PRs have been aggregated into daily totals. Then, by counting the number of times in a one-day period that a distinct anonymized MAC is seen and comparing it to a threshold, and by taking into account the aforementioned connection details file, we may define a day’s PRs as belonging to one of the three classes below. Since connected clients are separated out first, the classes are mutually exclusive.
Client Randomized, CR
The MAC string is seen from 1 to “threshold” times.
Client Fixed, CF
The MAC string is seen more than “threshold” times.
Client Connected, CC
The MAC string appears in the connection details file.
If CR clients’ MAC addresses are truly randomized, every CR MAC address should be unique, and we would have a threshold = 1. An initial study, however, indicated the situation is more complicated.
Figure 1 shows the fraction of total PR retained as a function of the threshold for a typical dataset. For thresholds from 30 down to about 4, the percentage of PR retained is almost independent of the threshold, meaning that most MAC addresses are seen many times—this is the CF class. Thresholds of 2 or 3 are somewhat of a gray area. The PRs retained could correspond to CRs whose MAC randomization was not perfect—techniques are, after all, vendor dependent and sometimes based on specific time windows. Or these could be individual CFs that contribute little traffic. Too high a threshold presents the danger of “polluting” the CR class with IoT devices, as explained in the next section. In contrast, too low a threshold might unnecessarily reduce the efficiency of CR. Most likely, both scenarios contribute to PRs in this region. As a compromise, a threshold of 2 was chosen to separate the CR and CF classes. Picking a larger threshold has very little impact on the numerical results obtained; however, choosing a threshold of 1—which we believe is too restrictive—reduces the
X conversion factors we measure by about 25%, which is a source of uncertainty but does not fundamentally alter the significance of the results that will be presented.
2.2. OUI, IoT, and General Trends
Despite the randomization of the MAC, its first 3 bytes, called the Organizationally Unique Identifier, or OUI, are retained and become part of the PR entry. The OUI is intended to provide information on the manufacturer of the WiFi network card producing the PR, available by interrogating a public OUI web repository. OUI usage, however, is unregulated and voluntary, and in practice, the OUI request will often return NaN, i.e., undefined. In the data sets studied, the percentage of NaN OUI responses was found to depend heavily upon the client class defined in
Section 2.1, as illustrated in
Figure 2.
Displayed are the percentages of NaN OUI responses for CR, CF, and CC client classes for PR in city 1 over a 6-month period in 2021. The figure shows that PRs of randomized clients, i.e., originating from GDPR-compliant operating systems, also tend to mask device manufacturer information. Fixed client OUIs are most often available, as expected; however, this observation comes with a caveat.
Figure 3 shows that OUI responses include a baffling array of manufacturers that are sometimes difficult to identify clearly. A detailed investigation suggests that up to 80% are manufacturers of IoT devices, such as lighting, heating, cameras, etc., have a network behavior that will be quite distinct from true client devices. Connected clients exhibit mixed OUI behavior. In addition, CC is susceptible to containing ES clients, which we will return to in
Section 3. These observations provide a strong motivation for basing GDPR-compliant crowd estimation solutions exclusively on randomized MAC clients.
2.3. Description of the Method
Figure 4 portrays the daily randomized MAC PR counts in the three cities over a period of several weeks. The figure shows the weekly periods punctuated by weekends, occasional departures from regularity corresponding to holidays, and some irregular periods. Scale shifts relating to periods of COVID-19 confinement, and apparent small upward drifts, most likely seasonally related, can also be discerned. The key concept of the proposed method is to exploit the observed periodic structure to reveal the underlying statistics of the PR data.
To implement the procedure, we first choose an interval where clean periodicity is apparent in the PR data for a particular site and consider the data as a set of experiments repeated over several weeks. A model of the data is then constructed in the form of a repeated weekly template composed of mean numbers of PRs for each of the 7 days of the week, to which we add a linear trend term to account for possible seasonal drifts. The model parameters are determined by minimizing, with respect to these parameters, the summed squared deviations of the data from the mode, i.e., by performing a least mean squares fit of the model to the data. An example of the model (red bars) superimposed on raw PR counts (blue curve) for city-C is shown in
Figure 5, where the linear trend term is also explicitly indicated by a yellow line. The proposed algorithm, described below, is based on interpreting the squared deviations of the data points from the model, along with some additional assumptions.
To derive the algorithm, let us consider a site equipped with WiFi access points capable of capturing PRs from telephones carried by visitors to the site, also known as clients. The experiment has a duration
T in some units—in this article, we will usually take
T equal to 1 day—and the total number of visitors during the period is
A. We let
tb <
T, with
b ∈ {1…
A}, be the duration of client
b’s stay at the site, and
xb the mean number of PR emitted by client
b’s telephone during a period of duration
T. Indeed, it is well known that telephone PR emission rates vary widely depending on the telephone operating system used and the current state of the device, as discussed for example in [
1]. We will then have, for the total number of PR emitted,
P:
where < > denotes the mean of the quantity within the brackets. Since we would expect the duration of a client’s stay at a site to be independent of the PR emission rate of his or her telephone, we may use the law of the mean of the product of two independent distributions to write:
where we now implicitly take
T to be one day, interpreting
X as the mean, over all telephones, of the number of PR emitted by one client over a full day, and
C the number of effective client days in the experiment. For example,
C = 2, that is, 2 client-days, could correspond to two clients both present for an entire day, four clients each present for half a day, etc. As another example, if we predict
C = 500 clients for one day, of which 250 remain the entire day, while the remaining 250 are progressively replaced by 250 different clients, we still have
C = 500, while
A = 750.
One may ask why we choose to predict
C in client days instead of
A, the actual number of different clients. Indeed, due to MAC randomization, we cannot identify individual clients and thus only have access to a “bulk” figure, such as
C. The choice of preferring
A or
C, however, is application dependent. For personal safety applications, for example, such as the occupation capacity of a building or a site, the essential figure is
C, the number of persons present in a certain time window, not the total number of persons who passed through the site during the period. On the other hand, if one is interested in the number of tickets sold at a site, independently of how much time each visitor spends there,
A is the more appropriate number. When local ground truth data—for example, gate receipts—are available, methods exist for estimating
A from
C on average (e.g., [
4,
7]), but these are not within the scope of this paper.
At this stage, we have a means of estimating the number of client days from the number of probe requests via a simple multiplicative factor
X. It now remains to propose a method of obtaining
X from the data. The technique proposed is based on the squared deviations of the PR data from the model. Deviations from the mean number of PR for a particular day will have as their source random fluctuations in the numbers of clients,
C, of course, but also, conceivably, time-dependent fluctuations in the value of
X. To quantify this, we make use of the expression for the variance of the product distribution, in this case,
P =
XC, which gives:
where the
σ2 terms are the variances of the parent and product distributions. As
C arises from a counting experiment over a fixed time interval, we expect it to have a Poisson distribution. Since the variance of a Poisson distribution is equal to its mean, we may now write:
where in (8), we have assumed that
C >> 1, and in (9), we replace
C with
P/
X in order to obtain an equation involving only
X and
P. The left-hand side of (9) is the Variance to Mean Ratio of
P, also known as its Fano factor, which, when superior to one, reveals correlations in the deviations about a mean value, as would be expected for a population of clients each emitting multiple PRs. We see that here it provides a simple measure of the mean conversion factor
X from clients to PR, along with an additional term, due to possible variability in the value of
X, that may grow with the number of PR. We now proceed to study the application of this equation to our randomized MAC (i.e., CR) PR data.
2.4. Results
Using a model of the type illustrated in
Figure 5, we now calculate the variance of
P from the scatter of the data points about the constructed template and use it to calculate an
X for each site. Since it arises from a variance, the calculation is sensitive to outliers. An outlier cut was designed to handle this problem by combining the
X distributions from all sites into a single plot and choosing an appropriate threshold. The resulting distribution is shown in
Figure 6. A drawback to this approach is that the
X distributions might be different for the different sites; however, as the statistics in our samples are somewhat limited—one measurement per day over a period of a few weeks—the combined distribution seems a more reliable way to proceed. The resulting cut, placed at
X = 4000, beyond the body of the distribution and at the beginning of the “tail,” preserves 95% of the data. The figure shows two outliers corresponding to French national holidays, which clearly must be removed from the sample. It also seems likely that there exist less “formal” sources of large correlated client population fluctuations—transit strikes or outages, social movements, etc.—that also contribute to the tail of the distribution.
Removing the outlier cut completely would be unreasonable and skew the results dramatically. Simply adjusting the position of the cut would introduce a shift in mean
X approximately equal to the current cut value, 4000, times the fraction of the data points gained or lost through the shift. As the current cut retains 95% of the data, a moderate change in its position might involve 1% or 2% of the data, thus resulting in a shift of the mean
X by 40 to 80, which is comparable to the estimated standard deviations of
X at the different sites, as discussed below in the context of
Figure 7.
The resulting mean
X values for the five city data sets are shown as a function of
P in
Figure 7, with the standard deviation of each measured mean
X indicated by an error bar. The figure shows that within the precision of the method,
X remains relatively constant over the full range of
P values in our data sets—nearly an order of magnitude. This implies that we may, at this stage in our study, ignore the term linear in
P in Equation (9). The validations mentioned in
Figure 7 will be discussed in
Section 3.
As mentioned earlier,
X values for the different sites might be expected to differ due to site-dependent Access Point (AP) numbers, AP layouts, and propagation effects; however, at present, the statistical uncertainties appear to mask any such differences. As such, it is useful to combine the data into an average overall value and standard deviation, which is indicated by the dashed line within the yellow band at
X = 524 ± 47 in
Figure 7.
Once a satisfactory
X is obtained, the predicted number of client days for a particular date is simply given by
C =
P/
X. If it is desired to study site occupation over the course of a day, the PR timestamps can be used to partition the number of client days into client hours by dividing
X by 24, as shown for a typical day for city-A in
Figure 8. The highlighted period 02:00–03:00 in the figure will be used in one of the validations in the following section.