Estimating Vehicle and Pedestrian Activity from Town and City Traffic Cameras

Traffic cameras are a widely available source of open data that offer tremendous value to public authorities by providing real-time statistics to understand and monitor the activity levels of local populations and their responses to policy interventions such as those seen during the COrona VIrus Disease 2019 (COVID-19) pandemic. This paper presents an end-to-end solution based on the Google Cloud Platform with scalable processing capability to deal with large volumes of traffic camera data across the UK in a cost-efficient manner. It describes a deep learning pipeline to detect pedestrians and vehicles and to generate mobility statistics from these. It includes novel methods for data cleaning and post-processing using a Structure SImilarity Measure (SSIM)-based static mask that improves reliability and accuracy in classifying people and vehicles from traffic camera images. The solution resulted in statistics describing trends in the ‘busyness’ of various towns and cities in the UK. We validated time series against Automatic Number Plate Recognition (ANPR) cameras across North East England, showing a close correlation between our statistical output and the ANPR source. Trends were also favorably compared against traffic flow statistics from the UK’s Department of Transport. The results of this work have been adopted as an experimental faster indicator of the impact of COVID-19 on the UK economy and society by the Office for National Statistics (ONS).


Introduction
Today's traffic monitoring systems are highly integrated within urban infrastructures, providing near-ubiquitous coverage in the busiest areas for managing flows, maintaining safety, and performance analysis. Traffic cameras and Closed-Circuit TeleVision (CCTV) networks comprise much of these systems, supplying real-time information primarily intended for use by control room operators.
A major advantage of CCTV networks is their scalability. A monitoring system can range from a single camera to potentially thousands of cameras across a city or region accessible from a unified network. The imagery these cameras provide can be a further source of data when combined with machine learning and object detection. Through these, it is possible to obtain data for the modeling of traffic and pedestrian flows, as part of microand macro-simulation systems, ranging from the discrete movements and interactions of a single entity to overall flow characteristics on a regional scale by simultaneously analyzing conditions across multiple camera sensors, respectively [1][2][3][4]. Recent years have seen considerable research focus on improved machine learning for CCTV applications, and a range of open-source, research, and commercial implementations are now widespread.

•
The data are very timely. Cameras may provide continuous video, or stills updated many times each hour, and these images can be accessed almost immediately following capture. • A wide range of different types of moving objects can be determined, including cars, buses, motorcycles, vans, cyclists, and pedestrians, all of which can provide different useful information from changes in individual behavior to impacts on services and local economies.

•
The source is open. Data are widely accessible to groups with differing interests and reusing a public resource adds value to public investment without incurring additional collection cost.

•
Cameras provide coverage over a range of geographic settings: the center of towns and cities, as well as many areas that show either retail or commuting traffic and pedestrian flows.

•
There is a large network of traffic cameras capturing data simultaneously, allowing the selection of individual cameras for different purposes whilst still retaining acceptable coverage and precision. • Individuals and vehicles are, in almost all cases, not personally identifiable, e.g., number plates and faces, due to the low resolution of the images.
However, the source is also not without data quality problems. In order to generate useful mobility statistics, objects of interest within images need to be automatically detected. These exhibit wide variabilities due to the different styles of vehicles and their sizes in the image, varying by distance from the camera. The quality of individual images can vary substantially due to weather, time of day, camera setting, and image resolution and quality. These factors make robust detection problematic. The images are generated frequently, creating large data volumes, and are dispersed across multiple sites of different camera operators. This complicates their collation and processing into statistics. The geographic coverage of cameras is also biased according to the desire to show traffic flow at strategic sites and the open data policies of different camera operators in different locations.
In this paper, we investigated these issues within the context of a National Statistical Office producing early experimental data to monitor the impact of COVID-19 on the UK economy and society [5]. In Section 2, we describe the source of data used and methods for selecting cameras pertinent to meet our objectives. The system architecture is described in Section 3, comprising a scalable pipeline in the Cloud to process these data sources. This is followed by the description of a deep learning-based machine learning pipeline in Section 4, which allows the detection of pedestrians and vehicles from traffic camera images as a way of measuring the busyness of different towns and cities in the UK. Here, we discuss the comparative analysis on the cost, speed, and accuracy of different machine learning methods to achieve this, and present pre-and post-processing methods to improve the accuracy of results. Section 5 introduces the statistical processing of imputation and seasonal adjustment on the raw time series, and then the raw and processed time series are validated with other sources from ANPR and traffic activities from the Department for Transport (DfT) in Section 6. Discussion and conclusions are presented in Sections 7 and 8, respectively, with plans for further work highlighted. This work has been made available for public authorities and researchers to re-deploy on an open basis.

Background
Extracting information from traffic cameras requires semantic entities to be recognized in the images and videos they capture. This is a popular topic in computer vision, particularly the field of object detection [6][7][8][9][10][11]. Deep learning research has achieved great success in this task through the development of algorithms such as the You Only Look Once (YOLO) series [12] and Faster-Region-based Convolutional Neural Network (Faster-RCNN) [13,14].
Traditional computer vision machine learning techniques have been employed by authors such as Wen et al. [15]. Here, Haar-like features and AdaBoost classifiers are used to achieve accurate vehicle detection. Such techniques were also enhanced to make use of a vehicle indexing and search system [16]. Another well-established technique for accurate vehicle detection takes the combined approach of background extraction, foreground segmentation, and blob classification [17,18]. For even further performance and reliability, Haar-features, histograms of Oriented Gradients (HOG), and blob classification can be combined [19]. Techniques prior to 2011 can be found in a detailed review by [6].
A major benefit of deep learning-based methods over traditional detection techniques is robust performance within challenging monitoring conditions. This compensates for poor lighting conditions [20], or the substandard camera resolution [7,21] used in the YOLOv2 deep learning framework to recognize multiple vehicle categories. In [22], Mask-RCNN was utilized, an extension of Faster-RCNN, to define precise segmentation masks of vehicles within traffic footage. Similarly, authors Hua et al. [23], Tran et al. [24], and Huang [25] all made use of the Faster-RCNN variant, employing resnet101 for the feature extractor. In addition to these, Fan et al. [26] demonstrated the use of Faster-RCNN from a vehicle platform, utilizing the KITTI Benchmark dataset.
Researchers in the UK have applied these models to traffic cameras to see if the data can be used to improve decision-making in COVID-19 policies. The Alan Turing Institute worked with the Greater London Authority (GLA) to assess activity in London before and after the lockdown [27]. Their work is, in part, based on the air pollution estimation project of University College London (UCL) [28], which detects and counts vehicles from JamCam (London) traffic cameras to estimate air pollution using a deep learning model based on YOLOv3. Newcastle University's Urban Observatory [29] has worked on traffic cameras and Automatic Number Plate Recognition (ANPR) to measure traffic and pedestrians in North East England using the Faster-RCNN model [30,31]. Similarly, the Urban Big Data Centre at the University of Glasgow has explored using the spare CCTV capability to monitor activity level during the COVID-19 pandemic on streets in Glasgow [32,33]. A few companies have also undertaken related work using their own networks of cameras. Vivacity Labs [34] demonstrated the use of their commercial cameras and modeling to assess compliance with social distancing. Meanwhile, camera-derived data on retail footfall from Springboard [35] have also been used in ONS as part of experimental data [5] looking at the economic impacts of COVID-19. Further afield, researchers at the University of Cantabria, Spain, have analyzed the impact of COVID-19 on urban mobility by comparing vehicle and pedestrian flow data (pre-and post-lockdown restrictions) from traffic control cameras [36].
Despite not employing traffic cameras in their research, Zambrano-Martinez et al. [37] presented a useful approach in modeling an urban environment based upon busyness metrics such as vehicle loads and journey times. Likewise, Costa et al. [38] did not focus on utilizing existing traffic camera networks, but rather outlined a concept by which existing telecommunication infrastructure can be leveraged to achieve accurate traffic flow modeling and prediction.
There are a number of unresolved issues in realizing the use of open traffic camera imagery for real-world insight. Most notably, there has been substandard sets of quality training data and models when compared against similar computer vision applications within other domains. The versatility of these tools and resources are of critical consideration for scalable applications (where internal and external camera conditions are variable), and for monocular camera setups (where accurate 3D geometry extraction is often either impossible or impractical). Since 2017, the AI City Challenge has gathered researchers from around the globe to address these problems, as well as to solve the complex issues that limit accurate data retrieval within real-world contexts [39]. With each challenge, increasing focus has been given to analyzing multi-camera networks that simulate existing large-scale urban systems. For example, in 2020, participants were encouraged to compete in one or more of the following four categories: Multi-Class Multi-Movement Vehicle Counting, City-Scale Multi-Camera Vehicle Re-Identification, City-Scale Multi-Camera Vehicle Tracking, and Traffic Anomaly Detection [40]. Among successful experiments, object detection models from the YOLO series and Faster-RCNN were utilized for their superior speeds and performance on consumer-grade GPUs. Similarly, challenges have been organized by UA-DETRAC for the AVSS conference 2019, Taipei, Taiwan, whereby participants contributed innovative solutions for improving computer vision vehicle detection and counting performance [41]. Here, the UA-DETRAC Benchmark Suite for multi-object detection and tracking was used exclusively.
The development of benchmark standards is integral for measuring the relative performance of deep learning-aided traffic monitoring applications. Notable benchmarks include the CityFlow Benchmark dataset for multi-camera, multi-target vehicle reidentification, which contains almost 3.5 h of 960p footage, of traffic at highways and intersections within the US [42], and UA-DETRAC Benchmark Suite containing 10 h of footage extracted from 24 locations in Beijing and Tianjin, China [41].
As with traffic monitoring applications, several notable pedestrian detection and tracking benchmarks have been developed. The CrowdHuman benchmark specializes in pedestrian detection within densely populated and highly diverse scenes, making for many occlusion instances [43]. In total, the dataset contains 470,000 identified persons across 24,000 images, for training, testing, and validation purposes. The Multiple Object Tracking (MOT) benchmark extends upon basic detection principles by providing evaluation datasets for multi-target pedestrian tracking [44]. Data are crowdsourced and distributed periodically with each new benchmark challenge. Built upon the CityScapes dataset, the CityPersons benchmark dataset consists of 19,654 unique persons captured across 18 central-European cities [45,46]. Footage was recorded from a vehicle platform under varying weather conditions. The Caltech Pedestrian Detection Benchmark consists of 2300 labeled pedestrians [47]. Dataset footage has been captured from a road vehicle perspective during 'regular' Los Angeles metropolitan traffic conditions. The EuroCity Persons detection benchmark consists of 238,200 persons within 47,300 images from road vehicle perspectives [48]. Data have been collected from across 31 cities in Europe, in all seasonal weather and day/night lighting conditions. One combined traffic and pedestrian monitoring benchmark is KITTI [49]. This dataset was captured using an array of vehicle platform sensors, including a stereo camera rig (for computer vision dataset), as well as a Velodyne Laserscanner and Global Positioning System (GPS) (for spatial and temporal ground truth validations). From these, 3D object detection and tracking could be achieved. In total, the dataset consists of over 200,000 annotations, in some instances with up to 30 pedestrians and 15 cars per image, in urban and rural environments alike from the city of Karlsruhe, Germany.
Previous work left a research gap with a lack of publicly available low-resolution traffic camera imagery and test data sets, so we provide a publicly available dataset of 10,000 traffic images [50] and also aim to release 4000 annotated images.

Data
The UK has thousands of traffic cameras, particularly in major cities and towns; for example, in London alone, there are nearly 1000 accessible cameras (see Figure 1 from the JamCam website). These data are available in the public domain where automatic periodic static images are updated from traffic camera networks across the UK.

Data
The UK has thousands of traffic cameras, particularly in major cities and towns; for example, in London alone, there are nearly 1000 accessible cameras (see Figure 1 from the JamCam website). These data are available in the public domain where automatic periodic static images are updated from traffic camera networks across the UK. Appendix A lists open websites we have reviewed where camera images are publicly available. We selected a subset of locations from that list and built up a profile of historical activity from their cameras. Table 1 lists the sources we selected. These locations give broad coverage across the UK whilst also representing a range of different sized settlements in both urban and rural settings. Appendix A lists open websites we have reviewed where camera images are publicly available. We selected a subset of locations from that list and built up a profile of historical activity from their cameras. Table 1 lists the sources we selected. These locations give broad coverage across the UK whilst also representing a range of different sized settlements in both urban and rural settings.

Camera Selection
In this research, we were interested in measuring changes in the general busyness of places, rather than monitoring traffic flow per se. Not all cameras are useful for measuring this; many of the cameras are focused on major trunk roads or located on country roads without pavements. To select cameras that were representative of the busyness of local populations, we manually tagged cameras if they contained certain scene elements, e.g., pavements and cycle lanes, within the scenes they captured. The elements we tagged are shown in Figure 2. this; many of the cameras are focused on major trunk roads or located on country roads without pavements. To select cameras that were representative of the busyness of local populations, we manually tagged cameras if they contained certain scene elements, e.g., pavements and cycle lanes, within the scenes they captured. The elements we tagged are shown in Figure 2. We scored cameras based on their mix of these scene elements to determine which were more likely to represent 'busyness,' such as street scenes with high footfall potential. The higher the total score, the more likely the cameras were expected to include both pedestrians and vehicles. The scores, determined empirically, were given in Table 2. We used the total scores to decide if a camera was used; example images with total scores from 1 to 14 are presented in Figure 3, noting the increasing likelihood of pedestrians in the image as coverage area becomes more residential or with shops. We scored cameras based on their mix of these scene elements to determine which were more likely to represent 'busyness,' such as street scenes with high footfall potential. The higher the total score, the more likely the cameras were expected to include both pedestrians and vehicles. The scores, determined empirically, were given in Table 2. We used the total scores to decide if a camera was used; example images with total scores from 1 to 14 are presented in Figure 3, noting the increasing likelihood of pedestrians in the image as coverage area becomes more residential or with shops.
We examined the distribution of scores amongst image providers, with total scores of eight or above noted as having an even distribution amongst providers, and likely to contain shops or residential areas (a source of pedestrians). A threshold value of eight was chosen to represent a reasonable likelihood of capturing footfall whilst reducing sample size (such as excluding areas without pavements).
These types of scenes showed a reasonable spread for all image providers, meaning the different geographic areas covered by providers have a reasonably consistent distribution of street scenes. To avoid one area overwhelming the contribution from another, we also opted to keep each geographic area as a separate dataset. We therefore investigated the time series behavior for each area independently, rather than producing an aggregate time series for the UK. How to produce such an output using post-stratification is being considered as part of future work.  We examined the distribution of scores amongst image providers, with total scores of eight or above noted as having an even distribution amongst providers, and likely to contain shops or residential areas (a source of pedestrians). A threshold value of eight was chosen to represent a reasonable likelihood of capturing footfall whilst reducing sample size (such as excluding areas without pavements).
These types of scenes showed a reasonable spread for all image providers, meaning the different geographic areas covered by providers have a reasonably consistent distribution of street scenes. To avoid one area overwhelming the contribution from another, we also opted to keep each geographic area as a separate dataset. We therefore investigated the time series behavior for each area independently, rather than producing an aggregate time series for the UK. How to produce such an output using post-stratification is being considered as part of future work.

Annotation of Images
Annotated data are required to train and evaluate models. For this research, we created a dataset of manually labeled images. We used an instance [51] of the VGG Image

Annotation of Images
Annotated data are required to train and evaluate models. For this research, we created a dataset of manually labeled images. We used an instance [51] of the VGG Image Annotator (VIA) [52] application to label our data, hosted courtesy of the UN Global Platform [53]. We labeled seven categories: car, van, truck, bus, person, cyclist, and motorcyclist. Figure 4 shows an interface of the image annotation application.
Two groups of volunteers from the Office for National Statistics (ONS) helped label approximately 4000 images from London and North East England. One group labeled a randomly selected set of images without consideration of the time of day; the other labeled randomly selected images but with an assigned time of day. Random selection helped to minimize the effect of bias on training and evaluation. However, we also wanted to evaluate the impact of different illumination levels and so purposely stratified sampling by fixed time periods for some images. Annotator (VIA) [52] application to label our data, hosted courtesy of the UN Global Platform [53]. We labeled seven categories: car, van, truck, bus, person, cyclist, and motorcyclist. Figure 4 shows an interface of the image annotation application. Two groups of volunteers from the Office for National Statistics (ONS) helped label approximately 4000 images from London and North East England. One group labeled a randomly selected set of images without consideration of the time of day; the other labeled randomly selected images but with an assigned time of day. Random selection helped to minimize the effect of bias on training and evaluation. However, we also wanted to evaluate the impact of different illumination levels and so purposely stratified sampling by fixed time periods for some images. Table 3 presents counts on the labeled data. The Urban Observatory, Newcastle University provides a well-trained Faster-RCNN model based on 10,000 images from North East England traffic cameras, and the above datasets are thus used to evaluate the models.

Processing Pipeline: Architecture
A processing pipeline was developed to gather images and apply models to detect numbers of pedestrians and vehicles. This architecture can be mapped to a single machine or a cloud system; we opted to use Google Compute Platform (GCP), but other platforms such as Amazon's AWS or Microsoft's Azure would provide relatively equivalent services.  Table 3 presents counts on the labeled data. The Urban Observatory, Newcastle University provides a well-trained Faster-RCNN model based on 10,000 images from North East England traffic cameras, and the above datasets are thus used to evaluate the models. Table 3. Counts of labels generated from two collections of image data (London and North East England) by type of object annotated.

Processing Pipeline: Architecture
A processing pipeline was developed to gather images and apply models to detect numbers of pedestrians and vehicles. This architecture can be mapped to a single machine or a cloud system; we opted to use Google Compute Platform (GCP), but other platforms such as Amazon's AWS or Microsoft's Azure would provide relatively equivalent services.
The solution pipeline was developed in the Cloud, as this removed the need for machine maintenance, and provided capacity for large (managed) data storage and ondemand scalable compute. Remote, distributed working is also supported, which was extremely useful as we developed this during the pandemic and, hence, were all working remotely. Code in the pipeline was run in 'bursts'; as cloud computing is only charged when used, this was a potential saving in compute costs. Our overall architecture is presented in Figure 5, where blue boxes indicate a process and other colors represent storage, the color being used to remind the user that this is the same object being referred to as a dependency.
tremely useful as we developed this during the pandemic and, hence, were all working remotely. Code in the pipeline was run in 'bursts'; as cloud computing is only charged when used, this was a potential saving in compute costs. Our overall architecture is presented in Figure 5, where blue boxes indicate a process and other colors represent storage, the color being used to remind the user that this is the same object being referred to as a dependency. The system depends on a list of available camera image Uniform Resource Locator (URL)s, which are refreshed daily; these camera images are then downloaded every 10 min, and analyzed 20 min later (as we need a before/after image for filtering-see later in Section 4.3 Static Model). The analysis generates object counts that are stored in a database, which is processed weekly to generate a time series and related statistical reports.
Google Cloud Functions [54] were used. These are stand-alone, stateless code that can be called repeatedly without causing corruption-a key consideration to increase the robustness of the functions. A major advantage occurs when we make multiple requests to a cloud function. If it is busy servicing another request and there is sufficient demand, another instance will be spun up automatically to take up the load. These instances are not specifically charged for, as charges are for seconds of compute used and memory allocated. Hence, running 100 instances to perform a task costs little more than running a single instance 100 times.
This on-demand scalability enabled the code for downloading images to take advantage of GCP's internet bandwidth. Over a thousand images could be downloaded in under a minute and the system could easily scale to tens of thousands of images. To cope with a potentially enormous quantity of imagery, cloud storage buckets were used to hold The system depends on a list of available camera image Uniform Resource Locator (URL)s, which are refreshed daily; these camera images are then downloaded every 10 min, and analyzed 20 min later (as we need a before/after image for filtering-see later in Section 4.3 Static Model). The analysis generates object counts that are stored in a database, which is processed weekly to generate a time series and related statistical reports.
Google Cloud Functions [54] were used. These are stand-alone, stateless code that can be called repeatedly without causing corruption-a key consideration to increase the robustness of the functions. A major advantage occurs when we make multiple requests to a cloud function. If it is busy servicing another request and there is sufficient demand, another instance will be spun up automatically to take up the load. These instances are not specifically charged for, as charges are for seconds of compute used and memory allocated. Hence, running 100 instances to perform a task costs little more than running a single instance 100 times.
This on-demand scalability enabled the code for downloading images to take advantage of GCP's internet bandwidth. Over a thousand images could be downloaded in under a minute and the system could easily scale to tens of thousands of images. To cope with a potentially enormous quantity of imagery, cloud storage buckets were used to hold the data-these are named file storage locations within GCP that are secure and allocated to the project. Processing the images through a model to detect cars and pedestrians resulted in counts of different types of detected objects being written into a database. We used a managed database, BigQuery [55], to reduce maintenance. The database was used to share data between the data collection and time series analysis, reducing coupling. The time series analysis was a compute-intensive process taking hours to run due to imputation overhead (described in Section 5). Processing was scheduled daily using cloud functions. The system is highly flexible; it is not restricted to images but is merely a large-scale data processing pipeline where data are fed into filters and then processed by models to eventually generate object counts. For instance, the input data could include audio files where we are looking to detect signals, or text reports where we wish to count key phrases or themes.
It is also cost-efficient. The entire cloud infrastructure to produce weekly time series costs approximately GBP 10/day to run, processing~800 cameras and, hence,~115,000 images per day.

Deep Learning Model
The deep learning pipeline consists of data cleaning, deep learning-based object detection, and a static filter (post-processing). Figure 6 describes the modeling process. The data flow shown in green processes spatial information from a single image. Spatial information refers to information from pixels and their neighbors within the two-dimensional space or a single image (rather than its location in geographic space). Here, each image is processed independently without considering other images captured before or after it. The blue workflow processes both spatial information, of individual images, and temporal information from their most recent neighboring images in time. The data cleaning step uses the temporal information to check if an image is duplicated, dummy, or faulty. As it may be possible to identify number plates or faces on some high-resolution images, the data cleaning also down-samples images from high resolution to low resolution to blur these details. The object detection step uses spatial information to identify semantic objects in a single image using a deep learning model. The static mask is a post-processing step that uses temporal information over a sequence of images to determine whether these objects are static or moving.
used a managed database, BigQuery [55], to reduce maintenance. The database was used to share data between the data collection and time series analysis, reducing coupling. The time series analysis was a compute-intensive process taking hours to run due to imputation overhead (described in Section 5). Processing was scheduled daily using cloud functions. The system is highly flexible; it is not restricted to images but is merely a large-scale data processing pipeline where data are fed into filters and then processed by models to eventually generate object counts. For instance, the input data could include audio files where we are looking to detect signals, or text reports where we wish to count key phrases or themes. It is also cost-efficient. The entire cloud infrastructure to produce weekly time series costs approximately GBP 10/day to run, processing ~800 cameras and, hence, ~115,000 images per day.

Deep Learning Model
The deep learning pipeline consists of data cleaning, deep learning-based object detection, and a static filter (post-processing). Figure 6 describes the modeling process. The data flow shown in green processes spatial information from a single image. Spatial information refers to information from pixels and their neighbors within the two-dimensional space or a single image (rather than its location in geographic space). Here, each image is processed independently without considering other images captured before or after it. The blue workflow processes both spatial information, of individual images, and temporal information from their most recent neighboring images in time. The data cleaning step uses the temporal information to check if an image is duplicated, dummy, or faulty. As it may be possible to identify number plates or faces on some high-resolution images, the data cleaning also down-samples images from high resolution to low resolution to blur these details. The object detection step uses spatial information to identify semantic objects in a single image using a deep learning model. The static mask is a post-processing step that uses temporal information over a sequence of images to determine whether these objects are static or moving.

Data Cleaning
A number of issues were found when using these low-quality camera images. In principle, new images are fetched every 10 min from cameras, for example, via an API or from cloud storage container. However, in practice, images can be duplicated if the hosts have not updated them in time. Cameras can be periodically offline, and when this happens, they display placeholder images. Images can also be partly or completely distorted due to physical problems with the cameras. Some examples of problematic images are

Data Cleaning
A number of issues were found when using these low-quality camera images. In principle, new images are fetched every 10 min from cameras, for example, via an API or from cloud storage container. However, in practice, images can be duplicated if the hosts have not updated them in time. Cameras can be periodically offline, and when this happens, they display placeholder images. Images can also be partly or completely distorted due to physical problems with the cameras. Some examples of problematic images are shown in Figure 7. These types of images will impact counts, so data cleaning is an important preliminary step to improve the quality of statistical outputs.
By examining images from London and North East England traffic, two common characteristics of unusable images were identified: large continuous areas consisting of different levels of pure grey (R=G=B) (Figure 7a-c) and horizontal rows repeated in whole or part of the image (Figure 7d-f). To deal with the first issue, a subsample of the colors in the image was used to detect if a single color covered more than a defined threshold area. We found that sampling every 4th pixel both horizontally and vertically (1/16th of the image) gave sufficient coverage; further down-sampling pixels from 256 levels to 32 levels in each of the red, green, and blue was sensitive enough to correctly detect all nonphotograph images, such as 'Camera 00001.06592 in use keeping London moving' in Figure 7a, where a single color covered >33% of an image. For the partly or completed distorted images, a simple check to determine if consecutive rows contained a near-identical sequence of pixels (allowing for variation due to compression artefacts) was used. If a significant portion of the image contained such identical rows, we marked the image as 'faulty'. shown in Figure 7. These types of images will impact counts, so data cleaning is an important preliminary step to improve the quality of statistical outputs. By examining images from London and North East England traffic, two common characteristics of unusable images were identified: large continuous areas consisting of different levels of pure grey (R=G=B) (Figure 7a-c) and horizontal rows repeated in whole or part of the image (Figure 7d-f). To deal with the first issue, a subsample of the colors in the image was used to detect if a single color covered more than a defined threshold area. We found that sampling every 4th pixel both horizontally and vertically (1/16th of the image) gave sufficient coverage; further down-sampling pixels from 256 levels to 32 levels in each of the red, green, and blue was sensitive enough to correctly detect all nonphotograph images, such as 'Camera 00001.06592 in use keeping London moving' in Figure 7a, where a single color covered >33% of an image. For the partly or completed distorted images, a simple check to determine if consecutive rows contained a near-identical sequence of pixels (allowing for variation due to compression artefacts) was used. If a significant portion of the image contained such identical rows, we marked the image as 'faulty'.
Some local councils have deployed high-definition cameras, and this brings General Data Protection Regulation (GDPR) issues where number plates or faces are potentially identified. Figure 8 shows two modified examples of disclosive number plates captured on CCTV images. To assess the risk of data protection, we recorded a 24 h period of images from all the traffic cameras and manually examined the resolution and risk of recognizing number plates and faces. We observed that the most common image width of these datasets was 440 pixels, and no legible number plates were identified under this width. A width of 440 was thus chosen as our threshold to deem an image as being of 'higher resolution' if exceeded. The down-sampled height depends on the ratio of width before and after (440 pixels). Figure 9 shows some examples of images before and after down-sampling. Some local councils have deployed high-definition cameras, and this brings General Data Protection Regulation (GDPR) issues where number plates or faces are potentially identified. Figure 8 shows two modified examples of disclosive number plates captured on CCTV images. To assess the risk of data protection, we recorded a 24 h period of images from all the traffic cameras and manually examined the resolution and risk of recognizing number plates and faces. We observed that the most common image width of these datasets was 440 pixels, and no legible number plates were identified under this width. A width of 440 was thus chosen as our threshold to deem an image as being of 'higher resolution' if exceeded. The down-sampled height depends on the ratio of width before and after (440 pixels). Figure 9 shows some examples of images before and after down-sampling.
We have evaluated the object counts generated before and after down-sampling across all locations on the 24 h recorded images. Figure 10 shows the object counts in different locations before and after down-sampling during 24 h. Traditional evaluation on object recognition requires a time-costing manual label, as mentioned in Section 2.2. Alternatively, we adopted the Pearson correlation to compare two time series, where 1 means perfectly correlated, 0 means no correlation, and a negative number means opposite correlated. A high Pearson correlation score reflects the trivial influence from the down-sampling operation. In general, cars have a 0.96 correlation, whereas a person and cyclist have a 0.86 correlation, and this shows a significant close correlation before and after down-sampling.   We have evaluated the object counts generated before and after down-sampling across all locations on the 24 h recorded images. Figure 10 shows the object counts in different locations before and after down-sampling during 24 h. Traditional evaluation on object recognition requires a time-costing manual label, as mentioned in Section 2.2. Alternatively, we adopted the Pearson correlation to compare two time series, where 1 means perfectly correlated, 0 means no correlation, and a negative number means opposite correlated. A high Pearson correlation score reflects the trivial influence from the down-sampling operation. In general, cars have a 0.96 correlation, whereas a person and cyclist have a 0.86 correlation, and this shows a significant close correlation before and after down-sampling.
(a) Durham  We have evaluated the object counts generated before and after down-sampling across all locations on the 24 h recorded images. Figure 10 shows the object counts in different locations before and after down-sampling during 24 h. Traditional evaluation on object recognition requires a time-costing manual label, as mentioned in Section 2.2. Alternatively, we adopted the Pearson correlation to compare two time series, where 1 means perfectly correlated, 0 means no correlation, and a negative number means opposite correlated. A high Pearson correlation score reflects the trivial influence from the down-sampling operation. In general, cars have a 0.96 correlation, whereas a person and cyclist have a 0.86 correlation, and this shows a significant close correlation before and after down-sampling.

Deep Learning Model Comparison
Object detection is used to locate and classify semantic objects within images. We compared a number of widely used models including the Google Vision API, Faster-

Deep Learning Model Comparison
Object detection is used to locate and classify semantic objects within images. We compared a number of widely used models including the Google Vision API, Faster-RCNN, and the YOLO series. Google Vision API provides a pre-trained machine learning model to 'assign labels to images and quickly classify them into millions of predefined categories' [56]. It helps with rapidly building proof-of-concept deployments but is costly with slower speed and worse accuracy compared to the other methods. Faster-RCNN was first published in 2017 [14] and is the most widely used version of the Region-based CNN family. It uses a two-stage proposal and refinement of regions followed by classification. The model used in this pipeline was trained on 10,000 camera images from North East England, and the base model and training code are openly available [57,58]. YOLOv5 is the latest version of the YOLO series of models, released in May 2020 [59]. The YOLO architecture performs detection in a single pass with features extracted from cells of an image split into a grid feeding into a regression algorithm predicting objects' bounding boxes. YOLOv5 provides different pre-trained models to balance speed and accuracy. In our experiments, we chose the YOLOv5s and YOLOv5x, where YOLOv5s is the lightest model with the fastest speed but the least accuracy, whilst YOLOv5x is the heaviest model with the slowest speed but the most accurate results. All the models were run on a Google cloud virtual machine using NVIDIA GPU Tesla T4 [60]. Table 4 shows a comparison of processing speed using the above models.  Figure 11 shows some example results from different models where the Google Vision API missed a few objects whilst both YOLOv5x and Faster-RCNN perform better with fewer omissions with differing confidence scores.
A detailed comparison of performance quality is shown in Figure 12. This is based on an evaluation dataset of 1700 images captured from JamCam, London during the daytime detailed in Table 5. As the Google Vision API does not identify any cyclists and YOLOv5 does not provide a category of cyclist or motorcyclist, these classes have been omitted in the comparison. The performance of object detection is normally evaluated using mean Average Precision (mAP) over categories [61]. Both YOLOv5 and the Faster-RCNN achieved better mAP than the Google Vision API did.
Here, 1700 traffic camera images across London (see Table 5) were used to evaluate all the models. As it is important that the model is able to classify objects at different sizes, in relation to the distance a detected object is from the camera in a scene, we calculated the metrics to also consider an object size. This is shown by the horizontal axis, filter pixel size, in Figure 12. Here, zero means that the testing set includes all sizes of objects; 100 means that the testing set excludes any labeled objects smaller than 100 pixels and so on. Figure 12 shows that the bigger the object size, the better the performance. Importantly, it also shows there was a turning point of mAP against object sizes where a model becomes more stable, and this helps guide the camera settings. For example, in Figure 12a, Faster-RCNN became stable when the filter pixel size of vehicles was 500 pixels, while YOLOv5s became stable at 400 pixels. Faster-RCNN achieved the best performance amongst individual models in the category of vehicle, while YOLOv5x achieved the best against other individual models in the category of a person.
on an evaluation dataset of 1700 images captured from JamCam, London during the daytime detailed in Table 5. As the Google Vision API does not identify any cyclists and YOLOv5 does not provide a category of cyclist or motorcyclist, these classes have been omitted in the comparison. The performance of object detection is normally evaluated using mean Average Precision (mAP) over categories [61]. Both YOLOv5 and the Faster-RCNN achieved better mAP than the Google Vision API did.     Another important factor is the cost. Google Vision API is billed per call [62]. The cost of using Faster-RCNN to create the weekly time series was about GBP 10 per day, whilst the Google Vision API charges about GBP 120 per day for the weekly time series. The YOLOv5 costs were similar to Faster-RCNN.
The Faster-RCNN and YOLO series have the code repositories on GitHub and this gives users flexibility to tailor the models on their targeted semantic objects or improve the accuracy through fine-tuning or re-training. However, the deployment and implementation of the Google Vision API is hidden from the end users.  Another important factor is the cost. Google Vision API is billed per call [62]. The cost of using Faster-RCNN to create the weekly time series was about GBP 10 per day, whilst the Google Vision API charges about GBP 120 per day for the weekly time series. The YOLOv5 costs were similar to Faster-RCNN.
The Faster-RCNN and YOLO series have the code repositories on GitHub and this gives users flexibility to tailor the models on their targeted semantic objects or improve the accuracy through fine-tuning or re-training. However, the deployment and implementation of the Google Vision API is hidden from the end users.

Investigation of Ensemble of Models
The above accuracy comparison indicates that the YOLOv5 and the Faster-RCNN models appeared to have particular strengths for different object detection types. Figure 13 shows an example.

Investigation of Ensemble of Models
The above accuracy comparison indicates that the YOLOv5 and the Faster-RCNN models appeared to have particular strengths for different object detection types. Figure  13 shows an example. To exploit these differences and improve accuracy, an ensemble method based on the Faster-RCNN and YOLOv5x models was designed. Ideally, a pool with three or more models can lead to an easy ensemble solution, such as a voting scheme, when conflicts arise from different models. In this case, only two models were available, so this presented a challenge to solve conflicts between two models. Through analyzing these two models' behavior, we proposed a 'max' method that gives priority to the individual results with the maximum confidence scores from different models.
Recall the comparison of individual models in Figure 12 where Faster-RCNN achieved the best mAP on the category of vehicle, while YOLOv5x achieved the best on the category of person. Using the same evaluation dataset (Table 5), a comparison of the ensemble method against Faster-RCNN and YOLOv5x is shown in Figure 14 where the ensemble method achieves a better mAP than the individual models in general in both categories of vehicle and person. The ensemble method can be parallelized and the speed is limited by the slowest model in the pool. To exploit these differences and improve accuracy, an ensemble method based on the Faster-RCNN and YOLOv5x models was designed. Ideally, a pool with three or more models can lead to an easy ensemble solution, such as a voting scheme, when conflicts arise from different models. In this case, only two models were available, so this presented a challenge to solve conflicts between two models. Through analyzing these two models' behavior, we proposed a 'max' method that gives priority to the individual results with the maximum confidence scores from different models.
Recall the comparison of individual models in Figure 12 where Faster-RCNN achieved the best mAP on the category of vehicle, while YOLOv5x achieved the best on the category of person. Using the same evaluation dataset (Table 5), a comparison of the ensemble method against Faster-RCNN and YOLOv5x is shown in Figure 14 where the ensemble method achieves a better mAP than the individual models in general in both categories of vehicle and person. The ensemble method can be parallelized and the speed is limited by the slowest model in the pool.

Static Mask
The object detection process identifies both static and moving objects. As the aim is to detect activity, it is important to filter out static objects using temporal information from successive images as a post-processing operation. The images are usually sampled at 10 min intervals, but cameras may change angles irregularly, so traditional methods for background modeling in continuous video, such as mixture of Gaussians [63], were not suitable for such a long interval and fast camera view switches. The traditional background modeling normally requires a period of continuing video with a fixed view to build up a background image.
A Structure SImilarity Measure (SSIM) [64] approach was adopted to extract the background. SSIM is a perception-based model that perceives changes in structural information. Structural information is the idea that the pixels have strong interdependencies especially when they are spatially close. These dependencies carry important information about the structure of the objects in the visual scene. SSIM is not suitable to build up background/foreground [65] with continuous video, because the changes occur between two sequential frames over a very short period, e.g., 40 ms for 25 frames per second video, and so are too subtle. For a 10 min interval typical of traffic camera images though, SSIM has enough information to perceive visual changes within structural patterns. Figure 14. Comparison of the mAP metric for the ensemble model, and components of Faster-RCNN and YOLOv5x models by minimum label size in pixels (horizontal axis) and label class (chart; above vehicles, below people) showing the superior performance of the ensemble method over its constituents.

Static Mask
The object detection process identifies both static and moving objects. As the aim is to detect activity, it is important to filter out static objects using temporal information from successive images as a post-processing operation. The images are usually sampled at 10 min intervals, but cameras may change angles irregularly, so traditional methods for background modeling in continuous video, such as mixture of Gaussians [63], were not suitable for such a long interval and fast camera view switches. The traditional background modeling normally requires a period of continuing video with a fixed view to build up a background image.
A Structure SImilarity Measure (SSIM) [64] approach was adopted to extract the background. SSIM is a perception-based model that perceives changes in structural information. Structural information is the idea that the pixels have strong interdependencies especially when they are spatially close. These dependencies carry important information about the structure of the objects in the visual scene. SSIM is not suitable to build up background/foreground [65] with continuous video, because the changes occur between two sequential frames over a very short period, e.g., 40 ms for 25 frames per second video, and so are too subtle. For a 10 min interval typical of traffic camera images though, SSIM has enough information to perceive visual changes within structural patterns.
Any pedestrians and vehicles classified during object detection were set as static and removed from the final counts if they also appeared in the background (Figure 15). Figure 15a illustrates an example where parked cars have been identified as static. In Figure 15b, a rubbish bin was misclassified as a pedestrian by the object detection but was filtered out as a static background element. Any pedestrians and vehicles classified during object detection were set as static and removed from the final counts if they also appeared in the background (Figure 15). Figure  15a illustrates an example where parked cars have been identified as static. In Figure 15b, a rubbish bin was misclassified as a pedestrian by the object detection but was filtered out as a static background element.

Imputation
Once the counts of objects were produced, a number of subsequent processing steps were carried out to: (a) compensate for missing or low-quality images, (b) aggregate the cameras to a town or city level, and (c) remove seasonality to facilitate interpretation of the results. These processes are necessary to make the time series comparable over time and to ensure that technical faults, such as camera downtime due to maintenance or nonfunctionality, are not counted as actual drops in activity.
The imputation consisted of identifying missing periods of data from each of the cameras (usually missing due to technical faults), and then replacing the missing counts of each object type by estimated values. This ensures that missing images or outages in the original data do not lead to artificial drops in the estimated level of the time series. The imputation was carried out using seasonal decomposition. First, the seasonality of the series was removed from the series. Secondly, the mean of the series within a specified time window was imputed to the missing periods. Finally, the seasonality was reapplied to the imputed series. This results in a time series that displays the same seasonality in the missing periods as the original series. The imputation was applied using the seasonally decomposed missing value imputation method (seadec) from the ImputeTS package in R [66], and was found to be the most suitable imputation method out of those assessed. Other methods such as imputation by interpolation, the last observation carried forward, and Kalman Smoothing and State Space Models were considered; however, these yielded less satisfactory results and, in the case of imputation by Kalman Smoothing, the compute time was significantly longer and unfeasible for production systems.
Following imputation, the counts of each object type were aggregated to each of the locations of interest. For example, all the cars counted in the city of Reading by the traffic cameras (including imputed counts) were summed to produce the corresponding daily time series for Reading.

Imputation
Once the counts of objects were produced, a number of subsequent processing steps were carried out to: (a) compensate for missing or low-quality images, (b) aggregate the cameras to a town or city level, and (c) remove seasonality to facilitate interpretation of the results. These processes are necessary to make the time series comparable over time and to ensure that technical faults, such as camera downtime due to maintenance or nonfunctionality, are not counted as actual drops in activity.
The imputation consisted of identifying missing periods of data from each of the cameras (usually missing due to technical faults), and then replacing the missing counts of each object type by estimated values. This ensures that missing images or outages in the original data do not lead to artificial drops in the estimated level of the time series. The imputation was carried out using seasonal decomposition. First, the seasonality of the series was removed from the series. Secondly, the mean of the series within a specified time window was imputed to the missing periods. Finally, the seasonality was reapplied to the imputed series. This results in a time series that displays the same seasonality in the missing periods as the original series. The imputation was applied using the seasonally decomposed missing value imputation method (seadec) from the ImputeTS package in R [66], and was found to be the most suitable imputation method out of those assessed. Other methods such as imputation by interpolation, the last observation carried forward, and Kalman Smoothing and State Space Models were considered; however, these yielded less satisfactory results and, in the case of imputation by Kalman Smoothing, the compute time was significantly longer and unfeasible for production systems.
Following imputation, the counts of each object type were aggregated to each of the locations of interest. For example, all the cars counted in the city of Reading by the traffic cameras (including imputed counts) were summed to produce the corresponding daily time series for Reading.

Seasonal Adjustment
Once the series had been aggregated, they were seasonally adjusted to make the interpretation of the results clearer. Seasonal adjustment removes fluctuations due to the time of day or day of the week from the data, making it easier to observe underlying trends in the data. Given that these are higher-frequency time series relative to those traditionally used in official statistics, we applied seasonal adjustment using the TRAMO/SEATS method in JDemetra+, as implemented in the R JDemetra High Frequency package (rjdhf) [67]. The adjustment was applied on both the hourly and daily time series. The presence of different types of outliers was tested during seasonal adjustment to ensure that these do not distort the estimation when it is decomposed into trend, seasonal, and irregular components. Seasonally adjusting the data led to the possibility of achieving negative values for some time series in cases in which the seasonal component was negative and the observed count was very low (e.g., zero). This is a problem often found in time series with strong seasonal movements and, as traffic counts are a nonnegative time series, a transformation was required. A square root transformation was applied to the data, prior to seasonal adjustment, in order to remove the possibility of artificially creating negative counts.

Time Series Validation against ANPR
Although mean average precision and precision-recall measure the models' performance, it was important to check the consistency of time series data produced against other sources. Traffic counts from Automatic Number Plate Recognition (ANPR) data obtained from Newcastle University's Urban Observatory project and Tyne and Wear UTMC were used as ground truths to compare against the raw time series generated from the machine learning pipeline.
ANPR traffic counts are defined on road segments with a pair of cameras, where a vehicle count is registered if the same number plate is recorded at both ends of the segment. This has the effect of not including vehicles that turn off onto a different road before they reach the second camera; such vehicles would appear in traffic camera images and so counts may not be entirely consistent between the two sources if this happens often. To this end, we identified short ANPR segments that also contained traffic cameras. We used the Faster-RCNN model to generate traffic counts from these cameras and compared these against the ANPR data. Figure 16 compares time series from ANPR data and the Faster-RCNN model (with and without the static mask) from 1 March to 31 May 2020. Pearson correlation coefficients showed a significant positive correlation between the raw time series and the ANPR data as high as 0.96. The vehicle count from the pipeline is only calculated every 10 min, whilst ANPR is continuous, so to harmonize the rate at which the two methods capture vehicles, the ANPR count was downscaled by a factor of 20. This assumes a camera image covers a stretch of road that would take 30 s for a car to travel along. As the camera only samples every 10 min (600 s), we assume we have captured 30 s out of the 600 s the ANPR will have recorded, i.e., 1/20th. Figure 17 compares the local correlation coefficients for the time series over a rolling seven-day window. The series is nearly the same for the period of 1 March and 31 May with or without the static mask, though that with the static mask shows less variation. The average of these local Pearson correlations was 0.91 for Faster-RCNN and 0.93 for the Faster-RCNN+static mask.

Comparison with Department for Transport Data
The UK Department for Transport (DfT) produces statistics on road traffic aggregated for Great Britain from 275 automatic traffic count sites [68]. In addition, they provide statistics on cyclists, bus travel, and rail travel. Their vehicle counts are disaggregated into cars, light commercial vehicles, and heavy goods vehicles. These data are not directly comparable to the traffic camera data [5], as the coverage is different (Great Britain-wide as opposed to a selection of towns and cities), but they are useful to assess whether the traffic camera sources can be used to estimate inflection points in traffic. Figures 18 and 19 compare vehicle activity from the traffic camera sources from London and North East England, respectively, with DfT traffic data for Great Britain [69].

Comparison with Department for Transport Data
The UK Department for Transport (DfT) produces statistics on road traffic aggregated for Great Britain from 275 automatic traffic count sites [68]. In addition, they provide statistics on cyclists, bus travel, and rail travel. Their vehicle counts are disaggregated into cars, light commercial vehicles, and heavy goods vehicles. These data are not directly comparable to the traffic camera data [5], as the coverage is different (Great Britain-wide as opposed to a selection of towns and cities), but they are useful to assess whether the traffic camera sources can be used to estimate inflection points in traffic. Figures 18 and 19 compare vehicle activity from the traffic camera sources from London and North East England, respectively, with DfT traffic data for Great Britain [69].

Comparison with Department for Transport Data
The UK Department for Transport (DfT) produces statistics on road traffic aggregated for Great Britain from 275 automatic traffic count sites [68]. In addition, they provide statistics on cyclists, bus travel, and rail travel. Their vehicle counts are disaggregated into cars, light commercial vehicles, and heavy goods vehicles. These data are not directly comparable to the traffic camera data [5], as the coverage is different (Great Britain-wide as opposed to a selection of towns and cities), but they are useful to assess whether the traffic camera sources can be used to estimate inflection points in traffic. Figures 18 and 19 compare vehicle activity from the traffic camera sources from London and North East England, respectively, with DfT traffic data for Great Britain [69].
Considering the differences in coverage, and the location of count sites (highways, urban, and rural) and traffic cameras (solely urban), both sources provide similar inflection points in activity coinciding with the introduction of lockdown measures in late March 2020, during the month of November 2020 and from late December in 2020. Similarly, both sources provide a similar trend of increasing activity when restrictions are gradually lifted, with somewhat differing levels. The traffic cameras we applied cover different areas from DfT data and, moreover, they cover pedestrian counts in commercial or residential areas that the DfT data do not cover.  Considering the differences in coverage, and the location of count sites (highways, urban, and rural) and traffic cameras (solely urban), both sources provide similar inflection points in activity coinciding with the introduction of lockdown measures in late March 2020, during the month of November 2020 and from late December in 2020. Similarly, both sources provide a similar trend of increasing activity when restrictions are  Considering the differences in coverage, and the location of count sites (highways, urban, and rural) and traffic cameras (solely urban), both sources provide similar inflection points in activity coinciding with the introduction of lockdown measures in late March 2020, during the month of November 2020 and from late December in 2020. Similarly, both sources provide a similar trend of increasing activity when restrictions are

Discussion
Traffic camera data can offer tremendous value to public authorities by providing real-time statistics to monitor levels of activity in local populations. This can inform local and national policy interventions such as those seen during the COVID-19 pandemic, especially where regional and local trends may differ widely. This type of traffic camera analysis is best suited at detecting trends in 'busyness' or activity in town and city centers, but is not suited for estimating the overall amount of vehicle or pedestrian movements, or for understanding the journeys that are made given the periodic nature of the images provided. It provides high frequency (hourly) and timely (weekly) data, which can help to detect trends and inflection points in social behavior. As such, it provides local insights that complement other mobility, and supplement other transport, datasets that lack such local and temporal granularity. For example, Google's mobility data [70] are aggregated from Android users who have turned on location history on their smartphones but aggregated by region; Highways England covers motorways and some A-roads; commercial mobile photo cell data are not widely available on a continuous basis and can be expensive. These openly accessible traffic cameras are mainly focused on junctions, and residential and commercial areas, and thereby capture a large proportion of vehicles and pedestrians regardless of journey purpose in near real-time. Moreover, as these sensors are already in place and images are relatively accessible, it provides a scalable and efficient means of garnering statistics.
Realizing the value of the data within these images requires the application of data science, statistical, and data engineering techniques that can produce robust results despite the wide variability in quality and characteristics of the input camera images. Here, we have demonstrated methods that are reliable in achieving a reasonable degree of accuracy and coherence with other sources. However, despite the success of the developed pipeline in providing operational data and as a beacon for future development, there remains a number of challenges to address. Although we have demonstrated outputs from a range of locations across the UK, the coverage of open cameras is not comprehensive, and so the selection has been led by availability and ease of access. Within locations, cameras are sited principally to capture information about traffic congestion; hence, places without cars such as pedestrianized shopping districts are entirely absent in some areas. Validation of the data was successful but reliable metrics on accuracy are problematic as the accuracy of detecting different object types depends on many factors such as occlusion of objects, weather, illumination, choice of machine learning models, training sets, and camera quality and setting. Some factors such as camera settings are also outside of our immediate control. Current social distancing has meant that the density of pedestrians observed is sufficiently low for them to be satisfactorily detected. However, we would not expect the current models to perform so well for very crowded pavements as levels of occlusion, etc. would increase. The camera images are typically produced every 10 min, which limits the scope of the insights that can be produced and, therefore, its use for the management of flows should not be overstated. The data nonetheless have real value in capturing, describing, and expressing the relative change in the busyness in our cities and, therefore, the effectiveness of some policy interventions.

Conclusions
Using publicly available traffic cameras as a source of data to understand changing levels of activity within society offers many benefits. The data are very timely with images being made available in real-time. While we have demonstrated the processing of these in batch mode, there are no technical barriers to more frequent, near real-time updates. Data outputs can also easily be scaled to hourly, daily, and weekly summaries.
The camera as a sensor enables a large array of different objects to be detected: cars, buses, motorcycles, vans, pedestrians, etc. Additional objects, such as e-scooters and ebikes, can be added to the models with further training. These camera systems already exist widely and reusing and repurposing them make further use of existing public resources. There is no additional cost for the deployment and upkeep of expensive infrastructure as their maintenance and upgrades are managed as part of existing agreements for the primary purpose of traffic management and/or public safety. The additional cost for processing this data is also relatively modest for our architecture less than GBP 1000 (USD 1200) per month, which is significantly less than the cost of a single traffic counting device such as a radar-based device.
Whilst not always currently open, camera coverage of town and city centers, which is the locus of much economic activity, is generally comprehensive, and most cities, and many smaller towns, already have cameras in place that can readily provide information on commuting, retail, and pedestrian flows. It is feasible to carry this out on a national scale with respect to cost, coverage, and digital infrastructure. There are particular benefits of providing regional updates to national government when there are distinct differences in policies (e.g., localized lockdown regimes).
The data used here are in the public domain and uses low-resolution imagery from which it is difficult or impossible to personally identify individuals. The use of timeseparated still images means that tracking or other privacy-invasive practices are largely impossible. The processing pipeline design ensures that summaries and counts are the only things that are stored, and images, once processed, can be deleted. As these cameras provide data in near real-time, there is also the potential for low-latency applications with data provided within minutes of upload.
The further uptake of this approach and benefits could be realized if more local authorities and organizations made their traffic camera imagery data available on a low-resolution, infrequent, and open licensing basis; non-highways coverage would be especially helpful and could also help people to avoid crowding and queues at retail sites. Standardization in the way imagery is exposed would reduce the complexity of scaling the approach nationally or internationally and could ensure there is consistency in the data frequency and aggregation. Commercial pedestrian and vehicle counting systems are also available for CCTV, but additional transparency around the algorithms used and their efficacy is desirable before combining different approaches and processing pipelines.
As further work, we intend to make a number of technical enhancements, including improving the sampling design and scoring system for selecting cameras and investigating methods for analyzing crowds in images. We will also look at how the data, and related social and environmental factors, might predict activity at future time points, in locations without cameras, or in response to proposed public policy interventions.