1. Introduction
The Urban Heat Island (UHI) is commonly evaluated using Satellite Derived Land Surface Temperature (SD-LST), which estimates Earth’s surface temperature from thermal infrared satellite bands [
1,
2]. The satellite’s image resolution, land cover type, albedo, climate, time of image capture, and other factors drive SD-LST estimation [
3]. However, SD-LST quantification and Surface UHI classification face several challenges. Cloud cover can obscure satellite-based SD-LST estimation, and there can be unavoidable trade-offs between spatial and temporal resolution of the satellites used in their capture [
4,
5,
6]. Moreover, SD-LST is a two-dimensional representation of surface temperature (T
S), which oversimplifies the complexity of three-dimensional thermal environments [
7]. For all of these reasons, SD-LST provides a limited portrait of the thermal environment, especially when used for urban heat mitigation inquiry [
8].
Despite these challenges, urban heat studies often rely on quantifying the thermal environment through these coarse SD-LST measurements, largely because these metrics are easy to access and are global in scope. A systematic literature review of land use and land cover’s impacts on SD-LST identified the Landsat satellites as the most commonly used satellite for the computation of both land cover composition and SD-LST, accounting for a majority of studies [
9], with the resampled 30 m resolution metric being used in the majority of recent papers investigating urban cooling [
10]. Thus, while SD-LST is valuable for assessing macro and meso-scale UHI impacts, such as in the creation of Local Climatic Zones (LCZs) to monitor neighborhood-level conditions [
11], its utility diminishes at hyper-local scales that capture site-specific variability within just a few meters. For biometeorologists, this often necessitates moving beyond SD-LST toward hyper-local measurements that more accurately reflect the thermal environment and pedestrian thermal comfort [
12,
13].
Recognizing this scale mismatch, urban climatologists and remote sensing scientists have developed methods for downscaling SD-LST to finer spatial resolutions using statistical, machine-learning, and multi-sensor fusion techniques. These efforts underscore that the gap between coarse satellite-derived and fine-scale thermal environments is known [
14]. Yet even when these methods successfully refine the spatial resolution of SD-LST, they still yield a metric that represents only one component of the thermal environment and does not directly capture the radiative load or convective conditions experienced by pedestrians. Thus, even when downscaled, SD-LST alone cannot represent the drivers of pedestrian thermal comfort and must ultimately be validated against micrometeorological conditions [
15].
The American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) defines thermal comfort as “the condition of the mind that expresses satisfaction with the thermal environment.” [
16]. It is often derived from micrometeorological data, such as T
Air, and Mean Radiant Temperature (T
MRT) [
17]. T
MRT is defined as the radiant heat exchange between two surfaces, specifically a person and their environment [
18]. Despite the challenges of its measurement, including variations in time, instruments, and settings of capture [
19], T
MRT is often considered the most important measurement for assessing human thermal comfort within biometeorological studies [
20]. These micrometeorological measurements are further used in the calculation of thermal comfort indices, such as the Wet Bulb Globe Temperature (WBGT), the Physiological Equivalent Temperature (PET) [
21], and the Universal Thermal Climate Index (UTCI) [
22].
While SD-LST provides useful information at broad spatial scales, it is a poor proxy for thermal comfort indices such as UTCI or PET [
23,
24]. Nevertheless, Landsat SD-LST continues to be widely used as a coarse proxy for outdoor thermal comfort or pedestrian heat exposure in urban heat studies [
25,
26,
27,
28], largely because physiologically meaningful variables often cannot be obtained remotely or are not acquired at hyper-local scales [
29]. This emphasis on remotely sensed and/or macro-scale measurements often produces unintended consequences. For instance, increasing surface albedo through reflective building materials can reduce SD-LST and even T
Air [
30]. Yet, in some instances, these materials often intensify shortwave radiation and raise T
MRT, ultimately exacerbating heat stress and thermal discomfort [
31,
32].
Therefore, outdoor thermal comfort is largely determined by T
MRT, which in turn is directly influenced by the hyper-local thermal environment, including surface temperatures (Ts), the infrared energy of surrounding materials [
33,
34,
35], and shade, often the most effective way to reduce pedestrian thermal stress [
36]. Advances in thermal imaging cameras, whose costs have declined in recent decades, have made it possible to obtain finer-scale resolution and hyper-local estimates of T
S [
37], potentially overcoming the challenges still posed by using SD-LST as a measure of thermal comfort. One widely used instrument is the Forward-looking Infrared (FLIR), which measures the longwave infrared radiation that objects self-emit [
38]. While this measurement remains a brightness temperature, FLIRs provide an approximate T
S, which has allowed for higher fidelity radiometric measurements of urban heat fluxes and provided a better understanding of T
S on specific patterns of neighborhood-scale urban morphology, such as differences in cooling rates between roofs and walls, as well as underneath tree canopy and shade structures [
39,
40]. While shade has also been found to be a better predictor than SD-LST in models estimating T
MRT [
12], studies that examine individual thermal comfort alongside T
S often occur indoors in climate-controlled chambers; there is a clear need for the use of infrared thermal imaging for outdoor thermal comfort studies [
41].
Therefore, while T
S serves as a crucial metric for evaluating both increased urban heat and thermal comfort—particularly in relation to the cooling effects of shade—a disconnect remains within measurement methodologies and urban heat mitigation strategies [
42]. Moreover, shade from different sources often has differing effects on T
S, with built structures typically providing greater T
S reduction than tree shade in arid systems [
43], yet this pattern does not hold for all climates, with differing types of shade structures sometimes demonstrating synergistic effects [
44].
A pressing need remains to study outdoor spaces through the lens of human thermal experience, as this directly shapes how such spaces are used [
45]. Bus stops represent a critical setting in which to examine the intersection of pedestrian thermal comfort and thermal dynamics. Extreme heat events can reduce transit ridership, with only the most transit-dependent individuals continuing to ride [
46]. While shade remains a key mitigation strategy for urban heat at bus stops [
47], bus stops and associated shelter designs understudied facets of transit user experience [
48]. Designing thermally comfortable transit stops is challenging, as several factors influence thermal comfort. Local micrometeorology, surface materials, and shade availability via shelters and/or vegetation all shape transit user comfort [
49]. Given the complexities between shade, heterogeneous transit infrastructure, and surface material in shaping thermal environments at bus stops, there is a growing need for hyper-local, multi-scalar methods to assess surface temperature and inform more effective interventions. This study explores whether hyper-local thermography offers a practical and cost-effective approach to meeting that need.
Amid calls from urban climatologists to integrate in situ measurements with SD-LST for more accurate assessment of UHI phenomena [
6], and from urban ecologists to examine emerging data collection tools into multi-scale urban systems research [
50], we ask the following: do simplified, low-cost hyper-local measurements of T
S provide more meaningful spatial and temporal information of pedestrian thermal comfort at bus stops? Our aim is not merely to show that FLIR is more spatially precise than SD-LST, but to examine whether hyper-local Ts can serve as an adequate predictor for thermal comfort in pedestrian-relevant microenvironments. We sought to (1) examine the differences in T
S as measured from Landsat 8 and 9 and from hyper-local FLIR photogrammetry, (2) test how strongly these T
S measurements were correlated with human biometeorology, including metrics such as air temperature (T
Air), and other commonly used indicators of thermal comfort, including T
MRT, WBGT, PET, and the UTCI, and (3) assessed how well FLIR thermography can predict these indices of thermal comfort and highlighted a simple and cost-effective way for cities to analyze T
S at a hyper-local scale.
3. Methods
3.1. Sampling Design and Methodology
Data were collected during an extensive five-week field campaign during the hottest summer months, July and August 2023. Measurements were made twice each weekday—once in the morning and once in the afternoon—during peak commute hours (7:30–10:30, 14:00–18:00). Bus stops were randomly selected for each day of the week, with the goal of obtaining a total of six replications per study site: three in the morning and three in the afternoon.
Hyper-local thermal dynamics of each bus stop were captured with thermographic images and biometeorological measurements. FLIR images were captured from three positions at the bus stop, forming a surface area of approximately 48 m
2. FLIR image capture coincided with biometeorological measurements using a series of three Kestrel 5400 sensors (
Figure 2).
3.2. Biometeorological Measurements
Biometeorological measurements taken with the three Kestrel sensors included air temperature (T
Air), wind speed (V
a), relative humidity (RelHum), Dry Bulb Globe Temperature (T
Globe), and Wet Bulb Globe Temperature (WBGT). The three sensors were calibrated to metric units and positioned 4.8 m apart on tripods at 1.1 m above ground level at each site, following methods designed by Dzyuban et al. [
49], to capture the micrometeorology of the bus stop. Sensors were set up to acclimate for five minutes prior to recording. Sensors recorded for two minutes, with their measurements averaged.
Thermal comfort indices, including T
MRT, were then calculated from these Kestrel measurements. T
MRT was calculated using a modified method of the ISO black globe thermometer equation found in Ouyang et al. [
54]. This modified method was specifically calibrated for the Kestrel sensor with a different convection coefficient and has the following equation:
where T
Globe and T
Air are the globe temperature and air temperature in Celsius, respectively, and V
a is the wind velocity in meters per second. The thermal comfort index Physiological Equivalent Temperature (PET) was calculated using the software RayMan Pro (Version 0.1) [
55,
56], and used inputs from the Kestrel sensors, T
MRT, and self-reported personal and biometric factors from transit users who were willing to report these metrics while waiting for their bus. These inputs included weight, height, and sex, and clothing insulation as calculated by the clothing metric (clo), a metric that assigns values for different articles of clothing. The collection of these data was reviewed and approved by The University of British Columbia’s Behavioural Research Ethics Board under identification code H23-01399. Another thermal comfort index, the Universal Thermal Climate Index (UTCI), was calculated using the R package ‘comf’ (Version 0.1.12) [
57] using T
Air, T
MRT, relative humidity, and wind velocity as inputs. A final thermal comfort index, WBGT, was obtained directly via the Kestrel sensors.
3.3. FLIR Image Capture, Segmentation, and TS Measurements
All three FLIR images were captured facing the bus stop to capture the T
S of the ground and horizontal surfaces, including buildings and bus stop infrastructure. The FLIR C5 thermal camera was set to the standard emissivity of 0.95. Camera positions were standardized as follows: FLIR 1 (F1), oriented to the left of the bus stop; FLIR 2 (F2) positioned in the street facing the bus stop; and FLIR 3 (F3), oriented to the right of the bus stop. F1 and F3 were taken 9.6 m from the stop’s center point (defined as the pole displaying the unique bus stop identification number), while F2 was captured 3 m from this pole (
Figure 2). These distances were chosen to approximate areas where most transit users wait. Images were then segmented into polygons using the proprietary software FLIR Thermal Studio Suite (Version 2.0). Segmentation was based on both surface type and camera placement. For F1 and F3, segments encompassed all surface types between the camera position and the central Kestrel (K2). For F2, segmentation included all surface types. Seven surface categories were defined: asphalt, concrete, fine vegetation (herbaceous surfaces such as grass), coarse vegetation (woody tissue of street trees), bare soil, building (walls or fences), and bus stop infrastructure (shelters, poles, benches, and other street furniture). An example of the images and their segmentation can be seen below (
Figure 3).
Due to the heterogeneous composition of each bus stop and its associated surfaces, the segmented polygon size was not standardized. From each segmented polygon, an average TS was determined from the software and was recorded within each image position and for each replicate. If multiple polygons of the same surface types were present within a single FLIR image (e.g., several polygons classified as grass), their TS values were averaged. This value was defined as FLIR image segment TS.
From these FLIR image segment (polygon) TS measurements, a grand mean was calculated for each of the three FLIR images to produce a FLIR image average for a given bus stop. If a surface category was absent from an image (e.g., no coarse vegetation or bare soil at that stop), it was assigned a null value and excluded from the grand mean. Thus, three averages were generated per replicate, one for each FLIR image. Importantly, this grand mean was not the overall pixel-based TS mean of each FLIR image, provided by the FLIR Thermal Studio, but the average of the defined surface-type segments within a given FLIR image. This approach allowed for comparison across camera positions and assessment of how consistently they measured TS from their position across bus stops. This value was defined as the FLIR image TS.
Finally, the average for TS for each surface type was calculated across all three images. As with the FLIR image TS, a grand mean of bus stop TS was then derived by TS values across all surface types at a given bus stop, with the surface averages representing the mean across the three camera positions. Importantly, this grand mean was not the overall mean of the three FLIR images, but rather an aggregate of surface type averages. This approach allowed assessment of the consistency of TS measurements across surface types. This value was defined as FLIR bus stop TS.
3.4. Analysis of FLIR Camera Position and Surface Type for TS Measurement Consistency, Variable Selection
With multiple FLIR images taken from different camera positions, and numerous thermographic images segmented, we wanted to ensure our method for calculating the TS of all surface types was consistent over the range of camera positions and segmented polygons at each bus stop. We first examined summary statistics (mean, median, standard deviation, coefficients of variation, and interquartile range) of our TS sample for each surface type, as well as the average TS for all FLIR images taken at a stop, to examine the relative variation between mean surface TS and mean image TS.
To determine whether we were measuring the TS of shared surfaces consistently, we calculated intraclass correlations using the two-way random effects model for the mean of k raters (ICC2k). ICC2k treats raters as a random sample from a larger population and estimates the reliability of their average rating. In our case, the “raters” were delineated FLIR image segments (polygons), which varied in size and spatial shading patterns across bus stops. Because these polygons can be considered randomly sampled subdivisions of a heterogeneous surface, ICC2k was appropriate for evaluating the consistency of FLIR camera position across surface types (FLIR image segment TS). We then calculated ICC3k, a two-way mixed-effects model for the mean of k raters, which assumes that the set of raters is fixed. Here, the raters were the three FLIR camera positions, which were held constant across all bus stops (same distances, same orientations). ICC3k was therefore used to examine the consistency of surface temperature measures across camera positions and to test how reliably the average of these three camera-derived values (one per image) represented the overall bus stop TS, relative to the grand average of all segmented polygons. Across all cases, we used ICC estimates of consistency rather than absolute agreement, as our focus was on whether FLIR-derived measures covaried reliably across positions and image segments, rather than whether they produced identical values in a heterogeneous thermal environment.
ICCs were run using the package ‘psych’ in R [
58]. Assessment of the ICC correlation coefficients, Cohen’s kappa, followed criteria with values less than 0.5 indicating poor consistency, between 0.5 and 0.75 indicating moderate consistency, between 0.75 and 0.9 indicating good consistency, and greater than 0.9 indicating excellent consistency [
59]. The metric with the greatest consistency, or the highest Cohen’s kappa correlation coefficient, was selected as our representative variable of T
S values captured by FLIR. The ICC coefficients of this analysis are found in
Appendix A. Ultimately, FLIR bus stop T
S was determined to be consistently measured across camera positions and was approved for use in this study.
3.5. Statistical Analyses
A linear mixed effect model was fit using the R package ‘lme4’ [
60] to examine differences in Landsat’s T
S (SD-LST) and FLIR T
S. As Landsat captures photos in the morning only, a smaller subset of the data was used: FLIR images captured in the morning that coincided with Landsat’s orbital cycle. Both methods of measurement, FLIR T
S and Landsat SD-LST, were placed as categorical fixed effects predicting bus stop T
S. To account for repeated measurements and clustering, random effects included the date of image capture for both FLIR and Landsat, as well as the unique bus stop ID (BSID) for each study site location [
61]. The equation for this model was thus as follows:
where TS
ij is the bus stop surface temperature for BSID
i on
j date, Method
ij is the measurement method (0 = FLIR, 1 = Landsat), β0 is the mean T
S for FLIR, the reference category, β1 is the fixed effect of method, bDate
j and bBSID
i are the random effects for date and bus stop, respectively, and ε
ij is the residual error.
To better understand the sources of variability in surface temperature measurements, we additionally examined the contribution of each component of the mixed-effects model separately. Specifically, we fit FLIR-only and Landsat-only models, including the same random effects for Date and BSID. This allowed us to quantify how much of the total variance was attributable to site-specific differences (BSID), day-to-day variation (Date), and residual measurement error, independently for each measurement method.
FLIR measurements and Landsat measurements were then compared to hyper-local Kestrel measurements and thermal comfort indices with a Pearson product-moment correlation matrix to examine correlations between these two methods and biometeorological measurements. Pearson correlations were assessed first for a significant linear correlation at an alpha of 0.05. Correlations above alpha were said not to be linearly correlated. Pearson correlation coefficients were then assessed for strength. Strongly correlated measurements were determined to be a Pearson correlation coefficient of 0.8 or higher.
We then examined how well the average T
S, as captured by the FLIR, predicted hyper-local measurements of the bus stop, including T
Air and indices of thermal comfort: WBGT, T
MRT, UTCI, and PET. Another series of five linear mixed-effects models was generated, this time predicting these metrics.
where Y
ij represents one of the five biometeorological metrics (T
Air, WBGT, T
MRT, UTCI, or PET), β0 is the model intercept, β1 is the fixed effect of FLIR T
S, and bDate
j and bBSID
i are the random effects of date and bus stop, and ε
ij is the residual error.
As we were not limited by Landsat’s orbital rotation, the larger data set that coincided with the field campaign was used, with the exception being the model predicting PET, which required biometric information from willing transit users as inputs for the calculation of PET. As not all transit users were willing to report this, the sample size for PET remained lower. Random effects also included study site (BSID) and date; however, after likelihood ratio testing of nesting models, BSID was ultimately dropped as a random effect to avoid issues of singular fit that would occur if left as a random effect in the model: the random effect was too complex for these data, resulting in an overfitted model. Lastly, Root Mean Squared Error (RMSE) values were calculated for each model, both for fixed effects within the model only and for the full mixed-effects model, to see if these models were reasonable for use under ISO 7726 standards for thermal comfort methods, for which an error of less than five degrees Celsius indicates adequate model fit [
62]. Significance within all models was evaluated at an alpha (a) of 0.05, and model diagnostics were visualized to check their performance and assumptions.
6. Conclusions
This study examined the differences in TS at bus stops from two measurement methods: macroscale SD-LST from the Landsat 8 satellite and hyper-local, FLIR thermography. In addition, it linked TS derived from FLIR photogrammetry to some commonly used thermal comfort indices to examine its use in predicting human thermal experience, including TMRT, PET, and UTCI.
We demonstrated that TS, as measured by Landsat, is on average 10.7 degrees hotter than FLIR measurements. Additionally, FLIR measurements are strong and significantly correlated to the micrometeorological measurements of the bus stop (r > 0.8, p < 0.001), while Landsat measurements have no significant correlations. Lastly, the average TS measured by the FLIR was able to explain over 50% of the variation in TAir, WBGT, UTCI, PET, and TMRT. With these models having RMSE values below five degrees Celsius, segmented FLIR image averages are adequate for use in thermal comfort studies under standards put forth by the ISO.
Ultimately, we find that this novel method of utilizing thermal image photogrammetry is sufficient as a simple method and low-cost alternative for analyzing the TS of bus stops, overcoming some of the challenges of scale posed by the Landsat satellites, which were demonstrated to be insignificantly correlated to hyper-local biometeorological measurements. Continuing to examine why these differences exist, along with other measurement methodologies for capturing SD-LST, would be helpful for advancing analyses derived from satellite imagery. In sum, hyper-local thermographic images are effective at predicting indices of human thermal comfort at bus stops, offering potential solutions for prioritizing heat-resilient transit design in semi-arid transit systems.