1. Introduction
Temporal synchronization of multi-view video streams is essential for a wide range of computer vision tasks, including 3D reconstruction [
1,
2,
3], pose estimation [
4,
5,
6], and scene understanding [
7,
8,
9,
10]. The exact tolerance to synchronization errors depends heavily on the application. Semantic scene analysis tasks, such as surgical phase recognition, activity classification, or object detection, tend to aggregate information over extended temporal windows, making them relatively robust to minor temporal misalignments. In contrast, geometry-driven applications like multi-view 3D reconstruction and motion capture require that corresponding frames represent the same physical moment. In these cases, even single-digit millisecond offsets can cause significant reprojection errors, blurry artifacts in reconstructions, or inaccurate motion trajectories.
In laboratory environments, specialized cameras with dedicated interfaces for hardware-based synchronization [
11,
12,
13] can be used to ensure that the shutters are perfectly synchronized. However, such solutions are costly, require complex wired setups, and greatly restrict the range of compatible devices. In contrast, real-world deployments often involve camera arrays that integrate a heterogeneous mix of consumer and professional devices, many of which lack hardware synchronization capabilities.
When hardware synchronization is not available, temporal alignment is often approximated by recording unsynchronized videos and then, during post-processing, estimating time offsets and pairing each frame from one camera with the closest frame from another stream.
In practice, audio-based methods, such as clapping or other transient sound cues, remain common due to their simplicity and widespread availability of built-in microphones [
14,
15,
16]. However, these methods are sensitive to environmental noise and cannot be used with devices lacking audio capture. Moreover, they constrain the spatial arrangement of the cameras, as all cameras must be within audible range and preferably positioned at similar distances from the sound source; otherwise, the relatively low propagation speed of sound causes arrival-time differences that can compromise synchronization. For example, in a room measuring 10 m across, the sound of a clap will take roughly 30 milliseconds to travel from one side to the other, given the speed of sound in air (≈343 m/s). This delay is already on the order of an entire video frame at 30 fps.
One alternative solution is to use timecodes stored alongside the video [
17,
18], but this requires external timecode systems or pre-synchronizing the cameras’ internal clocks before recording. More recently, content-based approaches have been proposed that estimate temporal relationships directly from image content, for example by tracking general scene features [
19], motion patterns [
20], lighting changes [
21], or human poses [
22] across multiple cameras. While some of these methods achieve millisecond-level synchronization accuracy, they are typically tailored to narrow use cases and rely on specific video content and overlapping fields of view, which limits their applicability in more diverse scenarios. Some systems address this by explicitly embedding visually encoded timecodes into the video [
23]. However, existing approaches are constrained by the low and fixed refresh rates of consumer screens and are not suitable for scenarios involving infrared (IR) cameras.
In contrast to hardware synchronization, post hoc alignment methods cannot ensure perfect frame correspondence, since the camera shutters operate independently. The residual error is bounded by half the frame period and thus scales with frame rate. For example, at 30 fps, the maximum misalignment can reach 16.7 ms, which is significant for fast motion. Although higher frame rates reduce this error, they also increase storage, power consumption (especially critical for battery-powered cameras), bandwidth, and processing costs, which can be prohibitive for long video sequences such as four-hour surgical recordings. It is therefore beneficial to estimate time offsets with sub-frame precision, enabling accurate frame interpolation for many downstream tasks.
To address this gap, we propose a low-cost, portable, and camera-agnostic method for achieving millisecond-level temporal alignment across heterogeneous camera systems, ranging from inexpensive action cameras to high-end optical tracking systems. The proposed solution is designed to be scalable, supporting configurations ranging from as few as two cameras to several dozen, and remains robust in real-world scenarios.
Central to our approach is a custom LED Clock, a
device that visually encodes timestamps using LEDs (see
Figure 1a). An operator sequentially positions the LED Clock in front of each camera. Timestamps are encoded via two complementary LED components (
Figure 1b): a binary counter providing a global timestamp with 100 ms resolution, and a circular arrangement of LEDs illuminated sequentially at 1 ms intervals. Because camera sensors integrate light over a non-zero exposure time, the circular sequence appears in the image as an elliptical arc, whose start and end points indicate the beginning and end of the exposure window. For accurate spatial decoding, an ArUco marker [
24] of known geometry is included at the center, enabling detection and homographic rectification of the captured LED components (i.e., binary counter and elliptical arc). Combining the binary timestamp with the arc-based exposure estimate yields a millisecond-accurate global timestamp for the exposure window. This process is repeated for at least two frames per video, enabling estimation of both offset and drift for each camera and allowing precise temporal alignment of the entire recordings. In practice, the LED Clock is shown at the beginning and end of a recording to minimize extrapolation errors. Importantly, cameras do not need overlapping fields of view or simultaneous visibility of the LED Clock. Each camera can be calibrated independently by sequentially presenting the LED Clock, enabling synchronization even across spatially separated or non-overlapping camera setups.
This approach allows multiple video streams to be synchronized with millisecond-level accuracy, regardless of the cameras’ native synchronization capabilities and scene content. Combined with established frame interpolation techniques [
25,
26], it opens up the possibility of synthesizing temporally aligned frames across all viewpoints, facilitating high-quality 3D reconstruction of dynamic scenes using unsynchronized consumer-grade cameras.
Our contributions are as follows: (1) We present a novel method for synchronizing heterogeneous camera systems comprising a combination of different infrared and RGB(-D) cameras. (2) We thoroughly evaluate our approach through a series of experiments conducted in both controlled laboratory settings and in real-world scenarios involving up to 25 cameras. (3) We demonstrate the importance and benefits of sub-frame temporal synchronization. (4) We contribute to the scientific community by open-sourcing all software and hardware designs developed in this work.
5. Experiments
To evaluate the accuracy and robustness of our method, we designed a comprehensive testing framework that compares it against existing synchronization techniques across various camera configurations and combinations, both in laboratory and real-world scenarios. For each configuration, we designed specific experiments and selected an appropriate quantitative metric based on the capabilities and limitations of the cameras involved.
In our first experiment, we used two Kinect cameras with hardware synchronization as the ground-truth reference. This setup allows us to compute the exact residual synchronization error in the time domain.
In our second experiment, we test the wide applicability of our method by comparing it with existing synchronization methods for consumer-grade cameras. Since GoPro cameras lack support for hardware synchronization to provide a reliable ground-truth reference, we instead performed an external calibration for every synchronization method using a fast-moving checkerboard. The resulting reprojection error was then used as a practical metric for evaluating temporal alignment, as it is minimized when the video streams are perfectly synchronized, thus providing a direct and quantitative measure of the temporal alignment achieved by each approach. Additionally, we used the sub-frame accurate timestamps provided by our method to interpolate synchronized 2D positions of the checkerboard instead of simply using the closest frame for every video stream, further minimizing the reprojection error.
Then, in our third experiment, we evaluated the accuracy of our method across two different modalities, namely RGB video cameras and a professional-grade IR optical tracking system. This setup enabled us to assess the robustness and generalizability of our approach when applied to heterogeneous camera systems.
In the fourth and fifth experiments, we demonstrated the importance of sub-frame-accurate synchronization for multi-view computer vision tasks. Specifically, we synchronized a multi-view recording with our method and compared the results of 3D pose estimation and 3D reconstruction when using sub-frame interpolation versus nearest-frame matching.
Finally, in the sixth experiment, we tested robustness outside controlled laboratory conditions through a case study in large-scale surgical data collection involving 25 cameras. The data collection featured a highly heterogeneous camera array, comprising devices from various manufacturers and price ranges, with differing settings such as field of view, frame rate, shutter speed, and sensor type, demonstrating the broad applicability of our method.
Together, this rigorous evaluation allowed us to thoroughly assess the accuracy and versatility of our method across diverse laboratory and real-world scenarios.
5.1. Experiment 1: Validation Against Hardware Synchronization
In this experiment, we evaluate the absolute synchronization error of our method by comparing its output against a ground-truth reference. To obtain the ground truth, we leveraged the hardware synchronization capabilities of the Azure Kinect DK (Microsoft Corporation, Redmond, WA, USA), assuming it provides perfect synchronization. The recording procedure was designed to closely match real-world multi-view data collections. We applied our method to recompute the timestamps for both video streams separately and then calculated the root mean square error (RMSE) between them across all frames. This error should ideally be zero because the shutters are perfectly synchronized. However, it is important to note that the Microsoft Azure Kinect DK uses an OV12A10 RGB sensor with a rolling shutter, meaning that different image rows are exposed sequentially in time. Consequently, the LED pattern on the clock may differ slightly between frames (see the Results paragraph for details). Together with the relatively low frame rate of 30 fps, this setup constitutes a challenging scenario for hardware-synchronization-free methods in terms of synchronization accuracy.
Experimental setup. Two Kinect cameras were placed next to each other, facing opposite directions, and connected using a cable to synchronize their shutter. We then used the k4arecorder utility (Azure Kinect SDK version 1.4.2,
https://github.com/microsoft/Azure-Kinect-Sensor-SDK, accessed on 27 November 2025) on a central laptop to record from both cameras simultaneously. The cameras were set to a resolution of
pixels at 30 fps with a fixed exposure time of 8.33 ms.
The recording followed a protocol representative of real-world, non-laboratory conditions, similar to those in large-scale data collections involving many cameras (e.g., the 25-camera setup used in Experiment 6), where not all cameras share overlapping visual content. We first showed the clock to the first camera for approximately 5 s, then waited 2 min, and subsequently showed it to the second camera for 5 s. The 2-min delay simulates the scenario of synchronizing more than two cameras, assuming that one full cycle of showing the clock to every camera takes approximately 2 min. The clock was positioned approximately 150 cm from the camera and centered in the field of view. After the initial synchronization cycle, the actual procedure of interest was simulated by waiting for 5 min before performing a second identical synchronization cycle. This allowed for a more accurate measurement of the drift over time. This process was repeated five times, resulting in five independent 14-min recordings from two cameras.
Evaluation score. The 10 videos were processed independently using our method to recompute global timestamps for every frame. Finally, for every video pair, we computed the root mean square error (RMSE) between the recomputed timestamps. The RMSE is defined as:
where
N is the total number of frames, and
and
represent the recomputed timestamps of the
i-th frame in the first and second video streams, respectively.
Results. RMSE values over the five runs are reported in
Table 1, with an average RMSE of 1.34 ms, indicating strong agreement with the ground-truth reference. This error is significantly lower than the duration of a single frame at 30 fps (33.3 ms), demonstrating that our method achieves meaningful sub-frame accuracy. We attribute the residual error primarily to the rolling shutter of the Kinect’s RGB sensor. Unlike a global shutter that exposes all pixels simultaneously, the rolling shutter captures each row sequentially over a short interval. This causes slight temporal offsets across the frame, meaning that the effective exposure time for a given pixel depends on its vertical position. Therefore, the timestamp encoded by the clock appears slightly shifted depending on the clock’s position within the frame. Additionally, since the LEDs are arranged in a ring and therefore at different vertical positions, the decoded timestamp can vary slightly based on which specific LED was illuminated during exposure, for example, whether the observed arc is predominantly horizontal or vertical. This is further supported by the slight variations in exposure times shown in
Figure A1. To reduce this artifact, the clock can be held at a greater distance from the camera, effectively decreasing the vertical pixel distance between the highest and lowest LED.
5.2. Experiment 2: GoPro to GoPro
In this experiment, we compare our method to existing synchronization approaches for aligning two GoPro video streams. Since GoPro cameras do not offer built-in hardware synchronization that can serve as a ground-truth reference, we evaluate performance indirectly by performing stereo external calibration based on a moving checkerboard and reporting the resulting reprojection error. The underlying intuition is that when video streams are accurately synchronized, the reprojection error is minimized, thus providing a practical and quantitative measure of temporal alignment. We deliberately maintained the checkerboard in continuous rapid circular motion to amplify the reprojection error caused by temporal misalignment, making the results easier to interpret. We are comparing the following methods:
- 1.
RocSync (our method);
- 2.
Global illumination-based synchronization (toggling room lights);
- 3.
Audio-based synchronization (clapping) using AudioAlign [
42];
- 4.
Timecode-based synchronization (GoPro Timecode Sync using QR code).
For global illumination-based synchronization, we implemented a simple script that estimates the temporal offset by cross-correlating the gradient of average global illumination between the two cameras. Audio-based synchronization was performed with the open-source software AudioAlign [
42] by applying FP-EP followed by cross-correlation. Like most battery-powered cameras, GoPros include an internal real-time clock that maintains wall time. This clock is then used to generate per-frame timestamps with frame-level precision, which are encoded in the SMPTE timecode format. These timestamps can be used to identify the corresponding frames in multiple video streams. The GoPro Labs Firmware (
https://gopro.com/en/us/info/gopro-labs, accessed on 27 November 2025) enables fast and accurate timecode synchronization using QR codes displayed on a smartphone.
Experimental setup. We placed two GoPro cameras (GoPro Hero 12 and GoPro Hero 10) approximately 1 m apart to capture the same scene. Both cameras were running the GoPro Labs Firmware and they were configured to record at a resolution of with 60 fps and a fixed exposure time of 1/480 s. The internal real-time clocks of the cameras were synchronized immediately prior to recording using an animated QR code displayed on a phone, a feature of the GoPro Labs Firmware.
The recording was then started manually on both cameras, after which we performed the calibration procedures required for each synchronization method. Specifically, we clapped, turned the room light off and on, and finally showed our LED Clock to each camera for 5 s. After completing these initial calibration steps, a 3D-printed precision checkerboard was continuously moved along a circular trajectory for 30 s, making sure that it remained visible in both camera views. Lastly, the initial calibration steps were repeated once more.
The Hero 12 was selected as the reference camera, and the Hero 10 was aligned to it using the different synchronization methods. For each method, the videos were synchronized independently, and the same set of 100 consecutive frames was then used for stereo external calibration with the stereoCalibrate function from the OpenCV library (version 4.12.0). In addition, all original videos were downsampled to 30 fps (half the native frame rate) using ffmpeg (version 7.1.2), and the entire process was repeated on these recordings.
Evaluation score. The synchronization accuracy of each method is evaluated using the mean reprojection error (MRE) of all checkerboard corners in the 100 consecutive frames after stereo calibration. For two cameras, a fully symmetric MRE can be defined using both forward and inverse projection. Let
and
denote the projection functions of the left camera and right camera respectively. The MRE is written as:
where
N is the number of image pairs (views),
M is the number of checkerboard points,
and
are the observed image points in the left and right images,
is the 3D position of the
j-th checkerboard point,
and
are the pose (rotation and translation) of the checkerboard in the left camera frame, and
and
are the rotation and translation from the left camera to the right one. This symmetric formulation measures the reprojection error in both directions, providing an assessment of calibration and temporal alignment.
Results. Table 2 reports the mean reprojection error across five runs, comparing different synchronization methods at 30 fps and 60 fps. At 30 fps, our method with interpolation achieves the lowest errors overall, consistently outperforming existing approaches such as light- or audio-based synchronization, and far surpassing timecode alignment. Without interpolation, our method performs similarly to light-based synchronization, as both provide frame-accurate offset estimation. At 60 fps, the margin between our interpolated method and the closest competing approaches narrows, since the maximum error of frame-accurate synchronization is inherently bounded by half a frame interval. With interpolation, our method achieves nearly the same error at 30 fps as at 60 fps, clearly demonstrating its sub-frame accuracy.
It is important to note that this experiment represents an ideal scenario for light-based synchronization, with a single ceiling-mounted lamp that turns on and off nearly instantaneously. In many real-world settings and larger rooms, illumination often comes from multiple independent lamps and is neither synchronized nor uniform.
Sound-based synchronization exhibits a noticeable error, which we suspect arises from the fact that video and audio streams are encoded separately and only later combined in a container. Differences in audio/image pipeline latency can introduce timing offsets that vary across camera models. Because the error exceeds a single frame interval, sub-frame interpolation does not consistently improve the result.
5.3. Experiment 3: IR Pose-Tracking to RGB Camera
One of the advantages of our approach lies in its ability to synchronize heterogeneous camera systems, including those that rely on infrared imaging. To evaluate this capability, we tested the combination of a consumer action camera (GoPro Hero 12) and a high-end IR marker tracking system. The pose-tracking modality was represented by the Atracsys fusionTrack 500 (Atracsys LLC, Puidoux, Switzerland), a stereo infrared tracking system used for high-precision 3D localization of infrared-sensitive markers in surgical applications. To our knowledge, no existing hardware synchronization method supports this camera setup, so we use the reprojection error as a means to validate the proposed synchronization method.
Experimental setup. Both cameras were mounted on the same tripod. The GoPro camera was set to record at a resolution of
with 60 fps. Similar to the previous experiment, a
3D-printed multimodal checkerboard with embedded infrared fiducials was moved through the scene with rapid linear motions visible to both cameras. Since the fusionTrack system records 3D marker and fiducial coordinates instead of 2D images, we developed a dedicated pipeline for detecting and decoding our LED Clock in these recordings (see
Appendix A). After synchronizing the recordings using our method, we performed stereo extrinsic calibration with the multimodal checkerboard following the approach of Hein et al. [
27]. The experiment was repeated three times in total.
Evaluation score. We use the MRE again. This time, it evaluates how accurately the corners of a moving 3D checkerboard rigidly attached to an IR marker are reprojected onto their corresponding 2D detections in the RGB images. For each synchronized pair, the 3D points
correspond to the checkerboard corners expressed in the marker reference frame. Using the pose of the IR marker with respect to the IR camera
, these points are first expressed in the IR camera frame, then transformed into the RGB camera frame using the calibrated extrinsic parameters
, and finally projected into the RGB image through the pinhole model
. The MRE is computed as follows:
where
N is the number of synchronized IR–RGB pairs,
M is the number of checkerboard corners per view,
are the 2D detections in the RGB camera (camera 2),
are the 3D coordinates of the checkerboard corners expressed in the marker reference frame,
and
represent the rotation and translation of the marker with respect to the IR camera (camera 1) for view
i, estimated from the IR stereo tracking system,
and
are the rotation and translation from the IR to the RGB camera, and
denotes the projection function of the RGB camera.
This formulation allows evaluating the overall alignment accuracy between both sensing modalities by directly comparing the projected positions of the 3D checkerboard corners to their detected 2D image locations in the RGB frames.
Results. Table 3 reports the mean reprojection error (MRE) across three runs. While no direct comparison to other synchronization methods is possible in this heterogeneous setup, the observed error of 1.81 px at 4K resolution is consistent with previously reported results for the same configuration, namely same checkerboard size and geometry, as well as comparable checkerboard-to-camera distances, using joint optimization of extrinsic parameters and time offset [
6]. This result demonstrates high synchronization accuracy.
5.4. Experiment 4: 3D Pose Estimation
In this experiment, we evaluate the applicability of our method to a common real-world downstream task: multi-view 3D human hand pose estimation. We quantitatively assess performance by comparing triangulation error when using interpolation with sub-frame timestamps versus the nearest-frame strategy. Additionally, side-by-side visualizations of the triangulated geometries qualitatively demonstrate fewer abnormalities when using sub-frame interpolation.
Experimental setup. We employed four GoPro Hero 12 cameras arranged in a square with approximately 30 cm spacing between adjacent units. All cameras recorded at 4K resolution and 30 fps with a shutter speed of 1/240. The captured sequence depicts a simulated surgical scenario in which a surgeon manipulates a saw bone.
Synchronization using our method was performed in two rounds following the same procedure as in previous experiments. For each view, 2D hand poses were manually annotated by an experienced annotator. Six distinct moments (denoted to ) with significant hand motion, where one hand was fully visible, were selected, and 3D hand joints were triangulated twice per moment: first by using the nearest frame from each view, and second by interpolating the 2D coordinates between the two nearest frames in each view using the sub-frame accurate timestamps obtained through our method.
In a second step, the four viewpoints were partitioned into two sets (GoPro 1 & 2 and GoPro 3 & 4), and moment was triangulated separately for each set using only two cameras (again with and without interpolation). The resulting 3D poses were then compared to assess the discrepancy between two independent 3D pose estimations, namely from the two camera subsets, corresponding to the same moment .
Results. For all six moments (
–
), we evaluated the triangulation error of the 3D hand joints in terms of the mean reprojection error. The results are reported in
Table 4. We obtain a mean triangulation error of 4.4 px when using interpolation, compared to 8.8 px when using the nearest-frame synchronization strategy. These results highlight the clear benefit of sub-frame-accurate synchronization.
In the partitioned sparse-view scenario, we evaluated the discrepancy between the two resulting 3D poses. As shown in
Figure 3, sub-frame synchronization reduced the mean Euclidean distance between poses from 26.07 mm without interpolation to 7.52 mm with interpolation, confirming its clear benefit.
Finally, we present a qualitative comparison with and without sub-frame synchronization in
Figure 4a,b, respectively, associated with the scene shown at the top of
Figure 5a. We observe a clear discrepancy in hand shape under the nearest-frame strategy, whereas the sub-frame synchronization case produces a reconstruction consistent with the actual hand shape.
5.5. Experiment 5: 3D Reconstruction
In this experiment, we visually demonstrate the benefits of sub-frame-accurate synchronization for multi-view 3D reconstruction. We combined our synchronization method with a state-of-the-art video interpolation technique to generate temporally aligned frames across all viewpoints, thereby significantly enhancing the quality of 3D reconstruction of dynamic scenes, particularly at lower frame rates.
Experimental setup. We reused the recordings from Experiment 4 and selected two moments with significant motion. The method of Jin et al. [
26] was applied to generate temporally aligned frames across all viewpoints. MASt3R [
43] was then performed on both the interpolated and non-interpolated frames. The reconstructions are shown in
Figure 5.
Results. For both selected moments, the interpolated frames produce noticeably higher-quality reconstructions with fewer artifacts compared to the nearest-frame strategy. We conclude that sub-frame interpolation using our synchronization method provides a practical and effective alternative to hardware synchronization for scenes involving fast-moving objects.
5.6. Large-Scale Data Collections
In addition to controlled experiments, our method has been successfully deployed in multiple surgical data collections, recording several hours of surgery footage using more than 25 cameras. These collections featured a highly heterogeneous camera array, including near-field and far-field GoPro cameras operating in linear lens mode, head-mounted Aria glasses capturing a wide-angle view, Canon CR-N300 capturing close-up views, different iPhone models, Aria glasses, and a high-end infrared surgical navigation system (Atracsys FusionTrack 500) for tool tracking. The devices exhibited significant variations in shutter speeds, frame rates, sensor types, and price points.
This diversity posed considerable synchronization challenges that are not addressed by existing methods. The successful synchronization of all video streams highlights the flexibility and robustness of our approach in complex, real-world environments. Synchronization was manually confirmed by visually comparing frames with significant motion across all cameras.
Figure 6 illustrates the relative exposure times of 25 unsynchronized cameras used simultaneously in a single recording. Our method accurately determines the exact start and end of each exposure. Note that the exposure durations differ across cameras, as their configurations were adjusted according to the amount of light within each field of view. Specifically, the Near and Ego groups have considerably shorter exposure times.
To validate our synchronization method, each video was visually inspected at a minimum of two checkpoints. We extracted frames from the aligned video streams and checked for light switch events, where the surgical lamp or the room light was turned off and on at the beginning and end of each recording.
In cases where a camera stopped recording early (e.g., egocentric cameras worn by surgeons leaving the operating room), the second light-switch event was not available. In those situations, we relied on distinct, fast movements or other visible markers to verify the synchronization at the second checkpoint. Synchronization between the IR and RGB streams was validated separately in Experiment 3.
Out of 73 total recordings, only one instance failed synchronization. This was associated with the Aria glasses worn by the lead surgeon. The failure was not due to our method, but rather to frame drops in the recordings from the Aria glasses.
6. Conclusions
We have presented a novel tandem hardware and software solution for millisecond-accurate temporal multi-view video synchronization that is compatible with a wide range of camera systems and has been successfully validated in a broad range of scenarios.
Through extensive experiments, we have demonstrated that for synchronizing two GoPro action cameras, our method outperforms all existing approaches. Moreover, for systems including infrared cameras, our approach remains effective.
We have also demonstrated that sub-frame-accurate synchronization can be achieved without hardware synchronization and can enhance the performance of many downstream computer vision tasks through frame interpolation. This allows operation at lower frame rates while still recovering precise information, reducing storage and computational requirements and enabling more efficient large-scale data collection and analysis.
Finally, our method has been successfully tested outside of the lab with 25+ cameras, demonstrating its robustness and real-world applicability. Together, these results confirm that our approach provides accurate, reliable, and broadly applicable sub-frame synchronization across diverse camera systems and practical scenarios.
We acknowledge that our approach introduces a dedicated hardware component and requires a short manual calibration step, which limits applicability in fully automated or unattended deployments. This represents a trade-off between operational convenience and universality. In practice, calibration is lightweight and performed only at the beginning and end of a recording, typically taking only a few seconds per camera, which we found acceptable even in large-scale setups.