The bidirectional transmission of audio and video data streaming is a fundamental task in a mobile robot designed for tele-operation and remote interaction. This section presents a detailed description of the generic bidirectional audio and video transmission system developed for the APR. This system is specifically designed to connect two Android devices: the tablet that controls the APR and the remote operator’s smartphone or tablet.
4.2.3. Video Communications
Video transmission/reception is an indispensable feature for a tele-operated assisted living robot. Video communication is a computationally expensive application that needs to be efficient, reliable, robust, and with low-delay in the transmissions.
Currently, one of the most commonly used encoders to perform live streams is the H.264/MPEG-4 due to its high-compression capabilities. This encoder is designed to generate stable and continuous transmissions with low-bandwidth requirements but has several drawbacks. On one hand, this encoder can have a variable communication delay from 3 to 5 s and high-motion scenes can appear blurred because of the dependence between consecutive frames [
17]. On the other hand, this encoder requires highly optimized libraries which are difficult to integrate into a custom application. The bidirectional video communication method implemented in the APR overcomes these drawbacks by transmitting video as a sequence of compressed images in JPEG format instead of using a common video streaming format. This approach simplifies the implementation and control of the communication. Therefore, the bidirectional video communication implemented in the APR has to acquire, compress and submit surrounding images while receiving, decompressing and showing the images submitted by the remote device.
The procedure for image acquisition in the Android environment starts in the
CameraSurfaceView class, which extends the Android class
SurfaceView. The
SurfaceView class is used as a canvas providing a dedicated drawing area where the preview of the camera is displayed. Once the camera has been initialized, the
onPreviewFrame(byte[] data, Camera camera) callback is triggered each time a new image is obtained, retrieving an array containing the YUV image data encoded with the NV21 Android native format for camera preview as well as the camera instance that has captured the image. The images are then compressed to the standard JPEG format (which reduces the file size at the expense of quality and computing time) to be submitted as an image packet.
Table 5 shows an example of typical common image resolutions with the relationship between size, quality and time required for the JPEG compression of the image, and the theoretical maximum rate of frames per second (fps) that can be obtained in communication with the Tablet of the APR. Video communication can use any resolution available in the telecontrol device. Under normal conditions, the time required to compress one image is similar to the time needed to decompress the image. The large size reduction achieved with the JPEG compression (
Table 5) is because the original YUV-NV21 image format is already a compressed image format with averaged and grouped U and V color planes so this image has less spatial variability than a conventional raw RGB color image and can be described with less information and size after compression.
The implementation of a videoconference system requires image acquisition, image compression, transmission, reception, and image decompression in a bidirectional way. The real frame rate of the image acquisition devices is usually 15 fps. Some images may be lost during this process: (a) because a new image is available while the previous image is still being compressed so this must be discarded; (b) because of UDP transmission on the network; and (c) because a new image has been received while the previous image is still being decompressed so this must be discarded. However, current smartphones and tablets usually include multiple CPUs and use different isolated threads for transmission and reception so they can achieve higher videoconference frame rates than when only one CPU is available for processing.
Figure 9 shows a visual example of the impact of the quality parameter on a JPEG compression and decompression.
Figure 9 shows the face the remote operator, acquired with a smartphone, compressed as JPEG, submitted to the network, received by the tablet of the APR, decompressed, and shown on the screen of the APR for visual interaction. In this example with small images, the difference between 100% and 60% qualities is almost imperceptible to the human eye while the image size is 12 times smaller. There are also small differences between 60% and 30% qualities in terms of compressed image size, processing time and expected frames per second. The APR uses a video transmission procedure configured to operate with a starting default resolution of 320 × 240 pixels with a JPEG quality factor of 60%. Then, the expected average image submitted to the network is 56 times smaller than the original YUV-NV21 images acquired by the device, and 84 times smaller than the RGB version of the original image. This image resolution can be changed manually during the videoconference or adapted dynamically to the network bandwidth capabilities.
The compressed image is sent to the UdpTxImages thread, which creates a new UDP packet containing the image. Once an image has been received at the UdpTxImages thread, a flag is activated to notify the CameraSurfaceView class that there is a transmission in progress, so newly generated images are not transmitted. This flag is used to indicate to the CameraSurfaceView that there is no need to compress and send new images to the UdpTxImages thread until the current transmission is completed. The main goal of this procedure is to avoid wasting CPU cycles by compressing images that will not be sent. However, this method slightly increases the video delay since the next image to be transmitted is not compressed until the last transmission ends. In general, a video transmission of compressed JPEG images of 320 × 240 pixels obtained at a frame rate of 15 fps and compressed with a quality factor of 60% with an average size of 2720 bytes requires a network bandwidth of 0.318 Mb/s per video streaming (0.881 Mb/s for a color image of 640 × 480 pixels). This image-based video streaming gives shorter delays in communication but may require higher network bandwidth than alternative buffered-based streaming systems. In general, typical domestic networks have enough upload and download bandwidth for two-directional video streaming communication with color images 640 × 480-pixels processed at 15 fps.
Video reception is carried out by the UdpRxImages execution thread. Once initialized, this thread starts receiving UDP packets containing the frames transmitted by the remote device. Each time an image packet is received, the data is decoded into a bitmap using the Android method BitmapFactory.decodeByteArray(). The new bitmap is then sent to the application main activity using the OnImageReceived() listener. Once the new image has been received at the main activity, the interface view is updated by the User Interface (UI) thread. In order to improve the memory usage at the reception side of the application, the inMutable flag from the BitmapFactory class is set to true, forcing the BitmapFactory class to always use the same memory space to store new decoded images by avoiding the creation of a new bitmap each time a new image is received. When using the inMutable flag the bitmap used to decode the received image must have the same resolution. If an image with a different resolution is received, the bitmap memory used by the BitmapFactory class is reallocated using the new image dimensions, so the only situation where a new bitmap is created is when the image resolution of the images is changed.
There is no a specific optimum resolution suitable for all situations and networks. The resolution of the local and remote images used in the video communication can be changed dynamically. If required, the video communication sends a TCP packet to the remote device to request a list containing the resolutions at which the remote camera can operate. In order to obtain such information, the getSupportedRes() function from the CameraSurfaceView class is called. This function uses the getParameters() method from the camera class that contains the instance of the camera that is being used. The getSupportedPreviewSizes() method from camera instance parameters retrieves a list of the supported resolutions for live video preview. This list is then sent back to the device that initially made the query and all the resolutions are shown on the screen. This system allows a fast change or selection of the best image resolution required for each specific situation.
Once the image resolution has been changed, a new TCP packet containing the new image width and height is sent to the remote device. When this packet is received, the remote device calls the changeResolution() method from the CameraSurfaceView class. This method stores the new resolution and enables a flag that notifies the onPreviewFrame() callback to change the configuration of the camera. In this case, the camera is stopped and the method setNewConfiguration() is called in order to change the video configuration and restarts the video transmission.
4.2.4. Audio Communications
The transmission of audio during a remote interaction with the APR is a natural, direct and efficient communication channel between the person in charge of the telecontrol of the robot and the person assisted by the robot. The main problem that arises when establishing bidirectional audio communication is the delay between audio transmission and reproduction at the remote device when using buffered-based libraries such as VoIP/SIP. This problem is produced because the Android VoIP requirements force the application to ensure that a minimum amount of data is received (and buffered) before starting the audio playback. This audio transmission is similar as the P2P procedure used for video transmission. This implementation uses the Android AudioRecord and AudioTrack classes to record and reproduce audio data in both devices respectively (telecontrol device and APR head device). Although AudioRecord is not designed to perform audio streaming, the functionality of this class has been modified to generate small audio chunks that can be sent through the network immediately and reproduced at the remote device.
Firstly an instance of the AudioTrack class is initialized with the following specifications: Sample rate of 8000 Hz, encoding format PCM 16bit, 1 channel (mono). Once initialized, the AudioRecorder instance stores all the data obtained by the device microphone in an internal buffer. When the internal buffer is filled with 3840 bytes of voice data, the buffer is sent to the UdpAudioTx thread, wrapped inside an UDP packet, and transmitted to the remote device. The UDP packet is received at the UdpAudioRx execution thread of the remote device and sent to the AudioTrack class running inside the Playback thread. This system has been initialized with the same audio parameters as the AudioRecord instance and it queues the received audio data for reproduction. In order to avoid an increasing delay on the reception side of the application, packets received with a time difference greater than 50 ms, or empty audio streams, are dropped thus ensuring that the delay of a single packet does not accumulate into a whole communication delay.