1. Introduction
Visual attributes of images have been widely studied for years. Most previous works have focused on recognizing "explicit attributes" of images, such as object’s texture and color distribution [
1], and semantic categories [
2]. With the advancing of computer vision and machine learning technologies, more and more works have been proposed to study "implicit attributes" of images. These implicit attributes may not be represented in explicit forms, but are usually recognizable by human beings. For example, Lu et al. [
3] proposed a method to recognize whether an image was captured on a sunny day or on a cloudy day. Hays and Efros [
4] proposed to estimate geographic information from a single image (a.k.a IM2GPS). Recent research has demonstrated that deep learning approaches are effective for recognizing painting styles [
5,
6].
Among various implicit attributes, weather properties of images have attracted increasing attention. The earliest investigation of the relationship between vision and weather conditions dates back to early 2000s [
7]. Thanks to the development of more advanced visual analysis and deep learning methods, a new wave of works studying the correlation between visual appearance and ambient temperature or other weather properties has recently emerged [
8,
9,
10,
11].
The main motivation of estimating weather properties from only images is that we could unveil characteristics in the real world from images available in cyberspace [
11]. Images can be viewed as weather sensors [
11], and by coupling estimated weather information with time/geographical information, explicit or implicit human behaviors can be discovered. Weather information can also be important priors for many computer vision applications.
Figure 1 shows that the Eiffel Tower has drastically different visual appearances in different weather conditions, which brings significant challenges to object/landmark recognition. Once weather properties can be estimated, an object detector/recognizer can adapt for different weather conditions, so that the influences of visual variations can be reduced. The work in [
12] shows that better understanding weather properties facilitates robust robotic vision. Models adaptive to weather conditions have been studied in lane detection and vehicle detection [
13], and flying target detection [
14]. The work in [
15] also mentions that weather context may give clues for modeling the appearance of objects.
Weather property estimation can already be done by inexpensive sensors. Please notice that the proposed weather property estimation neither replaces nor improves existing weather sensors. We argue that analyzing images in cyberspace from the perspective of weather enables us to discover implicit human behaviors or to improve computer vision technologies to some extent.
Given outdoor images, in our previous work [
10] we estimated ambient temperature based on visual information extracted from these images. Two application scenarios were proposed. The first one regards estimating the temperature of a given image regardless of temporal changes in the weather. The second scenario is, when several images of the same location over time are available, to "forecast" the temperature in the near future. In the first scenario, we extracted visual features using a convolutional neural network (CNN), and then used a regression layer outputting the estimated temperature. In the second scenario, features were also extracted using CNNs, but the temporal evolution was considered by recruiting a long-short term memory (LSTM) network, which output the estimated temperature of the last image in the given image sequence. This work is the state-of-the-art in temperature prediction from images.
On the basis of our previous work, in this study we made two significant improvements. First, we jointly estimated four weather properties, i.e., temperature, humidity, visibility, and wind speed, by a single network that was constructed based on the multi-task learning approach. These four properties can be estimated separately by four different models. However, the foundations for different estimations are the same, i.e., visual information extracted from images. Motivated by the success of deep multi-task learning [
16], we attempted to construct a single network based on the multi-task learning approach and jointly handle four tasks. When training the network jointly for multiple tasks, different tasks may contribute complementary information to make the network more powerful.
When estimating temperature, previous works either take a single image as the input [
9] or a sequence of images that were captured on different days [
8,
10]. For example, given the images captured on day
i and day
, previous works estimated the temperature of the image captured on day
. In this work, we advocate that temporal evolution can be taken into account at different temporal scales. We can conduct day-wise estimation as mentioned above, or conduct hour-wise estimation as well. That is, given the images captured at hour
j and hour
, which were captured on the same day
i, we can estimate properties of the image captured at hour
. In addition, by considering different temporal scales together, we can mix day-wise and hour-wise estimation. For example, given the images captured at hour
j and hour
both captured on day
i, we can estimate properties of the image captured at hour
of day
. A two-dimensional RNN is thus proposed to implement this idea, which is the second improvement over [
10].
Our contributions are summarized as follows.
We adopted the multi-task learning approach to build a network that jointly estimates four weather properties based on features extracted by CNNs. We show that with multi-task learning, the proposed model outperforms single-task methods.
We introduce a two-dimensional RNN to estimate weather properties at two temporal scales. To the best of our knowledge, this is the first deep learning model considering evolutions of appearance from two different perspectives.
The rest of this paper is organized as follows.
Section 2 provides the literature survey.
Section 3 presents data collection and data preprocessing.
Section 4 presents a brief review of our previous work on single-task learning, and
Section 5 provides details of the newly proposed multi-task learning approach. Various evaluation results and performance comparisons are given in
Section 7, followed by the conclusion in
Section 8.
2. Related Works
As a pioneering work studying visual manifestations of different weather conditions, Narasimhan and Nayar [
7] discussed the relationships between visual appearance and weather conditions. Since then, several works have been proposed to work on weather type classification. Roser and Moosmann [
17] focused on the images captured by cameras mounted on vehicles. They extracted features such as brightness, contrast, sharpness, and hue from sub-regions of an image, and concatenated them as an integrated vector. Based on these features, a classifier based on the support vector machine (SVM) was constructed to categorize images into clear, light rain, or heavy rain weather conditions. In [
3], five types of weather features were designed, i.e., sky, shadow, reflection, contrast, and haze features. These features are not always present simultaneously. Therefore, a collaborative learning framework was proposed to dynamically weight the influences of different features and classify images into sunny or cloudy. Weather-specific features can be extracted from different perspectives, and conceptually they may be heterogeneous. In [
11], a random forest classifier was proposed to integrate various types of visual features and classify images into one of five weather types, i.e., sunny, cloudy, snowy, rainy, or foggy. The merit of the random forest classifier is that it can handle heterogeneous types of features, and characteristics of the automatically determined decision trees imply the importance of different features. Kang et al. [
18] used deep learning methods to recognizing weather types. They studied the performance of GoogLeNet and AlexNet when classifying images into hazy, rainy, or snowy. In this study, we also used deep learning models for visual weather analysis, but focused on weather property estimations.
In addition to weather type classification, more weather properties have also been investigated. Jacobs and his colleagues [
19] initiated a project for collecting outdoor scene images captured by static webcams over a long period of time. The collected images formed the Archive of Many Outdoor Scenes (AMOS) dataset [
19]. Based on the AMOS dataset, they proposed that webcams installed across the earth can be viewed as image sensors and enable us to understand weather patterns and variations over time [
20]. More specifically, they adopted principal component analysis and canonical correlation analysis to predict wind velocity and vapor pressure from a sequence of images. Recently, Palvanov and Cho [
21] focused on visibility estimation. They proposed a three-stream convolutional neural network to jointly consider different types of visibility features and handled images captured in different visibility ranges. Ibrahim et al. [
22] developed the so-called WeatherNet that consists of four networks based on residual learning. These four networks were dedicated to recognize day/night, glare, precipitation, and fog. Similarly to [
21], multiple separate networks were constructed to conduct dedicated tasks, and then results or intermediate information were fused together to estimate weather properties.
Laffont et al. [
23] estimated scene attributes such as lighting, weather conditions, and seasons for images captured by webcams based on a set of regressors. Glasner et al. [
8] studied the correlation between pixel intensity/camera motion and temperature and found a moderate correlation. With this observation, a regression model considering pixel intensity was constructed to predict temperature. Following the discussion in [
8], Volokitin et al. [
24] showed that, with appropriate fine tuning, deep features can be promising for temperature prediction. Zhou et al. [
9] proposed a selective comparison learning scheme, and temperature prediction was conducted based on a CNN-based approach. Salman et al. [
25] explored the correlations between different weather properties. In addition to considering temporal evolution, they proposed that, for example, visibility of an image can be better predicted if temperature and dew point are given in advance. Zhao et al. [
26] argued that an image may be associated with multiple weather conditions; e.g., an image may present sunny but moist conditions. They thus proposed a CNN-RNN architecture to recognize weather conditions, and formulated this task as a multi-label problem. In this paper, we develop a multi-task learning approach considering temporal evolution to estimate weather properties. We propose that weather properties can be estimated from different temporal perspectives.
Aside from weather property estimations, Fedorov et al. [
27] worked on an interesting application about snow. They extracted a set of visual features from images captured by webcams that monitored the target mountain, and then estimated the degree of snow cover by classifiers based on SVM, random forest, or logistic regression. In addition to visual analysis, some studies have been conducted from the perspective of text-based data analysis. Qiu et al. [
28] proposed a deep learning-based method to predict rainfall based on weather features such as wind speed, air pressure, and temperature, collected from multiple surrounding observation sites.
In our previous work [
10], we aimed at predicting temperature from a single image and forecasting the temperature of the last image in a given image sequence. Deep learning approaches were developed to consider temporal evolution of visual appearance. Unlike [
8,
9,
24], we particularly advocate for the importance of modeling the temporal evolution via deep neural networks. Partially motivated by [
21,
22,
25], we attempted to jointly consider the estimation of multiple weather properties in this paper. Instead of separately training multiple dedicated networks, we tried to develop a unified network for multiple tasks based on the multi-task learning scheme. Furthermore, we propose considering visual evolutions at different temporal scales. This idea is proposed for the first time for weather property estimations from visual appearance.
5. Multi-Task Learning
In [
10], we focused on ambient temperature prediction from single images or a sequence of images. In this work, we would like to extend the idea to four different weather properties, i.e., temperature, humidity, visibility, and wind speed. These four properties are rather common in existing weather image databases. Predicting temperature and humidity helps to estimate perceived temperature, and predicting visibility and wind speed is important to estimate air quality.
In addition to increasing the number of estimated weather properties, motivated by the multi-task learning scheme [
16], we developed a unified network to jointly estimate four properties. The idea came from several observations. (1) All four properties are estimated based on the same visual appearance. If we take the estimations as four independent tasks, four similar feature extraction sub-networks are obviously inefficient. (2) As mentioned in [
25], some properties are correlated. Jointly training a network for four tasks enables information exchange, and constructing a better network for feature extraction and estimation is possible.
Figure 6 illustrates the network for estimating four weather properties based on multi-task learning. The CNN pre-trained for temperature prediction, as mentioned in
Section 4.1, is taken as the baseline feature extractor. Given a sequence of images
, visual features are separately extracted from each image by the CNN. These features are then sequentially fed to the LSTM. As illustrated in
Figure 6, information from
and
is processed and propagated, and the last LSTM outputs the predicted weather properties for the image
. Four LSTM streams are constructed to predict temperature, humidity, visibility, and wind speed, respectively.
Instead of treating four tasks separately, the network shown in
Figure 6 was trained in an end-to-end manner. We respectively calculated categorical cross entropies between the predicted temperature (humidity, visibility, and wind speed) and the true temperature (humidity, visibility, and wind speed) as
,
,
, and
, respectively. They were then combined as
, where
s were empirically set as 0.15, 0.10, 0.90, and 0.25, respectively. Based on this loss, we adopted the Adam optimizer with learning rate 0.001 and mini batch size 128 to find the best network parameters.
6. Two-Dimensional RNN
The temporal evolution considered by previous works such as [
8,
10] has only day-wise predictions. That is, given day
i and day
, it predicts weather properties on day
. This idea comes from that weather changes gradually on neighboring days. In this work, we further point out that weather properties usually change gradually on the same day, and we can make hour-wise predictions.
Figure 7 illustrates the idea of predicting weather properties at different temporal scales. The red arrow indicates the common day-wise perspective, and the yellow arrow indicates the hour-wise perspective, which has not been proposed or implemented before. Furthermore, with the designed 2D RNN, we would provide more flexible predictions, such as the ones shown as the green arrow and the blue arrow. As shown by the green arrow, given the images captured at 9 a.m. on day
i and day
, we can predict weather properties of the image captured at 10 a.m. on day
. On the other hand, given the images captured at 9 a.m. and 10 a.m. on day
, we can predict weather properties of the image captured at 10 a.m. on day
i (the blue arrow).
Figure 8 shows the architecture of the proposed two-dimensional RNN, where we take sequences of three images captured at three consecutive time instants as the example. The black arrows in this figure denote information propagation, and the red arrows denote the estimation outputs. Let
denote the image captured at hour
j on day
i. For day-wise prediction, given the image sequence of
,
, and
captured on days
i,
, and
, the model predicts the weather properties of
as
. For hour-wise prediction, given the image sequence of
,
, and
captured at hours
j,
, and
, the model predicts the weather properties of
as
. We also propose that day-wise and hour-wise prediction can be mixed together. Given the image sequence of
,
, and
, the model predicts the weather properties of
as
. Notice that, given the image sequence of
,
, and
, the model can also predict the weather properties of
as
. For example,
can be predicted by giving the image sequence
,
, and
, or by giving the image sequence
,
, and
. Overall, this model can be trained and tested based on horizontal (day-wise) sequences, vertical (hour-wise) sequences, and L-shaped (mixed) sequences.
Notice that
Figure 8 is a simplified representation, where the predicted vectors
s are shown. In fact, with multi-task learning, we jointly predict temperature, humidity, visibility, and wind speed, and we should denote different types of estimation results as
s,
s,
s, and
s, respectively. To simplify notation, we take temperature prediction as the main instance, and just denote ground truth and the corresponding prediction result as
and
, respectively. Please also notice that the numbers of LSTM layers and input/output channels are the same as those mentioned in
Section 4.2. The major difference between the model in
Section 4.2 and here is the perspective of the training sequences, as described below.
To train this model, we randomly selected horizontal sequences, vertical sequences, and L-shaped sequences from the training data. According to our previous work [
10], the length of each image sequence (the number of images) was set as 3, as illustrated in
Figure 8. The loss for each prediction is measured by categorical cross entropy. Therefore, taking temperature prediction as the main instance, the loss considering three types of sequences is:
where
is the categorical cross entropy between the ground truth
and the predicted value
, and
is the number of sequences used in each specific case for training.
Overall, the losses derived from four different weather properties are integrated as
, where
,
, and
are losses calculated from humidity prediction, visibility prediction, and wind speed prediction, respectively. Similarly to the training settings mentioned in
Section 5, based on this loss, we adopted the Adam optimizer with learning rate 0.001 and mini batch size 128 to find the best network parameters.
7. Evaluation
7.1. Experimental Settings
To train the CNN for feature extraction as mentioned in
Section 4.1, and the proposed 2D RNN model mentioned in
Section 6, 90% of the images in each scene of the Glasner-Exp dataset were taken as the training pool, and the remaining 10% were taken as the testing pool. Based on the training data, we constructed a CNN consisting of six convolutional layers, followed by twp fully-connected layers.
Table 1 shows detailed configurations of the CNN architecture. The term Conv2D(32, 3) denotes that the convolutional kernel is
, and the number of output channels is 32. This CNN’s output is a 70D vector indicating the probabilities of different temperatures (i.e., classes). To train the model, the activation function of each layer is ReLU, the loss function is cross entropy, the optimization algorithm is Adam, and learning rate is 0.001. This CNN was first trained to do temperature prediction specifically, and was the base model for feature extraction. When it was used in the multi-task learning, as illustrated in
Figure 6, parameters of this CNN were fine-tuned according to the given training data and loss function.
To train the proposed 2D RNN model, from the training pool, we enumerated all image sequences consisting of three temporally consecutive images as the training data; i.e., these sequences could be day-wise, hour-wise, or L-shaped. A day-wise sequence could contain images , , and captured on days i, , and , respectively. A L-shaped sequence could contain images , , and captured at j o’clock on day i, at j o’clock on day , and at o’clock on day , respectively. From the training pool that consisted of images captured from 8 a.m. to 5 p.m. each day in two consecutive years, in nine different places, finally we enumerated 39,546 image sequences in total to be the training data.
Length of the image sequence. We first evaluate prediction performance when different lengths of image sequences were used for training and testing, and show the average root mean square errors (RMSEs) between the estimated values and the ground truth in
Table 2. These values were all obtained based on the full model, i.e, containing day-wise, hour-wise, and L-shaped predictions. We denote this model as DHL-LSTM in the following. As can be seen, the best estimation performance could be obtained when
n was set as 3. That is, we estimated the weather properties of the day
based on day
and day
t, if using the day-wise perspective. This result is not surprising because this setting appropriately considers information of previous days, and prevents blunt updates when too many days are considered. Therefore, we set
in the subsequent experiments.
7.2. Single-Task Learning vs. Multi-Task Learning
Table 3 shows a performance comparison between the single-task LSTM [
10] and the recently-proposed multi-task LSTM for day-wise weather property prediction, in terms of RMSE. For the row of single-task LSTM, we separately implemented four single-task LSTMs to predict four properties. Our previous work [
10] only focused on temperature prediction, and here we trained dedicated models for humidity, visibility, and wind speed prediction to obtain the values with asterisks. As can be seen, the multi-task model obtained better performance in humidity, visibility, and wind speed predictions, and yielded slightly worse performance for temperature prediction. As pointed out by [
34,
35], the multi-task learning approach did not always guarantee a performance improvement. Even so, overall we see very encouraging results in
Table 3. Both the single-task LSTM and the multi-task LSTM significantly outperformed [
8].
7.3. Performance of 2D-RNN
Table 4 shows the performance variations of different 2D-RNN models, i.e., the day-wise-only prediction model; the hour-wise-only prediction model; the day-wise plus hour-wise prediction model; and the one containing day-wise, hour-wise, and L-shaped predictions. By comparing the third row with the first two rows, we can see that temperature prediction and humidity prediction are clearly improved by combining day-wise and hour-wise predictions into a single model. In this approach, a single model is trained both based on day-wise training sequences and hour-wise training sequences. Conceptually the number of training data increases, and this may be one of the reasons for the performance improvement. By comparing DHL-LSTM with the third row, clearly if we further consider L-shaped sequences, temperature prediction and humidity prediction are further improved.
Table 5 shows detailed the performance of temperature prediction for the nine evaluated scenes. The scenes (a) to (i) correspond to the subfigures shown in
Figure 2, from left to right, top to bottom. Two observations can be made. First, on average the proposed DHL-LSTM achieved the best performance. Second, performance varied for different scenes. For different scenes, sometimes the day-wise LSTM performed better, and sometimes the hour-wise LSTM performed better. This shows complex changes of weather conditions from the day-wise perspective and the hour-wise perspective. Not single perspective guarantees easier prediction.
Variations of different daytime hours. During a day, the strength and direction of sunlight vary from dawn to dusk. It would be interesting to know the performance variations for different daytime hours. To show this, we used image sequences at hours h as the testing data, and the remaining for training. The hours h were 8 a.m. to 5 p.m.
Figure 9 shows variations of average RMSEs of temperature prediction for image sequences captured at different daytime hours, based on the day-wise LSTM [
10] and the DHL-LSTM. We can clearly make two observations. First, for each model, we see there are clear performance variations for images captured at different daytime hours. The best performance was obtained for images captured at 11 p.m., which conformed to the selection of [
8,
9]. This may be because the sunlight is maximal around noon, and more robust visual information can be extracted. Second, the DHL-LSTM model significantly outperformed the day-wise LSTM model. For the day-wise LSTM, prediction errors at 8 a.m. and 5 p.m. are much larger than the others, whereas for the DHL-LSTM, the performance gap is relatively smaller, especially for images captured at 8 a.m.
Figure 10 and
Figure 11 show two sample images, and the pairs of prediction results and ground truths are shown in the captions.
Figure 10 was captured at 2 p.m. on 18 May 2014, around the University of Notre Dame, Indiana, USA.
Figure 11 was captured at 1 p.m. on 15 November 2013, in St. Louis, Missouri, USA. These two examples demonstrate the effectiveness of predicting weather properties from visual appearance.
8. Conclusions and Discussion
In this work, we presented deep models to estimate four weather properties of the last image in an image sequence. We jointly considered four weather properties in a unified CNN-RNN model based on the multi-task learning approach. Furthermore, we proposed considering property prediction from different temporal perspectives, i.e., day-wise, hour-wise, and the mixture of two scales. In the evaluation, we showed the effectiveness of multi-task learning and the multi-temporal-scale prediction. This is the first time that a 2D-RNN has been proposed to predict weather properties from visual appearance, and we show that state-of-the-art performance can be obtained.
In the future, we can investigate which region in a scene provides more clues in property estimation, and adopt the currently emerging attention networks to improve performance. We also believe that exploring the relationship between weather properties and vision would be interesting in a wide range of future applications.