Comparison of different artificial neural network (ANN) training algorithms to predict the atmospheric temperature in Tabuk, Saudi Arabia

. Use of Artificial neural network (ANN) models to predict weather parameters has become important over the years. ANN models give more accurate results in weather and climate forecasting among many other methods. However, different models require different data and these data have to be handled accordingly, but carefully. In addition, most of these data are from non-linear processes and therefore, the prediction models are usually complex. Nevertheless, neural networks perform well for non-linear data and produce well acceptable results. Therefore, this study was carried out to compare different ANN models to predict the minimum atmospheric temperature and maximum atmospheric temperature in Tabuk, Saudi Arabia. ANN models were trained using eight different training algorithms. BFGS Quasi Newton (BFG), Conjugate gradient with Powell-Beale restarts (CGB), Levenberg-Marquadt (LM), Scaled Conjugate Gradient (SCG), Fletcher-Reeves update Conjugate Gradient algorithm (CGF), One Step Secant (OSS), Polak-Ribiere update Conjugate Gradient (CGP) and Resilient Back-Propagation (RP) training algorithms were fed to the climatic data in Tabuk, Saudi Arabia. The performance of the different training algorithms to train ANN models were evaluated using Mean Squared Error (MSE) and correlation coefficient (R). The evaluation shows that training algorithms BFG, LM and SCG have outperformed others while OSS training algorithm has the lowest performance in comparison to other algorithms used.


Introduction
Interest in use of Artificial Neural Networks (ANNs) for developing climate change prediction models has increased in recent years due to ever changing climate patterns in the world (Yadav and Chandel, 2013;Acharya et al., 2014;Belayneh et al., 2014;Hashim et al., 2017;Moghim and Bras, 2017;Mishra et al., 2018). ANNs are computer systems inspired by biological neural networks to model relationships between independent and dependent variables. They are capable of modelling complex non-linear relationships from raw data sets to find relationships among variables Emeko et al., 2015;Ebrahimi and Rajaee, 2017;Ravansalar et al., 2017). Unlike traditional statistical techniques, ANNs do not require the transformation of raw data prior to model generation. Furthermore, preassumption of the nature of relationship between input and output variables is not required in ANNs (Agatonovic-Kustrin and Beresford, 2000;Ibrić et al., 2003). Literature gives different ANN algorithms and applied to many realworld events predicting future scenarios. In addition, these different ANN training algorithms were tested for their performance while comparing them each other (Kisi, 2004;Nayak et al., 2004;Wang et al., 2018). Climate change is such a real-world scenario which ANN was heavily used by many researchers.
The climate change, which is heavily influenced by the human activities has become a vital topic for discussion in the present world (Field et al., 2015). Emissions from burning fossil fuel is one of the major factors for the climate change and climate variability, where gases like carbon dioxide (CO 2 ), methane (CH 4 ), ozone (O 3 ), etc., are emitted through these fossil fuels (Karl, 2003;Quadrelli and Peterson, 2007). Not only human life, but also other living species like plants and animals have been badly affected through the climate change and climate variability (Vorbsmarty et al., 2000;Hughes et al., 2003;Barnett and Adger, 2007;Harvell et al., 2014;Patz et al., 2016;Azamathulla et al., 2018;Friedrich et al., 2018).
Among the other climatic factors, atmospheric temperature is a key factor in defining the climate change and a slightest change in temperature could trigger the changes in people's daily routines (Kalkstein and Smoyer, 1993). Thus, prediction of atmospheric temperature using various research methods can be increasingly found in literature (Rotstayn et al., 2014;Simmons et al., 2014;Mears and Wentz, 2017). Kisi and Uncuoglu (2005) studied on the use of three back propagation training algorithms, Levenberg-Marquadt (LM), resilient Back-Propagation (BP) and conjugate gradient for two case studies, stream-flow forecasting and to determine the lateral stress in cohesionless soils. Based on their comparisons of convergence velocities in training and testing performance, they found that LM algorithm is faster and has better performance than other algorithms in training. However, results showed that Resilient Back-Propagation algorithm has the best accuracy during training. Ghaffari et al. (2006) tested five training algorithms, Incremental Back-Propagation (IBP) and Batch Back-Propagation (BBP) under gradient descent, Levenberg-Marquadt, Quick Propagation (QP) and Genetic Algorithm (GA) for their ability in predicting the effect of coating weight gain and pectin-chitosan amount in the coating solution for drug delivery. The performance was tested using the effect of two factors, coating weight gain and amount of pectin-chitosan in the coating solution on the in-vitro release profile of theophylline for biomedical drug delivery. No significant difference was observed between the performances of IBP and BBP, although, the convergence speed of BBP was found to be three-to-four-fold higher than IBP. In addition, they found that the predicting ability precision based on the performance was in the order of IBP, BBP >LM >QP >GA. However, Pham and Sagiroglu (2001) showed that the BP is the best training algorithm out of tested algorithms, (BP), QP, Delta-Bar-Delta (DBD) and Extended-Delta-bar-Delta (EBDB). These training algorithms were used to learn ANN to recognize control chart patterns and classify wood veneer defects. Kişi (2007) studied on the use of four different ANN training algorithms, BP, conjugate gradient (CG), cascade correlation (CC) and LM to forecast streamflow in the North Platte river in the United States. He found that LM gave the best flow forecasts and faster results compared to other training algorithms. In addition, his results showed that CG and CC models produced more satisfactory predictions than BP. Therefore, several studies illustrate different algorithms to reach the best prediction.
However, further studies have used different training algorithms in ANN to forecast various climatic parameters such as rainfall (Lee et al., 1998;Hall et al., 1999), evaporation (Shiri et al., 2014a), dew point temperature (Shiri et al., 2014b), solar radiation (Landeras et al., 2012), daily reference evapotranspiration (Guven and Gunal, 2008;Izadifar and Elshorbagy, 2010), etc. Azamathulla et al. (2018) presented a study to predict the atmospheric temperature in Tabuk, Saudi Arabia using gene expression techniques and compared that to the ANN model. However, no proper study has carried out to predict the minimum and maximum atmospheric temperature in Tabuk using different ANN algorithms and then to compare their performance. Therefore, identifying that research gap, we developed eight different ANN models to predict the atmospheric temperature of Tabuk, Saudi Arabia using other climatic factors. The results from eight different training algorithms including, Levenberg-Marquadt (LM), BFGS Quasi Newton, Resilient Back-Propagation, scaled conjugate gradient, Conjugate gradient with Powell-Beale restarts, Fletcher-Powell conjugate gradient, Polak-Ribiere conjugate gradient and One step secant are promising. Prediction of minimum and maximum atmospheric temperatures which are two most important climatic parameters to the dwellers in Tabuk is the major novelty of the presented paper.

Artificial Neural Networks (ANNs) training algorithms
As it was stated in the introduction section, artificial neural networks are popular among researchers and planners these days to predict real world scenarios. These ANN algorithms use local or global non-linear optimization methods for optimizing the feed-forward neural networks weights. The local searches are limited to local solutions whereas, global searches avoid this limitation (Ilonen and Kamarainen, 2003). The training performance varies based on the objective function of optimization process and the underlying error surface for a given problem and the network configuration. The most popular optimization methods are variants of gradient based back-propagation algorithms. This is because the gradient information of error surface is available in these algorithms. The widely used methods are Levenberg-Marquadt (LM), BFGS Quasi Newton, Resilient Back-Propagation, Scaled conjugate gradient, Conjugate gradient with Powell-Beale restarts, Fletcher-Powell conjugate gradient, Polak-Ribiere conjugate gradient, One step secant (Hagan and Menhaj, 1994;Liang et al., 1994;Martin, 1997;Japkowicz and Hanson, 1999) The objective of the training process in an ANN is to reduce the global error (E) and the error is defined as follows in Equation (1) (Kişi, 2007). (1) where, P is the total no of training patterns and E P is the error for training pattern P. E P is calculated using the following Equation (2).
where, N is the total number of output nodes, O k is the network output at the k th output node and t k is the target output at the k th output node. The global error is reduced by adjusting the weights and biases in the training algorithm (Kişi, 2007). The following subsections present the widely used algorithms for training the neural networks.

Levenberg-Marquadt (LM) algorithm
The Levenberg-Marquadt (LM) optimization algorithm is identified to be more powerful than the conventional gradient descent techniques (Lera and Pinzolas, 2002;Lourakis, 2005;Kişi, 2007). It is the most widely used optimization algorithm and designed to approach the second-order training speed without computing the Hessian Matrix (Moré, 1978). The Hessian Matrix can be approximated when the performance function is in the form of sum of squares and given in Equation (3).
where, J is the Jacobian matrix, containing the first derivatives of the network errors with respect to the weights and biases.
The standard back-propagation technique is used to compute the Jacobian matrix, which is less complex than computing the Hessian matrix. Equation 4 gives the Newton-like update used in the LM algorithm. When = 0, this is the Newton's method, using the approximate Hessian matrix.
where, e is a vector of network errors, µ is a scalar quantity, x k+1 is the predicted minimizer and x k is the current point.

BFGS Quasi Newton Back-Propagation (BFG) algorithm
The BFGS Quasi Newton algorithm was independently developed by C. G. Broyden, D. Goldfarb, R. Fletcher and D. F. Shanno (Nocedal and Wright, 1999). The basic step in the Newton's method given in equation (5), where, H -1 is the Hessian matrix of the performance index at the current values of the weights and biases (usual notations are used here).
BFGS method does not calculate the 2 nd derivatives. However, an approximate Hessian matrix is updated in each iteration of the algorithm. This update is calculated as a function of the gradient. Super linear convergence rate is observed in BFGS method on most practical problems, even though the algorithm requires more computations and storage in each of the iterations performed (Nocedal, 1980;Nocedal and Wright, 1999;Schraudolph et al., 2018).

Resilient Back-Propagation (RP) algorithm
Resilient Back-Propagation is an algorithm which directly adapts weights based on local gradient. In this learning scheme, adaptation is not blurred by the behaviour of the gradient. For this to happen, an individual update value ij  is given for each weight to determine the size of the weight update.
ij  evolves during the learning process depending on its sight on local error function F, governed by the learning rule given in the equation (6) : where, 0 < − < 1 < + ; η is update value factor and ω is corresponding weight. More details on this algorithm can be found in Saini (2008).

Scaled Conjugate Gradient Back-Propagation (SCG) algorithm
Usually, line search is required at each iteration in conjugate gradient algorithms. However, scaled conjugate gradient algorithm developed by Moller (1993) avoids the time consumption in line search. The model trust region approach used in the LM algorithm is combined with conjugate gradient approach in this algorithm. Scaled Conjugate Gradient Back-Propagation algorithm requires more iterations to converge than other algorithms. But, computations in each iteration are significantly less compared to others since line search is not performed (Moller, 1993;Andrei, 2007;Cetisli and Barkana, 2010).

Conjugate Gradient with Powell-Beale Restarts (CGB) algorithm
The search direction is periodically reset to the negative gradient in conjugate gradient algorithms. The standard reset point has occurred when number of iterations is equal to the number of network parameters. Powell (1977) proposed a reset method based on the earlier method proposed by Beale (1967) to improve the efficiency of the training. According to this technique, restart is set if a little orthogonality is left between current and previous gradients and this is tested using the inequality given in the equation (7) (Colaco and Orlande, 1999;Saini and Soni, 2002 where, −1 is the norm squared of the previous gradient and is the current gradient.
where, is the ratio of the norm squared of the current gradient to the norm squared of the previous gradient and it is a positive scalar.

One step secant (OSS) algorithm
The one-step-secant (OSS) algorithm is an approach in bridging the gap between the quasi-Newton approach and the conjugate gradient algorithm. This approach does not store the complete Hessian matrix, instead it was assumed at each iteration. OSS also has the advantage of calculating the new search direction without computing an inverse matrix. However, OSS requires more computation in each iteration and more storage than conjugate gradient methods (Constantinescu et al., 2008;Upadhyay, 2013).

MATLAB installed in a personal computer (Intel(R)
Core (TM) i7-7700HQ CPU @ 2.80 GHz, 16 GB RAM) was used to develop the above stated algorithms to predict the monthly minimum and maximum atmospheric temperature in Tabuk, Saudi Arabia. The input variables of the algorithms were monthly rainfalls, minimum monthly relative humidity, maximum monthly relative humidity, minimum monthly air pressure, maximum monthly air pressure and monthly average wind speed. In addition, minimum monthly atmospheric temperatures and maximum monthly atmospheric temperatures were also fed to the algorithms to calibrate and test the ANN algorithms. The developed ANN models are used to predict y(t) (are dependent variables of the model; the minimum and maximum atmospheric temperatures) with d past values of y(t) and x(t), where y(t) is a parameter depending on different x(t) parameters for set of time steps, x(t) is the independent parameter for different time steps (are the monthly rainfalls, minimum monthly The ANN was trained with 70% of target time steps while 15% each of target time steps were used to validate and test. In addition, 10 hidden neurons and 2 delays were used in the network. The details are shown in the Fig. 1. Performance of each training algorithm in predicting the atmospheric temperatures was evaluated using the Mean Squared Error (MSE) and the correlation coefficient (R). The simulation times for all algorithms were around 1-3 seconds from the above stated personal computer.

Case study application -Tabuk, Saudi Arabia
Tabuk city is in Saudi Arabia in its north western part as shown in the Fig. 2 and it has an area of 139,000 km 2 . Tabuk province is bound by Saudi-Jordan country boundary from north, Red sea from south and west and Hufa depression from the eastern side. The city is located at an average altitude of 770 m from the mean sea level. Tabuk province is classified as a hyper-arid catchment and experiences a shorter winter season and a longer summer season (Abushandi and Alatawi, 2015). Monthly weather data were collected for 30 years from 1986 to 2015 from the Saudi General Authority of Meteorology and Environment Protection and fed to the developed ANN algorithms to predict the minimum and maximum atmospheric temperatures in Tabuk.

Results and discussion
Figs. 3(a-h) show the correlation coefficients (R values) for the performance of the neural networks used for predicting minimum atmospheric temperature. R values are separately shown for training, testing and validation of the neural network models produced using  well in the validation process. It reached the convergence at 20 epochs while reaching a considerable low MSE (~4). However, the OSS algorithm has taken the greatest number of epochs (77) and it has the highest MSE (~7) at the convergence. Therefore, OSS was outperformed by the other algorithms. Therefore, the results show that the BFG algorithm has a better approach when compared to the others in predicting the minimum atmospheric temperature in Tabuk, Saudi Arabia. Nevertheless, CGB and SCG algorithms can also be considered alternative better approaches.

Maximum atmospheric temperature
Correlation coefficients for performance of ANN models in predicting the maximum temperature in Tabuk are shown in Figs. 6(a-h). However, slight reductions can be seen in the correlation coefficients of Figs. 6(a-h) when compared to those of at Figs. 3(a-h) (for minimum atmospheric temperature). The maximum value of R value in minimum temperature prediction ANN is around 0.98; however, it is lower in the maximum temperature prediction. A similar observation can be seen for the minimum R values. The minimum atmospheric temperature prediction algorithms have the minimum R value of 0.935 while it is 0.865 in the maximum atmospheric temperate prediction algorithms. Therefore, this observation indicates that the regression plot of ANN models for predicting minimum atmospheric temperature are less scattered compared to that of in maximum atmospheric temperature. It can be clearly seen that the Figs. 6(a-h) give similar R values except in the OSS algorithm. OSS algorithm has shown lowered R values. However, among the others, LM and CGF algorithms have good performances.
Figs. 7(a-h) present the MSE values for the validation process of the different algorithms in maximum atmospheric temperature prediction. Similar to the Fig. 5(a), LM algorithm shows the lowest number of epochs in convergence (9 epochs) in the maximum atmospheric temperature prediction. However, it also has the lower MSE value (~4) among the other algorithms. In addition, SCG (~3) and CGF (~4) algorithms have lower MSE values; however, they used greater number of epochs for the converenges (38 and 61, respectively). OSS algorithm converged faster compared to the other algorithms; however, it has the highest MSE value. Therefore, similar conclusions can be drawn in maximum atmospheric temperature prediction for OSS algorithm (similar to the minimum atmosphertic temperature prediction). The OSS algorithm was outperformed by the other algorithms. Therefore, in general LM and CGF algorithms have performed well to predict the maximum atmospheric temperature for Tabuk, Saudi Arabia.

Conclusions
Eight different ANN algorithms were developed to predict the atmospheric temperature in Tabuk, Saudi Arabia. The algorithms used several weather parameters including, rainfall, relative humidity, wind speed, air pressure and atmospheric temperature. Results revealed that, in general, two different ANN algorithms have performed better for predicting minimum and maximum atmospheric temperature in Tabuk. They are BFG for minimum temperature and LM for maximum temperature. However, one common algorithm also performed better in minimum and maximum temperature prediction and that is SCG algorithm. Nevertheless, results show that all eight training algorithms have performed to an acceptable level because, all the correlation coefficients are greater than 0.85 and have acceptable MSE values for the validation processes. Therefore, it can be concluded herein that the atmospheric temperature forecasting process in Tabuk can be reliably done using the artificial neural networks. In addition, the weather forecasters have a choice of different algorithms to use in prediction. Among the other algorithms, LM, BFG and SCG algorithms are proposed as the preferred algorithms for the prediction.
Furthermore, the models are based on the real measured data. Therefore, they are applicable in the realworld cases. Not only applicable in real world cases, but also the models can be used to predict the future minimum and maximum atmospheric temperatures based on the various climate models' driven independent variables.