Abstract
Missing data are integral parts of most real datasets. To provide an efficient and accurate analytical result of data, the datasets need to be processed using imputation and cleaning techniques. Recently, deep learning is considered as the most powerful part of machine learning techniques, which is used for finding out the hidden knowledge within a very large dataset to make predictions more accurate. In this work, an efficient deep learning imputation model is proposed for imputing the missing values in weather data of an individual weather station on a temporal basis. Evaluation is carried out using various stations of National Climatic Data Center (NCDC) datasets to predict missing data of stations nearest to geographical station that are having the complete data. The comparison was performed on five optimizers [Rmsprop, Adam, Nadam, Stochastic Gradient Descent (SGD), Adagrad], on the basis of three evaluation criteria: mean absolute error (MAE), mean square error (MSE), and root mean square error (RMSE). Among these, the SGD optimizer is found to be more accurate in predicting the missing numbers. The proposed technique imputes missing values with higher accuracy and an error rate less than the previous models.
Similar content being viewed by others
References
Schneider, T.: Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J. Clim. 14(5), 853–871 (2001)
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Leke, C., Marwala, T., Paul, S.: Proposition of a theoretical model for missing data imputation using deep learning and evolutionary algorithms. arXiv preprint arXiv:1512.01362
Liang, F., Jia, B., Xue, J., Li, Q., Luo, Y.: An imputation-consistency algorithm for high-dimensional missing data problems and beyond. arXiv preprint arXiv:1802.02251
Nelwamondo, F.V., Mohamed, S., Marwala, T.: Missing data: a comparison of neural network and expectation maximization techniques. Curr. Sci. 93(11), 1514–1521 (2007)
Ibrahim, J.G., Chen, M.-H., Lipsitz, S.R., Herring, A.H.: Missing-data methods for generalized linear models: a comparative review. J. Am. Stat. Assoc. 100(469), 332–346 (2005)
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 333. Wiley, New York (2014)
Kang, H.: The prevention and handling of the missing data. Korean J. Anesthesiol. 64(5), 402–406 (2013)
Scheg, A.G.: Critical Examinations of Distance Education Transformation Across Disciplines. IGI Global, Hershey (2014)
Doreswamy, Gad, I., Manjunatha, B.: Performance evaluation of predictive models for missing data imputation in weather data. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1327–1334, IEEE, New York (2017). http://ieeexplore.ieee.org/document/8126025/. Accessed 2017
Deng, L., Yu, D., et al.: Deep learning: methods and applications. Found. Trends® Signal Process. 7(3–4), 197–387 (2014)
Sugomori, Y., Kaluza, B., Soares, F.M., Souza, A.M.: Deep Learning: Practical Neural Networks with Java. Packt Publishing Ltd, Birmingham (2017)
Grover, A., Kapoor, A., Horvitz, E.: A deep hybrid model for weather forecasting. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 379–386. ACM, New York (2015)
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT Press Cambridge (2016)
Koko, E.E.M., Mohamed, A.I.A.: Missing data treatment method on cluster analysis. Int. J. Adv. Stat. Probab. 3(2), 191–209 (2015)
Rana, S., John, A.H., Midi, H., Imon, A.: Robust regression imputation for missing data in the presence of outliers. Far East J. Math. Sci. 97(2), 183 (2015)
Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards missing data imputation: a study of fuzzy k-means clustering method. In: Rough Sets and Current Trends in Computing, vol. 3066, pp. 573–579. Springer, Berlin (2004)
Saba, T., Rehman, A., AlGhamdi, J.S.: Weather forecasting based on hybrid neural model. Appl. Water Sci. 7(7), 1–6 (2017)
Di, C., Yang, X., Wang, X.: A four-stage hybrid model for hydrological time series forecasting. PLoS One 9(8), e104663 (2014)
Yaseen, Z.M., Ghareb, M.I., Ebtehaj, I., Bonakdari, H., Siddique, R., Heddam, S., Yusif, A.A., Deo, R.: Rainfall pattern forecasting using novel hybrid intelligent model based ANFIS-FFA. Water Resour. Manag. 32(1), 105–122 (2018)
NCDC, National Climatic Data Center, NOAA’s National Centers for Environmental Information (NCEI). https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets. Accessed 2016
Lawrimore, J.H., Menne, M.J., Gleason, B.E., Williams, C.N., Wuertz, D.B, Vose, R.S., Rennie, J.: An overview of the Global Historical Climatology Network monthly mean temperature data set, version 3. J. Geophys. Res. Atmos. 116, D19121. https://doi.org/10.1029/2011JD016187
Balluff, S., Bendfeld, J., Krauter, S.: Meteorological data forecast using RNN. Int. J. Grid High Perform. Comput. 9(1), 61–74 (2017)
Firth, R., Chen, J.: Neural Network Implementation of a Mesoscale Meteorological Model, pp. 164–173. Springer International Publishing, Cham (2014). https://doi.org/10.1007/978-3-319-08326-1_17
Hu, Q., Zhang, R., Zhou, Y.: Transfer learning for short-term wind speed prediction with deep neural networks, Renew. Energy 85(Supplement C), 83–95 (2016). ISSN:0960-1481. http://www.sciencedirect.com/science/article/pii/S0960148115300574
Kiani, K., Saleem, K.: K-nearest temperature trends: a method for weather temperature data imputation. In: Proceedings of the 2017 International Conference on Information System and Data Mining, pp. 23–27. ACM, New York (2017)
Lobato, F., Sales, C., Araujo, I., Tadaiesky, V., Dias, L., Ramos, L., Santana, A.: Multi-objective genetic algorithm for missing data imputation. Pattern Recognit. Lett. 68, 126–131 (2015)
Abdella, M., Marwala, T.: The use of genetic algorithms and neural networks to approximate missing data in database. IEEE 3rd International Conference on Computational Cybernetics, 2005. ICCC 2005, pp. 207–212. IEEE, New York (2005)
Aydilek, I.B., Arslan, A.: A novel hybrid approach to estimating missing values in databases using k-nearest neighbors and neural networks. Int. J. Innov. Comput. Inf. Control 7(8), 4705–4717 (2012)
Leke, C., Twala, B., Marwala, T.: Modeling of missing data prediction: computational intelligence and optimization algorithms. 2014 IEEE International Conference on Systems. Man and Cybernetics (SMC), pp. 1400–1404. IEEE, New York (2014)
Liew, A.W.-C., Law, N.-F., Yan, H.: Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Brief. Bioinform. 12(5), 498–513 (2010)
Myers, T.A.: Goodbye, listwise deletion: presenting hot deck imputation as an easy and effective tool for handling missing data. Commun. Methods Meas. 5(4), 297–310 (2011)
Kezunovic, M., Obradovic, Z., Dokic, T., Zhang, B., Stojanovic, J., Dehghanian, P., Chen, P.-C.: Predicting Spatiotemporal Impacts of Weather on Power Systems Using Big Data Science, pp. 265–299. Springer International Publishing, Cham (2017). https://doi.org/10.1007/978-3-319-53474-9_12
Kalaycioglu, O., Copas, A., King, M., Omar, R.Z.: A comparison of multiple-imputation methods for handling missing data in repeated measurements observational studies. J. R. Stat. Soc. Ser. A (Stat. Soc.) 179(3), 683–706 (2016)
Schmitt, P., Mandel, J., Guedj, M.: A comparison of six methods for missing data imputation. J. Biom. Biostat. 6(1), 1 (2015)
Zeng, Y.: A study of missing data imputation and predictive modeling of strength properties of wood composites. Master’s Thesis, University of Tennessee. http://trace.tennessee.edu/utk_gradthes/1041. Accessed 2011
Subashini, P., Krishnaveni, M.: Imputation of missing data using Bayesian Principal Component Analysis on TEC ionospheric satellite dataset. In: Electrical and 24th Canadian Conference on Computer Engineering (CCECE), 2011, pp. 001540–001543. IEEE, New York (2011)
Boke, A.S.: Comparative evaluation of spatial interpolation methods for estimation of missing meteorological variables over Ethiopia. J. Water Resour. Prot. 9(08), 945 (2017)
Leke, C., Marwala, T.: Missing data estimation in high-dimensional datasets: a swarm intelligence-deep neural network approach. In: International Conference in Swarm Intelligence, pp. 259–270. Springer, Berlin (2016)
Denil, M., Shakibi, B., Dinh, L., De Freitas, N., et al.: Predicting parameters in deep learning. In: Advances in Neural Information Processing Systems, pp. 2148–2156 (2013)
Ghaderi, A., Sanandaji, B.M., Ghaderi, F.: Deep Forecast: Deep Learning-Based Spatio-Temporal Forecasting. arXiv preprint. arXiv:1707.08110
Gao, Y., Merz, C., Lischeid, G., Schneider, M.: A review on missing hydrological data processing. Environ. Earth Sci. 77(2), 47 (2018)
Angermueller, C., Pärnamaa, T., Parts, L., Stegle, O.: Deep learning for computational biology. Mol. Syst. Biol. 12(7), 878 (2016)
Swara, G.Y., et al.: Implementation of Haversine Formula and Best First Search Method in Searching of Tsunami Evacuation Route. In: IOP Conference Series: Earth and Environmental Science, vol. 97, p. 012004. IOP Publishing, Philadelphia (2017)
Campozano, L., Sánchez, E., Aviles, A., Samaniego, E.: Evaluation of infilling methods for time series of daily precipitation and temperature: the case of the Ecuadorian Andes. Maskana 5(1), 99–115 (2015)
Varatharajan, R., Manogaran, G., Priyan, M.: A big data classification approach using LDA with an enhanced SVM method for ECG signals in cloud computing. Multimed. Tools Appl. 77(8), 10195–10215 (2018)
Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint. arXiv:1609.04747
Gitman, I., Dilipkumar, D., Parr, B.: Convergence analysis of gradient descent algorithms with proportional updates. arXiv preprint. arXiv:1801.03137
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121–2159 (2011)
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint. arXiv:1412.6980
Dozat, T.: Incorporating nesterov momentum into adam, International Conference on Learning Representations (ICLR), pp. 1–6 (2016). http://cs229.stanford.edu/proj2015/054_report.pdf
Theano Development Team: Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints. arXiv:1605.02688
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Park, I., Kim, H.S., Lee, J., Kim, J.H., Song, C.H., Kim, H.K.: Temperature prediction using the missing data refinement model based on a long short-term memory neural network. Atmosphere 10(11), 718 (2019)
Saima, H., Jaafar, J., Belhaouari, S., Jillani, T.: Intelligent methods for weather forecasting: a review. In: National Postgraduate Conference (NPC), 2011, pp. 1–6. IEEE, New York (2011)
Acknowledgements
We are indebted to the National Oceanic and Atmospheric Administration for making available of the NCDC data to the public, without that this work would not have been made possible.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gad, I., Hosahalli, D., Manjunatha, B.R. et al. A robust deep learning model for missing value imputation in big NCDC dataset. Iran J Comput Sci 4, 67–84 (2021). https://doi.org/10.1007/s42044-020-00065-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42044-020-00065-z