DOI QR코드

DOI QR Code

Cleaning Noises from Time Series Data with Memory Effects

  • Cho, Jae-Han (Dept. of Computer Engineering, Kumoh National Institute of Technology) ;
  • Lee, Lee-Sub (Dept. of Computer Engineering, Kumoh National Institute of Technology)
  • Received : 2020.03.24
  • Accepted : 2020.04.13
  • Published : 2020.04.29

Abstract

The development process of deep learning is an iterative task that requires a lot of manual work. Among the steps in the development process, pre-processing of learning data is a very costly task, and is a step that significantly affects the learning results. In the early days of AI's algorithm research, learning data in the form of public DB provided mainly by data scientists were used. The learning data collected in the real environment is mostly the operational data of the sensors and inevitably contains various noises. Accordingly, various data cleaning frameworks and methods for removing noises have been studied. In this paper, we proposed a method for detecting and removing noises from time-series data, such as sensor data, that can occur in the IoT environment. In this method, the linear regression method is used so that the system repeatedly finds noises and provides data that can replace them to clean the learning data. In order to verify the effectiveness of the proposed method, a simulation method was proposed, and a method of determining factors for obtaining optimal cleaning results was proposed.

딥러닝의 개발 프로세스는 대량의 수작업이 요구되는 반복적인 작업으로 그 중 학습 데이터 전처리는 매우 큰 비용이 요구되며 학습 결과에 중요한 영향을 주는 단계이다. AI의 알고리즘 연구 초기에는 주로 데이터 과학자들에 의해 완벽하게 정리하여 제공된 공개 DB형태의 학습데이터를 주로 사용하였다. 실제 환경에서 수집된 학습 데이터는 주로 센서들의 운영 데이터이며 필연적으로 노이즈가 많이 발생할 수 있다. 따라서 노이즈를 제거하기 위한 다양한 데이터 클리닝 프레임워크와 방법들이 연구되었다. 본 논문에서는 IoT환경에서 발생 될 수 있는 센서 데이터와 같은 시계열 데이터에서 노이즈를 감지하고 제거하는 방법을 제안하였다. 이 방법은 선형회귀 방법을 사용하여 시스템이 반복적으로 노이즈를 찾아내고, 이를 대체할 수 있는 데이터를 제공하여 학습데이터를 클리닝한다. 제안된 방법의 효과를 검증하기 위해서 본 연구에서 시뮬레이션을 수행하여, 최적의 클리닝 결과를 얻을 수 있는 인자들의 결정 방법을 확인하였다.

Keywords

References

  1. D. Xin, L. Ma, J. Liu, S. Macke, S. Song, "Helix: Holistic optimization for accelerating iterative machine learning," Proceedings of the VLDB Endowment, pp. 446-460, Dec. 2018.
  2. S. D. Nunes, P. Zhang, J. S. Silva, "A survey on human-in-the-loop applications towards an internet of all," IEEE Communications Surveys & Tutorials, 17(2), pp. 944-965, Feb. 2015. DOI:10.1109/COMST.2015.2398816
  3. C. Xue, J. Yan, R. Yan, S. M. Chu, Y. Hu, Y. Lin, "Transferable AutoML by Model Sharing Over Grouped Datasets," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9002-9011, June. 2019. DOI:10.1109/CVPR.2019.00921
  4. M. Terry, D. Sculley, N. Hynes, "The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets," Machine Learning Systems Workshop at NIPS. 2017.
  5. Y. Roh, G. Heo, and S. E. Whang, "A survey on data collection for machine learning: a big data - ai integration perspective," IEEE Tran. on Knowledge and Data Engineering, June. 2019. DOI:10.1109/TKDE.2019.2946162
  6. Rekatsinas, Theodoros, et al. "Holoclean: Holistic data repairs with probabilistic inference," arXiv preprint arXiv:1702.00820, Feb. 2017.
  7. S. Krishnan, J. Wang, E. Wu, M. Franklin, J, K. Goldberg, "Activeclean : Interactive data cleaning for statistical modeling," Proceedings of the VLDB Endowment, 9, pp. 948-959, Aug. 2016. https://doi.org/10.14778/2994509.2994514
  8. Krishnan, Sanjay, et al. "Boostclean: Automated error detection and repair for machine learning," arXiv preprint arXiv:1711.01299, Nov. 2017.
  9. K. H. Tae, Y. Roh, Y. H. Oh, H. Kim, & S. E. Whang, "Data cleaning for accurate, fair, and robust models: A big data-AI integration approach," In Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning, pp. 1-4, June. 2019.
  10. M. Dolatshah, M. Teoh, J. Wang, and J. Pei, "Cleaning crowdsourced labels using oracles for statistical classification," Proceedings of the VLDB Endowment, Vol. 12, pp. 376-389, Dec. 2018. https://doi.org/10.14778/3297753.3297758
  11. Mahmoud Ghofrani and Musaad Alolayan, "Time Series and Renewable Energy Forecasting," intechopen, Dec. 2017. DOI:10.5772/intechopen.70845
  12. K. Bandara, C. Bergmeir, S. Smyl, "Forecasting across time series databases using recurrent neural networks on groups of similar series: A clustering approach," Expert Systems with Applications, Vol. 140, Feb. 2020.
  13. Won-chang Lee, Jae-Han Cho and LeeSub Lee, "Time Series Abnormal Data Detection for Smart Factory," International Journal of Control and Automation, Vol. 11, No. 1, pp. 91-98, Jan. 2018. https://doi.org/10.14257/ijca.2018.11.1.08
  14. J. Zaldivar, C. T. Calafate, J. C. Cano, and P. Manzoni, "Providing accident detection in vehicular networks through OBD-II devices and Android-based smartphones," IEEE Trans. on Local Computer Networks, pp. 813-819, Oct. 2011.
  15. Smith, H. Ben, Laurie Williams, "On guiding the augmentation of an automated test suite via mutation analysis," Empirical software engineering, pp. 341-369, June. 2009. https://doi.org/10.1007/s10664-008-9083-7
  16. L. Ma, F. Zhang, J. Sun, M. Xue, B. Li, F. Juefei-Xu, Y. Wang, "Deepmutation: Mutation testing of deep learning systems," IEEE Trans. on Software Reliability Engineering, pp. 100-111, Oct. 2018.
  17. C. Finn, S. Levine, P. Abbeel, "Guided cost learning: Deep inverse optimal control via policy optimization," In International conference on machine learning, pp. 49-58, June. 2016.
  18. F. P. Luus, B. P. Salmon, F. Van den Bergh, B. T. J. Maharaj, "Multiview deep learning for land-use classification," IEEE Geoscience and Remote Sensing Letters, pp. 2448-2452, Oct. 2015.