DOI QR코드

DOI QR Code

Enhanced Regular Expression as a DGL for Generation of Synthetic Big Data

  • Kai, Cheng (Dept. of Information Science, Kyushu Sangyo University) ;
  • Keisuke, Abe (Dept. of Information Science, Kyushu Sangyo University)
  • 투고 : 2021.01.15
  • 심사 : 2021.12.26
  • 발행 : 2023.02.28

초록

Synthetic data generation is generally used in performance evaluation and function tests in data-intensive applications, as well as in various areas of data analytics, such as privacy-preserving data publishing (PPDP) and statistical disclosure limit/control. A significant amount of research has been conducted on tools and languages for data generation. However, existing tools and languages have been developed for specific purposes and are unsuitable for other domains. In this article, we propose a regular expression-based data generation language (DGL) for flexible big data generation. To achieve a general-purpose and powerful DGL, we enhanced the standard regular expressions to support the data domain, type/format inference, sequence and random generation, probability distributions, and resource reference. To efficiently implement the proposed language, we propose caching techniques for both the intermediate and database queries. We evaluated the proposed improvement experimentally.

키워드

과제정보

This study was supported by a grant from the JSPS Grants-in-Aid for Scientific Research (No. 20K11836) to Kai Cheng.

참고문헌

  1. A. Adir, R. Levy, and T. Salman, "Dynamic test data generation for data intensive applications," in Hardware and Software: Verification and Testing. Heidelberg, Germany: Springer, 2012, pp. 219-233.
  2. T. S. Buda, T. Cerqueus, J. Murphy, and M. Kristiansen, "VFDS: an application to generate fast sample databases," in Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, Shanghai, China, 2014, pp. 2048-2050.
  3. T. Rabl, M. Frank, H. M. Sergieh, and H. Kosch, "A data generator for cloud-scale benchmarking," in Performance Evaluation, Measurement and Characterization of Complex Systems. Heidelberg, Germany: Springer, 2011, pp. 41-56.
  4. T. Rabl, M. Danisch, M. Frank, S. Schindler, and H. A. Jacobsen, "Just can't get enough: synthesizing big data," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Australia, 2015, pp. 1457-1462.
  5. K. Taneja, Y. Zhang, and T. Xie, "MODA: automated test generation for database applications via mock objects," in Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, Antwerp, Belgium, 2010, pp. 289-292.
  6. H. Wu, Y. Ning, P. Chakraborty, J. Vreeken, N. Tatti, and N. Ramakrishnan, "Generating realistic synthetic population datasets," ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 12, no. 4, pp. 1-22, 2018. https://doi.org/10.1145/3182383
  7. K. Mason, S. Vejdan, and S. Grijalva, "An "on the fly" framework for efficiently generating synthetic big data sets," in Proceedings of 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, 2019, pp. 3379-3387.
  8. B. C. Fung, K. Wang, R. Chen, and P. S. Yu, "Privacy-preserving data publishing: a survey of recent developments," ACM Computing Surveys, vol. 42, no. 4, pp. 1-53, 2010. https://doi.org/10.1145/1749603.1749605
  9. M. Elliot and J. Domingo-Ferrer, "The future of statistical disclosure control," 2018 [Online]. Available: https://arxiv.org/abs/1812.09204.
  10. A. Dries, "Declarative data generation with problog," in Proceedings of the 6th International Symposium on Information and Communication Technology (SoICT), Hue City, Vietnam, 2015, pp. 17-24.
  11. D. C. Ince, "The automatic generation of test data," The Computer Journal, vol. 30, no. 1, pp. 63-69, 1987. https://doi.org/10.1093/comjnl/30.1.63
  12. J. E. Hoag and C. W. Thompson, "A parallel general-purpose synthetic data generator," ACM SIGMOD Record, vol. 36, no. 1, pp. 19-24, 2007. https://doi.org/10.1145/1276301.1276305
  13. L. Burnett, K. Barlow-Stewart, A. L. Proos, and H. Aizenberg, "The "GeneTrustee": a universal identification system that ensures privacy and confidentiality for human genetic databases," Journal of Law and Medicine, vol. 10, no. 4, pp. 506-513, 2003.
  14. H. Surendra and H. S. Mohan, "A review of synthetic data generation methods for privacy preserving data publishing," International Journal of Scientific & Technology Research, vol. 6, no. 3, pp. 95-101, 2017.
  15. A. Dandekar, R. A. M. Zen, and S. Bressan, "Comparative evaluation of synthetic data generation methods," 2017 [Online]. Available: https://sgcsc.sg/wp-content/uploads/sites/10/2020/05/RF-04.pdf.
  16. J. Fan, T. Liu, G. Li, J. Chen, Y. Shen, and X. Du, "Relational data synthesis using generative adversarial networks: a design space exploration," Proceedings of the VLDB Endowment, vol. 13, no. 11, pp. 1962-1975, 2020. https://doi.org/10.14778/3407790.3407802
  17. R. Malhotra and M. Garg, "An adequacy based test data generation technique using genetic algorithms," Journal of Information Processing Systems, vol. 7, no. 2, pp. 363-384, 2011. https://doi.org/10.3745/JIPS.2011.7.2.363
  18. S. Sabharwal and M. Aggarwal, "Test set generation for pairwise testing using genetic algorithms," Journal of Information Processing Systems, vol. 13, no. 5, pp. 1089-1102, 2017.
  19. P. Fisher, N. Aljohani, and J. Baek, "Generation of finite inductive, pseudo random, binary sequences," Journal of Information Processing Systems, vol. 13, no. 6, pp. 1554-1574, 2017. https://doi.org/10.3745/JIPS.01.0021
  20. J. Kwak and Y. Sung, "Path generation method of UAV autopilots using max-min algorithm," Journal of Information Processing Systems, vol. 14, no. 6, pp. 1457-1463, 2018.
  21. J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger, "Quickly generating billion-record synthetic databases," in Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, MN, 1994, pp. 243-252.
  22. J. M. Stephens and M. Poess, "MUDD: a multi-dimensional data generator," ACM SIGSOFT Software Engineering Notes, vol. 29, no. 1, pp. 104-109, 2004. https://doi.org/10.1145/974043.974060
  23. N. Bruno and S. Chaudhuri, "Flexible database generators," in Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), Trondheim, Norway, 2005, pp. 1097-1107.
  24. R. Cox, "Regular expression matching can be simple and fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)," 2007 [Online]. Available: https://swtch.com/~rsc/regexp/regexp1.html.
  25. M. D. McIlroy, "Enumerating the strings of regular languages," Journal of Functional Programming, vol. 14, no. 5, pp. 503-518, 2004. https://doi.org/10.1017/S0956796803004982
  26. K. Thompson, "Programming techniques: regular expression search algorithm," Communications of the ACM, vol. 11, no. 6, pp. 419-422, 1968. https://doi.org/10.1145/363347.363387
  27. M. Poess and C. Floyd, "New TPC benchmarks for decision support and web commerce," ACM SIGMOD Record, vol. 29, no. 4, pp. 64-71, 2000. https://doi.org/10.1145/369275.369291
  28. A. Crolotte and A. Ghazal, "Introducing skew into the TPC-H benchmark," in Topics in Performance Evaluation, Measurement and Characterization. Heidelberg, Germany: Springer, 2012, pp. 137-145.
  29. D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri, "Practical skew handling in parallel joins," in Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), Vancouver, Canada, 1992, pp. 27-40.
  30. E. Lo, N. Cheng, W. W. Lin, W. K. Hon, and B. Choi, "MyBenchmark: generating databases for query workloads," The VLDB Journal, vol. 23, pp. 895-913, 2014. https://doi.org/10.1007/s00778-014-0354-1
  31. M. O. Rabin and D. Scott, "Finite automata and their decision problems," IBM Journal of Research and Development, vol. 3, no. 2, pp. 114-125, 1959. https://doi.org/10.1147/rd.32.0114
  32. M. Cognetta, Y. S. Han, and S. C. Kwon, "Incremental computation of infix probabilities for probabilistic finite automata," in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 2732-2741.
  33. Github, "ReverseRegex: use regular expressions to generate text strings," 2020 [Online]. Available: https://github.com/icomefromthenet/ReverseRegex.
  34. H. Ping, J. Stoyanovich, and B. Howe, "Datasynthesizer: privacy-preserving synthetic datasets," in Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, 2017, pp. 1-5.
  35. J. Drechsler, "Using support vector machines for generating synthetic datasets," in Privacy in Statistical Databases. Heidelberg, Germany: Springer, 2010, pp. 148-161.
  36. G. Caiola and J. P. Reiter, "Random forests for generating partially synthetic, categorical data," Transactions on Data Privacy, vol. 3, no. 1, pp. 27-42, 2010.
  37. X. Wu, Y. Wang, S. Guo, and Y. Zheng, "Privacy preserving database generation for database application testing," Fundamenta Informaticae, vol. 78, no. 4, pp. 595-612, 2007.
  38. J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao, "Privbayes: private data release via Bayesian networks," in Proceedings of International Conference on Management of Data (SIGMOD), Snowbird, UT, 2014, pp. 1423-1434.
  39. N. C. Abay, Y. Zhou, M. Kantarcioglu, B. Thuraisingham, and L. Sweeney, "Privacy preserving synthetic data release using deep learning," in Machine Learning and Knowledge Discovery in Databases. Cham, Switzerland: Springer, 2019, pp. 510-526.