DOI QR코드

DOI QR Code

Pattern mining for large distributed dataset: A parallel approach (PMLDD)

  • Pal, Amrit (Department of Information Technology, Indian Institute of Information Technology) ;
  • Kumar, Manish (Department of Information Technology, Indian Institute of Information Technology)
  • Received : 2017.12.08
  • Accepted : 2018.06.05
  • Published : 2018.11.30

Abstract

Handling vast amount of data found in large transactional datasets is an obvious challenge for the conventional data mining algorithms. Addressing this challenge, our paper proposes a parallel approach for proper decomposition of mining problem into sub-problems in order to find frequent patterns from these datasets. The proposed, Pattern Mining for Large Distributed Dataset (PMLDD) approach, ensures minimum dependencies as well as minimum communications among sub-problems. It establishes a linear aggregation of the intermediate results so that it can be adapted to large-scale programming models like MapReduce. In this context, an algorithmic structure for MapReduce programming model is presented. PMLDD guarantees an efficient load balancing among the sub-problems by a specific selection criterion. Further, it optimizes the number of required iterations over the dataset for mining frequent patterns as compared to the existing approaches. Finally, we believe that our approach is scalable enough to handle larger datasets in terms of performance evaluation, and the result analysis justifies all these mentioned concerns.

Keywords

References

  1. Wu, X., Zhu, X., Wu, G. Q., & Ding, W., "Data mining with big data," IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97-107, Jan. 2014. https://doi.org/10.1109/TKDE.2013.109
  2. Fan, W., Bifet, A., "Mining Big Data: current status, and forecast to the future," ACM SIGKDD Explorations Newsletter, Vol. 14, no. 2, pp. 1-5, Apr. 2013. https://doi.org/10.1145/2481244.2481246
  3. Rodrguez-Mazahua, L., Rodrguez-Enrquez, C. A., Snchez-Cervantes, J. L., Cervantes, J., Garca-Alcaraz, J. L., & Alor-Hernndez, G., "A general perspective of Big Data: applications, tools, challenges and trends," The Journal of Supercomputing, Vol. 72, no. 8, pp. 3073-3113, Aug. 2016. https://doi.org/10.1007/s11227-015-1501-1
  4. Hipp, J., Gntzer, U., & Nakhaeizadeh, G., "Algorithms for association rule mining a general survey and comparison," ACM Sigkdd Explorations Newsletter, Vol. 2, no. 1, pp. 58-64, Jun. 2000. https://doi.org/10.1145/360402.360421
  5. Han, Jiawei, Jian Pei, and Micheline Kamber. "Data mining concepts and techniques," Elsevier, 2011.
  6. Seol, W. S., Jeong, H. W., Lee, B., & Youn, H.Y., "Reduction of Association Rules for Big Data Sets in Socially-Aware Computing," in Proc. of Proceedings of the 2013 IEEE 16th International Conference on Computational Science and Engineering (CSE), pp. 949-956, Dec. 2013.
  7. Del Ro, S., Lpez, V., Bentez, J. M., & Herrera, F., "A mapreduce approach to address big data classification problems based on the fusion of linguistic fuzzy rules," International Journal of Computational Intelligence Systems, Vol. 8, no.3, pp. 422-437, May. 2015. https://doi.org/10.1080/18756891.2015.1017377
  8. Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communications of the ACM, Vol. 51, no. 1, pp. 107-113 Jan. 2008. https://doi.org/10.1145/1327452.1327492
  9. Apache Spark, accessed 15 Jun. 2017.
  10. Shvachko, K., Kuang, H., Radia, S., & Chansler,R., "The hadoop distributed file system," in Proc. of Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1-10, May. 2010.
  11. Agrawal, R., Imieliski, T., & Swami, A., "Mining association rules between sets of items in large databases," in Proc. of Proceedings of the ACM SIGMOD, Vol. 22, no. 2, pp. 207-216, Jun. 1993.
  12. Han, J., Pei, J., Yin, Y., & Mao, R., "Mining frequent patterns without candidate generation: A frequent-pattern tree approach," Data mining and knowledge discovery, Vol. 8, no. 1, pp. 53-87, Jan. 2004. https://doi.org/10.1023/B:DAMI.0000005258.31418.83
  13. Lin, M. Y., Lee, P. Y., & Hsueh, S. C., "Apriori based frequent itemset mining algorithms on MapReduce," in Proc. of Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, pp. 76, Feb. 2012.
  14. Yang, X. Y., Liu, Z., & Fu, Y., "MapReduce as a programming model for association rules algorithm on Hadoop," in Proc. of Proceedings of the 3rd International Conference on Information Sciences and Interaction Sciences (ICIS), pp. 99-102, Jun. 2010.
  15. Chang, X. Z., "MapReduce-Apriori algorithm under cloud computing environment," in Proc. of Proceedings of the International Conference on In Machine Learning and Cybernetics (ICMLC), Vol. 2, pp. 637-641, Jul. 2015.
  16. Lin, X., "Mr-apriori: Association rules algorithm based on MapReduce," in Proc. of Proceedings of the 5th IEEE International Conference on Software Engineering and Service Science (ICSESS), pp. 141-144, Jun. 2014.
  17. Li, N., Zeng, L., He, Q., & Shi, Z., "Parallel implementation of Apriori algorithm based onMapReduce," in Proc. of proceeding of the 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD), pp. 236-241, Aug. 2012.
  18. Guo, J., & Ren, Y. G., "Research on Improved Apriori Algorithm Based on Coding and MapReduce," in Proc. of proceeding of the IEEE 10th International Conference on Web Information System and Application Conference (WISA), pp. 294-299, Nov. 2013.
  19. Chang, H.-Y.; Lin, J.-C.; Cheng, M.-L. & Huang, S.-C. "A Novel Incremental Data MiningAlgorithm Based on FP-growth for Big Data," in Proc. of proceeding of the International Conference on Networking and Network Applications (NaNA), pp. 375-378, Jul. 2016.
  20. Chang, H.-Y.; Tzang, Y.-J.; Lin, J.-C.; Hong, Z.-H.; Chi, T.-Y. & Huang, C.-Y., "A Hybrid Algorithm for Frequent Pattern Mining Using MapReduce Framework," in Proc. of proceeding of the First International Conference on Computational Intelligence Theory, Systems and Applications (CCITSA), 19-2, Dec. 2015.
  21. Deng, L. & Lou, Y. "Improvement and research of FP-growth algorithm based on distributed spark," in Proc. of proceeding of the International Conference on Cloud Computing and Big Data (CCBD), 105-108, Nov. 2015.
  22. Lan, Q.; Zhang, D. & Wu, B. "A new algorithm for frequent itemsets mining based on apriori and FP-Tree," in Proc. of proceeding of WRI Global Congress on Intelligent Systems, pp.360-364, May. 2009.
  23. Tsay, Y. J., Hsu, T. J., & Yu, J. R., "FIUT: A new method for mining frequent itemsets," Information Sciences, Vol. 179, no. 11, pp. 1724-1737, May. 2009. https://doi.org/10.1016/j.ins.2009.01.010
  24. Farzanyar, Z. & Cercone, N. "Efficient mining of frequent itemsets in social network data based on MapReduce framework," in Proc. of proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 1183-1188, Aug. 2013.
  25. Li, H.; Wang, Y.; Zhang, D.; Zhang, M. & Chang, E. Y. "Pfp: parallel fp-growth for query recommendation," in Proc. of proceeding of the ACM conference on Recommender systems, pp. 107-114, Oct. 2008.
  26. Zhou, J. & Yu, K.-M. "Tidset-based parallel FP-tree algorithm for the frequent pattern mining problem on PC clusters," in Proc. of proceeding of the International Conference on Grid and Pervasive Computing, pp. 18-28, 2008.
  27. Riondato, M.; DeBrabant, J. A.; Fonseca, R. & Upfal, E. "PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce," in Proc. of proceedings of the 21st ACM international conference on Information and knowledge management, pp. 85-94, Oct. 2012.
  28. Wei, X., Ma, Y., Zhang, F., Liu, M., & Shen, W., "Incremental FP-Growth mining strategy for dynamic threshold value and database based on MapReduce," in Proc. of proceedings of the IEEE 18th International Conference on Computer Supported Cooperative Work in Design (CSCWD), pp. 271-276, May. 2014.
  29. Moens, S., Aksehirli, E., & Goethals, B., "Frequent itemset mining for big data," in Proc. of proceeding of the IEEE International Conference on Big Data, pp. 111-118, Oct. 2013.
  30. Xun, Y., Zhang, J., & Qin, X., Fidoop: "Parallel mining of frequent itemsets using mapreduce," IEEE Transactions on Systems, Man, and Cybernetics: Systems, Vol. 46, no. 3, pp. 313-325, Mar. 2016. https://doi.org/10.1109/TSMC.2015.2437327
  31. M. Lichman "UCI Machine Learning Repository," University of California, Irvine, School of Information and Computer Sciences, 2013.
  32. Huerta-Cepas, J., Dopazo, J., & Gabaldn, T., "ETE: a python Environment for Tree Exploration," BMC bioinformatics, Vol. 11, no. 1, pp.24, Jan. 2010. https://doi.org/10.1186/1471-2105-11-24
  33. Huerta-Cepas, J., Serra, F., & Bork, P., "ETE3: Reconstruction, analysis, and visualization of phylogenomic data," Molecular biology and evolution, Vol. 33, no. 6, pp. 1635-1638, Feb. 2016. https://doi.org/10.1093/molbev/msw046
  34. Fournier-Viger, P., Gomariz, Gueniche, T., A., Soltani, A., Wu., C., Tseng, V. S., "SPMF: a Java Open-Source Pattern Mining Library," Journal of Machine Learning Research (JMLR), Vol. 15, PP. 3389-3393, Jan. 2014.
  35. Holt, J. D., & Chung, S. M., "Parallel mining of association rules from text databases," The Journal of Supercomputing, Vol. 39, no. 3, pp. 273-299, Mar. 2007. https://doi.org/10.1007/s11227-006-0008-1
  36. Rashid, M. M.; Gondal, I.; Kamruzzaman, J.; "Dependable large scale behavioral patterns mining from sensor data using Hadoop platform," Information Sciences, Vol. 379, pp. 128-145, Feb. 2017. https://doi.org/10.1016/j.ins.2016.06.036
  37. Jarvis, E. D., Mirarab, S., Aberer, A. J., Li, B., Houde, P., Li, C., & Suh, A., "Whole-genome analyses resolve early branches in the tree of life of modern birds," Science, Vol. 346, no. 6215, pp. 1320-1331, Dec. 2014. https://doi.org/10.1126/science.1253451