DOI QR코드

DOI QR Code

Labeling Big Spatial Data: A Case Study of New York Taxi Limousine Dataset

  • AlBatati, Fawaz (Umm Al-Qura University, College of Computer and Information Systems, Department of Computer Science) ;
  • Alarabi, Louai (Umm Al-Qura University, College of Computer and Information Systems, Department of Computer Science)
  • Received : 2021.06.05
  • Published : 2021.06.30

Abstract

Clustering Unlabeled Spatial-datasets to convert them to Labeled Spatial-datasets is a challenging task specially for geographical information systems. In this research study we investigated the NYC Taxi Limousine Commission dataset and discover that all of the spatial-temporal trajectory are unlabeled Spatial-datasets, which is in this case it is not suitable for any data mining tasks, such as classification and regression. Therefore, it is necessary to convert unlabeled Spatial-datasets into labeled Spatial-datasets. In this research study we are going to use the Clustering Technique to do this task for all the Trajectory datasets. A key difficulty for applying machine learning classification algorithms for many applications is that they require a lot of labeled datasets. Labeling a Big-data in many cases is a costly process. In this paper, we show the effectiveness of utilizing a Clustering Technique for labeling spatial data that leads to a high-accuracy classifier.

Keywords

References

  1. Mahmood, A. R., Punni, S., & Aref, W. G. (2019). Spatio-temporal access methods: a survey (2010-2017). GeoInformatica, 23(1), 1-36. https://doi.org/10.1007/s10707-018-0329-2
  2. Liu, Y., Singleton, A., Arribas-Bel, D., & Chen, M. (2021). Identifying and understanding road-constrained areas of interest (AOIs) through spatiotemporal taxi GPS data: A case study in New York City. Computers, Environment and Urban Systems, 86, 101592. https://doi.org/10.1016/j.compenvurbsys.2020.101592
  3. Dritsas, E., Kanavos, A., Trigka, M., Vonitsanos, G., Sioutas, S., & Tsakalidis, A. (2020). Trajectory Clustering and k-NN for Robust Privacy Preserving k-NN Query Processing in GeoSpark. Algorithms, 13(8), 182. https://doi.org/10.3390/a13080182
  4. Wang, M., Ji, G., Zhao, B., & Tang, M. (2015, October). A parallel clustering algorithm based on grid index for spatiotemporal trajectories. In 2015 Third International Conference on Advanced Cloud and Big Data (pp. 319-326). IEEE.
  5. Alarabi, Louai., Mokbel, M. F., & Musleh, M. (2018). St-hadoop: A Mapreduce Framework for Spatio-Temporal Data. GeoInformatica journal, 22(4), 785-813. https://doi.org/10.1007/s10707-018-0325-6
  6. NYC Taxi & Limousine Commission web site. [CrossRef]
  7. Vittaut JN., Amini MR., Gallinari P. (2002) Learning Classification with Both Labeled and Unlabeled Data. In: Elomaa T., Mannila H., Toivonen H. (eds) Machine Learning: ECML 2002. ECML 2002. Lecture Notes in Computer Science, vol 2430. Springer, Berlin, Heidelberg.
  8. Blum, A., & Mitchell, T. (1998, July). Combining labeled and unlabeled data with cotraining. In Proceedings of the eleventh annual conference on Computational learning theory (pp. 92-100).
  9. De Sa, V. R. (1994). Learning classification with unlabeled data. In Advances in neural information processing systems (pp. 112-119).
  10. Dara, R., Kremer, S. C. and Stacey, D. A. (2002) 'Clustering unlabeled data with SOMs improves classification of labeled real-world data', Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290), Neural Networks, 2002. IJCNN '02. Proceedings of the 2002 International Joint Conference on, Neural networks, IJCNN'02, 3, p. 2237. doi: 10.1109/IJCNN.2002.1007489.
  11. Forestier, G. and Wemmert, C. (2016) 'Semi-supervised learning using multiple clusterings with limited labeled data', Information Sciences, 361-362, pp. 48-65. doi: 10.1016/j.ins.2016.04.040.
  12. More information about the Attribute. [CrossRef]