1. Introduction
With the popularization of big data environment, the big data activities of general corporations as well as public organizations are increasing. Recently, the government presented the vision of Government 3.0 as a new paradigm of administration and announced plans to implement the vision. General corporations are also using big data analytics for various management activities including the creation of results, the optimization of operation, crisis management, prevention and solution. Big data is also applicable to cadastral information, but service and development for the utilization, analysis, and provision of cadastral information is still insufficient.
Cadastral information is a core national information source that forms the basis of spatial information and geographical information in South Korea that directly and indirectly influences people’s daily life including the production and consumption of information related to real estate. The cadastral information included in the spatial information is closely related to various social areas and its importance will stand out if the cadastral information is merged with various other fields of data and can be effectively used as data for communication between government and people. As an example of data that can estimate the actual status of using spatial information in communication with and services to people, 88% of the public services of the government in 2011 was related to spatial information, and in the spatial information, the percentage of cadastral information including land registration records and real estate data were the highest and its use is continuously increasing (Yoo and Yu, 2013). Kim(2013) reviews the preconditions for realization of real estate price announcement materials in Government 3.0 plans and to suggest the ways to utilize the materials through a network analysis which is a type of big data analysis. Yoo and Yu(2013) show the effect that is expected to enable the construction of cadastral information to be utilized in perspective of big data and visualize the analysis results in a scenario or analysis of the construction of big data as a case study of cadastral resurvey project. Results of these existing researches lay the foundation in terms of big data cadastral information focusing on the questions of possibility of its application. Still, they were limited in taking advantage of the actual data and an empirical analytic method, and considerable work needs to be done.
Therefore, this study was conducted to approach from the perspective of the fusion and utilization of spatial data in the cadastral information and to present big data analytics methods for new reports related to the cadastral resurvey project. As specific research method, the categories of cadastral information to be analyzed by big data were defined in Chapter 2 and a framework of big data analytics was designed to analyze them. In Chapter 3, the analysis procedure and analysis package for searching cadastral information news were established. For this purpose, the TM (Text Mining) package from R was used to read various formats of news reports as texts, and nouns were extracted by using the KoNLP (Korean Natural Language Process) package which offers a feature to remove stop words and symbols and to recognize Hangul in morpheme units.
In Chapter 4, for the data mining analysis of cadastral information, new reports related to cadastral resurvey between 2012 and 2014 were searched in Chosun Ilbo, Maeil Business Newspaper, Munhwa Ilbo, JoongAng Ilbo, and other local newspapers, and nouns were extracted from the searched data. A word cloud analysis was performed by calculating the frequencies of words. Furthermore, the approval rating, reliability, and improvement of rules were presented through correlation analyses among the extracted compound nouns. Lastly, major keywords were extracted from the response of the public opinion about the cadastral resurvey and public services following the resurvey, the correlations among the keywords were analyzed and suggestions were presented. In Conclusions, the results and meanings of this study that can generally analyze and visualize civil news and various opinions regarding the cadastral resurvey were summarized and future study subjects were suggested. The procedure for overall research details and methods are as shown below (Fig. 1).
Fig. 1.Overview of study procedure
2. Design of Big Data Analytics Method for Cadastral Informatopm
2.1 Concept of cadastral information
Cadastral information refers to the information written in a data format by processing the data managed by the agency in charge of cadastral information with a computer in accordance with the Act on Land Survey, Waterway Survey and Cadastral Records, and consists of drawing in formation and properties information (Lee, 2011). Therefore, the information developed for cadastral administration, registration information, and cadastral survey information can be included in the cadastral information in a broad sense. As explained above, cadastral information is divided into drawing and properties information, and each data are subdivided into cadastral study data, administration data, measurement data, and processing data. Properties data are divided into 11 categories including land register and drawings can be divided into 11 categories including land registration map.
Table 1.Category of cadastral information
2.2 Big data analytics method
There is not single definition about big data, but Gartner said big data means the growth and change of data in the three dimensions of data volume, data speed, and variety of data types, and defined it as cost-effective, innovative, high-volume, high-speed, and high-variety information assets that are used to provide improved suggestions and better decision making (Crampton et al., 2013; Kim et al., 2013). McKinsey focused on the size of database and defined big data as data of a size that exceeds the scope of storage, management, and analysis by general database software (James et al., 2011; Bae et al., 2013). Furthermore, techniques and tools (collection, storage, search, sharing, analysis, visualization, etc.) related to large-scale data are also included in the category of big data, which means the techniques and architecture for extracting values and analyzing the results from a large volume of standard or non-standard data sets using the techniques and tools.
New values must be found through the analysis of big data and one of the analysis techniques is mining. As mining techniques are applied to decision making, marketing, customer management, finance, medicine, education, and energy, data mining in a broad sense is a generic term for mining techniques based on all types of data (Song, 2012; Yoo and Yu, 2013). For representative big data analytics techniques based on data include data, association rule, text, Web, social, and reality mining, and the details vary by mining technique (Table 2).
Table 2.Mining techniques based on data
In this study, text mining and association rule mining were used. Text mining means to apply the statistical and mechanical learning algorithms and methods to documents in order to find useful patterns from a large volume of documents. The data mining can apply patterns to finding information extracted from a large volume of data, but text mining is characterized by the use of non-standard data (Oh, 2012).
Various indices used in text mining are based on mathematical algorithms, and the most widely known method is TF-IDF (Term Frequency-Inverse Document Frequency). This method assigns importance through vector calculation and has been created as an expression method of documents to evaluate information research, text mining, and the relative importance among words in documents. TF-IDF is an index that TF is multiplied by the reciprocal of DF and can be used to extract the importance of a word. In other words, rather than simple frequency, the frequency of appearance is processed once more based on the appearance probability of each word to represent the frequency of each word. Under the premise that words that appear simultaneously in multiple documents have a higher probability of appearance, the reverse literature appearance is calculated, and the larger the DF is, the more the importance decreases (Lim, 2013). The TF value means the frequency of a specific word in a document. It is normalized by dividing the appearance frequency of a word in the document by the appearance count of all words. The IDF is determined by dividing the number of words in a document group by the number of documents in which the word appears. This can be expressed as Eq. (1).
Where TF: Frequency of certain words in the document, DF: Frequency of specific words in multiple documents, IDF: Inverse of DF, fi,j: Occurrence count of word ti in the document dj, max fi,j : The number of hits for every words in the document dj, N: The number of documents in the document set.
The basic concept of association rule mining is to derive the correlation between data based on the appearance frequency of data. For example, the association rule “A, B → C” means that people who buy the products A and B mostly buy the product C (Joo et al., 2011). In this study, the Apriori algorithm which is representative of the association rule mining was used to extract representative keywords from unstructured reports and analyze the association rules among the keywords. The data analyzed by the above mining technique can be expressed as a graph to show the information in a form that is easy to see and contains useful meaning. In this study, the analysis results using text mining and association rule mining techniques were visualized as graphs.
2.3 Big data analytics program R
R was developed to support statistical analysis and graphics. It comes from S language and is both a software application and a programming language. R can be freely used and distributed in accordance with general public license, and provides many functions for data expression by graphs as well as statistics. The images obtained from such graphic functions can be transformed to various forms and the functions related to statistical analysis can visualize the analysis results and save them as objects or files. The programming language features of R allow users to analyze multiple data consecutively by using loop or to collect programs written for different analyses and use them for a complex analysis.
3. Implementation of Cadastral Information Search and Analysis
3.1 Establishment of report search and analysis procedure
The report search process and cautions that must be taken when searching are as follows. In the first step, the keywords that can define and represent the objective and perspective of survey are extracted through the understanding of the cadastral area. In the second step, the key words are established, and the number of keywords in report search determines the amount of search results. Furthermore, it should be noted that if keywords are established with no knowledge about the related area, unwanted or unimportant results may be searched. In the third step, the search results are analyzed. The data are grouped by similar reports or keywords to derive the candidates, keywords are extracted, and the compound noun generation process through the extraction of associated words is carried out. In the last step, the analysis results are used to derive the final search results by the second search.
3.2 Implementation of R analysis package
Because the objective is for the user to derive report analysis results within the shortest time, the reports are analyzed through the second step of the report search process. The search results analysis program is implemented in the following procedure: Generate corpus by report related to cadastral resurvey project; Remove stop words and symbols from the created corpus by report; Extract only nouns of two characters or longer; Group through the measurement of similarity between report and document; Select valid candidate groups from the grouped reports; Finally, extract the keywords and compound nouns from the reports included in the valid group.
Fig. 2.Analysis process of the article
This program was developed using R which is easy to analyze big data and visualize analysis results. R is a free software environment that supports statistics and graphics for data analysis. This software is a computer language combined with various analysis packages and libraries that can be used for data analyses. It provides such packages as statistics, machine running, finance, and graphics for free (Seo, 2013). In this study, the TM package of R that has the corpus generation feature was used. The TM package can read various formats of documents including PDF (Portable Document Format), XML (eXtensible Markup Language), and HTML (Hyper Text Markup Language). In addition, for the removal of stop words and symbols and the extraction of nouns, the KoNLP package that can recognize Hangul in morpheme units was used. Using these two packages, meaningless English alphabets and symbols are removed and texts and nouns are extracted (Fig. 3).
Fig. 3.Extracted nouns and Hangul morphemes separated utilizing TM and KoNPL package
4. Analysis of Data Mining Results of Cadastral Information
4.1 Word cloud analysis
The scope of reports related to cadastral resurvey included reports from Chosun Ilbo, Maeil Business Newspaper, Munhwa Ilbo, JoongAng Ilbo, and other local newspapers between January 2012 and April 2014. The first search keywords were “cadastral” and “cadastral resurvey.” The first search found about 300 reports. With them, the second search was performed with the keywords “cadastral discrepancy” and “digital land registration,” which found 160 reports finally. The full texts of the searched reports were converted to text files and used for analysis. Most of the reports excluded from the second search were about homophone such as “point out,” “Intellectual,” “intellectual disability.” For the keyword extraction step, as shown in Fig. 3, the word cloud feature of the KoNPL package that allows the recognition of Hangul in morpheme units was used to extract nouns with two or more characters and the frequencies of the extracted nouns were analyzed using the word count feature (Table 3).
Table 3.Noun extraction and frequency analysis
Furthermore, the analysis results were visualized using the package Arules Viz (Visualizing). As shown below, various keywords were extracted such as “cadastral,” “boundary,” “resurvey,” “cadastral resurvey,” “cadastral survey,” “cadastral discrepant land,” “dispute,” “settlement,” etc. (Fig. 4(a)).
Fig. 4.Visualization of data mining analysis for cadastral big data
4.2 Association rule generation and result
To extract compound nouns and analyze the correlations of nouns in the next step, the Arules package that provides the Apriori feature was used. The nouns were converted to a data format for analyzing association rules among nouns, and the data were divided by the first factor (vector data) and the second factor (each factor of the given vector); then lists were generated. As a result, five groups of data consisting of 133 keywords were generated. The most frequently appeared words were “cadastral resurvey,” “civil complaint,” “dispute,” “cadastral surveying,” “lawsuit,” “settlement,” “mediation,” “discrepant land,” and “parcel.” Furthermore, 42 association rules were found as a result of the correlation analysis among the major nouns with high frequencies. There were 18 association rules related to 2 keywords, 19 association rules related to 3 keywords, and 4 association rules related to 4 keywords. The association rules between the LHS (Left-Hand-Side) and RHS (Right-Hand-Side) were analyzed and the approval rating, reliability, and improvement of each association rule are shown in Table 4.
Table 4.Report of result of association rules
When the above analysis results for association rules were visualized with the Arules Viz package, “cadastral resurvey" appeared at the center and it was associated with all the other nine nouns. The other nouns also had specific correlations. The visualization analysis result not only showed the correlations, but also helped better understanding by showing arrows, with the thickness and color depth representing the approval rating and improvement, respectively (Fig. 4(b)).
5. Conclusions
In this study, big data analytics techniques and tools were used to search reports related to the cadastral resurvey project. Major keywords and compound nouns were extracted, their associations were analyzed, and the analysis results were visualized. To search reports related to cadastral resurvey, the gist of search areas was identified and the keywords were established. Then, first search, analysis of the search results, and second search based on the analysis results were carried out. For the analysis of the search results, a total of 160 reports were analyzed using the R program. After corpuses were created for each of the collected reports, stop words were processed, nouns were extracted, and their frequencies were calculated and visualized. As a result of the correlation analysis among the most frequently used ones of the extracted nouns, five groups of data consisting of 133 keywords were generated. The most frequently appeared words were “cadastral resurvey,” “civil complaint,” “dispute,” “cadastral survey,” “lawsuit,” “settlement,” “mediation,” “discrepant land,” and “parcel.” Then the correlation analysis results were visualized. The conclusion which can be drawn from big data analytics on news articles of the cadastral resurvey are these: 1) this cadastral resurvey performed in some local governments has been proceeding smoothly as positive results. 2) On the contrary, disputes from owner of land have been provoking a stream of complaints from parcel surveying for the cadastral resurvey. Through such keyword analysis, various public opinion and the types of civil complaints related to the cadastral resurvey project can be identified to prevent them through pre-emptive responses for direct call centre on the cadastral surveying, Electronic civil service and customer counseling, and high quality services about cadastral information can be provided. If “openness, sharing, and cooperation” on this information can be achieved, it will coincide with the objective of Government 3.0 that the current government is pursuing. In this study, only news reports were analyzed, but in the future, more diverse analyses will be possible by adding social network data and the information accumulated in public portal services (Onnara real estate information, spatial information open platform, and National Land and Ocean Statistics Nuri, etc.), and also by using data mining analysis techniques.
참고문헌
- Bae, D., Park, H., and Oh, G. (2013), Current trends and policy implications of gig data, Journal of International Telecommunications Policy Review, Vol. 25, No. 10, pp. 37-74. (in Korean with English abstract)
- Crampton, J., Graham, M., and Zook, M. (2013), Beyond the geotag: situating 'Big Data' and leveraging the potential of the Geoweb, Cartography and Geographic Information Science, Vol. 40, No. 2, pp. 130-139. https://doi.org/10.1080/15230406.2013.777137
- Joo, K., Shin E., and Lee, W. (2011), Hierarchical automatic classification of news articles based on association rules, Journal of the Korean Multimedia Society, Vol. 14, No. 6, pp. 730-741. (in Korean with English abstract) https://doi.org/10.9717/kmms.2011.14.6.730
- James, M., Michael, C., Brad, B., Jacques, B., Richard, D., Charles, R., and Angela, H. (2011), Big Data: The Next Frontier for Innovation, Competition, and Productivity, Report 2011, McKinsey Global Institute, United States.
- Kim, B. (2013), A study on the utilization of real estate price announcement data based on government 3.0, Journal of the Korean Society of Cadastre, Vol. 29, No. 2, pp. 209-223. (in Korean with English abstract)
- Kim, M., Kim, D., and Lee, Y. (2013), Spatial Big Data Utilization for the National Land Policy, 2013-01, Korea Research Institute for Human Settlements, Anyang, Korea, pp. 13-14.
- Lee, T. (2011), Survey Result's Data Set Establishment Base on Cadastral Information, Ph.D. dissertation, Myongji University, Seoul, Korea, 139p. (in Korean with English abstract)
- Lim, H. (2013), Chungnam Policy Propagation Analysis Using Big Data, 2013-00, Chungnam Development Institute, Gongju, Korea, 31p. (in Korean)
- Oh, S. (2012), A Study on the Practical Applications of Text Mining, Master's thesis, Korea University, Seoul, Korea, 40p. (in Korean)
- Song, M. (2012), Creation of the Business Future Map is Big Data, Hans Media, Seoul, Korea.
- Seo, M. (2013), R for Practical Data Analysis, R-Project, Albacete, Spain, http://r4pda.co.kr/ (last date accessed: 25 July 2014).
- Yoo, K. and Yu, C. (2013), A study on the application method of cadastral information big data, Journal of the Korean Association of Cadastre Information, Vol. 15, No. 2, pp. 31-51. (in Korean with English abstract)