• Title/Summary/Keyword: data mining technique

Search Result 639, Processing Time 0.023 seconds

A Study on Differences of Contents and Tones of Arguments among Newspapers Using Text Mining Analysis (텍스트 마이닝을 활용한 신문사에 따른 내용 및 논조 차이점 분석)

  • Kam, Miah;Song, Min
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.3
    • /
    • pp.53-77
    • /
    • 2012
  • This study analyses the difference of contents and tones of arguments among three Korean major newspapers, the Kyunghyang Shinmoon, the HanKyoreh, and the Dong-A Ilbo. It is commonly accepted that newspapers in Korea explicitly deliver their own tone of arguments when they talk about some sensitive issues and topics. It could be controversial if readers of newspapers read the news without being aware of the type of tones of arguments because the contents and the tones of arguments can affect readers easily. Thus it is very desirable to have a new tool that can inform the readers of what tone of argument a newspaper has. This study presents the results of clustering and classification techniques as part of text mining analysis. We focus on six main subjects such as Culture, Politics, International, Editorial-opinion, Eco-business and National issues in newspapers, and attempt to identify differences and similarities among the newspapers. The basic unit of text mining analysis is a paragraph of news articles. This study uses a keyword-network analysis tool and visualizes relationships among keywords to make it easier to see the differences. Newspaper articles were gathered from KINDS, the Korean integrated news database system. KINDS preserves news articles of the Kyunghyang Shinmun, the HanKyoreh and the Dong-A Ilbo and these are open to the public. This study used these three Korean major newspapers from KINDS. About 3,030 articles from 2008 to 2012 were used. International, national issues and politics sections were gathered with some specific issues. The International section was collected with the keyword of 'Nuclear weapon of North Korea.' The National issues section was collected with the keyword of '4-major-river.' The Politics section was collected with the keyword of 'Tonghap-Jinbo Dang.' All of the articles from April 2012 to May 2012 of Eco-business, Culture and Editorial-opinion sections were also collected. All of the collected data were handled and edited into paragraphs. We got rid of stop-words using the Lucene Korean Module. We calculated keyword co-occurrence counts from the paired co-occurrence list of keywords in a paragraph. We made a co-occurrence matrix from the list. Once the co-occurrence matrix was built, we used the Cosine coefficient matrix as input for PFNet(Pathfinder Network). In order to analyze these three newspapers and find out the significant keywords in each paper, we analyzed the list of 10 highest frequency keywords and keyword-networks of 20 highest ranking frequency keywords to closely examine the relationships and show the detailed network map among keywords. We used NodeXL software to visualize the PFNet. After drawing all the networks, we compared the results with the classification results. Classification was firstly handled to identify how the tone of argument of a newspaper is different from others. Then, to analyze tones of arguments, all the paragraphs were divided into two types of tones, Positive tone and Negative tone. To identify and classify all of the tones of paragraphs and articles we had collected, supervised learning technique was used. The Na$\ddot{i}$ve Bayesian classifier algorithm provided in the MALLET package was used to classify all the paragraphs in articles. After classification, Precision, Recall and F-value were used to evaluate the results of classification. Based on the results of this study, three subjects such as Culture, Eco-business and Politics showed some differences in contents and tones of arguments among these three newspapers. In addition, for the National issues, tones of arguments on 4-major-rivers project were different from each other. It seems three newspapers have their own specific tone of argument in those sections. And keyword-networks showed different shapes with each other in the same period in the same section. It means that frequently appeared keywords in articles are different and their contents are comprised with different keywords. And the Positive-Negative classification showed the possibility of classifying newspapers' tones of arguments compared to others. These results indicate that the approach in this study is promising to be extended as a new tool to identify the different tones of arguments of newspapers.

A Classification Model for Illegal Debt Collection Using Rule and Machine Learning Based Methods

  • Kim, Tae-Ho;Lim, Jong-In
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.4
    • /
    • pp.93-103
    • /
    • 2021
  • Despite the efforts of financial authorities in conducting the direct management and supervision of collection agents and bond-collecting guideline, the illegal and unfair collection of debts still exist. To effectively prevent such illegal and unfair debt collection activities, we need a method for strengthening the monitoring of illegal collection activities even with little manpower using technologies such as unstructured data machine learning. In this study, we propose a classification model for illegal debt collection that combine machine learning such as Support Vector Machine (SVM) with a rule-based technique that obtains the collection transcript of loan companies and converts them into text data to identify illegal activities. Moreover, the study also compares how accurate identification was made in accordance with the machine learning algorithm. The study shows that a case of using the combination of the rule-based illegal rules and machine learning for classification has higher accuracy than the classification model of the previous study that applied only machine learning. This study is the first attempt to classify illegalities by combining rule-based illegal detection rules with machine learning. If further research will be conducted to improve the model's completeness, it will greatly contribute in preventing consumer damage from illegal debt collection activities.

The Analysis of Spectral characteristics of Water Quality Factors Uisng Airborne MSS Data (Airborne MSS 자료를 이용한 수질인자의 분광특성 분석)

  • Dong-Ho Jang;Gi-Ho Jo;Kwang-Hoon Chi
    • Korean Journal of Remote Sensing
    • /
    • v.14 no.3
    • /
    • pp.296-306
    • /
    • 1998
  • Airborne MSS data is regarded as a potentially effective data source for the measurement of water quality and for the environmental change of water bodies. In this study, we measured the radiance reflectance by using multi-spectral image of low resolution camera(LRC) which will be reached in the multi-purpose satellite(KOMPSAT) to use the data in analyzing water pollution. We also investigated the possibility of extraction of water quality factors in water bodies by using high resolution remote sensing data such as Airborne MSS. Especially, we tried to extract environmental factors related with eutrophication such as chlorophyll-a, suspended sediments and turbidity, and also tried to develop the process technique and the radiance feature of reflectance related with eutrophication. Although it was difficult to explicitly correlate Airborne MSS data with water quality factors due to the insufficient number of ground truth data. The results were summarized as follows: First, the spectrum of sun's rays which reaches the surface of the earth was consistent with visible bands of 0.4${\mu}{\textrm}{m}$~0.7${\mu}{\textrm}{m}$ and about 50% of total quantity of radiation could be found. The spectrum was reached highest at around 0.5${\mu}{\textrm}{m}$ of green spectral band in visible bands. Second, as a result of the radiance reflectance Chlorophyll-a represented high mainly around 0.52${\mu}{\textrm}{m}$ of green spectral band, and suspended sediments and turbidity represented high at 0.8${\mu}{\textrm}{m}$ and at 0.57${\mu}{\textrm}{m}$, respectively. Finally, as a result of the water quality analysis by using Airborne MSS, Chlorophyll-a could have a distribution image after carrying out ratio of B3 and B5 to B7. Band 7 was useful for making the distribution image of suspended sediments. When we carried out PCA, suspended sediments and turbidity had distributions at PC 1 and PC 4 which are similar to the ground data. Above results can be changed according to the change of season and time. Therefore, in order to analyze the environmental factors of water quality by using LRC data more exactly, we need to investigate the ground data and the radiance feature of reflectance of water bodies constantly. For further studies, we will constantly analyze the radiance feature of the surface of water in wafter bodies by measuring the on-the-spot radiance reflectance and using low resolution satellite image(SeaWiFS). We will also gather the data of water quality analysis in water bodies and analyze the pattern of water pollution.

An Expert System for the Estimation of the Growth Curve Parameters of New Markets (신규시장 성장모형의 모수 추정을 위한 전문가 시스템)

  • Lee, Dongwon;Jung, Yeojin;Jung, Jaekwon;Park, Dohyung
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.4
    • /
    • pp.17-35
    • /
    • 2015
  • Demand forecasting is the activity of estimating the quantity of a product or service that consumers will purchase for a certain period of time. Developing precise forecasting models are considered important since corporates can make strategic decisions on new markets based on future demand estimated by the models. Many studies have developed market growth curve models, such as Bass, Logistic, Gompertz models, which estimate future demand when a market is in its early stage. Among the models, Bass model, which explains the demand from two types of adopters, innovators and imitators, has been widely used in forecasting. Such models require sufficient demand observations to ensure qualified results. In the beginning of a new market, however, observations are not sufficient for the models to precisely estimate the market's future demand. For this reason, as an alternative, demands guessed from those of most adjacent markets are often used as references in such cases. Reference markets can be those whose products are developed with the same categorical technologies. A market's demand may be expected to have the similar pattern with that of a reference market in case the adoption pattern of a product in the market is determined mainly by the technology related to the product. However, such processes may not always ensure pleasing results because the similarity between markets depends on intuition and/or experience. There are two major drawbacks that human experts cannot effectively handle in this approach. One is the abundance of candidate reference markets to consider, and the other is the difficulty in calculating the similarity between markets. First, there can be too many markets to consider in selecting reference markets. Mostly, markets in the same category in an industrial hierarchy can be reference markets because they are usually based on the similar technologies. However, markets can be classified into different categories even if they are based on the same generic technologies. Therefore, markets in other categories also need to be considered as potential candidates. Next, even domain experts cannot consistently calculate the similarity between markets with their own qualitative standards. The inconsistency implies missing adjacent reference markets, which may lead to the imprecise estimation of future demand. Even though there are no missing reference markets, the new market's parameters can be hardly estimated from the reference markets without quantitative standards. For this reason, this study proposes a case-based expert system that helps experts overcome the drawbacks in discovering referential markets. First, this study proposes the use of Euclidean distance measure to calculate the similarity between markets. Based on their similarities, markets are grouped into clusters. Then, missing markets with the characteristics of the cluster are searched for. Potential candidate reference markets are extracted and recommended to users. After the iteration of these steps, definite reference markets are determined according to the user's selection among those candidates. Then, finally, the new market's parameters are estimated from the reference markets. For this procedure, two techniques are used in the model. One is clustering data mining technique, and the other content-based filtering of recommender systems. The proposed system implemented with those techniques can determine the most adjacent markets based on whether a user accepts candidate markets. Experiments were conducted to validate the usefulness of the system with five ICT experts involved. In the experiments, the experts were given the list of 16 ICT markets whose parameters to be estimated. For each of the markets, the experts estimated its parameters of growth curve models with intuition at first, and then with the system. The comparison of the experiments results show that the estimated parameters are closer when they use the system in comparison with the results when they guessed them without the system.

KNU Korean Sentiment Lexicon: Bi-LSTM-based Method for Building a Korean Sentiment Lexicon (Bi-LSTM 기반의 한국어 감성사전 구축 방안)

  • Park, Sang-Min;Na, Chul-Won;Choi, Min-Seong;Lee, Da-Hee;On, Byung-Won
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.4
    • /
    • pp.219-240
    • /
    • 2018
  • Sentiment analysis, which is one of the text mining techniques, is a method for extracting subjective content embedded in text documents. Recently, the sentiment analysis methods have been widely used in many fields. As good examples, data-driven surveys are based on analyzing the subjectivity of text data posted by users and market researches are conducted by analyzing users' review posts to quantify users' reputation on a target product. The basic method of sentiment analysis is to use sentiment dictionary (or lexicon), a list of sentiment vocabularies with positive, neutral, or negative semantics. In general, the meaning of many sentiment words is likely to be different across domains. For example, a sentiment word, 'sad' indicates negative meaning in many fields but a movie. In order to perform accurate sentiment analysis, we need to build the sentiment dictionary for a given domain. However, such a method of building the sentiment lexicon is time-consuming and various sentiment vocabularies are not included without the use of general-purpose sentiment lexicon. In order to address this problem, several studies have been carried out to construct the sentiment lexicon suitable for a specific domain based on 'OPEN HANGUL' and 'SentiWordNet', which are general-purpose sentiment lexicons. However, OPEN HANGUL is no longer being serviced and SentiWordNet does not work well because of language difference in the process of converting Korean word into English word. There are restrictions on the use of such general-purpose sentiment lexicons as seed data for building the sentiment lexicon for a specific domain. In this article, we construct 'KNU Korean Sentiment Lexicon (KNU-KSL)', a new general-purpose Korean sentiment dictionary that is more advanced than existing general-purpose lexicons. The proposed dictionary, which is a list of domain-independent sentiment words such as 'thank you', 'worthy', and 'impressed', is built to quickly construct the sentiment dictionary for a target domain. Especially, it constructs sentiment vocabularies by analyzing the glosses contained in Standard Korean Language Dictionary (SKLD) by the following procedures: First, we propose a sentiment classification model based on Bidirectional Long Short-Term Memory (Bi-LSTM). Second, the proposed deep learning model automatically classifies each of glosses to either positive or negative meaning. Third, positive words and phrases are extracted from the glosses classified as positive meaning, while negative words and phrases are extracted from the glosses classified as negative meaning. Our experimental results show that the average accuracy of the proposed sentiment classification model is up to 89.45%. In addition, the sentiment dictionary is more extended using various external sources including SentiWordNet, SenticNet, Emotional Verbs, and Sentiment Lexicon 0603. Furthermore, we add sentiment information about frequently used coined words and emoticons that are used mainly on the Web. The KNU-KSL contains a total of 14,843 sentiment vocabularies, each of which is one of 1-grams, 2-grams, phrases, and sentence patterns. Unlike existing sentiment dictionaries, it is composed of words that are not affected by particular domains. The recent trend on sentiment analysis is to use deep learning technique without sentiment dictionaries. The importance of developing sentiment dictionaries is declined gradually. However, one of recent studies shows that the words in the sentiment dictionary can be used as features of deep learning models, resulting in the sentiment analysis performed with higher accuracy (Teng, Z., 2016). This result indicates that the sentiment dictionary is used not only for sentiment analysis but also as features of deep learning models for improving accuracy. The proposed dictionary can be used as a basic data for constructing the sentiment lexicon of a particular domain and as features of deep learning models. It is also useful to automatically and quickly build large training sets for deep learning models.

The Analysis of the Road Freight Transportation using the Simultaneous Demand-Supply Model (수요-공급의 동시모형을 통한 공로 화물운송특성분석)

  • 장수은;이용택;지준호
    • Journal of Korean Society of Transportation
    • /
    • v.19 no.4
    • /
    • pp.7-18
    • /
    • 2001
  • This study represents a first attempt in Korea to develop the simultaneous freight supply-demand model which considers the relationship between freight supply and demand. As the existing study was limited in one area, or the supply and the demand was separated and assumed not to affect each other, this study take it into consideration the fact that the demand affects supply and simultaneously vice versa. This approach allows us to diagnose a policy carried on and helps us to make a resonable alternative for the effectiveness of freight transportation system. To find a relationship between them, we use a method of econometrics. a structural equation theory and two stage least-squares(2SLS) estimation technique, to get rid of bias which involves two successive applications of OLS. Based on the domestic freight data, this study consider as explanatory variables a number of population(P), industry(IN), the amount of production of the mining and manufacturing industries(MMI), the rate of the effectiveness of freight capacity(LE) and the distance of an empty carriage operation(VC). This study describes well the simultaneous process of freight supply-demand system in that the increase of VC from the decrease of VC raises the cargo capacity and cargo capacity also augments VC. By the way. it is analyzed that the increment of VC due to the increase of the cargo capacity is larger than the reduction of VC owing to the increase of the quantify of goods. Therefore an alternative policy is needed in a short and long run point of view. That is to say, to promote the effectiveness of the freight transportation system, a short term supply control and a long run logistic infrastructure are urgent based on the restoration of market economy by successive deregulation. So we are able to conclude that gradual deregulation is more desirable to build effective freight market.

  • PDF

A Technique to Recommend Appropriate Developers for Reported Bugs Based on Term Similarity and Bug Resolution History (개발자 별 버그 해결 유형을 고려한 자동적 개발자 추천 접근법)

  • Park, Seong Hun;Kim, Jung Il;Lee, Eun Joo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.3 no.12
    • /
    • pp.511-522
    • /
    • 2014
  • During the development of the software, a variety of bugs are reported. Several bug tracking systems, such as, Bugzilla, MantisBT, Trac, JIRA, are used to deal with reported bug information in many open source development projects. Bug reports in bug tracking system would be triaged to manage bugs and determine developer who is responsible for resolving the bug report. As the size of the software is increasingly growing and bug reports tend to be duplicated, bug triage becomes more and more complex and difficult. In this paper, we present an approach to assign bug reports to appropriate developers, which is a main part of bug triage task. At first, words which have been included the resolved bug reports are classified according to each developer. Second, words in newly bug reports are selected. After first and second steps, vectors whose items are the selected words are generated. At the third step, TF-IDF(Term frequency - Inverse document frequency) of the each selected words are computed, which is the weight value of each vector item. Finally, the developers are recommended based on the similarity between the developer's word vector and the vector of new bug report. We conducted an experiment on Eclipse JDT and CDT project to show the applicability of the proposed approach. We also compared the proposed approach with an existing study which is based on machine learning. The experimental results show that the proposed approach is superior to existing method.

Technology Planning through Technology Roadmap: Application of Patent Citation Network (기술로드맵을 통한 기술기획: 특허인용네트워크의 활용)

  • Jeong, Yu-Jin;Yoon, Byung-Un
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.12 no.11
    • /
    • pp.5227-5237
    • /
    • 2011
  • Technology roadmap is a powerful tool that considers relationships of technology, product and market and referred as a supporting technology strategy and planning. There are numerous studies that have attempted to develop technology roadmap and case studies on specific technology areas. However, a number of studies have been dependant on brainstorming and discussion of expert group, delphi technique as qualitative analysis rather than systemic and quantitative analysis. To overcome the limitation, patent analysis considered as quite quantitative analysis is employed in this paper. Therefore, this paper proposes new technology roadmapping based on patent citation network considering technology life cycle and suggests planning for undeveloped technology but considered as promising. At first, patent data and citation information are collected and patent citation network is developed on the basis of collected patent information. Secondly, we investigate a stage of technology in the life cycle by considering patent application year and the technology life cycle, and duration of technology development is estimated. In addition, subsequent technologies are grouped as nodes of a super-level technology to show the evolution of the technology for the period. Finally, a technology roadmap is drawn by linking these technology nodes in a technology layer and estimating the duration of development time. Based on technology roadmap, technology planning is conducted to identify undeveloped technology through text mining and this paper suggests characteristics of technology that needs to be developed in the future. In order to illustrate the process of the proposed approach, technology for hydrogen storage is selected in this paper.

Analysis of Research Trends in Tax Compliance using Topic Modeling (토픽모델링을 활용한 조세순응 연구 동향 분석)

  • Kang, Min-Jo;Baek, Pyoung-Gu
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.1
    • /
    • pp.99-115
    • /
    • 2022
  • In this study, domestic academic journal papers on tax compliance, tax consciousness, and faithful tax payment (hereinafter referred to as "tax compliance") were comprehensively analyzed from an interdisciplinary perspective as a representative research topic in the field of tax science. To achieve the research purpose, topic modeling technique was applied as part of text mining. In the flow of data collection-keyword preprocessing-topic model analysis, potential research topics were presented from tax compliance related keywords registered by the researcher in a total of 347 papers. The results of this study can be summarized as follows. First, in the keyword analysis, keywords such as tax investigation, tax avoidance, and honest tax reporting system were included in the top 5 keywords based on simple term-frequency, and in the TF-IDF value considering the relative importance of keywords, they were also included in the top 5 keywords. On the other hand, the keyword, tax evasion, was included in the top keyword based on the TF-IDF value, whereas it was not highlighted in the simple term-frequency. Second, eight potential research topics were derived through topic modeling. The topics covered are (1) tax fairness and suppression of tax offenses, (2) the ideology of the tax law and the validity of tax policies, (3) the principle of substance over form and guarantee of tax receivables (4) tax compliance costs and tax administration services, (5) the tax returns self- assessment system and tax experts, (6) tax climate and strategic tax behavior, (7) multifaceted tax behavior and differential compliance intentions, (8) tax information system and tax resource management. The research comprehensively looked at the various perspectives on the tax compliance from an interdisciplinary perspective, thereby comprehensively grasping past research trends on tax compliance and suggesting the direction of future research.

A Topic Modeling Approach to the Analysis of Seniors' Happiness and Unhappiness in Korea (토픽 모델링 기반 한국 노인의 행복과 불행 이슈 분석)

  • Dong ji Moon;Dine Yon;Hee-Woong Kim
    • Information Systems Review
    • /
    • v.20 no.2
    • /
    • pp.139-161
    • /
    • 2018
  • As Korea became one of the oldest countries in the world, successful aging emerged as an important issue to individuals as well as to society. This study aims to determine not only the Korean seniors' happiness and unhappiness factors but also the means to enhance their happiness and deal with unhappiness. We collected news articles related to the happiness and unhappiness of seniors with nine keywords based on Alderfer's ERG Theory. We then applied a topic modeling technique, Latent Dirichlet Allocation, to examine the main issues underlying the seniors' happiness and unhappiness. According to the analysis, we investigated the conditions of happiness and unhappiness by inspecting the topics based on each keyword. We also conducted a detailed analysis based on the main factors from topic modeling. We proposed specific ways to increase and overcome the happiness and unhappiness of seniors, respectively, in terms of government, corporate, family, and other social welfare organizations. This study indicates the major factors that affect the happiness and unhappiness of seniors. Specific methods to boost happiness and relief unhappiness are suggested from the additional analysis.