• Title/Summary/Keyword: news data

Search Result 888, Processing Time 0.028 seconds

Improving the Accuracy of Document Classification by Learning Heterogeneity (이질성 학습을 통한 문서 분류의 정확성 향상 기법)

  • Wong, William Xiu Shun;Hyun, Yoonjin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.21-44
    • /
    • 2018
  • In recent years, the rapid development of internet technology and the popularization of smart devices have resulted in massive amounts of text data. Those text data were produced and distributed through various media platforms such as World Wide Web, Internet news feeds, microblog, and social media. However, this enormous amount of easily obtained information is lack of organization. Therefore, this problem has raised the interest of many researchers in order to manage this huge amount of information. Further, this problem also required professionals that are capable of classifying relevant information and hence text classification is introduced. Text classification is a challenging task in modern data analysis, which it needs to assign a text document into one or more predefined categories or classes. In text classification field, there are different kinds of techniques available such as K-Nearest Neighbor, Naïve Bayes Algorithm, Support Vector Machine, Decision Tree, and Artificial Neural Network. However, while dealing with huge amount of text data, model performance and accuracy becomes a challenge. According to the type of words used in the corpus and type of features created for classification, the performance of a text classification model can be varied. Most of the attempts are been made based on proposing a new algorithm or modifying an existing algorithm. This kind of research can be said already reached their certain limitations for further improvements. In this study, aside from proposing a new algorithm or modifying the algorithm, we focus on searching a way to modify the use of data. It is widely known that classifier performance is influenced by the quality of training data upon which this classifier is built. The real world datasets in most of the time contain noise, or in other words noisy data, these can actually affect the decision made by the classifiers built from these data. In this study, we consider that the data from different domains, which is heterogeneous data might have the characteristics of noise which can be utilized in the classification process. In order to build the classifier, machine learning algorithm is performed based on the assumption that the characteristics of training data and target data are the same or very similar to each other. However, in the case of unstructured data such as text, the features are determined according to the vocabularies included in the document. If the viewpoints of the learning data and target data are different, the features may be appearing different between these two data. In this study, we attempt to improve the classification accuracy by strengthening the robustness of the document classifier through artificially injecting the noise into the process of constructing the document classifier. With data coming from various kind of sources, these data are likely formatted differently. These cause difficulties for traditional machine learning algorithms because they are not developed to recognize different type of data representation at one time and to put them together in same generalization. Therefore, in order to utilize heterogeneous data in the learning process of document classifier, we apply semi-supervised learning in our study. However, unlabeled data might have the possibility to degrade the performance of the document classifier. Therefore, we further proposed a method called Rule Selection-Based Ensemble Semi-Supervised Learning Algorithm (RSESLA) to select only the documents that contributing to the accuracy improvement of the classifier. RSESLA creates multiple views by manipulating the features using different types of classification models and different types of heterogeneous data. The most confident classification rules will be selected and applied for the final decision making. In this paper, three different types of real-world data sources were used, which are news, twitter and blogs.

Application of a Topic Model on the Korea Expressway Corporation's VOC Data (한국도로공사 VOC 데이터를 이용한 토픽 모형 적용 방안)

  • Kim, Ji Won;Park, Sang Min;Park, Sungho;Jeong, Harim;Yun, Ilsoo
    • Journal of Information Technology Services
    • /
    • v.19 no.6
    • /
    • pp.1-13
    • /
    • 2020
  • Recently, 80% of big data consists of unstructured text data. In particular, various types of documents are stored in the form of large-scale unstructured documents through social network services (SNS), blogs, news, etc., and the importance of unstructured data is highlighted. As the possibility of using unstructured data increases, various analysis techniques such as text mining have recently appeared. Therefore, in this study, topic modeling technique was applied to the Korea Highway Corporation's voice of customer (VOC) data that includes customer opinions and complaints. Currently, VOC data is divided into the business areas of Korea Expressway Corporation. However, the classified categories are often not accurate, and the ambiguous ones are classified as "other". Therefore, in order to use VOC data for efficient service improvement and the like, a more systematic and efficient classification method of VOC data is required. To this end, this study proposed two approaches, including method using only the latent dirichlet allocation (LDA), the most representative topic modeling technique, and a new method combining the LDA and the word embedding technique, Word2vec. As a result, it was confirmed that the categories of VOC data are relatively well classified when using the new method. Through these results, it is judged that it will be possible to derive the implications of the Korea Expressway Corporation and utilize it for service improvement.

(Dynamic Video Object Data Model(DIVID) (동적 비디오 객체 데이터 모델(DVID))

  • Song, Yong-Jun;Kim, Hyeong-Ju
    • Journal of KIISE:Software and Applications
    • /
    • v.26 no.9
    • /
    • pp.1052-1060
    • /
    • 1999
  • 이제까지 비디오 데이타베이스를 모델링하기 위한 많은 연구들이 수행되었지만 그 모든 모델들에서 다루는 비디오 데이타는 사용자의 개입이 없을 때 항상 미리 정의된 순서로 보여진다는 점에서 정적 데이타 모델로 간주될 수 있다. 주문형 뉴스 서비스, 주문형 비디오 서비스, 디지털 도서관, 인터넷 쇼핑 등과 같이 최신 비디오 정보 서비스를 제공하는 비디오 데이타베이스 응용들에서는 빈번한 비디오 편집이 요구되는데 실시간 처리가 바람직하다. 이를 위해서 기존의 비디오 데이타 내용이 변경되거나 새로운 비디오 데이타가 생성되어야 하지만 이제까지의 비디오 데이타 모델에서는 이러한 비디오 편집 작업이 일일이 수작업으로 수행되어야만 했다. 본 논문에서는 비디오 편집에 드는 노력을 줄이기 위해서 객체지향 데이타 모델에 기반하여 DVID(Dynamic Video Object Data Model)라는 동적 비디오 객체 데이타 모델을 제안한다. DVID는 기존의 정적 비디오 객체뿐만 아니라 사용자의 개입없이도 비디오의 내용을 비디오 데이타베이스로부터 동적으로 결정하여 보여주는 동적 비디오 객체도 함께 제공한다.Abstract A lot of research has been done on modeling video databases, but all of them can be considered as the static video data model from the viewpoint that all video data on those models are always presented according to the predefined sequences if there is no user interaction. For some video database applications which provides with up-to-date video information services such as news-on-demand, video-on-demand, digital library, internet shopping, etc., video editing is requested frequently, preferably in real time. To do this, the contents of the existing video data should be changed or new video data should be created, but on the traditional video data models such video editing works should be done manually. In order to save trouble in video editing work, this paper proposes the dynamic video object data model named DVID based on object oriented data model. DVID allows not only the static video object but also the dynamic video object whose contents are dynamically determined from video databases in real time even without user interaction.

Current Issues with the Big Data Utilization from a Humanities Perspective (인문학적 관점으로 본 빅데이터 활용을 위한 당면 문제)

  • Park, Eun-ha;Jeon, Jin-woo
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.6
    • /
    • pp.125-134
    • /
    • 2022
  • This study aims to critically discuss the problems that need to be solved from a humanities perspective in order to utilize big data. It identifies and discusses three research problems that may arise from collecting, processing, and using big data. First, it looks at the fake information circulating with regard to problems with the data itself, specifically looking at article-type advertisements and fake news related to politics. Second, discrimination by the algorithm was cited as a problem with big data processing and its results. This discrimination was seen while searching for engineers on the portal site. Finally, problems related to the invasion of personal related information were seen in three categories: the right to privacy, the right to self-determination of information, and the right to be forgotten. This study is meaningful in that it points out the problems facing in the aspect of big data utilization from the humanities perspective in the era of big data and discusses possible problems in the collection, processing, and use of big data, respectively.

A case study on the Collective activities of Spectators of Professional soccer games (프로축구 관중의 집합행동 사례연구)

  • Joo, Il-Yeob
    • Korean Security Journal
    • /
    • no.6
    • /
    • pp.195-213
    • /
    • 2003
  • This is a case study on the violent collective activities of spectators in professional soccer games. The purpose of this study is to examine the reasons and the processes of the activities. In order to achieve the purpose well, I analyzed two cases of violent collective activities in spectators of professional soccer games that were held on June 24 and July 28 in 2001. The data for this study was collected from 14 kinds of daily newspapers from June 10 to August 20 in 2001 to maintain the objective validity of the outline, the reasons, and the processes of the violent collective activities of spectators in two cases. The data was collected from Korea Integrated News Database System(KINDS) of Korea Press Foundation, and Chollian, on-line service of DACOM, for its efficiency, accuracy and promptness. On the basis of the above-mentioned method as well as the consequences of data analysis, I've reached to a conclusion as follows. The reasons of the violent collective activities of spectators in professional soccer games were the mass effect, the activities of players, umpires and the results of games, etc. The violent collective activities of spectators in professional soccer games need pre-requirements and have relationships with special affairs that are developed in a regular sequence. In other words, a collective activity gives an effect on another one directly or indirectly. Therefore, this study shows that we can reduce or prevent the damages by the violent collective activities of spectators in sport games when we analyze the processes of collective activities and make a previous counter-measure for that.

  • PDF

R&D Perspective Social Issue Packaging using Text Analysis

  • Wong, William Xiu Shun;Kim, Namgyu
    • Journal of Information Technology Services
    • /
    • v.15 no.3
    • /
    • pp.71-95
    • /
    • 2016
  • In recent years, text mining has been used to extract meaningful insights from the large volume of unstructured text data sets of various domains. As one of the most representative text mining applications, topic modeling has been widely used to extract main topics in the form of a set of keywords extracted from a large collection of documents. In general, topic modeling is performed according to the weighted frequency of words in a document corpus. However, general topic modeling cannot discover the relation between documents if the documents share only a few terms, although the documents are in fact strongly related from a particular perspective. For instance, a document about "sexual offense" and another document about "silver industry for aged persons" might not be classified into the same topic because they may not share many key terms. However, these two documents can be strongly related from the R&D perspective because some technologies, such as "RF Tag," "CCTV," and "Heart Rate Sensor," are core components of both "sexual offense" and "silver industry." Thus, in this study, we attempted to discover the differences between the results of general topic modeling and R&D perspective topic modeling. Furthermore, we package social issues from the R&D perspective and present a prototype system, which provides a package of news articles for each R&D issue. Finally, we analyze the quality of R&D perspective topic modeling and provide the results of inter- and intra-topic analysis.

The Effect of Corporate Integrity on Stock Price Crash Risk

  • YIN, Hong;ZHANG, Ruonan
    • Asian Journal of Business Environment
    • /
    • v.10 no.1
    • /
    • pp.19-28
    • /
    • 2020
  • Purpose: This research aims to investigate the impact of corporate integrity on stock price crash risk. Research design, data, and methodology: Taking 1419 firms listed in Shenzhen Stock Exchange in China as a sample, this paper empirically analyzed the relationship between corporate integrity and stock price crash risk. The main integrity data was hand-collected from Shenzhen Stock Exchange Website. Other financial data was collected from CSMAR Database. Results: Findings show that corporate integrity can significantly decrease stock price crash risk. After changing the selection of samples, model estimation methods and the proxy variable of stock price crash risk, the conclusion is still valid. Further research shows that the relationship between corporate integrity and stock price crash risk is only found in firms with weak internal control and firms in poor legal system areas. Conclusions: Results of the study suggest that corporate integrity has a significant influence on behaviors of managers. Business ethics reduces the likelihood of managers to overstate financial performance and hide bad news, which leads to the low likelihood of future stock price crashes. Meanwhile, corporate integrity can supplement internal control and legal system in decreasing stock price crash risks.

Trend Analysis of the Agricultural Industry Based on Text Analytics

  • Choi, Solsaem;Kim, Junhwan;Nam, Seungju
    • Agribusiness and Information Management
    • /
    • v.11 no.1
    • /
    • pp.1-9
    • /
    • 2019
  • This research intends to propose the methodology for analyzing the current trends of agriculture, which directly connects to the survival of the nation, and through this methodology, identify the agricultural trend of Korea. Based on the relationship between three types of data - policy reports, academic articles, and news articles - the research deducts the major issues stored by each data through LDA, the representative topic modeling method. By comparing and analyzing the LDA results deducted from each data source, this study intends to identify the implications regarding the current agricultural trends of Korea. This methodology can be utilized in analyzing industrial trends other than agricultural ones. To go on further, it can also be used as a basic resource for contemplation on potential areas in the future through insight on the current situation. database of the profitability of a total of 180 crop types by analyzing Rural Development Administration's survey of agricultural products income of 115 crop types, small land profitability index survey of 53 crop types, and Statistics Korea's survey of production costs of 12 crop types. Furthermore, this research presents the result and developmental process of a web-based crop introduction decision support system that provides overseas cases of new crop introduction support programs, as well as databases of outstanding business success cases of each crop type researched by agricultural institutions.

Expiration-Day Effects on Index Futures: Evidence from Indian Market

  • SAMINENI, Ravi Kumar;PUPPALA, Raja Babu;MUTHANGI, Ramesh;KULAPATHI, Syamsundar
    • The Journal of Asian Finance, Economics and Business
    • /
    • v.7 no.11
    • /
    • pp.95-100
    • /
    • 2020
  • Nifty Bank Index has started trading in futures and options (F&O) segment from 13th June 2005 in National Stock Exchange. The purpose of the study is to enhance the literature by examining expiration effect on the price volatility and price reversal of Underlying Index in India. Historical data used for the current study primarily comprise of daily close prices of Nifty Bank which is the only equity sectoral index in India which is traded in derivatives market and its Future contract value is derived from the underlying CNX Bank Index during the period 1st January 2010 till 31st March 2020. To check stationarity of the data, Augmented Dicky Fuller test was used. The study employed ARMA- EGARCH model for analysing the data. The empirical results revealed that there is no effect on the mean returns of underlying Index and EGARCH (1,1) model furthermore shows there is existence of leverage effect in the Bank Index i.e., negative shocks causes more fluctuations in the Index than positive news of similar magnitude. The outcome of the study specifies that there is no effect on volatility on the underlying sectoral index due to expiration days and also observed no price reversal effect once the expiration days are over.

Dependence of spacecraft anomalies at different orbits on energetic electron and proton fluxes

  • Yi, Kangwoo;Moon, Yong-Jae;Lee, Ensang;Lee, Jae-Ok
    • The Bulletin of The Korean Astronomical Society
    • /
    • v.41 no.1
    • /
    • pp.45.2-45.2
    • /
    • 2016
  • In this study we investigate 195 spacecraft anomalies from 1998 to 2010 from Satellite News Digest (SND). We classify these data according to types of anomaly : Control, Power, Telemetry etc. We examine the association between these anomaly data and daily peak particle (electron and proton) flux data from GOES as well as their occurrence rates. To determine the association, we use two criteria that electron criterion is >10,000 pfu and proton criterion is >100 pfu. Main results from this study are as flows. First, the number of days satisfying the criteria for electron flux has a peak near a week before the anomaly day and decreases from the peak day to the anomaly day, while that for proton flux has a peak near the anomaly day. Second, we found a similar pattern for the mean daily peak particle (electron and proton) flux as a function of day before the anomaly day. Third, an examination of multiple spacecraft anomaly events, which are likely to occur by severe space weather effects, shows that anomalies mostly occur either when electron fluxes are in the declining stage, or when daily proton peak fluxes are strongly enhanced. This result is very consistent with the above statistical studies. Our results will be discussed in view of the origins of spacecraft anomaly.

  • PDF