• Title/Summary/Keyword: news data

Search Result 888, Processing Time 0.029 seconds

A Methodology for Automatic Multi-Categorization of Single-Categorized Documents (단일 카테고리 문서의 다중 카테고리 자동확장 방법론)

  • Hong, Jin-Sung;Kim, Namgyu;Lee, Sangwon
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.3
    • /
    • pp.77-92
    • /
    • 2014
  • Recently, numerous documents including unstructured data and text have been created due to the rapid increase in the usage of social media and the Internet. Each document is usually provided with a specific category for the convenience of the users. In the past, the categorization was performed manually. However, in the case of manual categorization, not only can the accuracy of the categorization be not guaranteed but the categorization also requires a large amount of time and huge costs. Many studies have been conducted towards the automatic creation of categories to solve the limitations of manual categorization. Unfortunately, most of these methods cannot be applied to categorizing complex documents with multiple topics because the methods work by assuming that one document can be categorized into one category only. In order to overcome this limitation, some studies have attempted to categorize each document into multiple categories. However, they are also limited in that their learning process involves training using a multi-categorized document set. These methods therefore cannot be applied to multi-categorization of most documents unless multi-categorized training sets are provided. To overcome the limitation of the requirement of a multi-categorized training set by traditional multi-categorization algorithms, we propose a new methodology that can extend a category of a single-categorized document to multiple categorizes by analyzing relationships among categories, topics, and documents. First, we attempt to find the relationship between documents and topics by using the result of topic analysis for single-categorized documents. Second, we construct a correspondence table between topics and categories by investigating the relationship between them. Finally, we calculate the matching scores for each document to multiple categories. The results imply that a document can be classified into a certain category if and only if the matching score is higher than the predefined threshold. For example, we can classify a certain document into three categories that have larger matching scores than the predefined threshold. The main contribution of our study is that our methodology can improve the applicability of traditional multi-category classifiers by generating multi-categorized documents from single-categorized documents. Additionally, we propose a module for verifying the accuracy of the proposed methodology. For performance evaluation, we performed intensive experiments with news articles. News articles are clearly categorized based on the theme, whereas the use of vulgar language and slang is smaller than other usual text document. We collected news articles from July 2012 to June 2013. The articles exhibit large variations in terms of the number of types of categories. This is because readers have different levels of interest in each category. Additionally, the result is also attributed to the differences in the frequency of the events in each category. In order to minimize the distortion of the result from the number of articles in different categories, we extracted 3,000 articles equally from each of the eight categories. Therefore, the total number of articles used in our experiments was 24,000. The eight categories were "IT Science," "Economy," "Society," "Life and Culture," "World," "Sports," "Entertainment," and "Politics." By using the news articles that we collected, we calculated the document/category correspondence scores by utilizing topic/category and document/topics correspondence scores. The document/category correspondence score can be said to indicate the degree of correspondence of each document to a certain category. As a result, we could present two additional categories for each of the 23,089 documents. Precision, recall, and F-score were revealed to be 0.605, 0.629, and 0.617 respectively when only the top 1 predicted category was evaluated, whereas they were revealed to be 0.838, 0.290, and 0.431 when the top 1 - 3 predicted categories were considered. It was very interesting to find a large variation between the scores of the eight categories on precision, recall, and F-score.

Mapping Categories of Heterogeneous Sources Using Text Analytics (텍스트 분석을 통한 이종 매체 카테고리 다중 매핑 방법론)

  • Kim, Dasom;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.4
    • /
    • pp.193-215
    • /
    • 2016
  • In recent years, the proliferation of diverse social networking services has led users to use many mediums simultaneously depending on their individual purpose and taste. Besides, while collecting information about particular themes, they usually employ various mediums such as social networking services, Internet news, and blogs. However, in terms of management, each document circulated through diverse mediums is placed in different categories on the basis of each source's policy and standards, hindering any attempt to conduct research on a specific category across different kinds of sources. For example, documents containing content on "Application for a foreign travel" can be classified into "Information Technology," "Travel," or "Life and Culture" according to the peculiar standard of each source. Likewise, with different viewpoints of definition and levels of specification for each source, similar categories can be named and structured differently in accordance with each source. To overcome these limitations, this study proposes a plan for conducting category mapping between different sources with various mediums while maintaining the existing category system of the medium as it is. Specifically, by re-classifying individual documents from the viewpoint of diverse sources and storing the result of such a classification as extra attributes, this study proposes a logical layer by which users can search for a specific document from multiple heterogeneous sources with different category names as if they belong to the same source. Besides, by collecting 6,000 articles of news from two Internet news portals, experiments were conducted to compare accuracy among sources, supervised learning and semi-supervised learning, and homogeneous and heterogeneous learning data. It is particularly interesting that in some categories, classifying accuracy of semi-supervised learning using heterogeneous learning data proved to be higher than that of supervised learning and semi-supervised learning, which used homogeneous learning data. This study has the following significances. First, it proposes a logical plan for establishing a system to integrate and manage all the heterogeneous mediums in different classifying systems while maintaining the existing physical classifying system as it is. This study's results particularly exhibit very different classifying accuracies in accordance with the heterogeneity of learning data; this is expected to spur further studies for enhancing the performance of the proposed methodology through the analysis of characteristics by category. In addition, with an increasing demand for search, collection, and analysis of documents from diverse mediums, the scope of the Internet search is not restricted to one medium. However, since each medium has a different categorical structure and name, it is actually very difficult to search for a specific category insofar as encompassing heterogeneous mediums. The proposed methodology is also significant for presenting a plan that enquires into all the documents regarding the standards of the relevant sites' categorical classification when the users select the desired site, while maintaining the existing site's characteristics and structure as it is. This study's proposed methodology needs to be further complemented in the following aspects. First, though only an indirect comparison and evaluation was made on the performance of this proposed methodology, future studies would need to conduct more direct tests on its accuracy. That is, after re-classifying documents of the object source on the basis of the categorical system of the existing source, the extent to which the classification was accurate needs to be verified through evaluation by actual users. In addition, the accuracy in classification needs to be increased by making the methodology more sophisticated. Furthermore, an understanding is required that the characteristics of some categories that showed a rather higher classifying accuracy of heterogeneous semi-supervised learning than that of supervised learning might assist in obtaining heterogeneous documents from diverse mediums and seeking plans that enhance the accuracy of document classification through its usage.

Historical Review of Sensory Integration in Korea (감각통합치료의 역사적 고찰)

  • Kim, Kyeong-Mi
    • The Journal of Korean Academy of Sensory Integration
    • /
    • v.1 no.1
    • /
    • pp.73-79
    • /
    • 2003
  • The significant historical events and developments in the area of approach for sensory integration in Korean Society of Sensory Integration Therapists(KSSIT) are reviewed to stimulate future planning and clinical application in professional and health promotion. The paper reviewed the purposes of establishment, the role of members, progress of regular meetings and education for members and seminars in KSSIT. This review is placed within the context of the progressive period of KSSIT history, 1997-2003. Historical data are used as primary sources as records of events in KSSIT's homepage, official transcripts, professional articles, and secondary sources such as news letters for occupational therapists and oral information from first members. This review examined current actions and thinking, and suggests further actions and thinking for progressive professional education programs, research, clinical applications and public relations in sensory integration approach in KSSIT.

  • PDF

A Study on the Enhancement of Information & Communication Technology Literacy Capacity through Web News-data in Education (웹 신문학습을 통한 정보통신기술 소양 능력 신장에 관한 연구)

  • Bang, Ju-Hye;Lee, Yong-Bae
    • 한국정보교육학회:학술대회논문집
    • /
    • 2006.01a
    • /
    • pp.77-82
    • /
    • 2006
  • 학습자들은 소양교육으로 정보통신기술에 대한 기초적인 지식과 활용 능력을 습득하고 이를 토대로 각 교과에서 정보통신기술을 활용할 수 있어야 한다. 이 두 가지의 교육이 서로 연계하여 이루어질 때 정보통신기술 활용 능력은 가장 효과적으로 신장된다. 이러한 소양교육과 활용교육 연계선상에서 이루어지는 교육이 바로 웹 신문학습이다. 그러나 웹 신문학습이 정보통신기술 소양 능력 향상에 영향을 미치는지의 검증 연구가 진행되지 않아 교육효과에 대한 불확실성을 갖고 있었다. 이에 컴퓨터 교육의 목표를 효율적으로 달성할 수 있는 교수 학습 방법인 웹 신문학습을 수행한 후 정보통신기술 소양 능력을 측정하여 객관적인 수치로 교육적 효과를 검증하였다.

  • PDF

Characteristics of Heat Waves From a Disaster Perspective

  • Kim, Do-Woo;Kwon, Chaeyoung;Kim, Jineun;Lee, Jong-Seol
    • Journal of Preventive Medicine and Public Health
    • /
    • v.53 no.1
    • /
    • pp.26-28
    • /
    • 2020
  • In September 2018, heat waves were declared to be a type of natural disaster by the Framework Act on the Management of Disasters and Safety. The present study examined the characteristics of heat waves from the perspectives of meteorological phenomena and health damage. The government's efforts to minimize the damages incurred by heat waves are summarized chronologically. Furthermore, various issues pertaining to heat waves that are being raised in our society despite the government's efforts are summarized by analyzing big data derived from reported news and academic articles.

An Anchor-frame Detection Algorithm in MPEG News Data using DC component extraction and Color Clustering (MPEG으로 압축된 뉴스 데이터에서의 DC성분 추출과 컬러 클러스터링을 이용한 앵커 프레임 검색 기법)

  • 정정훈;이근섭;오화종;최병욱
    • Proceedings of the IEEK Conference
    • /
    • 2000.09a
    • /
    • pp.729-732
    • /
    • 2000
  • 대용량 비디오 데이터의 이용에 있어 효과적인 비디오 검색을 위해서는 비디오 데이터의 색인 과정이 필요하다. 효과적인 비디오 데이터의 색인을 위해서는 의미적 단위인 씬(Scene)으로 이루어진 비디오 데이터를 물리적인 경계면인 컷(장면전환점)으로 검출하는 기법이 필수적이며 각 샷에서의 키 프레임 추출 또한 필수적이다. 본 논문에서는 뉴스 비디오데이터의 키 프레임인 앵커 프레임의 효과적인 검색을 위해 DC 성분 추출과 이진 검색기법, 그리고 컬러 클러스터링을 이용하고 있다. 본 논문에서 제하고 있는 방법을 검증하기 위해서 47분 10초 분량의 MPEG-2 로 압축된 뉴스 비디오 데이터에 적용한 결과 91.3%의 정확도와 84.0%의 재현율을 보여 제안한 방법의 우수성을 증명하고 있다.

  • PDF

A Study on the Cost Analysis for the Container Terminal Services based on ABC Approach

  • Ryu, Dong-Ha;Ahn, Ki-Myung;Yoon, Yeo-Sang
    • Journal of Navigation and Port Research
    • /
    • v.35 no.7
    • /
    • pp.589-596
    • /
    • 2011
  • Terminal market has rapidly crashed and market rates have taken a sharp plunge. The substantial throughput decrease resulted from the world economic downturn has been a finishing blow to the terminal operators in Busan. Every terminal operator is taking cost saving as its first priority and accelerating structural reform and downsizing. Under the desperate situation, the need of effective cost analysis would be highly required to effectively control operation cost and to develop new services to satisfy the different needs of the customers. Furthermore, terminal operators could reduce unnecessary activities and concentrate their resource on the more cost-effective process through the operation cost analysis. In order to suggest a new framework of the cost control of container terminals, this paper seeks to analyze terminal costs based on ABC approach by processing actual data.

Text Mining and Visualization of Papers Reviews Using R Language

  • Li, Jiapei;Shin, Seong Yoon;Lee, Hyun Chang
    • Journal of information and communication convergence engineering
    • /
    • v.15 no.3
    • /
    • pp.170-174
    • /
    • 2017
  • Nowadays, people share and discuss scientific papers on social media such as the Web 2.0, big data, online forums, blogs, Twitter, Facebook and scholar community, etc. In addition to a variety of metrics such as numbers of citation, download, recommendation, etc., paper review text is also one of the effective resources for the study of scientific impact. The social media tools improve the research process: recording a series online scholarly behaviors. This paper aims to research the huge amount of paper reviews which have generated in the social media platforms to explore the implicit information about research papers. We implemented and shown the result of text mining on review texts using R language. And we found that Zika virus was the research hotspot and association research methods were widely used in 2016. We also mined the news review about one paper and derived the public opinion.

News Article Big Data Analysis based on Machine Learning in Distributed Processing Environments (분산 처리 환경에서의 기계학습 기반의 뉴스 기사 빅 데이터 분석)

  • Oh, Hee-bin;Lee, Jeong-cheol;Kim, Kyungsup
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2017.11a
    • /
    • pp.59-62
    • /
    • 2017
  • 본 논문에서는 텍스트 형태의 빅 데이터를 분산처리 환경에서 기계학습을 이용하여 분석하고 유의미한 데이터를 만들어내는 시스템에 대해 다루었다. 빅 데이터의 한 종류인 뉴스 기사 빅 데이터를 분산 시스템 환경(Spark) 내에서 기계 학습(Word2Vec)을 이용하여 뉴스 기사의 키워드 간의 연관도를 분석하는 분산 처리 시스템을 설계 및 구현하였고, 사용자가 입력한 검색어와 연관된 키워드들을 한눈에 파악하기 쉽게 만드는 시각화 시스템을 설계하였다.

Detecting Method of Video Caption Frame on News Data (뉴스 데이터에서 자막프레임 검출방법)

  • Nam, Yun-Seong;Bae, Jong-Sik;Choi, Hyung-Jin
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2003.11a
    • /
    • pp.505-508
    • /
    • 2003
  • 디지털 영상 자천가 대중화됨에 따라 방대한 양의 자료를 효과적으로 이용 및 검색하기 하기 위해 영상 데이터의 색인과정이 필수적이다. 뉴스 데이터에서 자막 프레임은 뉴스의 내용을 한 눈에 파악할 수 있는 중요한 정보이다. 따라서 본 논문에서는 뉴스 데이터에서 색인과정을 위해 우선 자막 프레임을 검출하는 기법을 제안하고자 한다. 자막이 있는 프레임을 검출하기 위해 가변길이 프레임 생략법을 이용하여 키프레임을 검출한다. 영상보정을 위한 전처리 작업으로 BC(Brightness & Contrast) 필터기법을 제안하고 자막영역을 대상으로 IT(Invers & Threshold) 기법을 적용하여 자막프레임을 검출하는 방법을 제안한다.

  • PDF