• Title/Summary/Keyword: Text data

Search Result 2,953, Processing Time 0.036 seconds

HKIB-20000 & HKIB-40075: Hangul Benchmark Collections for Text Categorization Research

  • Kim, Jin-Suk;Choe, Ho-Seop;You, Beom-Jong;Seo, Jeong-Hyun;Lee, Suk-Hoon;Ra, Dong-Yul
    • Journal of Computing Science and Engineering
    • /
    • v.3 no.3
    • /
    • pp.165-180
    • /
    • 2009
  • The HKIB, or Hankookilbo, test collections are two archives of Korean newswire stories manually categorized with semi-hierarchical or hierarchical category taxonomies. The base newswire stories were made available by the Hankook Ilbo (The Korea Daily) for research purposes. At first, Chungnam National University and KISTI collaborated to manually tag 40,075 news stories with categories by semi-hierarchical and balanced three-level classification scheme, where each news story has only one level-3 category (single-labeling). We refer to this original data set as HKIB-40075 test collection. And then Yonsei University and KISTI collaborated to select 20,000 newswire stories from the HKIB-40075 test collection, to rearrange the classification scheme to be fully hierarchical but unbalanced, and to assign one or more categories to each news story (multi-labeling). We refer to this modified data set as HKIB-20000 test collection. We benchmark a k-NN categorization algorithm both on HKIB-20000 and on HKIB-40075, illustrating properties of the collections, providing baseline results for future studies, and suggesting new directions for further research on Korean text categorization problem.

Sums-of-Products Models for Korean Segment Duration Prediction

  • Chung, Hyun-Song
    • Speech Sciences
    • /
    • v.10 no.4
    • /
    • pp.7-21
    • /
    • 2003
  • Sums-of-Products models were built for segment duration prediction of spoken Korean. An experiment for the modelling was carried out to apply the results to Korean text-to-speech synthesis systems. 670 read sentences were analyzed. trained and tested for the construction of the duration models. Traditional sequential rule systems were extended to simple additive, multiplicative and additive-multiplicative models based on Sums-of-Products modelling. The parameters used in the modelling include the properties of the target segment and its neighbors and the target segment's position in the prosodic structure. Two optimisation strategies were used: the downhill simplex method and the simulated annealing method. The performance of the models was measured by the correlation coefficient and the root mean squared prediction error (RMSE) between actual and predicted duration in the test data. The best performance was obtained when the data was trained and tested by ' additive-multiplicative models. ' The correlation for the vowel duration prediction was 0.69 and the RMSE. 31.80 ms. while the correlation for the consonant duration prediction was 0.54 and the RMSE. 29.02 ms. The results were not good enough to be applied to the real-time text-to-speech systems. Further investigation of feature interactions is required for the better performance of the Sums-of-Products models.

  • PDF

Gradation Image Processing for Text Recognition in Road Signs Using Image Division and Merging

  • Chong, Kyusoo
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.13 no.2
    • /
    • pp.27-33
    • /
    • 2014
  • This paper proposes a gradation image processing method for the development of a Road Sign Recognition Platform (RReP), which aims to facilitate the rapid and accurate management and surveying of approximately 160,000 road signs installed along the highways, national roadways, and local roads in the cities, districts (gun), and provinces (do) of Korea. RReP is based on GPS(Global Positioning System), IMU(Inertial Measurement Unit), INS(Inertial Navigation System), DMI(Distance Measurement Instrument), and lasers, and uses an imagery information collection/classification module to allow the automatic recognition of signs, the collection of shapes, pole locations, and sign-type data, and the creation of road sign registers, by extracting basic data related to the shape and sign content, and automated database design. Image division and merging, which were applied in this study, produce superior results compared with local binarization method in terms of speed. At the results, larger texts area were found in images, the accuracy of text recognition was improved when images had been gradated. Multi-threshold values of natural scene images are used to improve the extraction rate of texts and figures based on pattern recognition.

Factors Influencing Cell Phone Addiction in Adolescents (청소년의 휴대전화 중독에 영향을 미치는 요인)

  • Koo, Hyun-Young;Park, Hyun-Sook
    • Child Health Nursing Research
    • /
    • v.16 no.1
    • /
    • pp.56-65
    • /
    • 2010
  • Purpose: This study was done to identify factors influencing cell phone addiction in adolescents. Methods: The participants were 548 adolescents in two middle schools and four high schools. Data were collected through self-report questionnaires which were constructed to include a cell phone addiction scale, an impulsiveness scale, media specific factors, and cell phone use. The data were analyzed using the SPSS program. Results: Of the adolescents, 88.7% reported being average users, 8.4%, heavy users, and 2.9%, cell phone addicted. Cell phone addiction was significantly correlated with impulsiveness and media specific factors. Significant factors influencing cell phone addiction were gender, sending and receiving text messages on weekends, monthly call charges, impulsiveness, recreational reasons, and cultural reasons. Conclusion: The above findings indicate that cell phone addiction in adolescents is influenced by gender, text message use, call charges, impulsiveness and media specific factors. Therefore the development of prevention and management programs for cell phone addiction in adolescents should be based on these factors which influence cell phone addiction.

A Study of an Efficient Retrieval System Algorithm using a Text Mining (텍스트마이닝 기술을 이용한 효율적인 검색시스템 알고리즘에 대한 연구)

  • Kim, Je-Seok;Kim, Jang-Hyung
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • v.9 no.2
    • /
    • pp.531-534
    • /
    • 2005
  • Currently some problems are presented by the enlargement of network range and hardware upgrade for the solutions for network traffic and treatment speed of server processing, as well as the resource of networks and increasing speed of on-line information that is exceeding in operation limit of existing information systems. The study proposes the Architecture, an organic unification system of optimized content for retrieval, which is adapted to variable points of view of users or content changes of document aggregation by the study of algorithm, which offers easy retrieval of the location of documents on a multitude of on-line data.

  • PDF

Corpus-Based Literary Analysis (코퍼스에 기반한 문학텍스트 분석)

  • Ha, Myung-Jeong
    • The Journal of the Korea Contents Association
    • /
    • v.13 no.9
    • /
    • pp.440-447
    • /
    • 2013
  • Recently corpus linguistic analyses enable researchers to examine meanings and structural features of data, that is not detected intuitively. While the potential of corpus linguistic techniques has been established and demonstrated for non-literary data, corpus stylistic analyses have been rarely performed in terms of the analysis of literature. Specifically this paper explores keywords and their role in text analysis, which is primary part of corpus linguistic analyses. This paper focuses on the application of techniques from corpus linguistics and the interpretation of results. This paper addresses the question of what is to be gained from keyword analysis by scrutinizing keywords in Shakespeare's Romeo and Juliet.

Feature based Text Watermarking for Binary Document Image (이진 문서 영상을 위한 특징 기반 텍스트 워터마킹)

  • Choo Hyon-Gon;Kim Whoi-yul
    • The KIPS Transactions:PartB
    • /
    • v.12B no.2 s.98
    • /
    • pp.151-156
    • /
    • 2005
  • In this paper, we propose feature based character watermarking methods based on geometical features specific to characters of text in document image. The proposed methods can satisfy both data capacity and robustness simultaneously while none of the conventional methods can. According to the characteristics of characters, watermark can be embed or detected through changes of connectivity of the characters, differences of characteristics of edge pixels or changes of area of holes. Experimental results show that our identification techniques are very robust to distortion and have high data capacity.

Effect of Organizational Culture on Corporate Social Welfare Activities

  • JEONG, Young Joo;CHOI, Moon Kyung
    • East Asian Journal of Business Economics (EAJBE)
    • /
    • v.9 no.4
    • /
    • pp.43-54
    • /
    • 2021
  • Purpose - Stakeholders play a vital part in the company's CSR activities and they are part of the company's achievement and affect the company's achievement or business objectives. This study aims to add insight into the already existing knowledge how the organizational culture can promote corporate social welfare activities. Research design, Data, and methodology - The current authors obtained text data for the possible practical suggestions which might be used for the creation of coding method. That implies that the present author investigated only trustable textual sources to provide for the possible solutions such as peer-revied sources and published book. Result - Research results indicated that organizational culture promotes corporate social welfare activities by making people know their values and understand how they come about. Not every community knows what its members want and how to achieve its needs. Sometimes, a community can obtain the values and principles of an organization and incorporate them into community values. Conclusion - Executive leadership and customers are part of society. Any strategy that influences their operation and works ethic influences the contact of the community. This research found methods vital in setting up an excellent culture that enhances profitability and the corporate social welfare activities through motivation and communication.

An Efficient Implementation of Lightweight Block Cipher Algorithm HIGHT for IoT Security (사물인터넷 보안용 경량 블록암호 알고리듬 HIGHT의 효율적인 하드웨어 구현)

  • Bae, Gi-Chur;Shin, Kyung-Wook
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2014.10a
    • /
    • pp.285-287
    • /
    • 2014
  • This paper describes a design of area-efficient/low-power cryptographic processor for lightweight block cipher algorithm HIGHT which was approved as a cryptographic standard by KATS and ISO/IEC. The HIGHT algorithm which is suitable for the security of IoT(Internet of Things), encrypts a 64-bit plain text with a 128-bit cipher key to make a 64-bit cipher text, and vice versa. For area-efficient and low-power implementation, we adopt 32-bit data path and optimize round transform block and key scheduler to share hardware resources for encryption and decryption.

  • PDF

Minimally Supervised Relation Identification from Wikipedia Articles

  • Oh, Heung-Seon;Jung, Yuchul
    • Journal of Information Science Theory and Practice
    • /
    • v.6 no.4
    • /
    • pp.28-38
    • /
    • 2018
  • Wikipedia is composed of millions of articles, each of which explains a particular entity with various languages in the real world. Since the articles are contributed and edited by a large population of diverse experts with no specific authority, Wikipedia can be seen as a naturally occurring body of human knowledge. In this paper, we propose a method to automatically identify key entities and relations in Wikipedia articles, which can be used for automatic ontology construction. Compared to previous approaches to entity and relation extraction and/or identification from text, our goal is to capture naturally occurring entities and relations from Wikipedia while minimizing artificiality often introduced at the stages of constructing training and testing data. The titles of the articles and anchored phrases in their text are regarded as entities, and their types are automatically classified with minimal training. We attempt to automatically detect and identify possible relations among the entities based on clustering without training data, as opposed to the relation extraction approach that focuses on improvement of accuracy in selecting one of the several target relations for a given pair of entities. While the relation extraction approach with supervised learning requires a significant amount of annotation efforts for a predefined set of relations, our approach attempts to discover relations as they occur naturally. Unlike other unsupervised relation identification work where evaluation of automatically identified relations is done with the correct relations determined a priori by human judges, we attempted to evaluate appropriateness of the naturally occurring clusters of relations involving person-artifact and person-organization entities and their relation names.