• Title/Summary/Keyword: Text features

Search Result 580, Processing Time 0.038 seconds

Improving Hypertext Classification Systems through WordNet-based Feature Abstraction (워드넷 기반 특징 추상화를 통한 웹문서 자동분류시스템의 성능향상)

  • Roh, Jun-Ho;Kim, Han-Joon;Chang, Jae-Young
    • The Journal of Society for e-Business Studies
    • /
    • v.18 no.2
    • /
    • pp.95-110
    • /
    • 2013
  • This paper presents a novel feature engineering technique that can improve the conventional machine learning-based text classification systems. The proposed method extends the initial set of features by using hyperlink relationships in order to effectively categorize hypertext web documents. Web documents are connected to each other through hyperlinks, and in many cases hyperlinks exist among highly related documents. Such hyperlink relationships can be used to enhance the quality of features which consist of classification models. The basic idea of the proposed method is to generate a sort of ed concept feature which consists of a few raw feature words; for this, the method computes the semantic similarity between a target document and its neighbor documents by utilizing hierarchical relationships in the WordNet ontology. In developing classification models, the ed concept features are equated with other raw features, and they can play a great role in developing more accurate classification models. Through the extensive experiments with the Web-KB test collection, we prove that the proposed methods outperform the conventional ones.

Korean Web Content Extraction using Tag Rank Position and Gradient Boosting (태그 서열 위치와 경사 부스팅을 활용한 한국어 웹 본문 추출)

  • Mo, Jonghoon;Yu, Jae-Myung
    • Journal of KIISE
    • /
    • v.44 no.6
    • /
    • pp.581-586
    • /
    • 2017
  • For automatic web scraping, unnecessary components such as menus and advertisements need to be removed from web pages and main contents should be extracted automatically. A content block tends to be located in the middle of a web page. In particular, Korean web documents rarely include metadata and have a complex design; a suitable method of content extraction is therefore needed. Existing content extraction algorithms use the textual and structural features of content blocks because processing visual features requires heavy computation for rendering and image processing. In this paper, we propose a new content extraction method using the tag positions in HTML as a quasi-visual feature. In addition, we develop a tag rank position, a type of tag position not affected by text length, and show that gradient boosting with the tag rank position is a very accurate content extraction method. The result of this paper shows that the content extraction method can be used to collect high-quality text data automatically from various web pages.

Intelligent Spam-mail Filtering Based on Textual Information and Hyperlinks (텍스트정보와 하이퍼링크에 기반한 지능형 스팸 메일 필터링)

  • Kang, Sin-Jae;Kim, Jong-Wan
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.14 no.7
    • /
    • pp.895-901
    • /
    • 2004
  • This paper describes a two-phase intelligent method for filtering spam mail based on textual information and hyperlinks. Scince the body of spam mail has little text information, it provides insufficient hints to distinguish spam mails from legitimate mails. To resolve this problem, we follows hyperlinks contained in the email body, fetches contents of a remote webpage, and extracts hints (i.e., features) from original email body and fetched webpages. We divided hints into two kinds of information: definite information (sender`s information and definite spam keyword lists) and less definite textual information (words or phrases, and particular features of email). In filtering spam mails, definite information is used first, and then less definite textual information is applied. In our experiment, the method of fetching web pages achieved an improvement of F-measure by 9.4% over the method of using on original email header and body only.

Studies of the Gruel as Medicated Diet for the RegimenYangSaeng of the Elderly - In Yang-lo-bong-chin-seo(養老奉親書) - (노인 식이양생(食餌養生)을 위한 약선죽(藥膳粥)에 관한 연구 - "양노봉친서(養老奉親書)"를 중심으로 -)

  • Kim, Jung-Eun;Ji, Myoung-Soon
    • Journal of Korean Medical classics
    • /
    • v.26 no.1
    • /
    • pp.99-129
    • /
    • 2013
  • Objective : Most disease of the aged comprise chronic illness, hence the diet is important. Yet, the study on diet methodology for the remedy of the aged folks' aliments is scarce. The diet for the aged must be easy to digest- in regard to the physiological features of the aged, delightfully chewable, while meeting the expectation and guaranteeing nutrition-supply and remedial efficacy. Material and Method : This study is designed to accomplish following things with the Yang-lo-bong-chin-seo, a text on maintaining and upbringing the health for the aged: (1) classify the food recorded in the text in terms of cooking methods, (2) then sort the main ingredients in remedial herbal rice porridge(Yak-sun-jook) in food material science manner, (3) evaluate the cooking methods of the porridge for each and various symptoms, and (4) assess the features of each ingredient of the porridge, the value of it both in oriental medicine's and nutrition's scope. Results : 1) Among 64 main dishes recorded in Yang-lo-bong-chin-seo, rice porridge composes the majority, which is 64%. Stew and soup account for 60% of side dishes. 2) In 15 food cures, 43 remedial herbal rice porridges(Yak-sun-jook) were recorded. 3) Yak-sun-jook utilizes most chinese herbs as its food material. 4) Yak-sun-jook is made more with vegetable ingredients than animal ingredients and consist highly of chinese herbs. 5) Main ingredients in the porridges are effective in disease cure in addition to sufficient, well-balanced nutrition. 6) Cooking method of the porridge is grinding chinese herbs into powder or boiling them for a long time. Conclusion: All forementioned steps build the informational foundation - for this purpose the information be utilized - for making possible the development and the devising of pragmatic and feasible Remedial herbal rice porridge(Yak-sun-jook).

Topic Analysis of the National Petition Site and Prediction of Answerable Petitions Based on Deep Learning (국민청원 주제 분석 및 딥러닝 기반 답변 가능 청원 예측)

  • Woo, Yun Hui;Kim, Hyon Hee
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.2
    • /
    • pp.45-52
    • /
    • 2020
  • Since the opening of the national petition site, it has attracted much attention. In this paper, we perform topic analysis of the national petition site and propose a prediction model for answerable petitions based on deep learning. First, 1,500 petitions are collected, topics are extracted based on the petitions' contents. Main subjects are defined using K-means clustering algorithm, and detailed subjects are defined using topic modeling of petitions belonging to the main subjects. Also, long short-term memory (LSTM) is used for prediction of answerable petitions. Not only title and contents but also categories, length of text, and ratio of part of speech such as noun, adjective, adverb, verb are also used for the proposed model. Our experimental results show that the type 2 model using other features such as ratio of part of speech, length of text, and categories outperforms the type 1 model without other features.

Terminology Recognition System based on Machine Learning for Scientific Document Analysis (과학 기술 문헌 분석을 위한 기계학습 기반 범용 전문용어 인식 시스템)

  • Choi, Yun-Soo;Song, Sa-Kwang;Chun, Hong-Woo;Jeong, Chang-Hoo;Choi, Sung-Pil
    • The KIPS Transactions:PartD
    • /
    • v.18D no.5
    • /
    • pp.329-338
    • /
    • 2011
  • Terminology recognition system which is a preceding research for text mining, information extraction, information retrieval, semantic web, and question-answering has been intensively studied in limited range of domains, especially in bio-medical domain. We propose a domain independent terminology recognition system based on machine learning method using dictionary, syntactic features, and Web search results, since the previous works revealed limitation on applying their approaches to general domain because their resources were domain specific. We achieved F-score 80.8 and 6.5% improvement after comparing the proposed approach with the related approach, C-value, which has been widely used and is based on local domain frequencies. In the second experiment with various combinations of unithood features, the method combined with NGD(Normalized Google Distance) showed the best performance of 81.8 on F-score. We applied three machine learning methods such as Logistic regression, C4.5, and SVMs, and got the best score from the decision tree method, C4.5.

Metaverse App Market and Leisure: Analysis on Oculus Apps (메타버스 앱 시장과 여가: 오큘러스 앱 분석)

  • Kim, Taekyung;Kim, Seongsu
    • Knowledge Management Research
    • /
    • v.23 no.2
    • /
    • pp.37-60
    • /
    • 2022
  • The growth of virtual reality games and the popularization of blockchain technology are bringing significant changes to the formation of the metaverse industry ecosystem. Especially, after Meta acquired Oculus, a VR device and application company, the growth of VR-based metaverse services is accelerating. In this study, the concept that supports leisure activities in the metaverse environment is explored realting to game-like features in VR apps, which differentiates traditional mobile apps based on a smart phone device. Using exploratory text mining methods and network analysis approches, 241 apps registed in the Oculus Quest 2 App Store were analyzed. Analysis results from a quasi-network show that a leisure concept is closely related to various genre features including a game and tourism. Additionally, the anlaysis results of G & F model indicate that the leisure concept is distictive in the view of gateway brokerage role. Those results were also confirmed in LDA topic modeling analysis.

A study on the xylographica of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ ("의방류취(醫方類聚)"에 대한 판본(版本) 연구)

  • Shin, Soon-Shik;Choi, Hwan-Soo
    • Korean Journal of Oriental Medicine
    • /
    • v.3 no.1
    • /
    • pp.1-15
    • /
    • 1997
  • ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$(1445) is a book compiled the medical achievements of China and Choseon in those times and it's our source of pride to have it In this country. It also deserves careful investigation since this book can provide some clues of features of missing books in China and Korea. The extent of accuracy of xylographica of old books determines the possiblity of in depth further study. So authors attempted to investigate the xylographica of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ one of the 3 main books in Korea. Previous investigation done by Miki Sakae and Kim Doo Jong are noticeable. On the basis of their respective works, we analyzed 'Annals of the Choseon Dynasty' to find records related with ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ and estimated the situation of its publication. We tried figure the situation of those times of China, Japan and Korea(including North Korea) and tried to estimate the book's original xylographica as much as we could. By King Sejong's command, the first draft of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ consisted of 365 books was made by collaboration of civil officials and medical officers during the period from 1443 to 1445. And then from 1451(first year of Moonjong's reign) to 1464(l0th year of Sejo's reign) lots of manpowers were employed and through the process of countless erasure, proofreading, arrangement and rearrangement revised version of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ which is called by Sejo text was completed. After 3 years of wood engraving work, the first printed form of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ (alternately called Seongjong text) in folding case consisted of 266 chapters, 264 volumes came into the world in 1477.(8th year of Seongjong's reign). This was 32 years after the initial completion of the edition. So ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ exists in three forms as Sejong text, Sejo text and Seongjong text respectively. Since those texts were plundered during the Japanese invasion of Korea in 1592, none of the original copy remains within korea. The texts were constantly moved to kadeungcheongieong, to Kongdeungpyeongio, Jesookoan of Edo, to East University of department of classic books, to Cheoncho archives, to the Imperial Museum and finally is kept in the royal palace at present. (Doseoryo text Eulhae printing type) Reduced-size republication books of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ in wooden type were imported at the time of 'Byeongja Korea-Japan Treaty in 1876' and of those 2 books, one copy was treasured in the Royal Household of the Yi Dynasty and than was lost during the Korean War circa 1950. The other remaining copy has been kept succesively by Kojong's imperial grant, Royal doctor Hong Cheol Bo, Hong Taek Joo, Hong Ik Pyo the book agent, and now is kept In Yonsei University Library and this is the only existing copy in Korea at present. In 1965, Dongyang Medical college published the transcription version of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ consisting of 11 books and then in 1981 after edition and arrangement by Choonghoa(中華) publishing company, photoprint copy of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ was published in Keumgang(金剛) publishing company In 1991, October Yeokang(驛江) publishing company producd photocopies of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ which were previously translated into Korean by North Korea Institute of Oriental Medicine and then issued by medical publishing company. In China, two institutes, Zhejiang Institute of Traditional Chinese Medicine and Huzhou Traditional Chinese Medical Hospital cooperated to publish a revised and marked text consiting of 11 books by adding marking points to japanse Edohakhoondang text which were used as a reference. Both the korean and chinese texts issued were grounded by the ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ kept in the royal palace. Any further study concerning ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ can acquire its accuracy and objectivity when the japanese text kept in the royal palace is taken as an original copy.

  • PDF

A Study on the Hyun-Mu Sutra(玄武經) of Jeungsan (증산계 『현무경』 연구)

  • Koo, Jung-hoe
    • Journal of the Daesoon Academy of Sciences
    • /
    • v.25_1
    • /
    • pp.25-85
    • /
    • 2015
  • In this study, source criticism (an establishment of authentic text) of the Hyun-Mu Sutra(玄武經) among different editions is studied and an attempt of a new interpretation appropriate to that is attempted. The Hyun-Mu Sutra, a scripture written in 1909, began to communicate with the world through the religions of Jeungsanism. In particular, it was remarkable that The Hyun-Mu Sutra was absorbed as canon textbooks Jeonkyung(典經), the Scriptures of Daesoonjinrihoe, The Fellowship of Daesoon Truth(大巡眞理) from a loner and secret pull-out of heritage traditions. However, this scripture though written in 1909 and more than 100 years has passed, remained in a state unestablished authentic text. The Hyun-Mu Sutra is the scripture consisted of 25 pages by the religions of Jeungsanism[Gang Il-sun 姜一淳(1871~1909)]. 33 page type of Hyun-Mu Sutra has been distributed in the world until now the authentic text of The Hyun-Mu Sutra. However, as a result of the examination, diagnostic scripture(病勢文) was found to have been added by descendants. After a review of authentic text of The Hyun-Mu Sutra, it concluded that there is no diagnostic scripture in primary The Hyun-Mu Sutra. Though The Hyun-Mu Sutra is a booklet of a small amount, the notation and expression is so unique, it has been in secrecy to read its contents. Interpretation way of The Hyun-Mu Sutra up to now can be summarized in two as follows. 1) approaches by I-ching 2) approaches by ten celestrial stemps and twelve earthly branches(10干12支). Approaches by I-ching among this sometimes was supplemented with Buddhist classification methods. Nevertheless, these studies can be evaluated limited because it fails to secure authentic text of The Hyun-Mu Sutra. In this study, the contents of The Hyun-Mu Sutra was examined itemized by focusing on the following four points. 1) The icon of The Hyun-Mu Sutra(玄武經符) is similar as normal talisman(符籍) but it has other features. 2) 'Reverse Fonts'(反書體)[the opposite view of the standard fonts(正書體), reflected in the mirror fonts] and size or location used in text is not in uniform. 3) letters in scripture were pointed and points were stamped in the left and upper and lower characters. 4) "Spiritual poem" (詠歌, the Korean traditional music with a view of elegance as an origin of eco), and the music with the Five-Sounds[宮Gung, 商Sang, 角Gak, 徵Chi, 羽Wu) were related. As a result, content analysis of The Hyun-Mu Sutra is carried out in the next four points. 1) The icon of The Hyun-Mu Sutra (玄武經符) has been primarily developed by Jeungsan. 2) 'Reverse Fonts'(反書體)[the opposite view of the standard fonts(正書體), reflected in the mirror fonts] and reverse location such as '宙宇' [the reverse of '宇宙'] represents based on a new world based on a forward and reverse I-ching(正易). 3) Dot and neighbor points is a symbolic map that guides the position of lateral new world(後天) and era(人尊) 4) Spiritual poem is the entrance to achieve the Realization of Do(道通). The above can be considered as the results of this study.

Complex Color Model for Efficient Representation of Color-Shape in Content-based Image Retrieval (내용 기반 이미지 검색에서 효율적인 색상-모양 표현을 위한 복소 색상 모델)

  • Choi, Min-Seok
    • Journal of Digital Convergence
    • /
    • v.15 no.4
    • /
    • pp.267-273
    • /
    • 2017
  • With the development of various devices and communication technologies, the production and distribution of various multimedia contents are increasing exponentially. In order to retrieve multimedia data such as images and videos, an approach different from conventional text-based retrieval is needed. Color and shape are key features used in content-based image retrieval, which quantifies and analyzes various physical features of images and compares them to search for similar images. Color and shape have been used as independent features, but the two features are closely related in terms of cognition. In this paper, a method of describing the spatial distribution of color using a complex color model that projects three-dimensional color information onto two-dimensional complex form is proposed. Experimental results show that the proposed method can efficiently represent the shape of spatial distribution of colors by frequency transforming the complex image and reconstructing it with only a few coefficients in the low frequency.