• Title/Summary/Keyword: Morphological Analyzers

Search Result 11, Processing Time 0.03 seconds

Development of Text Mining-Based Accounting Terminology Analyzer for Financial Information Utilization (재정정보 활용을 위한 텍스트 마이닝 기반 회계용어 형태소 분석기 구축)

  • Jung, Geon-Yong;Yoon, Seung-Sik;Kang, Ju-Young
    • The Journal of Information Systems
    • /
    • v.28 no.4
    • /
    • pp.155-174
    • /
    • 2019
  • Purpose Social interest in financial statement notes has recently increased. However, contrary to the keen interest in financial statement notes, there is no morphological analyzer for accounting terms, which is why researchers are having considerable difficulty in carrying out research. In this study, we build a morphological analyzer for accounting related text mining techniques. This morphological analyzer can handle accounting terms like financial statements and we expect it to serve as a springboard for growth in the text mining research field. Design/methodology/approach In this study, we build customized korean morphological analyzer to extract proper accounting terms. First, we collect Company's Financial Statement notes, financial information data published by KPFIS(Korea Public Finance Information Service), K-IFRS accounting terms data. Second, we cleaning and tokeninzing and removing stopwords. Third, we customize morphological analyzer using n-gram methodology. Findings Existing morphological analyzer cannot extract accounting terms because it split accounting terms to many nouns. In this study, the new customized morphological analyzer can detect more appropriate accounting terms comparing to the existing morphological analyzer. We found that accounting words that were not detected by existing morphological analyzers were detected in new customized morphological analyzers.

Cloning of Korean Morphological Analyzers using Pre-analyzed Eojeol Dictionary and Syllable-based Probabilistic Model (기분석 어절 사전과 음절 단위의 확률 모델을 이용한 한국어 형태소 분석기 복제)

  • Shim, Kwangseob
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.3
    • /
    • pp.119-126
    • /
    • 2016
  • In this study, we verified the feasibility of a Korean morphological analyzer that uses a pre-analyzed Eojeol dictionary and syllable-based probabilistic model. For the verification, MACH and KLT2000, Korean morphological analyzers, were cloned with a pre-analyzed eojeol dictionary and syllable-based probabilistic model. The analysis results were compared between the cloned morphological analyzer, MACH, and KLT2000. The 10 million Eojeol Sejong corpus was segmented into 10 sets for cross-validation. The 10-fold cross-validated precision and recall for cloned MACH and KLT2000 were 97.16%, 98.31% and 96.80%, 99.03%, respectively. Analysis speed of a cloned MACH was 308,000 Eojeols per second, and the speed of a cloned KLT2000 was 436,000 Eojeols per second. The experimental results indicated that a Korean morphological analyzer that uses a pre-analyzed eojeol dictionary and syllable-based probabilistic model could be used in practical applications.

A Rule-Based Analysis from Raw Korean Text to Morphologically Annotated Corpora

  • Lee, Ki-Yong;Markus Schulze
    • Language and Information
    • /
    • v.6 no.2
    • /
    • pp.105-128
    • /
    • 2002
  • Morphologically annotated corpora are the basis for many tasks of computational linguistics. Most current approaches use statistically driven methods of morphological analysis, that provide just POS-tags. While this is sufficient for some applications, a rule-based full morphological analysis also yielding lemmatization and segmentation is needed for many others. This work thus aims at 〔1〕 introducing a rule-based Korean morphological analyzer called Kormoran based on the principle of linearity that prohibits any combination of left-to-right or right-to-left analysis or backtracking and then at 〔2〕 showing how it on be used as a POS-tagger by adopting an ordinary technique of preprocessing and also by filtering out irrelevant morpho-syntactic information in analyzed feature structures. It is shown that, besides providing a basis for subsequent syntactic or semantic processing, full morphological analyzers like Kormoran have the greater power of resolving ambiguities than simple POS-taggers. The focus of our present analysis is on Korean text.

  • PDF

Grain Size Analysis Using Morphological Properties of Grains (입자의 형태적 특성을 활용한 퇴적물 입도분석)

  • Choi, Kwang Hee
    • Journal of The Geomorphological Association of Korea
    • /
    • v.27 no.2
    • /
    • pp.19-28
    • /
    • 2020
  • Grain size analysis is the most basic procedure for identifying the origin, transport and sedimentation processes of sediments, and is widely used in geomorphology and sedimentology. Traditionally, grain size was determined by a sieve-pippette method, but the use of automated analyzers is increasing in recent years. These analyzers have many advantages over traditional techniques, but the measurement results are not always the same. It is still difficult to solve the pretreatment problem such as incomplete diffusion and residual organic matter, and inappropriate results may be obtained. This study compared image-based grain size analysis and sieve analysis to verify its statistical reliability, and conducted experiments to enhance the measurement accuracy using shape parameters. The results showed that the image-based analysis overestimated the grain size of sand dunes by about 7% compared to the sieve analysis, but the two measurements were not statistically different. In addition, by using shape parameters, such as aspect ratio, sphericity, and convexity, improved statistics were obtained compared to the original data. Using the morphological properties of the individual grains is a complementary method to the incomplete pretreatment of the grain size analysis process, and at the same time, it will contribute to improving the accuracy and reliability of the results.

Information Retrieval Systems: Between Morphological Analyzers and Systemming Algorithms

  • Mohamed, Afaf Abdel Rhman;Ouni, Chafika;Eljack, Sarah Mustafa;Alfayez, Fayez
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.3
    • /
    • pp.375-381
    • /
    • 2022
  • The main objective of an Information Retrieval System (IRS) is to obtain suitable information within a reasonable time to satisfy a user need. To achieve this purpose, an IRS should have a good indexing system that is based on natural language processing.In this context, we focus on the available Arabic language processing techniques for an IRS with the goal of contributing to an improvement in the performance. Our contribution consists of integrating morphological analysis into an IRS in order to compare the impact of morphological analysis with that of stemming algorithms.

Efficient Keyword Extraction from Social Big Data Based on Cohesion Scoring

  • Kim, Hyeon Gyu
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.10
    • /
    • pp.87-94
    • /
    • 2020
  • Social reviews such as SNS feeds and blog articles have been widely used to extract keywords reflecting opinions and complaints from users' perspective, and often include proper nouns or new words reflecting recent trends. In general, these words are not included in a dictionary, so conventional morphological analyzers may not detect and extract those words from the reviews properly. In addition, due to their high processing time, it is inadequate to provide analysis results in a timely manner. This paper presents a method for efficient keyword extraction from social reviews based on the notion of cohesion scoring. Cohesion scores can be calculated based on word frequencies, so keyword extraction can be performed without a dictionary when using it. On the other hand, their accuracy can be degraded when input data with poor spacing is given. Regarding this, an algorithm is presented which improves the existing cohesion scoring mechanism using the structure of a word tree. Our experiment results show that it took only 0.008 seconds to extract keywords from 1,000 reviews in the proposed method while resulting in 15.5% error ratio which is better than the existing morphological analyzers.

A Robust Pattern-based Feature Extraction Method for Sentiment Categorization of Korean Customer Reviews (강건한 한국어 상품평의 감정 분류를 위한 패턴 기반 자질 추출 방법)

  • Shin, Jun-Soo;Kim, Hark-Soo
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.12
    • /
    • pp.946-950
    • /
    • 2010
  • Many sentiment categorization systems based on machine learning methods use morphological analyzers in order to extract linguistic features from sentences. However, the morphological analyzers do not generally perform well in a customer review domain because online customer reviews include many spacing errors and spelling errors. These low performances of the underlying systems lead to performance decreases of the sentiment categorization systems. To resolve this problem, we propose a feature extraction method based on simple longest matching of Eojeol (a Korean spacing unit) and phoneme patterns. The two kinds of patterns are automatically constructed from a large amount of POS (part-of-speech) tagged corpus. Eojeol patterns consist of Eojeols including content words such as nouns and verbs. Phoneme patterns consist of leading consonant and vowel pairs of predicate words such as verbs and adjectives because spelling errors seldom occur in leading consonants and vowels. To evaluate the proposed method, we implemented a sentiment categorization system using a SVM (Support Vector Machine) as a machine learner. In the experiment with Korean customer reviews, the sentiment categorization system using the proposed method outperformed that using a morphological analyzer as a feature extractor.

Crawlers and Morphological Analyzers Utilize to Identify Personal Information Leaks on the Web System (크롤러와 형태소 분석기를 활용한 웹상 개인정보 유출 판별 시스템)

  • Lee, Hyeongseon;Park, Jaehee;Na, Cheolhun;Jung, Hoekyung
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.10a
    • /
    • pp.559-560
    • /
    • 2017
  • Recently, as the problem of personal information leakage has emerged, studies on data collection and web document classification have been made. The existing system judges only the existence of personal information, and there is a problem in that unnecessary data is not filtered because classification of documents published by the same name or user is not performed. In this paper, we propose a system that can identify the types of data or homonyms using the crawler and morphological analyzer for solve the problem. The user collects personal information on the web through the crawler. The collected data can be classified through the morpheme analyzer, and then the leaked data can be confirmed. Also, if the system is reused, more accurate results can be obtained. It is expected that users will be provided with customized data.

  • PDF

An Automated Draft Report Generator for Peripheral Blood Smear Examinations Based on Complete Blood Count Parameters

  • Kim, Young-gon;Kwon, Jung Ah;Moon, Yeonsook;Park, Seong Jun;Kim, Sangwook;Lee, Hyun-A;Ko, Sun-Young;Chang, Eun-Ah;Nam, Myung-Hyun;Lim, Chae Seung;Yoon, Soo-Young
    • Annals of Laboratory Medicine
    • /
    • v.38 no.6
    • /
    • pp.512-517
    • /
    • 2018
  • Background: Complete blood count (CBC) results play an important role in peripheral blood smear (PBS) examinations. Many descriptions in PBS reports may simply be translated from CBC parameters. We developed a computer program that automatically generates a PBS draft report based on CBC parameters and age- and sex-matched reference ranges. Methods: The Java programming language was used to develop a computer program that supports a graphical user interface. Four hematology analyzers from three different laboratories were tested: Sysmex XE-5000 (Sysmex, Kobe, Japan), Sysmex XN-9000 (Sysmex), DxH800 (Beckman Coulter, Brea, CA, USA), and ADVIA 2120i (Siemens Healthcare Diagnostics, Eschborn, Germany). Input data files containing 862 CBC results were generated from hematology analyzers, middlewares, or laboratory information systems. The draft reports were compared with the content of input data files. Results: We developed a computer program that reads CBC results from a data file and automatically writes a draft PBS report. Age- and sex-matched reference ranges can be automatically applied. After examining PBS, users can modify the draft report based on microscopic findings. Recommendations such as suggestions for further evaluations are also provided based on morphological findings, and they can be modified by users. The program was compatible with all four hematology analyzers tested. Conclusions: Our program is expected to reduce the time required to manually incorporate CBC results into PBS reports. Systematic inclusion of CBC results could help improve the reliability and sensitivity of PBS examinations.

Construction of an Efficient Pre-analyzed Dictionary for Korean Morphological Analysis (한국어 형태소 분석을 위한 효율적 기분석 사전의 구성 방법)

  • Kwak, Sujeong;Kim, Bogyum;Lee, Jae Sung
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.2 no.12
    • /
    • pp.881-888
    • /
    • 2013
  • A pre-analyzed dictionary is used to increase the speed and the accuracy of morphological analyzers and to decrease the over-generation. However, if the dictionary includes 'Insufficiently-analyzed word-phrases', which do not include all the possible analysis of the word-phrase, it may cause the decrease of the analysis accuracy. In this paper, we measure the accuracy changes according to the number of word-phrase frequency and the size changes of corpus by Sejong corpus. And performance of integrate system(SMA with pre-dictionary) is highest when sufficient analysis rate of pre-dictionary is more than 99.82%. Also pre-dictionary is constructed with word-phrase that frequency more than 32(64) when size of corpus is 1,600,000(6,300,000) word-phrase.