• Title/Summary/Keyword: N-gram 분석

Search Result 136, Processing Time 0.022 seconds

Comparative Analysis of 4-gram Word Clusters in South vs. North Korean High School English Textbooks (남북한 고등학교 영어교과서 4-gram 연어 비교 분석)

  • Kim, Jeong-ryeol
    • The Journal of the Korea Contents Association
    • /
    • v.20 no.7
    • /
    • pp.274-281
    • /
    • 2020
  • N-gram analysis casts a new look at the n-word cluster in use different from the previously known idioms. It analyzes a corpus of English textbooks for frequently occurring n consecutive words mechanically using a concordance software, which is different from the previously known idioms. The current paper aims at extracting and comparing 4-gram words clusters between South Korean high school English textbooks and its North Korean counterpart. The classification criteria includes number of tokens and types between the two across oral and written languages in the textbooks. The criteria also use the grammatical categories and functional categories to classify and compare the 4-gram words clusters. The grammatical categories include noun phrases, verb phrases, prepositional phrases, partial clauses and others. The functional categories include deictic function, text organizers, stance and others. The findings are: South Korean high school English textbook contains more tokens and types in both oral and written languages. Verb phrase and partial clause 4-grams are grammatically most frequently encountered categories across both South and North Korean high school English textbooks. Stance is most dominant functional category in both South and North Korean English textbooks.

Out of Vocabulary Word Extractor based on a Syllable n-gram (음절 n-gram 기반의 미등록 어휘 추정기 구현)

  • Shin, Junsoo;Hong, Chohee
    • Annual Conference on Human and Language Technology
    • /
    • 2013.10a
    • /
    • pp.139-141
    • /
    • 2013
  • 다양한 콘텐츠가 생성됨에 따라 신조어 및 미등록어도 다양한 형태로 나타나고 있다. 이러한 신조어 및 미등록어는 텍스트 처리 단계에서 오분석 되어 성능 저하의 원인이 된다. 본 논문은 이러한 문제를 해결하기 위해서 대량의 문서로부터 신조어 및 미등록 어휘를 추정하는 방법에 대해서 제안한다. 제안 방법은 대량의 문서로부터 음절 n-gram을 추출한 뒤, 각 n-gram에서 n을 한음절 축소 및 확장 시켜, (n+1)gram, (n-1)gram을 추가적으로 추출한다. 추출된 음절 n-gram을 기준으로 (n+1)gram, (n-1)gram과의 빈도 차이를 계산하여 빈도차가 급격하게 발생하는 구간을 신조어 및 미등록 어휘로 추정한다. 실험결과 신조어 뿐만 아니라 트위터, 미투데이 등과 같은 도메인에 종속적인 미등록 어휘도 추출되는 것을 확인할 수 있었다.

  • PDF

A Study on Applying Novel Reverse N-Gram for Construction of Natural Language Processing Dictionary for Healthcare Big Data Analysis (헬스케어 분야 빅데이터 분석을 위한 개체명 사전구축에 새로운 역 N-Gram 적용 연구)

  • KyungHyun Lee;RackJune Baek;WooSu Kim
    • The Journal of the Convergence on Culture Technology
    • /
    • v.10 no.3
    • /
    • pp.391-396
    • /
    • 2024
  • This study proposes a novel reverse N-Gram approach to overcome the limitations of traditional N-Gram methods and enhance performance in building an entity dictionary specialized for the healthcare sector. The proposed reverse N-Gram technique allows for more precise analysis and processing of the complex linguistic features of healthcare-related big data. To verify the efficiency of the proposed method, big data on healthcare and digital health announced during the Consumer Electronics Show (CES) held each January was collected. Using the Python programming language, 2,185 news titles and summaries mentioned from January 1 to 31 in 2010 and from January 1 to 31 in 2024 were preprocessed with the new reverse N-Gram method. This resulted in the stable construction of a dictionary for natural language processing in the healthcare field.

A Detection Method of Similar Sentences Considering Plagiarism Patterns of Korean Sentence (한국어 문장 표절 유형을 고려한 유사 문장 판별)

  • Ji, Hye-Sung;Joh, Joon-Hee;Lim, Heui-Seok
    • The Journal of Korean Association of Computer Education
    • /
    • v.13 no.6
    • /
    • pp.79-89
    • /
    • 2010
  • In this paper, we proposed a method to find out similar sentences from documents to detect plagiarized documents. The proposed model adapts LSA and N-gram techniques to detect every type of Korean plagiarized sentence type. To evaluate the performance of the model, we constructed experimental data using students' essays on the same theme. Students made their essay by intentionally plagiarizing some reference documents. The experimental results showed that our proposed model outperforms the conventional N-gram model, Vector model, LSA model in precision, recall, and F measures.

  • PDF

Text Mining Analysis Technique on ECDIS Accident Report (텍스트 마이닝 기법을 활용한 ECDIS 사고보고서 분석)

  • Lee, Jeong-Seok;Lee, Bo-Kyeong;Cho, Ik-Soon
    • Journal of the Korean Society of Marine Environment & Safety
    • /
    • v.25 no.4
    • /
    • pp.405-412
    • /
    • 2019
  • SOLAS requires that ECDIS be installed on ships of more than 500 gross tonnage engaged in international navigation until the first inspection arriving after July 1, 2018. Several accidents related to the use of ECDIS have occurred with its installation as a new major navigation instrument. The 12 incident reports issued by MAIB, BSU, BEAmer, DMAIB, and DSB were analyzed, and the cause of accident was determined to be related to the operation of the navigator and the ECDIS system. The text was analyzed using the R-program to quantitatively analyze words related to the cause of the accident. We used text mining techniques such as Wordcloud, Wordnetwork and Wordweight to represent the importance of words according to their frequency of derivation. Wordcloud uses the N-gram model as a way of expressing the frequency of used words in cloud form. As a result of the uni-gram analysis of the N-gram model, ECDIS words were obtained the most, and the bi-gram analysis results showed that the word "Safety Contour" was used most frequently. Based on the bi-gram analysis, the causative words are classified into the officer and the ECDIS system, and the related words are represented by Wordnetwork. Finally, the related words with the of icer and the ECDIS system were composed of word corpus, and Wordweight was applied to analyze the change in corpus frequency by year. As a result of analyzing the tendency of corpus variation with the trend line graph, more recently, the corpus of the officer has decreased, and conversely, the corpus of the ECDIS system is gradually increasing.

A Study on Machine Learning Based Anti-Analysis Technique Detection Using N-gram Opcode (N-gram Opcode를 활용한 머신러닝 기반의 분석 방지 보호 기법 탐지 방안 연구)

  • Kim, Hee Yeon;Lee, Dong Hoon
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.32 no.2
    • /
    • pp.181-192
    • /
    • 2022
  • The emergence of new malware is incapacitating existing signature-based malware detection techniques., and applying various anti-analysis techniques makes it difficult to analyze. Recent studies related to signature-based malware detection have limitations in that malware creators can easily bypass them. Therefore, in this study, we try to build a machine learning model that can detect and classify the anti-analysis techniques of packers applied to malware, not using the characteristics of the malware itself. In this study, the n-gram opcodes are extracted from the malicious binary to which various anti-analysis techniques of the commercial packers are applied, and the features are extracted by using TF-IDF, and through this, each anti-analysis technique is detected and classified. In this study, real-world malware samples packed using The mida and VMProtect with multiple anti-analysis techniques were trained and tested with 6 machine learning models, and it constructed the optimal model showing 81.25% accuracy for The mida and 95.65% accuracy for VMProtect.

Performance Analysis of n-Gram Indexing Methods for Korean text Retrieval (한글 문서 검색에서 n-Gram 색인방법의 성능 분석)

  • 이준규;심수정;박혁로
    • Proceedings of the IEEK Conference
    • /
    • 2003.11b
    • /
    • pp.145-148
    • /
    • 2003
  • The agglutinative nature of Korean language makes the problem of automatic indexing of Korean much different from that of Indo-Eroupean languages. Especially, indexing with compound nouns in Korean is very problematic because of the exponential number of possible analysis and the existence of unknown words. To deal with this compound noun indexing problem, we propose a new indexing methods which combines the merits of the morpheme-based indexing methods and the n-gram based indexing methods. Through the experiments, we also find that the best performance of n-gram indexing methods can be achieved with 1.75-gram which is never considered in the previous researches.

  • PDF

Etiology of Bacteremia in Children With Hemato-Oncologic Diseases From 2013 to 2023: A Single Center Study

  • Sun Woo Park;Ji Young Park;Hyoung Soo Choi;Hyunju Lee
    • Pediatric Infection and Vaccine
    • /
    • v.31 no.1
    • /
    • pp.46-54
    • /
    • 2024
  • Purpose: This study aimed to identify the pathogens of bloodstream infection in children with underlying hemato-oncologic diseases, analyze susceptibility patterns, compare temporal trends with those of previous studies, and assess empirical antimicrobial therapy. Methods: Retrospective review study of children bacteremia in hemato-oncologic diseases was conducted at Seoul National University Bundang Hospital from January 2013 to July 2023. Results: Overall, 98 episodes of bacteremia were observed in 74 patients. Among pathogens isolated, 57.1% (n=56) were Gram-positive bacteria, 38.8% (n=38) were Gram-negative bacteria, and 4.1% (n=4) were Candida spp. The most common Gram-positive bacteria were coagulase-negative staphylococci (n=21, 21.4%) and Staphylococcus aureus, (n=14, 14.3%) whereas the most common Gram-negative bacteria were Klebsiella pneumoniae (n=16, 16.3%) and Escherichia coli (n=10, 10.2%). The susceptibility of Gram-positive bacteria to penicillin, oxacillin, and vancomycin was 11.5%, 32.7%, and 94.2%, respectively and the susceptibility of Gram-negative bacteria to cefotaxime, piperacillin/tazobactam, imipenem, gentamicin, and amikacin was 68.6%, 80%, 97.1%, 82.9%, and 91.4%, respectively. Methicillin-resistant S. aureus was detected in 1 strain and among Gram-negative strains, extended spectrum β-lactamase accounted for 28.9% (12/38). When analyzing the antibiotic susceptibility and empirical antibiotics, the mismatch rate was 25.5% (n=25). The mortality rate of children within 30 days of bacteremia was 7.1% (n=7). Conclusions: Empirical antibiotic therapy for bacteremia in children with hemato-oncologic diseases should be based on the local antibiogram in each institution and continuous monitoring is necessary.

A Study on Negation Handling and Term Weighting Schemes and Their Effects on Mood-based Text Classification (감정 기반 블로그 문서 분류를 위한 부정어 처리 및 단어 가중치 적용 기법의 효과에 대한 연구)

  • Jung, Yu-Chul;Choi, Yoon-Jung;Myaeng, Sung-Hyon
    • Korean Journal of Cognitive Science
    • /
    • v.19 no.4
    • /
    • pp.477-497
    • /
    • 2008
  • Mood classification of blog text is an interesting problem, with a potential for a variety of services involving the Web. This paper introduces an approach to mood classification enhancements through the normalized negation n-grams which contain mood clues and corpus-specific term weighting(CSTW). We've done experiments on blog texts with two different classification methods: Enhanced Mood Flow Analysis(EMFA) and Support Vector Machine based Mood Classification(SVMMC). It proves that the normalized negation n-gram method is quite effective in dealing with negations and gave gradual improvements in mood classification with EMF A. From the selection of CSTW, we noticed that the appropriate weighting scheme is important for supporting adequate levels of mood classification performance because it outperforms the result of TF*IDF and TF.

  • PDF

Emotion Prediction of Paragraph using Big Data Analysis (빅데이터 분석을 이용한 문단 내의 감정 예측)

  • Kim, Jin-su
    • Journal of Digital Convergence
    • /
    • v.14 no.11
    • /
    • pp.267-273
    • /
    • 2016
  • Creation and Sharing of information which is structured data as well as various unstructured data. makes progress actively through the spread of mobile. Recently, Big Data extracts the semantic information from SNS and data mining is one of the big data technique. Especially, the general emotion analysis that expresses the collective intelligence of the masses is utilized using large and a variety of materials. In this paper, we propose the emotion prediction system architecture which extracts the significant keywords from social network paragraphs using n-gram and Korean morphological analyzer, and predicts the emotion using SVM and these extracted emotion features. The proposed system showed 82.25% more improved recall rate in average than previous systems and it will help extract the semantic keyword using morphological analysis.