• Title/Summary/Keyword: n-gram

Search Result 576, Processing Time 0.037 seconds

Sentence Similarity Measurement Method Using a Set-based POI Data Search (집합 기반 POI 검색을 이용한 문장 유사도 측정 기법)

  • Ko, EunByul;Lee, JongWoo
    • KIISE Transactions on Computing Practices
    • /
    • v.20 no.12
    • /
    • pp.711-716
    • /
    • 2014
  • With the gradual increase of interest in plagiarism and intelligent file content search, the demand for similarity measuring between two sentences is increasing. There is a lot of researches for sentence similarity measurement methods in various directions such as n-gram, edit-distance and LSA. However, these methods have their own advantages and disadvantages. In this paper, we propose a new sentence similarity measurement method approaching from another direction. The proposed method uses the set-based POI data search that improves search performance compared to the existing hard matching method when data includes the inverse, omission, insertion and revision of characters. Using this method, we are able to measure the similarity between two sentences more accurately and more quickly. We modified the data loading and text search algorithm of the set-based POI data search. We also added a word operation algorithm and a similarity measure between two sentences expressed as a percentage. From the experimental results, we observe that our sentence similarity measurement method shows better performance than n-gram and the set-based POI data search.

Classification Protein Subcellular Locations Using n-Gram Features (단백질 서열의 n-Gram 자질을 이용한 세포내 위치 예측)

  • Kim, Jinsuk
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2007.11a
    • /
    • pp.12-16
    • /
    • 2007
  • The function of a protein is closely co-related with its subcellular location(s). Given a protein sequence, therefore, how to determine its subcellular location is a vitally important problem. We have developed a new prediction method for protein subcellular location(s), which is based on n-gram feature extraction and k-nearest neighbor (kNN) classification algorithm. It classifies a protein sequence to one or more subcellular compartments based on the locations of top k sequences which show the highest similarity weights against the input sequence. The similarity weight is a kind of similarity measure which is determined by comparing n-gram features between two sequences. Currently our method extract penta-grams as features of protein sequences, computes scores of the potential localization site(s) using kNN algorithm, and finally presents the locations and their associated scores. We constructed a large-scale data set of protein sequences with known subcellular locations from the SWISS-PROT database. This data set contains 51,885 entries with one or more known subcellular locations. Our method show very high prediction precision of about 93% for this data set, and compared with other method, it also showed comparable prediction improvement for a test collection used in a previous work.

  • PDF

Enhancement of a language model using two separate corpora of distinct characteristics

  • Cho, Sehyeong;Chung, Tae-Sun
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.14 no.3
    • /
    • pp.357-362
    • /
    • 2004
  • Language models are essential in predicting the next word in a spoken sentence, thereby enhancing the speech recognition accuracy, among other things. However, spoken language domains are too numerous, and therefore developers suffer from the lack of corpora with sufficient sizes. This paper proposes a method of combining two n-gram language models, one constructed from a very small corpus of the right domain of interest, the other constructed from a large but less adequate corpus, resulting in a significantly enhanced language model. This method is based on the observation that a small corpus from the right domain has high quality n-grams but has serious sparseness problem, while a large corpus from a different domain has more n-gram statistics but incorrectly biased. With our approach, two n-gram statistics are combined by extending the idea of Katz's backoff and therefore is called a dual-source backoff. We ran experiments with 3-gram language models constructed from newspaper corpora of several million to tens of million words together with models from smaller broadcast news corpora. The target domain was broadcast news. We obtained significant improvement (30%) by incorporating a small corpus around one thirtieth size of the newspaper corpus.

Key Expressions in Editorial Texts: Determining the Unithood and Termhood of Word Sequences based on a 2009 Newspaper Corpus (신문 사설의 특징적 표현들에 대한 연구)

  • Kim, Hye-Young;Kang, Beom-Mo
    • Annual Conference on Human and Language Technology
    • /
    • 2012.10a
    • /
    • pp.185-190
    • /
    • 2012
  • 본 논문은 동아, 조선, 중앙, 한겨레 신문의 2009년 신문 사설의 제목과 본문에서 나타나는 n-gram에 대한 논의이다. 구체적으로 자주 출현하는 단어들의 연속 단위 3~6개의 형태소를 추출하여 신문 사설에서 나타난 고빈도 형태소 연속체를 살펴본다. 또한 이들을 기사문에서 추출한 패턴과 로그공산비로 비교하여 신문 사설에서 더 특징적인 의미로 사용되는 어휘들을 살펴본다. 그 결과, 사설 본문에서는 3-gram은 '아야 한다'. 4-gram은 'ㄹ 것이다', 5-gram은 'ㄹ 수밖에 없다', 6-gram은 '아야 할 것이다' 등이, 사설 제목은 '것인가, 안 된다'가 하나의 용어처럼 사용되고 있었다. 이러한 형태소 연속체를 살펴봄으로써, 신문사설의 텍스트 특징과 정형적인 표현에 대해서 살펴볼 수 있다.

  • PDF

An n-gram-based Indexing Method for Effective Retrieval of Hangul Texts (한글 문서의 효과적인 검색을 위한 n-gram 기반의 색인 방법)

  • 이준호;안정수;박현주;김명호
    • Journal of the Korean Society for information Management
    • /
    • v.13 no.1
    • /
    • pp.47-63
    • /
    • 1996
  • Conventional automatic indexing methods for Hangul texts can be classified into two groups as follows: One is to extract index terms by removing non-indexable segments from word-phrases, and the other is to generate index terms from the morphemes of word-phrases. The former suffers from the problem of word boundaries when documents contain many compound nouns. The latter can overcome the word boundary problem by extracting simple nouns, but has many overheads to develop a lot of linguistic knowledges needed in the indexing procedure. In this paper we propose a new indexing method based on n-grams. This method alleviates the problems of previous indexing methods related with word boundaries and linguistic knowledges. We also compare the effectiveness of the n-gram based indexing method with that of the previous ones.

  • PDF

Biosynthesis of L-Ascorbic Acid by Microorganisms in Kimchi Fermentation Process

  • Cheigh, Hong-Sik;Rina Yu;Park, Hyun-Jeong;Jun, Hong-Ki
    • Preventive Nutrition and Food Science
    • /
    • v.1 no.1
    • /
    • pp.37-40
    • /
    • 1996
  • Kinchi is and important source of various vitamins, minerals, dietary fiber, organic acids and other nutrients. In order to get a basic information for developing vitamins-rich funtional kimchi, we investigated microorganisms which are capable of synthesis of vitamin C in Kimchi system. Microorganisms isolated from aliquots of kimchi were screened and cultured by using MRS or nutrient agar medium. L-Ascorbic acid produced by microorganism in medium was measured with high performance liquid chromatography. As the result, we isolated two bacteria strins N7 and N5202 preducing L-ascorbic acid from the kimchi system. Morphological and Gram staining experiment showed that N7 was Gram positive bacilli, while N5202 was Gram negative. There were also several bacteria that were considered to synthesizs erythorbic acid which is an analog of ascorbic acid. These results suggested that vitamin C-rich functional food could be developed by using the kimchi microorganisms.

  • PDF

Clinical Significance and Incidence of Gram-positive Uropathogens in Pediatric Patients Younger than 1 Year of Age with Febrile Urinary Tract Infection (1세 이하의 발열성 소아 요로감염에서 Gram-Positive Uropathogens의 발생 빈도 및 임상적 의의)

  • Yang, Tae Hwan;Yim, Hyung Eun;Yoo, Kee Hwan
    • Childhood Kidney Diseases
    • /
    • v.17 no.2
    • /
    • pp.65-72
    • /
    • 2013
  • Purpose: Urinary tract infection (UTI) caused by gram-positive uropathogens is usually hospital-acquired and associated with predisposing conditions. However, the incidence of gram-positive bacteria in community-acquired UTIs has recently increased worldwide. We aimed to investigate the clinical significance of UTI and associated genitourinary malformations in young children with febrile UTIs caused by gram-positive bacteria. Methods: We retrospectively reviewed the medical records of 566 patients (age, <1 year) who visited the Korea University Medical Center for febrile UTIs between January 2008 and May 2013. We classified the patients into the following two groups: gram-positive (P group) and gram-negative (N group), according to the results of urine culture. The fever duration; white blood cell (WBC) counts and C-reactive protein (CRP) levels in peripheral blood; and the presence of hydronephrosis, cortical defects, vesicoureteral reflux (VUR), and renal scarring were compared between the two groups. Results: The number of patients with gram-positive bacteria was 23 (4.1%) and with gram-negative bacteria was 543 (95.9%). The most common pathogen was Escherichia coli, and Enterococcus faecalis showed the highest incidence among gram-positive uropathogens. Patients with gram-positive bacteria showed longer fever duration compared to that in patients with gram-negative bacteria (P vs. N, $3.4{\pm}1.2$ vs. $2.9{\pm}1.6$ days, P <0.05). The incidence of VUR was increased in the gram-positive group compared to that in the gram-negative group (P vs. N, 55.6 vs. 17.8%, P<0.05). However, there were no significant differences in other laboratory and radiologic findings. Conclusion: The findings of our study show that community-acquired UTIs in patients younger than 1 year of age, caused by gram-positive uropathogens, can be associated with prolonged fever duration and the presence of VUR.

Etiological Agents in Bacteremia of Children with Hemato-oncologic Diseases (2006-2010): A Single Center Study (최근 5년(2006-2010)간 소아 혈액 종양 환자에서 발생한 균혈증의 원인균 및 임상 양상: 단일기관 연구)

  • Kang, Ji Eun;Seok, Joon Young;Yun, Ki Wook;Kang, Hyoung Jin;Choi, Eun Hwa;Park, Kyung Duk;Shin, Hee Young;Lee, Hoan Jong;Ahn, Hyo Seop
    • Pediatric Infection and Vaccine
    • /
    • v.19 no.3
    • /
    • pp.131-140
    • /
    • 2012
  • Purpose : This study was performed to identify the etiologic agents and antimicrobial susceptibility patterns of organisms responsible for bloodstream infections in pediatric cancer patients for guidance in empiric antimicrobial therapy. Methods : A 5-year retrospective study of pediatric hemato-oncologic patients with bacteremia in Seoul National University Children's Hospital, from 2006 to 2010 was conducted. Results : A total of 246 pathogens were isolated, of which 63.4% (n=156) were gram-negative, bacteria 34.6% (n=85) were gram-positive bacteria, and 2.0% (n=5) were fungi. The most common pathogens were Klebsiella spp. (n=61, 24.8%) followed by Escherichia coli (n=31, 12.6%), coagulase-negative staphylococci (n=23, 9.3%), and Staphylococcus aureus (n=22, 8.9 %). Resistance rates of gram-positive bacteria to penicillin, oxacillin, and vancomycin were 85.7%, 65.9%, and 9.5%, respectively. Resistance rates of gram-negative bacteria to cefotaxime, piperacillin/tazobactam, imipenem, gentamicin, and amikacin were 37.2%, 17.1%, 6.2%, 32.2%, and 13.7%, respectively. Overall fatality rate was 12.7%. Gram-negative bacteremia was more often associated with shock (48.4% vs. 11.9%, P<0.01) and had higher fatality rate than gram-positive bacteremia (12.1% vs. 3.0%, P=0.03). Neutropenic patients were more often associated with shock than non-neutropenic patients (39.6 % vs. 22.0%, P=0.04). Conclusion : This study revealed that gram-negative bacteria were still dominant organisms of bloodstream infections in children with hemato-oncologic diseases, and patients with gram-negative bacteremia showed fatal course more frequently than those with gram-positive bacteremia.

  • PDF

Comments Classification System using Support Vector Machines and Topic Signature (지지 벡터 기계와 토픽 시그너처를 이용한 댓글 분류 시스템 언어에 독립적인 댓글 분류 시스템)

  • Bae, Min-Young;En, Ji-Hyun;Jang, Du-Sung;Cha, Jeong-Won
    • 한국HCI학회:학술대회논문집
    • /
    • 2009.02a
    • /
    • pp.263-266
    • /
    • 2009
  • Comments are short and not use spacing words or comma more than general document. We convert the 7-gram into 3-gram and select key features using topic signature. Topic signature is widely used for selecting features in document classification and summarization. We use the SVM(Support Vector Machines) as a classifier. From the result of experiments, we can see that the proposed method is outstanding over the previous methods. The proposed system can also apply to other languages.

  • PDF

Development and Evaluation of Information Extraction Module for Postal Address Information (우편주소정보 추출모듈 개발 및 평가)

  • Shin, Hyunkyung;Kim, Hyunseok
    • Journal of Creative Information Culture
    • /
    • v.5 no.2
    • /
    • pp.145-156
    • /
    • 2019
  • In this study, we have developed and evaluated an information extracting module based on the named entity recognition technique. For the given purpose in this paper, the module was designed to apply to the problem dealing with extraction of postal address information from arbitrary documents without any prior knowledge on the document layout. From the perspective of information technique practice, our approach can be said as a probabilistic n-gram (bi- or tri-gram) method which is a generalized technique compared with a uni-gram based keyword matching. It is the main difference between our approach and the conventional methods adopted in natural language processing that applying sentence detection, tokenization, and POS tagging recursively rather than applying the models sequentially. The test results with approximately two thousands documents are presented at this paper.