• Title/Summary/Keyword: Corpus-based

Search Result 573, Processing Time 0.025 seconds

Deletion-Based Sentence Compression Using Sentence Scoring Reflecting Linguistic Information (언어 정보가 반영된 문장 점수를 활용하는 삭제 기반 문장 압축)

  • Lee, Jun-Beom;Kim, So-Eon;Park, Seong-Bae
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.3
    • /
    • pp.125-132
    • /
    • 2022
  • Sentence compression is a natural language processing task that generates concise sentences that preserves the important meaning of the original sentence. For grammatically appropriate sentence compression, early studies utilized human-defined linguistic rules. Furthermore, while the sequence-to-sequence models perform well on various natural language processing tasks, such as machine translation, there have been studies that utilize it for sentence compression. However, for the linguistic rule-based studies, all rules have to be defined by human, and for the sequence-to-sequence model based studies require a large amount of parallel data for model training. In order to address these challenges, Deleter, a sentence compression model that leverages a pre-trained language model BERT, is proposed. Because the Deleter utilizes perplexity based score computed over BERT to compress sentences, any linguistic rules and parallel dataset is not required for sentence compression. However, because Deleter compresses sentences only considering perplexity, it does not compress sentences by reflecting the linguistic information of the words in the sentences. Furthermore, since the dataset used for pre-learning BERT are far from compressed sentences, there is a problem that this can lad to incorrect sentence compression. In order to address these problems, this paper proposes a method to quantify the importance of linguistic information and reflect it in perplexity-based sentence scoring. Furthermore, by fine-tuning BERT with a corpus of news articles that often contain proper nouns and often omit the unnecessary modifiers, we allow BERT to measure the perplexity appropriate for sentence compression. The evaluations on the English and Korean dataset confirm that the sentence compression performance of sentence-scoring based models can be improved by utilizing the proposed method.

A Korean Grammar Checker based on the Trees Resulted from a Full Parser (전체 문장 분석에 기반한 한국어 문법 검사기)

  • 이공주;황선영;김지은
    • Journal of KIISE:Software and Applications
    • /
    • v.30 no.10
    • /
    • pp.992-999
    • /
    • 2003
  • The purpose of a grammar checker is to find a grammatical erroneous expression in a sentence, and to provide appropriate suggestions for them. To find those errors, grammar checker should parse the whole input sentence, which is a highly time-consuming job. B7or this reason, most Korean grammar checkers adopt a partial parser that can analyze a fragment of a sentence without an ambiguity. This paper presents a Korean grammar checker using a full parser in order to find grammatical errors. This approach allows the grammar checker to critique the errors between the two words in a long distance relationship within a sentence. As a result, this approach improves the accuracy in correcting errors, but it nay come at the expense of decrease in its performance. The Korean grammar checker described in this paper is implemented with 65 rules for checking and correcting the grammatical errors. The grammar checker shows 96.49% in checking accuracy against the test corpus including 7 million words.

Two -week Oral Toxicity Study of 1- (4-methylpiperazinyl) -3- phenylisoquinoline (CWJ-a-5) in sprague-Dawley (SD) Rats (1-(4-methylpiperazinyl)-3-phenylisoquinoline (CWJ- a-5)의 Sprague-Dawley(SD) 랫드를 이용한 2주간 반복 경구투여 독성시험)

  • 강부현;조원제;김대덕;김용범;차신우;장순재
    • Toxicological Research
    • /
    • v.18 no.1
    • /
    • pp.47-57
    • /
    • 2002
  • The subacute oral toxicity of 1-(4-methylpiperazinyl)-3-phenylisoquinoline (CWJ- a-5) was investigated in Sprague-Dawley (SD) rats. Five groups of 5 males and 5 females were orally administered at doses of 0, 37.5, 75, 150 and 200 mg/kg with CWJ-a-5 for 2 weeks. In clinical signs, Salivation was observed in the 75, 150 and 500 mg/kg male and female groups. Loss of fur was observed in the 500 mg/kg male and female group. Body weight were significantly decreased in the 150 and 500 mg/kg male groups and in the 500 mg/kg female group. Food consumption was significantly decreased in the 300 mg/kg male group. In serum biochemistry, total cholesterol and phospholipid were significantly increased in 500 mg/kg male and female group. Aspartate aminotransferase was significantly increased in the 500 mg/kg female group. In histopathological examination, vacuolar degeneration of renal tubules in the kidney, vacuolar degeneration of hepatocytes in the liver vacuolar degeneration of myocytes in the heart, vacuolar degeneration of histiocytes in the spleen and thymus, atrophy of seminiferous tubule and degeneration of germinal epithelium in the testis, vacuolar degeneration of corpus luteum, granulosa cell and theca cell in the ovary were observed in the 150 and 500 mg/kg male and female groups. Based on these results, the no observed adverse effect level (NOAEL) with CWJ-a-5 was considered to be 75 mg/kg and the absolute toxic dose was considered to be 150 mg/kg in this study

Development and Evaluation of a Document Summarization System using Features and a Text Component Identification Method (텍스트 구성요소 판별 기법과 자질을 이용한 문서 요약 시스템의 개발 및 평가)

  • Jang, Dong-Hyun;Myaeng, Sung-Hyon
    • Journal of KIISE:Software and Applications
    • /
    • v.27 no.6
    • /
    • pp.678-689
    • /
    • 2000
  • This paper describes an automatic summarization approach that constructs a summary by extracting sentences that are likely to represent the main theme of a document. As a way of selecting summary sentences, the system uses a model that takes into account lexical and statistical information obtained from a document corpus. As such, the system consists of two parts: the training part and the summarization part. The former processes sentences that have been manually tagged for summary sentences and extracts necessary statistical information of various kinds, and the latter uses the information to calculate the likelihood that a given sentence is to be included in the summary. There are at least three unique aspects of this research. First of all, the system uses a text component identification model to categorize sentences into one of the text components. This allows us to eliminate parts of text that are not likely to contain summary sentences. Second, although our statistically-based model stems from an existing one developed for English texts, it applies the framework to individual features separately and computes the final score for each sentence by combining the pieces of evidence using the Dempster-Shafer combination rule. Third, not only were new features introduced but also all the features were tested for their effectiveness in the summarization framework.

  • PDF

Decision Tree based Disambiguation of Semantic Roles for Korean Adverbial Postpositions in Korean-English Machine Translation (한영 기계번역에서 결정 트리 학습에 의한 한국어 부사격 조사의 의미 중의성 해소)

  • Park, Seong-Bae;Zhang, Byoung-Tak;Kim, Yung-Taek
    • Journal of KIISE:Software and Applications
    • /
    • v.27 no.6
    • /
    • pp.668-677
    • /
    • 2000
  • Korean has the characteristics that case postpositions determine the syntactic roles of phrases and a postposition may have more than one meanings. In particular, the adverbial postpositions make translation from Korean to English difficult, because they can have various meanings. In this paper, we describe a method for resolving such semantic ambiguities of Korean adverbial postpositions using decision trees. The training examples for decision tree induction are extracted from a corpus consisting of 0.5 million words, and the semantic roles for adverbial postpositions are classified into 25 classes. The lack of training examples in decision tree induction is overcome by clustering words into classes using a greedy clustering algorithm. The cross validation results show that the presented method achieved 76.2% of precision on the average, which means 26.0% improvement over the method determining the semantic role of an adverbial postposition as the most frequently appearing role.

  • PDF

Effects of Ovarian Morphology and Culture Vessel on In vitro Development and Cell Number in Embryos of Korean Native Cows

  • Park, Yong-Soo;Kim, Jae-Myeoung
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.20 no.1
    • /
    • pp.31-35
    • /
    • 2007
  • The main purpose of this study was to improve the efficiency and quality of in vitro embryo production in Korean Native Cows (KNC). We examined the effects of ovarian morphologies (Experiment 1) and the culture vessel (Experiment 2) on in vitro maturation (IVM). We measured the subsequent development rates and cell numbers of blastocysts. In Experiment 1, the ovaries of KNC were divided into six groups, based on follicle and corpus luteum (CL) morphology. The development rates to the 2- and 8-cell stages were similar among the six groups. The development rates to blastocyst stages were significantly higher in the group without a CL or follicle (WOCL/F) than in the groups with follicular cysts (FCs), regressive CLs (RCLs) or cystic CLs (CCLs) (p<0.05). The cell number of the inner cell mass (ICM) of blastocysts in the FCs and RCLs groups, and the number of cells in the trophectoderm (TE) in the WOCL/F group, FCs, growing CLs (GCLs) and RCLs were significantly higher than in other groups (p<0.05). The total cell number (TCN) in the WOCL/F, FC and RCL groups was also significantly higher than in other groups (p<0.05). The ICM cell number/TCN ratio was significantly higher in the FC and RCL groups than in the GCL and DF groups (p<0.05). In Experiment 2, oocyte IVM was carried out in culture dishes, in 0.25- or 0.5-ml straws used for freezing sperm. The development rate to the 2-cell stage was significantly higher in the 0.5-ml straw group than in the 0.25-ml straw group. The development rates to the blastocyst stage were similar in the dish and the two straw groups. There were no differences in the cell numbers of ICM, TE or TCN or ICM cell number/TCN ratios between groups.

Trend of Pharmacopuncture Therapy for Treating Cervical Disease in Korea

  • Kim, Seok-Hee;Jung, Da-Jung;Choi, Yoo-Min;Kim, Jong-Uk;Yook, Tae-Han
    • Journal of Pharmacopuncture
    • /
    • v.17 no.4
    • /
    • pp.7-14
    • /
    • 2014
  • Objectives: The purpose of this study is to analyze trends in domestic studies on pharmacopuncture therapy for treating cervical disease. Methods: This study was carried out on original copies and abstracts of theses listed in databases or published until July 2014. The search was made on the Oriental medicine Advanced Searching Integrated System (OASIS) the National Digital Science Library (NDSL), and the Korean traditional knowledge portal. Search words were 'pain on cervical spine', 'cervical pain', 'ruptured cervical disk', 'cervical disc disorder', 'stiffness of the neck', 'cervical disk', 'whiplash injury', 'cervicalgia', 'posterior cervical pain', 'neck disability', 'Herniated Nucleus Pulposus (HNP)', and 'Herniated Intervertebral Disc (HIVD)'. Results: Twenty-five clinical theses related to pharmacopuncture were selected and were analyzed by year according to the type of pharmacopuncture used, the academic journal in which the publication appeared, and the effect of pharmacopuncture therapy. Conclusion: The significant conclusions are as follows: (1) Pharmacopunctures used for cervical pain were Bee venom pharmacopuncture, Carthami-flos pharmacopuncture, Scolopendra pharmacopuncture, Ouhyul pharmacopuncturen, Hwangryun pharmacopuncture, Corpus pharmacopuncture, Soyeom pharmacopuncture, Hwangryunhaedoktang pharmacopuncture, Shinbaro phamacopuncture. (2) Randomized controlled trials showed that pharmacopuncture therapy combined with other methods was more effective. (3) In the past, studies oriented toward Bee venom pharmacopuncture were actively pursued, but the number of studies on various other types of pharmacopuncture gradually began to increase. (4) For treating a patient with cervical pain, the type of pharmacopuncture to be used should be selected based on the cause of the disease and the patient's condition.

A Study on the Construction of the Automatic Extracts and Summaries - On the Basis of Scientific Journal Articles - (자동 발췌문/요약 시스템 구축에 관한 연구 - 학술지 논문기사를 중심으로 -)

  • Lee Tae-Young
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.39 no.3
    • /
    • pp.139-163
    • /
    • 2005
  • Various corpus-based approaches, rhetorical roles of discourse structure, and unifications of similar sentences were applied to construct the automatic Ext/Sums(extracts and summaries). Rhetorical roles of sentences like objective, method, background, result, conclusion, etc. for making elastic Ext/Sums were established and extraction engines according to respective role were prepared. The $90\%$ of Success rate in extracting the important sentences of sample articles was accomplished. Rearranging the selected sentences, it used unification of similar sentences using the cosine coefficient equation, deletion of unnecessary modification and insertion clauses, junction of short sentences, and connection of sentences able to link. They suggest the methods applying rhetorical roles of sentences, meaning and signature of noun and verb in clauses, and cue words and location will be researched to construct the more effective Ext/Sums.

Ontology Construction and Its Application to Disambiguate Word Senses (온톨로지 구축 및 단어 의미 중의성 해소에의 활용)

  • Kang, Sin-Jae
    • The KIPS Transactions:PartB
    • /
    • v.11B no.4
    • /
    • pp.491-500
    • /
    • 2004
  • This paper presents an ontology construction method using various computational language resources, and an ontology-based word sense disambiguation method. In order to acquire a reasonably practical ontology the Kadokawa thesaurus is extended by inserting additional semantic relations into its hierarchy, which are classified as case relations and other semantic relations. To apply the ontology to disambiguate word senses, we apply the previously-secured dictionary information to select the correct senses of some ambiguous words with high precision, and then use the ontology to disambiguate the remaining ambiguous words. The mutual information between concepts in the ontology was calculated before using the ontology as knowledge for disambiguating word senses. If mutual information is regarded as a weight between ontology concepts, the ontology can be treated as a graph with weighted edges, and then we locate the weighted path from one concept to the other concept. In our practical machine translation system, our word sense disambiguation method achieved a 9% improvement over methods which do not use ontology for Korean translation.

Segmentation of Continuous Speech based on PCA of Feature Vectors (주요고유성분분석을 이용한 연속음성의 세그멘테이션)

  • 신옥근
    • The Journal of the Acoustical Society of Korea
    • /
    • v.19 no.2
    • /
    • pp.40-45
    • /
    • 2000
  • In speech corpus generation and speech recognition, it is sometimes needed to segment the input speech data without any prior knowledge. A method to accomplish this kind of segmentation, often called as blind segmentation, or acoustic segmentation, is to find boundaries which minimize the Euclidean distances among the feature vectors of each segments. However, the use of this metric alone is prone to errors because of the fluctuations or variations of the feature vectors within a segment. In this paper, we introduce the principal component analysis method to take the trend of feature vectors into consideration, so that the proposed distance measure be the distance between feature vectors and their projected points on the principal components. The proposed distance measure is applied in the LBDP(level building dynamic programming) algorithm for an experimentation of continuous speech segmentation. The result was rather promising, resulting in 3-6% reduction in deletion rate compared to the pure Euclidean measure.

  • PDF