Search | Korea Science

Applying Token Tagging to Augment Dataset for Automatic Program Repair

Hu, Huimin;Lee, Byungjeong
- Journal of Information Processing Systems
- /
- v.18 no.5
- /
- pp.628-636
- /
- 2022
Automatic program repair (APR) techniques focus on automatically repairing bugs in programs and providing correct patches for developers, which have been investigated for decades. However, most studies have limitations in repairing complex bugs. To overcome these limitations, we developed an approach that augments datasets by utilizing token tagging and applying machine learning techniques for APR. First, to alleviate the data insufficiency problem, we augmented datasets by extracting all the methods (buggy and non-buggy methods) in the program source code and conducting token tagging on non-buggy methods. Second, we fed the preprocessed code into the model as an input for training. Finally, we evaluated the performance of the proposed approach by comparing it with the baselines. The results show that the proposed approach is efficient for augmenting datasets using token tagging and is promising for APR.
https://doi.org/10.3745/JIPS.04.0251 인용 PDF KSCI

Korean Head-Tail Tokenization and Part-of-Speech Tagging by using Deep Learning (딥러닝을 이용한 한국어 Head-Tail 토큰화 기법과 품사 태깅)

Kim, Jungmin;Kang, Seungshik;Kim, Hyeokman
- IEMEK Journal of Embedded Systems and Applications
- /
- v.17 no.4
- /
- pp.199-208
- /
- 2022
Korean is an agglutinative language, and one or more morphemes are combined to form a single word. Part-of-speech tagging method separates each morpheme from a word and attaches a part-of-speech tag. In this study, we propose a new Korean part-of-speech tagging method based on the Head-Tail tokenization technique that divides a word into a lexical morpheme part and a grammatical morpheme part without decomposing compound words. In this method, the Head-Tail is divided by the syllable boundary without restoring irregular deformation or abbreviated syllables. Korean part-of-speech tagger was implemented using the Head-Tail tokenization and deep learning technique. In order to solve the problem that a large number of complex tags are generated due to the segmented tags and the tagging accuracy is low, we reduced the number of tags to a complex tag composed of large classification tags, and as a result, we improved the tagging accuracy. The performance of the Head-Tail part-of-speech tagger was experimented by using BERT, syllable bigram, and subword bigram embedding, and both syllable bigram and subword bigram embedding showed improvement in performance compared to general BERT. Part-of-speech tagging was performed by integrating the Head-Tail tokenization model and the simplified part-of-speech tagging model, achieving 98.99% word unit accuracy and 99.08% token unit accuracy. As a result of the experiment, it was found that the performance of part-of-speech tagging improved when the maximum token length was limited to twice the number of words.
https://doi.org/10.14372/IEMEK.2022.17.4.199 인용 PDF KSCI

A Survey of Machine Translation and Parts of Speech Tagging for Indian Languages

Khedkar, Vijayshri;Shah, Pritesh
- International Journal of Computer Science & Network Security
- /
- v.22 no.4
- /
- pp.245-253
- /
- 2022
Commenced in 1954 by IBM, machine translation has expanded immensely, particularly in this period. Machine translation can be broken into seven main steps namely- token generation, analyzing morphology, lexeme, tagging Part of Speech, chunking, parsing, and disambiguation in words. Morphological analysis plays a major role when translating Indian languages to develop accurate parts of speech taggers and word sense. The paper presents various machine translation methods used by different researchers for Indian languages along with their performance and drawbacks. Further, the paper concentrates on parts of speech (POS) tagging in Marathi dialect using various methods such as rule-based tagging, unigram, bigram, and more. After careful study, it is concluded that for machine translation, parts of speech tagging is a major step. Also, for the Marathi language, the Hidden Markov Model gives the best results for parts of speech tagging with an accuracy of 93% which can be further improved according to the dataset.
https://doi.org/10.22937/IJCSNS.2022.22.4.31 인용 PDF KSCI

Reference String Recognition based on Word Sequence Tagging and Post-processing: Evaluation with English and German Datasets

Kang, In-Su
- Journal of the Korea Society of Computer and Information
- /
- v.23 no.5
- /
- pp.1-7
- /
- 2018
Reference string recognition is to extract individual reference strings from a reference section of an academic article, which consists of a sequence of reference lines. This task has been attacked by heuristic-based, clustering-based, classification-based approaches, exploiting lexical and layout characteristics of reference lines. Most classification-based methods have used sequence labeling to assign labels to either a sequence of tokens within reference lines, or a sequence of reference lines. Unlike the previous token-level sequence labeling approach, this study attempts to assign different labels to the beginning, intermediate and terminating tokens of a reference string. After that, post-processing is applied to identify reference strings by predicting their beginning and/or terminating tokens. Experimental evaluation using English and German reference string recognition datasets shows that the proposed method obtains above 94% in the macro-averaged F1.
https://doi.org/10.9708/jksci.2018.23.05.001 인용 PDF KSCI

Proper Noun Embedding Model for the Korean Dependency Parsing

Nam, Gyu-Hyeon;Lee, Hyun-Young;Kang, Seung-Shik
- Journal of Multimedia Information System
- /
- v.9 no.2
- /
- pp.93-102
- /
- 2022
Dependency parsing is a decision problem of the syntactic relation between words in a sentence. Recently, deep learning models are used for dependency parsing based on the word representations in a continuous vector space. However, it causes a mislabeled tagging problem for the proper nouns that rarely appear in the training corpus because it is difficult to express out-of-vocabulary (OOV) words in a continuous vector space. To solve the OOV problem in dependency parsing, we explored the proper noun embedding method according to the embedding unit. Before representing words in a continuous vector space, we replace the proper nouns with a special token and train them for the contextual features by using the multi-layer bidirectional LSTM. Two models of the syllable-based and morpheme-based unit are proposed for proper noun embedding and the performance of the dependency parsing is more improved in the ensemble model than each syllable and morpheme embedding model. The experimental results showed that our ensemble model improved 1.69%p in UAS and 2.17%p in LAS than the same arc-eager approach-based Malt parser.
https://doi.org/10.33851/JMIS.2022.9.2.93 인용 PDF KSCI HTML

Korean Lexical Disambiguation Based on Statistical Information (통계정보에 기반을 둔 한국어 어휘중의성해소)

박하규;김영택
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.19 no.2
- /
- pp.265-275
- /
- 1994
Lexical disambiguation is one of the most basic areas in natural language processing such as speech recognition/synthesis, information retrieval, corpus tagging/ etc. This paper describes a Korean lexical disambiguation mechanism where the disambigution is perfoemed on the basis of the statistical information collected from corpora. In this mechanism, the token tags corresponding to the results of the morphological analysis are used instead of part of speech tags for the purpose of detail disambiguation. The lexical selection function proposed shows considerably high accuracy, since the lexical characteristics of Korean such as concordance of endings or postpositions are well reflected in it. Two disambiguation methods, a unique selection method and a multiple selection method, are provided so that they can be properly according to the application areas.
PDF

Search Result 6, Processing Time 0.02 seconds

Applying Token Tagging to Augment Dataset for Automatic Program Repair

Korean Head-Tail Tokenization and Part-of-Speech Tagging by using Deep Learning (딥러닝을 이용한 한국어 Head-Tail 토큰화 기법과 품사 태깅)

A Survey of Machine Translation and Parts of Speech Tagging for Indian Languages

Reference String Recognition based on Word Sequence Tagging and Post-processing: Evaluation with English and German Datasets

Proper Noun Embedding Model for the Korean Dependency Parsing

Korean Lexical Disambiguation Based on Statistical Information (통계정보에 기반을 둔 한국어 어휘중의성해소)

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)