• 제목/요약/키워드: analysis of corpus

Search Result 423, Processing Time 0.031 seconds

Identifying Key Grammatical Errors of Japanese English as a Foreign Language Learners in a Learner Corpus: Toward Focused Grammar Instruction with Data-Driven Learning

  • Atsushi Mizumoto;Yoichi Watari
    • Asia Pacific Journal of Corpus Research
    • /
    • v.4 no.1
    • /
    • pp.25-42
    • /
    • 2023
  • The number of studies on data-driven learning (DDL) has increased in recent years, and DDL's overall effectiveness as an L2 (second language) teaching methodology has been reported to be high. However, the degree of its effectiveness in grammar instruction, particularly for the goal of correcting errors in L2 writing, is still unclear. To provide guidelines for focused grammar instruction with DDL in the Japanese classroom setting, we aimed to identify the typical grammatical errors made by Japanese learners in the Cambridge Learner Corpus First Certificate in English (CLC FCE) dataset. The results revealed that three error types (nouns, articles, and prepositions) should be addressed in DDL grammar instruction for Japanese English as a foreign language (EFL) learners. In light of the findings, pedagogical implications and suggestions for future DDL research and practice are discussed.

In My Opinion: Modality in Japanese EFL Learners' Argumentative Essays

  • Pemberton, Christine
    • Asia Pacific Journal of Corpus Research
    • /
    • v.1 no.2
    • /
    • pp.57-72
    • /
    • 2020
  • This study seeks to add to the current understanding of learners' use of modality in argumentative writing. A learner corpus of argumentative essays on four topics was created and compared to native English speaker data from the International Corpus Network of Asian Learners of English (ICNALE). The relationship between learners' use of modal devices (MDs) and the devices' appearance in the school's curriculum was also examined. The results showed that learners relied on a very narrow range of MDs compared to those in previous studies. The frequency of use of MDs varied based on the topic and did not seem to be driven by cultural factors as has been previously suggested. Learners used more hedges than boosters on all topics, contradicting most previous studies. Curriculum was determined to have a direct correlation with MD use, and other important factors may include perception of topic and overreliance on certain MDs over others (the One-to-One principal). This research implies that learners' perception of topic should be explored further as a variable affecting MD use. Curricula should be designed based on frequency of MD use by English native speakers, and learners should receive instruction that teaches the norms of MD use in academic writing. The methodology used in the study to determine correlations between MD use and the curriculum has a wide range of potential applications in the field of Contrastive Interlanguage Analysis.

A Corpus-Based Analysis of Crosslinguistic Influence on the Acquisition of Concessive Conditionals in L2 English

  • Newbery-Payton, Laurence
    • Asia Pacific Journal of Corpus Research
    • /
    • v.3 no.1
    • /
    • pp.35-49
    • /
    • 2022
  • This study examines crosslinguistic influence on the use of concessive conditionals by Japanese EFL learners. Contrastive analysis suggests that Japanese native speakers may overuse the concessive conditional even if due to partial similarities to Japanese concessive conditionals, whose formal and semantic restrictions are fewer than those of English concessive conditionals. This hypothesis is tested using data from the written module of the International Corpus Network of Asian Learners of English (ICNALE). Comparison of Japanese native speakers with English native speakers and Chinese native speakers reveals the following trends. First, Japanese native speakers tend to overuse concessive conditionals compared to native speakers, while similar overuse is not observed in Chinese native speaker data. Second, non-nativelike uses of even if appear in contexts allowing the use of concessive conditionals in Japanese. Third, while overuse and infelicitous use of even if is observed at all proficiency levels, formal errors are restricted to learners at lower proficiency levels. These findings suggest that crosslinguistic influence does occur in the use of concessive conditionals, and that its particular realization is affected by L2 proficiency, with formal crosslinguistic influence mediated at an earlier stage than semantic cross-linguistic influence.

A Study on the Use of Stopword Corpus for Cleansing Unstructured Text Data (비정형 텍스트 데이터 정제를 위한 불용어 코퍼스의 활용에 관한 연구)

  • Lee, Won-Jo
    • The Journal of the Convergence on Culture Technology
    • /
    • v.8 no.6
    • /
    • pp.891-897
    • /
    • 2022
  • In big data analysis, raw text data mostly exists in various unstructured data forms, so it becomes a structured data form that can be analyzed only after undergoing heuristic pre-processing and computer post-processing cleansing. Therefore, in this study, unnecessary elements are purified through pre-processing of the collected raw data in order to apply the wordcloud of R program, which is one of the text data analysis techniques, and stopwords are removed in the post-processing process. Then, a case study of wordcloud analysis was conducted, which calculates the frequency of occurrence of words and expresses words with high frequency as key issues. In this study, to improve the problems of the "nested stopword source code" method, which is the existing stopword processing method, using the word cloud technique of R, we propose the use of "general stopword corpus" and "user-defined stopword corpus" and conduct case analysis. The advantages and disadvantages of the proposed "unstructured data cleansing process model" are comparatively verified and presented, and the practical application of word cloud visualization analysis using the "proposed external corpus cleansing technique" is presented.

A Comparative Study of a New Approach to Keyword Analysis: Focusing on NBC (키워드 분석에 대한 최신 접근법 비교 연구: 성경 코퍼스를 중심으로)

  • Ha, Myoungho
    • Journal of Digital Convergence
    • /
    • v.19 no.7
    • /
    • pp.33-39
    • /
    • 2021
  • This paper aims to analyze lexical properties of keyword lists extracted from NLT Old Testament Corpus(NOTC), NLT New Testament Corpus(NNTC), and The NLT Bible Corpus(NBC) and identify that text dispersion keyness is more effective than corpus frequency keyness. For this purpose, NOTC including around 570,000 running words and NNTC about 200,000 were compiled after downloading the files from NLT website of Bible Hub. Scott's (2020) WordSmith 8.0 was utilized to extract keyword lists through comparing a target corpus and a reference corpus. The result demonstrated that text dispersion keyness showed lexical properties of keyword lists better than corpus frequency keyness and that the former was a superior measure for generating optimal keyword lists to fully meet content-generalizability and content distinctiveness.

Corpus-based analysis of the usage of Korean markers -(n)un and -i/ka in editorial texts

  • Kim, Kyoung-Young
    • Language and Information
    • /
    • v.19 no.2
    • /
    • pp.19-36
    • /
    • 2015
  • The aim of this paper is to investigate the usage of Korean markers -(n)un and -i/ka in editorial texts focusing on information structure. Noun phrases ending with the markers -(n)un and -i/ka were annotated semi-automatically using a corpus obtained from an online newspaper. Two important factors to determine the choice of markers were examined with the annotated data: referential givenness/newness and position in a sentence. Referential givenness and newness were adopted as indicators of information structure, topic and focus respectively. In addition to quantitative analysis, qualitative analysis was conducted on the selected data. The results suggest that both the marker -(n)un and -i/ka could carry a topic and a focus reading. Sentence position also played a crucial role in determining the marker, and the marker -i/ka was used more frequently in a later position of a sentence than the marker -(n)un.

  • PDF

Phonological processes of consonants from orthographic to pronounced words in the Buckeye Corpus

  • Yang, Byunggon
    • Phonetics and Speech Sciences
    • /
    • v.11 no.4
    • /
    • pp.55-62
    • /
    • 2019
  • This paper investigates the phonological processes of consonants in pronounced words in the Buckeye Corpus and compares the frequency distribution of these processes to provide a clearer understanding of conversational English for linguists and teachers. Both orthographic and pronounced words were extracted from the transcribed label scripts of the Buckeye Corpus. Next, the phonological processes of consonants in the orthographic and pronounced labels were tabulated separately by onsets and codas, and a frequency distribution by consonant process types was examined. The results showed that the majority of the onset clusters were pronounced as the same sounds in the Buckeye Corpus. The participants in the corpus were presumed to speak semiformally. In addition, the onsets have fewer deletions than the codas, which might be related to the information weight of the syllable components. Moreover, there is a significant association and strong positive correlation between the phonological processes of the onsets and codas in men and women. This paper concludes that an analysis of phonological processes in spontaneous speech corpora can contribute to a practical understanding of spoken English. Further studies comparing the current phonological process data with those of other languages would be desirable to establish universal patterns in phonological processes.

Clustering Keywords to Define Cybersecurity: An Analysis of Malaysian and ASEAN Countries' Cyber Laws

  • Joharry, Siti Aeisha;Turiman, Syamimi;Nor, Nor Fariza Mohd
    • Asia Pacific Journal of Corpus Research
    • /
    • v.3 no.2
    • /
    • pp.17-33
    • /
    • 2022
  • While the term is nothing new, 'cybersecurity' still seems to be defined quite loosely and subjectively depending on context. This is problematic especially to legal writers for prosecuting cybercrimes that do not fit a particular clause/act. In fact, what is more difficult is the non-existent single 'cybersecurity law' in Malaysia, rather than the current implementation of 10-related cyber security acts. In this paper, the 10 acts are compiled into a corpus to analyse the language used in these acts via a corpus linguistics approach. A list of frequent words is firstly investigated to see whether the so-called related laws do talk about cybersecurity followed by close inspection of the concordance lines and habitually associated phrases (clusters) to explore use of these words in context. The 'compare 2 wordlist' feature is used to identify similarities or differences between the 10 Malaysian cybersecurity related laws against a corpus of cyber laws from other ASEAN countries. Findings revealed that ASEAN cyber laws refer mostly to three cybersecurity dominant themes identified in the literature: technological solutions, events, and strategies, processes, and methods, whereas Malaysian cybersecurity-related laws revolved around themes like human engagement, and referent objects (of security). Although these so-called cyber related policies and laws in Malaysia are highlighted in the National Cyber Security Agency (NACSA), their practical applications to combat cybercrimes remain uncertain.

A Corpus-Based Study of the Use of HEART and HEAD in English

  • Oh, Sang-suk
    • Language and Information
    • /
    • v.18 no.2
    • /
    • pp.81-102
    • /
    • 2014
  • The purpose of this paper is to provide corpus-based quantitative analyses of HEART and HEAD in order to examine their actual usage status and to consider some cognitive linguistic aspects associated with their use. The two corpora COCA and COHA are used for analysis in this study. The analysis of COCA corpus reveals that the total frequency of HEAD is much higher than that of HEART, and that the figurative use of HEART (60%) is two times higher than its literal use (32%); by contrast, the figurative use of HEAD (41%) is a bit higher than its literal use (38%). Among all four genres, both lexemes occur most frequently in fictions and then in magazines. Over the past two centuries, the use of HEART has been steadily decreasing; by contrast, that the use of HEAD has been steadily increasing. It is assumed that the decreasing use of HEART has partially to do with the decrease in its figurative use and that the increasing use of HEAD is attributable to its diverse meanings, the increase of its lexical use, and the partial increase in its figurative use. The analysis of the collocation of verbs and adjectives preceding HEART and HEAD, as well the modifying and predicating forms of HEART and HEAD also provides some relevant information of the usage of the two lexemes. This paper showcases that the quantitative information helps understanding not only of the actual usage of the two lexemes but also of the cognitive forces working behind it.

  • PDF

On the Analysis of Natural Language Processing Morphology for the Specialized Corpus in the Railway Domain

  • Won, Jong Un;Jeon, Hong Kyu;Kim, Min Joong;Kim, Beak Hyun;Kim, Young Min
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.14 no.4
    • /
    • pp.189-197
    • /
    • 2022
  • Today, we are exposed to various text-based media such as newspapers, Internet articles, and SNS, and the amount of text data we encounter has increased exponentially due to the recent availability of Internet access using mobile devices such as smartphones. Collecting useful information from a lot of text information is called text analysis, and in order to extract information, it is performed using technologies such as Natural Language Processing (NLP) for processing natural language with the recent development of artificial intelligence. For this purpose, a morpheme analyzer based on everyday language has been disclosed and is being used. Pre-learning language models, which can acquire natural language knowledge through unsupervised learning based on large numbers of corpus, are a very common factor in natural language processing recently, but conventional morpheme analysts are limited in their use in specialized fields. In this paper, as a preliminary work to develop a natural language analysis language model specialized in the railway field, the procedure for construction a corpus specialized in the railway field is presented.