• Title/Summary/Keyword: Corpus Analysis

Search Result 423, Processing Time 0.025 seconds

An Attempt to Measure the Familiarity of Specialized Japanese in the Nursing Care Field

  • Haihong Huang;Hiroyuki Muto;Toshiyuki Kanamaru
    • Asia Pacific Journal of Corpus Research
    • /
    • v.4 no.2
    • /
    • pp.57-74
    • /
    • 2023
  • Having a firm grasp of technical terms is essential for learners of Japanese for Specific Purposes (JSP). This research aims to analyze Japanese nursing care vocabulary based on objective corpus-based frequency and subjectively rated word familiarity. For this purpose, we constructed a text corpus centered on the National Examination for Certified Care Workers to extract nursing care keywords. The Log-Likelihood Ratio (LLR) was used as the statistical criterion for keyword identification, giving a list of 300 keywords as target words for a further word recognition survey. The survey involved 115 participants of whom 51 were certified care workers (CW group) and 64 were individuals from the general public (GP group). These participants rated the familiarity of the target keywords through crowdsourcing. Given the limited sample size, Bayesian linear mixed models were utilized to determine word familiarity rates. Our study conducted a comparative analysis of word familiarity between the CW group and the GP group, revealing key terms that are crucial for professionals but potentially unfamiliar to the general public. By focusing on these terms, instructors can bridge the knowledge gap more efficiently.

In My Opinion: Modality in Japanese EFL Learners' Argumentative Essays

  • Pemberton, Christine
    • Asia Pacific Journal of Corpus Research
    • /
    • v.1 no.2
    • /
    • pp.57-72
    • /
    • 2020
  • This study seeks to add to the current understanding of learners' use of modality in argumentative writing. A learner corpus of argumentative essays on four topics was created and compared to native English speaker data from the International Corpus Network of Asian Learners of English (ICNALE). The relationship between learners' use of modal devices (MDs) and the devices' appearance in the school's curriculum was also examined. The results showed that learners relied on a very narrow range of MDs compared to those in previous studies. The frequency of use of MDs varied based on the topic and did not seem to be driven by cultural factors as has been previously suggested. Learners used more hedges than boosters on all topics, contradicting most previous studies. Curriculum was determined to have a direct correlation with MD use, and other important factors may include perception of topic and overreliance on certain MDs over others (the One-to-One principal). This research implies that learners' perception of topic should be explored further as a variable affecting MD use. Curricula should be designed based on frequency of MD use by English native speakers, and learners should receive instruction that teaches the norms of MD use in academic writing. The methodology used in the study to determine correlations between MD use and the curriculum has a wide range of potential applications in the field of Contrastive Interlanguage Analysis.

A Corpus-Based Analysis of Crosslinguistic Influence on the Acquisition of Concessive Conditionals in L2 English

  • Newbery-Payton, Laurence
    • Asia Pacific Journal of Corpus Research
    • /
    • v.3 no.1
    • /
    • pp.35-49
    • /
    • 2022
  • This study examines crosslinguistic influence on the use of concessive conditionals by Japanese EFL learners. Contrastive analysis suggests that Japanese native speakers may overuse the concessive conditional even if due to partial similarities to Japanese concessive conditionals, whose formal and semantic restrictions are fewer than those of English concessive conditionals. This hypothesis is tested using data from the written module of the International Corpus Network of Asian Learners of English (ICNALE). Comparison of Japanese native speakers with English native speakers and Chinese native speakers reveals the following trends. First, Japanese native speakers tend to overuse concessive conditionals compared to native speakers, while similar overuse is not observed in Chinese native speaker data. Second, non-nativelike uses of even if appear in contexts allowing the use of concessive conditionals in Japanese. Third, while overuse and infelicitous use of even if is observed at all proficiency levels, formal errors are restricted to learners at lower proficiency levels. These findings suggest that crosslinguistic influence does occur in the use of concessive conditionals, and that its particular realization is affected by L2 proficiency, with formal crosslinguistic influence mediated at an earlier stage than semantic cross-linguistic influence.

A Comparative Study of a New Approach to Keyword Analysis: Focusing on NBC (키워드 분석에 대한 최신 접근법 비교 연구: 성경 코퍼스를 중심으로)

  • Ha, Myoungho
    • Journal of Digital Convergence
    • /
    • v.19 no.7
    • /
    • pp.33-39
    • /
    • 2021
  • This paper aims to analyze lexical properties of keyword lists extracted from NLT Old Testament Corpus(NOTC), NLT New Testament Corpus(NNTC), and The NLT Bible Corpus(NBC) and identify that text dispersion keyness is more effective than corpus frequency keyness. For this purpose, NOTC including around 570,000 running words and NNTC about 200,000 were compiled after downloading the files from NLT website of Bible Hub. Scott's (2020) WordSmith 8.0 was utilized to extract keyword lists through comparing a target corpus and a reference corpus. The result demonstrated that text dispersion keyness showed lexical properties of keyword lists better than corpus frequency keyness and that the former was a superior measure for generating optimal keyword lists to fully meet content-generalizability and content distinctiveness.

A Study on the Use of Stopword Corpus for Cleansing Unstructured Text Data (비정형 텍스트 데이터 정제를 위한 불용어 코퍼스의 활용에 관한 연구)

  • Lee, Won-Jo
    • The Journal of the Convergence on Culture Technology
    • /
    • v.8 no.6
    • /
    • pp.891-897
    • /
    • 2022
  • In big data analysis, raw text data mostly exists in various unstructured data forms, so it becomes a structured data form that can be analyzed only after undergoing heuristic pre-processing and computer post-processing cleansing. Therefore, in this study, unnecessary elements are purified through pre-processing of the collected raw data in order to apply the wordcloud of R program, which is one of the text data analysis techniques, and stopwords are removed in the post-processing process. Then, a case study of wordcloud analysis was conducted, which calculates the frequency of occurrence of words and expresses words with high frequency as key issues. In this study, to improve the problems of the "nested stopword source code" method, which is the existing stopword processing method, using the word cloud technique of R, we propose the use of "general stopword corpus" and "user-defined stopword corpus" and conduct case analysis. The advantages and disadvantages of the proposed "unstructured data cleansing process model" are comparatively verified and presented, and the practical application of word cloud visualization analysis using the "proposed external corpus cleansing technique" is presented.

Corpus-based analysis of the usage of Korean markers -(n)un and -i/ka in editorial texts

  • Kim, Kyoung-Young
    • Language and Information
    • /
    • v.19 no.2
    • /
    • pp.19-36
    • /
    • 2015
  • The aim of this paper is to investigate the usage of Korean markers -(n)un and -i/ka in editorial texts focusing on information structure. Noun phrases ending with the markers -(n)un and -i/ka were annotated semi-automatically using a corpus obtained from an online newspaper. Two important factors to determine the choice of markers were examined with the annotated data: referential givenness/newness and position in a sentence. Referential givenness and newness were adopted as indicators of information structure, topic and focus respectively. In addition to quantitative analysis, qualitative analysis was conducted on the selected data. The results suggest that both the marker -(n)un and -i/ka could carry a topic and a focus reading. Sentence position also played a crucial role in determining the marker, and the marker -i/ka was used more frequently in a later position of a sentence than the marker -(n)un.

  • PDF

Phonological processes of consonants from orthographic to pronounced words in the Buckeye Corpus

  • Yang, Byunggon
    • Phonetics and Speech Sciences
    • /
    • v.11 no.4
    • /
    • pp.55-62
    • /
    • 2019
  • This paper investigates the phonological processes of consonants in pronounced words in the Buckeye Corpus and compares the frequency distribution of these processes to provide a clearer understanding of conversational English for linguists and teachers. Both orthographic and pronounced words were extracted from the transcribed label scripts of the Buckeye Corpus. Next, the phonological processes of consonants in the orthographic and pronounced labels were tabulated separately by onsets and codas, and a frequency distribution by consonant process types was examined. The results showed that the majority of the onset clusters were pronounced as the same sounds in the Buckeye Corpus. The participants in the corpus were presumed to speak semiformally. In addition, the onsets have fewer deletions than the codas, which might be related to the information weight of the syllable components. Moreover, there is a significant association and strong positive correlation between the phonological processes of the onsets and codas in men and women. This paper concludes that an analysis of phonological processes in spontaneous speech corpora can contribute to a practical understanding of spoken English. Further studies comparing the current phonological process data with those of other languages would be desirable to establish universal patterns in phonological processes.

Clustering Keywords to Define Cybersecurity: An Analysis of Malaysian and ASEAN Countries' Cyber Laws

  • Joharry, Siti Aeisha;Turiman, Syamimi;Nor, Nor Fariza Mohd
    • Asia Pacific Journal of Corpus Research
    • /
    • v.3 no.2
    • /
    • pp.17-33
    • /
    • 2022
  • While the term is nothing new, 'cybersecurity' still seems to be defined quite loosely and subjectively depending on context. This is problematic especially to legal writers for prosecuting cybercrimes that do not fit a particular clause/act. In fact, what is more difficult is the non-existent single 'cybersecurity law' in Malaysia, rather than the current implementation of 10-related cyber security acts. In this paper, the 10 acts are compiled into a corpus to analyse the language used in these acts via a corpus linguistics approach. A list of frequent words is firstly investigated to see whether the so-called related laws do talk about cybersecurity followed by close inspection of the concordance lines and habitually associated phrases (clusters) to explore use of these words in context. The 'compare 2 wordlist' feature is used to identify similarities or differences between the 10 Malaysian cybersecurity related laws against a corpus of cyber laws from other ASEAN countries. Findings revealed that ASEAN cyber laws refer mostly to three cybersecurity dominant themes identified in the literature: technological solutions, events, and strategies, processes, and methods, whereas Malaysian cybersecurity-related laws revolved around themes like human engagement, and referent objects (of security). Although these so-called cyber related policies and laws in Malaysia are highlighted in the National Cyber Security Agency (NACSA), their practical applications to combat cybercrimes remain uncertain.

Topic Analysis of Science and Technology Articles using CiteSeer Corpus (CiteSeer 말뭉치를 이용한 과학기술 문헌의 주제 분석)

  • Jung, Han-Min;Kang, In-Su;Sung, Won-Kyung
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.14 no.5
    • /
    • pp.507-511
    • /
    • 2008
  • There have been enormous technological advances in science & technology domain and frequent convergences between its sub-domains. Topic analysis with science & technology corpus is a key process to grasp topic trends and relations between topics. The main objective of this research is to show various analytic approaches with topics extracted from CiteSeer corpus, which is widely used in information technology domain. This paper will also show a case study of Onto-Frame, an R&D support system developed by KISTI, to reveal the role of topics on the system.

A Corpus-Based Study of the Use of HEART and HEAD in English

  • Oh, Sang-suk
    • Language and Information
    • /
    • v.18 no.2
    • /
    • pp.81-102
    • /
    • 2014
  • The purpose of this paper is to provide corpus-based quantitative analyses of HEART and HEAD in order to examine their actual usage status and to consider some cognitive linguistic aspects associated with their use. The two corpora COCA and COHA are used for analysis in this study. The analysis of COCA corpus reveals that the total frequency of HEAD is much higher than that of HEART, and that the figurative use of HEART (60%) is two times higher than its literal use (32%); by contrast, the figurative use of HEAD (41%) is a bit higher than its literal use (38%). Among all four genres, both lexemes occur most frequently in fictions and then in magazines. Over the past two centuries, the use of HEART has been steadily decreasing; by contrast, that the use of HEAD has been steadily increasing. It is assumed that the decreasing use of HEART has partially to do with the decrease in its figurative use and that the increasing use of HEAD is attributable to its diverse meanings, the increase of its lexical use, and the partial increase in its figurative use. The analysis of the collocation of verbs and adjectives preceding HEART and HEAD, as well the modifying and predicating forms of HEART and HEAD also provides some relevant information of the usage of the two lexemes. This paper showcases that the quantitative information helps understanding not only of the actual usage of the two lexemes but also of the cognitive forces working behind it.

  • PDF