• Title/Summary/Keyword: analysis of corpus

Search Result 423, Processing Time 0.029 seconds

Topic Analysis of Science and Technology Articles using CiteSeer Corpus (CiteSeer 말뭉치를 이용한 과학기술 문헌의 주제 분석)

  • Jung, Han-Min;Kang, In-Su;Sung, Won-Kyung
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.14 no.5
    • /
    • pp.507-511
    • /
    • 2008
  • There have been enormous technological advances in science & technology domain and frequent convergences between its sub-domains. Topic analysis with science & technology corpus is a key process to grasp topic trends and relations between topics. The main objective of this research is to show various analytic approaches with topics extracted from CiteSeer corpus, which is widely used in information technology domain. This paper will also show a case study of Onto-Frame, an R&D support system developed by KISTI, to reveal the role of topics on the system.

An Analysis of the Vowel Formants of the Young Males in the Buckeye Corpus (벅아이 코퍼스에서의 젊은 성인 남성의 모음 포먼트 분석)

  • Yoon, Kyu-Chul;Noh, Hye-Uk
    • Phonetics and Speech Sciences
    • /
    • v.4 no.2
    • /
    • pp.41-49
    • /
    • 2012
  • The purpose of this paper is to extract the vowel formants of the ten young male speakers from the Buckeye Corpus of Conversational Speech [1] and to analyze them in comparison to earlier works in terms of various phonetic factors that are expected to affect the realization of the formant distribution. The first two formant frequency values were automatically extracted with a Praat script along with such factors as the place of articulation, the content versus function word information, syllabic stress information, the location in a word, location in utterance, speech rate of three consecutive words, and the word frequency in the corpus. The results indicated that the formant patterns from the corpus were very different from those of earlier works although the overall pattern was similar and that the factors were strongly responsible for the realization of the two formants. The purpose of this paper is to extract the vowel formants of the ten young male speakers from the Buckeye Corpus of Conversational Speech [1] and to analyze them in comparison to earlier works in terms of various phonetic factors that are expected to affect the realization of the formant distribution. The first two formant frequency values were automatically extracted with a Praat script along with such factors as the place of articulation, the content versus function word information, the syllabic stress information, the location in a word, the location in an utterance, the speech rate of the three consecutive words, and the word frequency in the corpus. The result indicated that the formant patterns from the corpus were very different from those of earlier works although the overall pattern was similar and that the factors were strongly responsible for the realization of the two formants.

A Study on the Use of Genitive Particle '의': Focusing on the analysis of Korean Learners Corpus (한국어 학습자의 관형격 조사 '의' 사용 양상 연구: 학습자 말뭉치 분석을 중심으로)

  • Ji-Young Sim;Soo-Hyun Lee
    • Journal of the Korean Society of Industry Convergence
    • /
    • v.26 no.3
    • /
    • pp.433-442
    • /
    • 2023
  • The purpose of this study is to reveal the Korean learners' usage pattern of '의', the genitive particle, according to semantic classification, so that it can be referred to in determining the contents and methods of related education. The method of this study adopts a quantitative analysis using learners corpus established by National Institute of Korean Language. As a result of the analysis, as proficiency increases, the overall frequency of '의' increases and the number of meaning senses used increases. However, the frequency of errors also increases with it. As for the usage pattern of each sense, the meaning of 'ownership, belonging' is the most frequent, and followed by 'acting entity', 'kinship, social relations', and 'relationship(area)'. In conclusion, the meanings of 'acting subjects' and 'relationships(area) need to be supplemented with explicit education. Other meanings need to be discussed, and decisions should be made in consideration of learning purpose and proficiency.

The meaning of 'Educational Philosophy': by the usage of ('교육철학' 용어의 의미 분석: <물결21 코퍼스>를 중심으로)

  • Chang, Chi Won
    • Philosophy of Education
    • /
    • no.66
    • /
    • pp.77-103
    • /
    • 2018
  • This study focused on the meaning of 'educational philosophy' by the method of corpus analysis. There is the difference of meaning on educational philosophy between professional researchers and publics. This semantic phenomenon implies that the image acoustics of 'educational philosophy' are not matched between two groups. This study, which originated from Saussure's linguistics theory, examined the semantics of educational philosophy in the . Unlike philosophical inquiry on education, the definition of educational philosophy, the general public use 'educational philosophy' like the connotation of secret of successful learning and child nurturing. Given the power of the media and the mass, these tendency could affect the meaning and definition of educational philosophy. Professional researchers should investigate these acoustic image from the sense of linguistic and educational approaches. These researches could contribute to clarify descriptive and normative meaning of the educational philosophy.

A corpus-based study on the effects of voicing and gender on American English Fricatives (성대진동 및 성별이 미국영어 마찰음에 미치는 효과에 관한 코퍼스 기반 연구)

  • Yoon, Tae-Jin
    • Phonetics and Speech Sciences
    • /
    • v.10 no.2
    • /
    • pp.7-14
    • /
    • 2018
  • The paper investigates the acoustic characteristics of English fricatives in the TIMIT corpus, with a special focus on the role of voicing in rendering fricatives in American English. The TIMIT database includes 630 talkers and 2,342 different sentences, and comprises more than five hours of speech. Acoustic analyses are conducted in the domain of spectral and temporal properties by treating gender, voicing, and place of articulation as independent factors. The results of the acoustic analyses revealed that acoustic signals interact in a complex way to signal the gender, place, and voicing of fricatives. Classification experiments using a multiclass support vector machine (SVM) revealed that 78.7% of fricatives are correctly classified. The majority of errors stem from the misclassification of /θ/ as [f] and /ʒ/ as [z]. The average accuracy of gender classification is 78.7%. Most errors result from the classification of female speakers as male speakers. The paper contributes to the understanding of the effects of voicing and gender on fricatives in a large-scale speech corpus.

An Analysis of the Vowel Formants of the Young Females in the Buckeye Corpus (벅아이 코퍼스에서의 젊은 성인 여성의 모음 포먼트 분석)

  • Yoon, Kyuchul
    • Phonetics and Speech Sciences
    • /
    • v.4 no.4
    • /
    • pp.45-52
    • /
    • 2012
  • The purpose of this paper is to measure the first two vowel formants of the ten young female speakers from the Buckeye Corpus of Conversational Speech [1] automatically and then to analyze various potential factors that may affect the formant distribution of the eight peripheral vowels of English. The factors that were analyzed included the place of articulation, the content versus function word information, the syllabic stress information, the location in a word, the location in an utterance, the speech rate of the three consecutive words, and the word frequency in the corpus. The results indicate that the overall formant patterns of the female speakers were similar to those of earlier works. The effects of the factors on the realization of the two formants were also similar to those from the male speakers with minor differences.

A study on the release burst spectra of the voiceless plosives from the English and Korean spontaneous speech corpus (영어와 한국어 자연발화 코퍼스에서의 무성 폐쇄음 개방 파열 스펙트럼 연구)

  • Hwang, Sunmi;Yoon, Kyuchul
    • Phonetics and Speech Sciences
    • /
    • v.9 no.4
    • /
    • pp.27-34
    • /
    • 2017
  • The purpose of this work is to examine the English and Korean voiceless plosives from the Buckeye[15] and Seoul[16] corpus in terms of their static spectral characteristics. The plosives were automatically extracted by a Praat script. In order to estimate the percent correctness in the classification of the plosives, discriminant analyses were performed whose trainings were based on four spectral moments, i.e. the center of gravity, variance, skewness and kurtosis as suggested in [6]. Another set of discriminant analyses were performed based on the spectral tilts. In the last set of analyeses, the spectral moments and tilts were both used in the training. Results showed that the correct classification rate did not exceed around 65% in the best case, which suggested that phonetic cues other than the release burst would be necessary including the dynamic spectral aspects and vowel-onset cues.

Development of Online Fashion Thesaurus and Taxonomy for Text Mining (텍스트마이닝을 위한 패션 속성 분류체계 및 말뭉치 웹사전 구축)

  • Seyoon Jang;Ha Youn Kim;Songmee Kim;Woojin Choi;Jin Jeong;Yuri Lee
    • Journal of the Korean Society of Clothing and Textiles
    • /
    • v.46 no.6
    • /
    • pp.1142-1160
    • /
    • 2022
  • Text data plays a significant role in understanding and analyzing trends in consumer, business, and social sectors. For text analysis, there must be a corpus that reflects specific domain knowledge. However, in the field of fashion, the professional corpus is insufficient. This study aims to develop a taxonomy and thesaurus that considers the specialty of fashion products. To this end, about 100,000 fashion vocabulary terms were collected by crawling text data from WSGN, Pantone, and online platforms; text subsequently was extracted through preprocessing with Python. The taxonomy was composed of items, silhouettes, details, styles, colors, textiles, and patterns/prints, which are seven attributes of clothes. The corpus was completed through processing synonyms of terms from fashion books such as dictionaries. Finally, 10,294 vocabulary words, including 1,956 standard Korean words, were classified in the taxonomy. All data was then developed into a web dictionary system. Quantitative and qualitative performance tests of the results were conducted through expert reviews. The performance of the thesaurus also was verified by comparing the results of text mining analysis through the previously developed corpus. This study contributes to achieving a text data standard and enables meaningful results of text mining analysis in the fashion field.

The pattern of use by gender and age of the discourse markers 'a', 'eo', and 'eum' (담화표지 '아', '어', '음'의 성별과 연령별 사용 양상)

  • Song, Youngsook;Shim, Jisu;Oh, Jeahyuk
    • Phonetics and Speech Sciences
    • /
    • v.12 no.4
    • /
    • pp.37-45
    • /
    • 2020
  • This paper quantitatively calculated the speech frequency of the discourse markers 'a', 'eo', and 'eum' and the speech duration of these discourse markers using the Seoul Corpus, a spontaneous speech corpus. The sound durations were confirmed with Praat, the Seoul Corpus was analyzed with Emeditor, and the results were presented by statistical analysis with R. Based on the corpus analysis, the study investigated whether a particular factor is preferred by speakers of particular categories. The most prominent feature of the corpus is that the sound durations of female speakers were longer than those of men when using the 'eum' discourse marker in a final position. In age-related variables, teenagers uttered 'a' more than 'eo' in an initial position when compared to people in their 40s. This study is significant because it has quantitatively analyzed the discourse markers 'a', 'eo', and 'eum' by gender and age. In order to continue the discussion, more precise research should be conducted considering the context. In addition, similarities can be found in "e" and "ma" in Japanese(Watanabe & Ishi, 2000) and 'uh', 'um' in English(Gries, 2013). afterwards, a study to identify commonalities and differences can be predicted by using the cross-linguistic analysis of the discourse.

A Corpus-based English Syntax Academic Word List Building and its Lexical Profile Analysis (코퍼스 기반 영어 통사론 학술 어휘목록 구축 및 어휘 분포 분석)

  • Lee, Hye-Jin;Lee, Je-Young
    • The Journal of the Korea Contents Association
    • /
    • v.21 no.12
    • /
    • pp.132-139
    • /
    • 2021
  • This corpus-driven research expounded the compilation of the most frequently occurring academic words in the domain of syntax and compared the extracted wordlist with Academic Word List(AWL) of Coxhead(2000) and General Service List(GSL) of West(1953) to examine their distribution and coverage within the syntax corpus. A specialized 546,074 token corpus, composed of widely used must-read syntax textbooks for English education majors, was loaded into and analyzed with AntWordProfiler 1.4.1. Under the parameter of lexical frequency, the analysis identified 288(50.5%) AWL word forms, appeared 16 times or more, as well as 218(38.2%) AWL items, occurred not exceeding 15 times. The analysis also indicated that the coverage of AWL and GSL accounted for 9.19% and 78.92% respectively and the combination of GSL and AWL amounted to 88.11% of all tokens. Given that AWL can be instrumental in serving broad disciplinary needs, this study highlighted the necessity to compile the domain-specific AWL as a lexical repertoire to promote academic literacy and competence.