Search | Korea Science

Sentence Boundary Detection Using Machine Learning Techniques (기계학습 기법을 이용한 문장경계인식)

Park, Su-Hyuk;Rim, Hae-Chang
- Proceedings of the Korea Information Processing Society Conference
- /
- 2008.05a
- /
- pp.69-72
- /
- 2008
본 논문은 언어의 통계적 특징을 이용하여 범용의 문장경계 인식기를 제안한다. 제안하는 방법은 대량의 코퍼스 내에서 사용되고 있는 문장 경계를 기준으로 음절 및 어절 등의 자질을 이용하여 통계적 특징을 추출하고 다양한 기계학습 기법을 사용하여 문장경계를 인식하고자 하였다. 또한 특정 언어나 도메인에 제한적이지 않고 범용적인 자질만을 사용하려고 노력하였다. 언어의 특성상 문장의 구분이 애매한 경우 또는 잘못 사용 된 구두점 등의 경우에도 적용 가능하도록 다양한 자질을 사용하여 실험하였으며, 한국어와 영문 코퍼스에 대해서 동일한 자질을 적용하여 실험하여 본 논문에서 제시한 자질들이 한국어 및 다른 언어권의 언어에도 적용될 수 있는 범용적인 자질임을 확인할 수 있었다. 한국어 문장경계 인식을 위한 기계학습 및 실험을 위해서 세종계획 코퍼스를 사용하였으며, 성능척도로는 정확률과 재현율을 사용하였으며, 실험결과 제안한 방법으로 99%의 정확률과 99.2%의 재현율을 보였다. 영문의 경우는 Wall Street Journal 코퍼스를 사용하였으며, 동일한 자질을 적용하여 실험한 결과 98.9%의 정확률과 94.6%의 재현율을 보였다.
https://doi.org/10.3745/PKIPS.y2008m05a.69 인용 PDF

Language Games between Donald Trump and Gloria Anzaldúa (도널드 트럼프와 글로리아 안살두아의 '언어' 게임)

Park, Jungwon
- Cross-Cultural Studies
- /
- v.46
- /
- pp.85-112
- /
- 2017
Donald Trump, the $45^{th}$ president of the United States, has revived the 'English Only' policy since the beginning of his presidential campaign. The monolingualism not only underscores his extremely conservative ideas, but it also reflects the nativist tendency that prevents the demographic and cultural transformation of the US, which is accelerated by globalization and transnational migration. In particular, Donald Trump tries to reconfirm the mainstream American culture that is now thought to have been threatened by Hispanization and the growing number of Spanish speakers. This paper examines the effects of "code-switching" and the possibility of a bilingual community by contrasting Donald Trump with Gloria $Anzald{\acute{u}}a$, one of the representative Latina writers who created a "border language." Borderlands/La Frontera (1987) includes Spanish glossaries and expressions to represent her bilingual realities, while attempting to translate from English to Spanish, and vice versa. However, the text occasionally demonstrates the impossibility of translation. In doing so, $Anzald{\acute{u}}a$ indirectly states that it is indispensable to present both languages at the stage; she also invites monolingual readers to make more efforts to learn and better understand the Other's language. A "border language" she attempts to embody throughout the text is created in the process of encounters, conflicts, and negotiations among languages of different ethnicities, classes and generations. It does not signify an established form: rather it appears as a constantly transforming language, which can provide us with new perspectives and an alternative way of communication beyond monolingualism.

Trans-boundary Characteristics of the Post-dramatic Play as a Cultural Content (문화콘텐츠로서 포스트드라마 연극의 탈경계적 성격)

Song, Eun-A
- Journal of Korea Entertainment Industry Association
- /
- v.13 no.4
- /
- pp.157-164
- /
- 2019
If a drama play is interested in recreating the play text on stage, the post-drama play aims at a drama that has been liberated from the play text. In this process, the boundaries created by drama theater are dismantled. Actors and audiences, fiction and reality, theater and non-theater, works and events, language and non-language are the names of typical boundaries. The demolition of these boundaries is an opportunity to restore the festival character of ancient Greek theater, which was forgotten by drama theater. This has led to the dismantling of language-centric and play-centricism, which has dominated the play since Aristotle, and has led to a new play. If language-centered, play-centricism has brought about the crisis of drama, the post-dramatic play dismisses them and finds ways to communicate with the audience as new cultural content. The method is found above all in the restoration of dramaturgy. This is because the post-drama plays are more dependent on theatricality than literature. The demilitarized nature of post-dramatic play with enhanced theatricality will be a stepping stone to popularization, and this shows the possibility of post-dramatic play as cultural contents.
https://doi.org/10.21184/jkeia.2019.6.13.4.157 인용

Using CRF (Conditional Random Fields) to Predict Phrase Breaks in Korean (CRF를 이용한 한국어 운율 경계 추정)

Kim, Seung-Won;Kim, Byeong-Chang;Jeong, Min-Woo;Lee, Gary Geun-Bae
- Annual Conference on Human and Language Technology
- /
- 2005.10a
- /
- pp.134-138
- /
- 2005
본 논문은 한국어 TTS(Text-To-Speech)에서 운율 경계를 추정하는 문제를 클래스 분류문제로 보고 CRF(Conditional Random Fields)를 적용하여 운율 경계를 추정하였다. 우리는 품사와 운율 경계로 구성된 말뭉치를 사용하여 품사, 어휘, 단어의 길이, 문장에서의 단어 위치와 같은 다양한 속성의 언어적 자질을 추출하여 CRF를 훈련시켰으며, 자질들을 서로 조합하여 최고의 성능을 보이는 자질 집합을 골랐다 또한 가우스 평활 (Gaussian Smoothing)을 적용하여 데이터의 희소성 문제를 줄였다. 실험 결과에서 본 방법이 기존의 방법보다 성능이 좋을 뿐만 아니라 운율 경계를 추정하기 위한 자질을 독립시켰기 때문에 다른 시스템과의 호환성도 높다는 것을 알 수 있었다.
PDF

Prediction of Prosodic Break Using Syntactic Relations and Prosodic Features (구문 관계와 운율 특성을 이용한 한국어 운율구 경계 예측)

Jung, Youngim;Cho, SunHo;Yoon, Aesun;Kwon, Hyuk-Chul
- Annual Conference on Human and Language Technology
- /
- 2007.10a
- /
- pp.7-14
- /
- 2007
본 논문에서는 자연스러운 한국어 운율구 경계를 예측하기 위해 (1) 문장 성분을 하위범주화하고, (2) 세분화된 문장 성분 간 의존관계를 이용하여 통사구를 추출하며 (3) 추출한 통사구의 유형에 따른 운율구 경계 예측 규칙을 설정하였다. 또한, (4) 통사적 정보 외에도 통사구와 문장의 길이, 통사구의 문장 내 위치, 문맥의 의미 정보 등에 따라 가변적인 운율구 경계를 판단하여 보다 자연스러운 한국어 운율구 경계 예측 시스템을 개발하였다. 그 결과 통사구 경계와 상관 관계가 높은 강한 운율구 경계 예측과 운율구 내부 비경계 예측에 있어 90% 이상의 높은 재현율과 정확도를 보였으며, 전체 운율구 경계 예측에 있어서도 87% 이상의 성능을 보였다.
PDF

Self-Disclosure and Boundary Impermeability among Languages of Twitter Users (트위터 이용자의 언어권별 자기노출 및 경계 불투과성)

Jang, Phil-Sik
- The Journal of the Korea Contents Association
- /
- v.16 no.4
- /
- pp.434-441
- /
- 2016
Using bigdata analysis procedures, the present study sought to review and explore the various aspects of self-disclosure and boundary impermeability of worldwide twitter users. A total of 415 million tweets issued by 54 million users were collected during 6 months and the users of top 10 languages were investigated. And the effect of languages of twitter users on the boundary impermeability, disclosure rate of user profile, profile image, geographical information, URL in profile and user description were analyzed in this study. The results showed that the boundary impermeability and all the self-disclosure rates of twitter users (profile, profile image, geographical information, URL in profile, user description) were significantly (p<0.001) different among language groups of users. The self-disclosure rates and the average points of Portuguese, Indonesian and Spanish users were higher than those of Arabic, Japanese, Turkish and Korean users. The results also showed a positive relationship between boundary impermeability and the number of tweets (including retweets) issued by each users.
https://doi.org/10.5392/JKCA.2016.16.04.434 인용 PDF KSCI

Boundary Testing research using Dynamic Symbolic Execution (동적 심볼릭 수행을 이용한 경계 테스팅 연구)

Koo, Keun-Hoi;Choi, Seok-Won;Choi, Jin-Young
- Proceedings of the Korea Information Processing Society Conference
- /
- 2016.10a
- /
- pp.429-432
- /
- 2016
전통적인 태스팅은 테스트케이스를 만드는데 많은 시간과 비용이 들기 때문에 시기적절한 출시를 해야 하는 휴대전화, TV와 같은 제품을 테스팅하는데 많은 어려움을 겪고 있다. 이러한 문제를 해결하려는 노력으로 관련 학계 및 산업체에서는 동적 심볼릭 수행을 이용한 커버리지가 높은 테스트케이스 자동생성 연구가 진행 중이다. 특히, Microsoft Research에서 만든 동적 심볼릭 수행도구인 PEX는 C#언어로 작성된 웹 또는 윈도우 프로그램의 테스트케이스를 자동 생성한다. 그러나 PEX의 사용자들로부터 테스트케이스가 부족하다는 피드백올 받았고, 그 결과로 경계 값 테스트케이스를 추가하는 연구를 진행하여 경계 값 테스트케이스를 추가하지 않았을 때 보다 더 많은 오류를 찾았다. 본 논문에서는 소프트웨어 오류가 있을 경우 엄청난 재산, 인명 피해가 발생하는 임베디드 소프트웨어 분야에서 자주 사용하는 언어인 C언어를 지원하는 동적 심볼릭 수행 도구 CREST를 수정하여 경계 값 테스트 케이스를 생생하는 연구를 진행한다.
https://doi.org/10.3745/PKIPS.y2016m10a.429 인용 PDF

Advanced detection of sentence boundaries based on hybrid method (하이브리드 방법을 이용한 개선된 문장경계인식)

Lee, Chung-Hee;Jang, Myung-Gil;Seo, Young-Hoon
- Annual Conference on Human and Language Technology
- /
- 2009.10a
- /
- pp.61-66
- /
- 2009
본 논문은 다양한 형태의 웹 문서에 적용하기 위해서, 언어의 통계정보 및 후처리 규칙에 기반 하여 개선된 문장경계 인식 기술을 제안한다. 제안한 방법은 구두점 생략 및 띄어쓰기 오류가 빈번한 웹 문서에 적용하기 위해서 문장경계로 사용될 수 있는 모든 음절을 대상으로 학습하여 문장경계 인식을 수행하였고, 문장경계인식 성능을 최대화 하기 위해서 다양한 실험을 통해 최적의 자질 및 학습데이터를 선정하였고, 다양한 기계학습 기반 분류 모델을 비교하여 최적의 분류모델을 선택하였으며, 학습데이터에 의존적인 통계모델의 오류를 규칙에 기반 해서 보정하였다. 성능 실험은 다양한 형태의 문서별 성능 측정을 위해서 문어체와 구어체가 복합적으로 사용된 신문기사와 블로그 문서(평가셋1), 문어체 위주로 구성된 세종말뭉치와 백과사전 본문(평가셋2), 구두점 생략 및 띄어쓰기 오류가 빈번한 웹 사이트의 게시판 글(평가셋3)을 대상으로 성능 측정을 하였다. 성능척도로는 F-measure를 사용하였으며, 구두점만을 대상으로 문장경계 인식 성능을 평가한 결과, 평가셋1에서는 96.5%, 평가셋2에서는 99.4%를 보였는데, 구어체의 문장경계인식이 더 어려움을 알 수 있었다. 평가셋1의 경우에도 규칙으로 후처리한 경우 정확률이 92.1%에서 99.4%로 올라갔으며, 이를 통해 후처리 규칙의 필요성을 알 수 있었다. 최종 성능평가로는 구두점만을 대상으로 학습된 기본 엔진과 모든 문장경계후보를 인식하도록 개선된 엔진을 평가셋3을 사용하여 비교 평가하였고, 기본 엔진(61.1%)에 비해서 개선된 엔진이 32.0% 성능 향상이 있음을 확인함으로써 제안한 방법이 웹 문서에 효과적임을 입증하였다.
PDF

Multi-class Classification System Based on Multi-loss Linear Combination for Word Spacing and Sentence Boundary Detection (띄어쓰기 및 문장 경계 인식을 위한 다중 손실 선형 결합 기반의 다중 클래스 분류 시스템)

Kim, GiHwan;Seo, Jisu;Lee, Kyungyeol;Ko, Youngjoong
- Annual Conference on Human and Language Technology
- /
- 2018.10a
- /
- pp.185-188
- /
- 2018
띄어쓰기와 문장 경계 인식은 그 성능에 따라 자연어 분석 단계에서 오류를 크게 전파하기 때문에 굉장히 중요한 문제로 인식되고 있지만 각각 서로 다른 자질을 사용하는 문제 때문에 각각 다른 모델을 사용해 순차적으로 해결하였다. 그러나 띄어쓰기와 문장 경계 인식은 완전히 다른 문제라고는 볼 수 없으며 두 모델의 순차적 수행은 앞선 모델의 오류가 다음 모델에 전파될 뿐만 아니라 시간 복잡도가 높아진다는 문제점이 있다. 본 논문에서는 띄어쓰기와 문장 경계 인식을 하나의 문제로 보고 한 번에 처리하는 다중 클래스 분류 시스템을 통해 시간 복잡도 문제를 해결하고 다중 손실 선형 결합을 사용하여 띄어쓰기와 문장 경계 인식이 서로 다른 자질을 사용하는 문제를 해결했다. 최종 모델은 띄어쓰기와 문장 경계 인식 기본 모델보다 각각 3.98%p, 0.34%p 증가한 성능을 보였다. 시간 복잡도 면에서도 단일 모델의 순차적 수행 시간보다 38.7% 감소한 수행 시간을 보였다.
PDF

Robust Sentence Boundary Detection for Korean SNS Documents (한국어 SNS 문서에 적합한 문장 경계 인식)

Yeom, Haram;Kim, Jae-Hoon
- Annual Conference on Human and Language Technology
- /
- 2021.10a
- /
- pp.532-535
- /
- 2021
다양한 SNS 플랫폼이 등장하고, 이용자 수가 급증함에 따라 온라인에서 얻을 수 있는 정보의 활용 가치가 높아지고 있다. 문장은 자연어 처리 시스템의 기본적인 단위이므로 주어진 문서로부터 문장의 경계를 인식하는 작업이 필수적이다. 공개된 문장 경계 인식기는 SNS 문서에서 좋은 성능을 보이지 않는다. 본 논문에서는 문어체로 구성된 일반 문서뿐 아니라 SNS 문서에서 사용할 수 있는 문장 경계 인식기를 제안한다. 본 논문에서는 SNS 문서에 적용하기 위해 다음과 같은 두 가지를 개선한다. 1) 학습 말뭉치를 일반문서와 SNS 문서 두 영역으로 확장하고, 2) 이모티콘을 사용하는 SNS 문서의 특징을 반영하는 어절의 유형을 자질로 추가하여 성능을 개선한다. 실험을 통해서 추가된 자질의 기여도를 분석하고, 또한 기존의 한국어 문장 경계 인식기와 제안한 모델의 성능을 비교·분석하였다. 개선된 모델은 일반 문서에서 99.1%의 재현율을 보이며, SNS 문서에서 88.4%의 재현율을 보였다. 두 영역 모두에서 문장 경계 인식이 잘 이루어지는 것을 확인할 수 있었다.
PDF

Search Result 169, Processing Time 0.034 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)