• Title/Summary/Keyword: Chunking

Search Result 70, Processing Time 0.023 seconds

Resolving Prepositional Phrase Attachment Using a Maximum Entropy Boosting Model (최대 엔트로피 부스팅 모델을 이용한 전치사 접속 모호성 해소)

  • 박성배;장병탁
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2002.10d
    • /
    • pp.670-672
    • /
    • 2002
  • Park과 Zhang은 최대 엔트로피 모델(maximum entropy model)을 실제 자연언어 처리에 적용함에 있어서 나타날 수 있는 여러가지 문제를 해결하기 위한 최대 엔트로피 모델(maximum entropy boosting model)을 제시하여 문서 단위화(text chunking)에 성공적으로 적용하였다. 최대 엔트로피 부스팅 모델은 쉬운 모델링과 높은 성능을 보이는 장점을 가지고 있다. 본 논문에서는 최대 엔트로피 부스팅 모델을 영어 전치사 접속 모호성 해소에 적용한다. Wall Street Journal 말뭉치에 대한 실험 결과, 아주 작은 노력을 들였음에도 84.3%의 성능을 보여 지금까지 알려진 최고의 성능과 비슷한 결과를 보였다.

  • PDF

Expanded Korean Chunking by $k$-NN ($k$-NN으로 확장된 한국어 단위화)

  • 박성배;장병탁;김영택
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2000.10b
    • /
    • pp.182-184
    • /
    • 2000
  • 대부분의 자연언어처리에서 단위화는 구문 분석 이전의 매우 기본적인 처리 단계로, 텍스트 문장을 문법적으로 서로 관련된 단위로 분할하는 것이다. 따라서, 단위화를 이용하면 구문 분석이나 의미 분석 등에서 메모리와 시간을 효율적으로 줄일 수 있다. 일반적으로 통찰에 의한 규칙을 사용해서도 비교적 높은 단위화 성능을 얻을 수 있지만, 본 논문에서는 기계 학습 기법인 k-NN을 사용하여 보다 정확한 단위화를 구현한다. 인터넷 홈페이지에서 얻은 1,273 문장을 대상으로 학습한 결과, k-NN으로 단위화를 확장했을 때에 확장하지 않았을 때보다 2.3%의 정확도 증가를 보였다.

  • PDF

Storage System Performance Enhancement Using Duplicated Data Management Scheme (중복 데이터 관리 기법을 통한 저장 시스템 성능 개선)

  • Jung, Ho-Min;Ko, Young-Woong
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.37 no.1
    • /
    • pp.8-18
    • /
    • 2010
  • Traditional storage server suffers from duplicated data blocks which cause an waste of storage space and network bandwidth. To address this problem, various de-duplication mechanisms are proposed. Especially, lots of works are limited to backup server that exploits Contents-Defined Chunking (CDC). In backup server, duplicated blocks can be easily traced by using Anchor, therefore CDC scheme is widely used for backup server. In this paper, we propose a new de-duplication mechanism for improving a storage system. We focus on efficient algorithm for supporting general purpose de-duplication server including backup server, P2P server, and FTP server. The key idea is to adapt stride scheme on traditional fixed block duplication checking mechanism. Experimental result shows that the proposed mechanism can minimize computation time for detecting duplicated region of blocks and efficiently manage storage systems.

Parallel Rabin Fingerprinting on GPGPU for Efficient Data Deduplication (효율적인 데이터 중복제거를 위한 GPGPU 병렬 라빈 핑거프린팅)

  • Ma, Jeonghyeon;Park, Sejin;Park, Chanik
    • Journal of KIISE
    • /
    • v.41 no.9
    • /
    • pp.611-616
    • /
    • 2014
  • Rabin fingerprinting used for chunking requires the largest amount computation time in data deduplication, In this paper, therefore, we proposed parallel Rabin fingerprinting on GPGPU for efficient data deduplication. In addition, for efficient parallelism in Rabin fingerprinting, four issues are considered. Firstly, when dividing input data stream into data sections, we consider the data located near the boundaries between data sections to calculate Rabin fingerprint continuously. Secondly, we consider exploiting the characteristics of Rabin fingerprinting for efficient operation. Thirdly, we consider the chunk boundaries which can be changed compared to sequential Rabin fingerprinting when adapting parallel Rabin fingerprinting. Finally, we consider optimizing GPGPU memory access. Parallel Rabin fingerprinting on GPGPU shows 16 times and 5.3 times better performance compared to sequential Rabin fingerprinting on CPU and compared to parallel Rabin fingerprinting on CPU, respectively. These throughput improvement of Rabin fingerprinting can lead to total performance improvement of data deduplication.

Near Realtime Packet Classification & Handling Mechanism for Visualized Security Management in Cloud Environments (클라우드 환경에서 보안 가시성 확보를 위한 자동화된 패킷 분류 및 처리기법)

  • Ahn, Myong-ho;Ryoo, Mi-hyeon
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2014.10a
    • /
    • pp.331-337
    • /
    • 2014
  • Paradigm shift to cloud computing has increased the importance of security. Even though public cloud computing providers such as Amazon, already provides security related service like firewall and identity management services, it is not suitable to protect data in cloud environments. Because in public cloud computing environments do not allow to use client's own security solution nor equipments. In this environments, user are supposed to do something to enhance security by their hands, so the needs of visualized security management arises. To implement visualized security management, developing near realtime data handling & packet classification mechanisms are crucial. The key technical challenges in packet classification is how to classify packet in the manner of unsupervised way without human interactions. To achieve the goal, this paper presents automated packet classification mechanism based on naive-bayesian and packet Chunking techniques, which can identify signature and does machine learning by itself without human intervention.

  • PDF

File Deduplication using Logical Partition of Storage System (저장 시스템의 논리 파티션을 이용한 파일 중복 제거)

  • Kong, Jin-San;Yoo, Chuck;Ko, Young-Woong
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.7 no.6
    • /
    • pp.345-351
    • /
    • 2012
  • In traditional target-based data deduplication system, all of the files should be chunked and compared for reducing duplicated data blocks. One of the critical problem of this system arises as the number of files are increasing. The system suffers from computational delay for calculating hash value and processing metadata for handling each file. To overcome this problem, in this paper, we propose a novel data deduplication system using logical partition of storage system. The system applies data deduplication scheme to each logical partition not each file. Experiment result shows that the proposed system is more efficient compared with traditional deduplication scheme where the logical partition is full of files by 50% in terms of deduplication capacity and processing time.

Korean BaseNP Chunking Using Head-word of Word Phrase (어절의 중심어 정보를 이용한 한국어 기반 명사구 인식)

  • Seo, Chung-Won;Oh, Jong-Hoon;Choi, Key-Sun
    • Annual Conference on Human and Language Technology
    • /
    • 2003.10d
    • /
    • pp.145-151
    • /
    • 2003
  • 기반 명사구는 명사구 내부에 다른 명사구를 포함하지 않는 명사구로 정의된다. 이러한 기반명사구인식은 구문해석의 성능을 향상시키기 위한 방법으로 많이 사용되어 왔다. 효과적인 기반 명사구인식을 위해서는 올바른 학습자질의 선택과 적절한 문맥의 범위의 설정이 중요하다. 이러한 관점에서 기존의 연구에서는 여러 가지 학습자질과 문맥의 범위로 기반명사구를 인식하였다. 하지만 기존의 연구들에서는 학습자질로 단순한 어휘, 품사, 띄어쓰기 정보만을 사용하여 좁은 범위의 문맥정보만을 사용하였다. 본 논문에서는 한국어의 기반 명사구 인식을 위해 학습의 자질로 어절의 중심어를 사용하는 HMM모델을 제안한다. 본 논문의 방법을 통해 정확률 94.3%, 재현률 93.2%의 성능을 얻었다.

  • PDF

Linguistic Modeling for Multilingual Machine Translation based on Common Transfer (공통변환 기반 다국어 자동번역을 위한 언어학적 모델링)

  • Choi, Sungkwon;Kim, Younggil
    • Language and Information
    • /
    • v.18 no.1
    • /
    • pp.77-97
    • /
    • 2014
  • Multilingual machine translation means the machine translation that is for more than two languages. Common transfer means the transfer in which we can reuse the transfer rules among similar languages according to linguistic typology. Therefore, the multilingual machine translation based on common transfer is the multilingual machine translation that can share the transfer rules among languages with similar linguistic typology. This paper describes the linguistic modeling for multilingual machine translation based on common transfer under development. This linguistic modeling consists of the linguistic devices such as 1) multilingual common Part-of-Speech set, 2) multilingual common transfer format, 3) multilingual common transfer chunking, and 4) multilingual common transfer rules based on linguistic typology. Validity of this linguistic modeling for multilingual machine translation is shown in the simulation. The multilingual machine translation system based on common transfer including Korean, English, Chinese, Spanish, and French will be developed till 2018.

  • PDF

Enhancement of User Understanding and Service Value Using Online Reviews (온라인 리뷰를 활용한 사용자 이해 및 서비스 가치 증대)

  • Kim, Jin-Hwa;Byeon, Hyeon-Su;Lee, Seung-Hun
    • The Journal of Information Systems
    • /
    • v.20 no.2
    • /
    • pp.21-36
    • /
    • 2011
  • The Web has become an excellent source for gathering consumer opinions. There are now numerous Web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. This paper focuses on online customer reviews of products. It makes some contributions. Especially it proposes minimalism and chunking framework for analyzing and comparing consumer opinions of competing products. Users are able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. This comparison is useful to both potential customers and product manufacturers. For a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. In this paper, we only focus on mining opinion/product features that the reviewers have commented on. Five types of online review presentations are presented to mine such features. Our experimental results show that these techniques are useful to identify customers' opinions and trends.

Scalable Path Testing Method Adopting Slicing and Chunking Concept (슬라이싱과 청킹 개념을 도입한 확장 가능한 경로 테스팅 방안)

  • Choi, Eun-Man;Choi, Hee-Sung
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2012.06b
    • /
    • pp.164-166
    • /
    • 2012
  • 화이트 박스 테스팅을 위하여 주로 사용하는 경로 테스팅 방법은 테스트 대상 프로그램의 규모에 크게 영향을 받는다. 이런 단점을 해소하기 위하여 이 논문에서는 슬라이싱과 청킹 개념을 도입하였다. 청킹은 논리 흐름의 덩어리를 프레임화 하여 필요에 따라 펼치거나 추상화할 수 있게 한다. 또한 슬라이싱은 프로그램 동작의 부분 집합을 추출하여 복잡도를 줄이고 특정 변수에 집중하게 한다. 본 논문에서는 이런 두 가지 개념을 도입하여 확장 가능한 경로 테스팅 방법을 제안하여 화이트 박스 테스팅의 실용성을 높일 수 있음을 보였다.