Odysseus/Parallel-OOSQL: A Parallel Search Engine using the Odysseus DBMS Tightly-Coupled with IR Capability

Ryu, Jae-Joon;Whang, Kyu-Young;Lee, Jae-Gil;Kwon, Hyuk-Yoon;Kim, Yi-Reun;Heo, Jun-Suk;Lee, Ki-Hoon;

한국정보과학회논문지:컴퓨팅의 실제 및 레터 (Journal of KIISE:Computing Practices and Letters)

제14권4호
/
Pages.412-429
/
2008
/
1229-7712(pISSN)

한국정보과학회 (Korean Institute of Information Scientists and Engineers)

오디세우스/Parallel-OOSQL: 오디세우스 정보검색용 밀결합 DBMS를 사용한 병렬 정보 검색 엔진

Odysseus/Parallel-OOSQL: A Parallel Search Engine using the Odysseus DBMS Tightly-Coupled with IR Capability

류재준 (한국과학기술원 전산학과) ;
황규영 (한국과학기술원 전산학과) ;
이재길 (한국과학기술원 전산학과) ;
권혁윤 (한국과학기술원 전산학과) ;
김이른 (한국과학기술원 전산학과) ;
허준석 (한국과학기술원 전산학과) ;
이기훈 (한국과학기술원 전산학과)

발행 : 2008.06.15

PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

최근 들어 인터넷의 성장으로 인하여 문서의 양이 기하급수적으로 증가함에 따라, 대용량의 문서를 마르게 검색 할 수 있는 병렬 정보 검색 엔진에 대한 중요성이 더욱 대두되고 있다. 병렬 정보 검색 엔진을 구현하기 위하여서는 역 색인을 분할하고, 분할된 역 색인을 통하여 병렬적으로 검색하는 것이 필요하다. 역 색인을 분할하는 기존 방법으로는 1) 문서 식별자 분할 방법과 2) 식별자 분할 방법이 있다. 그러나 각 분할 방법은 다음과 같은 단점들을 가지고 있다. 문서 식별자 분할 방법은 문서의 추가가 용이하고 처리량(throughput)이 높은 반면에 top-k 질의 처리 성능이 좋지 않다. 그리고 식별자 분할 방법은 top-k 질의 처리 성능이 좋은 반면에 문서의 추가가 어렵고 처리량이 낮다. 본 논문에서는 이러한 단점들을 해결하기 위하여 혼합 분할 방법을 제안하고 이를 정보 검색 기능과 밀결합된 DBMS인 오디세우스에 실현한 병렬 정보 검색 엔진을 설계하고 구현한다. 먼저, 제안된 병렬 정보 검색 엔진인 오디세우스/parallel-OOSQL의 아키텍쳐를 설명한다. 그리고 체계적인 실험을 통하여 제안된 시스템의 유용성을 보인다. 실험 결과, 문서 식별자 분할 방법은 질의 처리 시간이 역 색인 분할의 블록의 개수에 근사적으로 역 비례함을 보였으며, 키워드 식별자 분할 방법은 top-k 질의 처리에 좋은 성능을 보였다. 본 논문에서 제안된 병렬 정보 검색 엔진은 세 가지 분할 방법을 모두 제공하기 때문에 응용 환경에 따라 분할 방법을 커스터마이즈함으로써 항상 좋은 성능을 낼 수 있다. 오디세우스/parallel-OOSQL 병렬 정보 검색 엔진은 각 슬레이브 노드 당 1억 건의 웹 문서를, 시스템 전체로는 수십억 건의 웹 문서를 인덱스하여 저장하고 질의를 처리할 수 있다.

As the amount of electronic documents increases rapidly with the growth of the Internet, a parallel search engine capable of handling a large number of documents are becoming ever important. To implement a parallel search engine, we need to partition the inverted index and search through the partitioned index in parallel. There are two methods of partitioning the inverted index: 1) document-identifier based partitioning and 2) keyword-identifier based partitioning. However, each method alone has the following drawbacks. The former is convenient in inserting documents and has high throughput, but has poor performance for top h query processing. The latter has good performance for top-k query processing, but is inconvenient in inserting documents and has low throughput. In this paper, we propose a hybrid partitioning method to compensate for the drawback of each method. We design and implement a parallel search engine that supports the hybrid partitioning method using the Odysseus DBMS tightly coupled with information retrieval capability. We first introduce the architecture of the parallel search engine-Odysseus/parallel-OOSQL. We then show the effectiveness of the proposed system through systematic experiments. The experimental results show that the query processing time of the document-identifier based partitioning method is approximately inversely proportional to the number of blocks in the partition of the inverted index. The results also show that the keyword-identifier based partitioning method has good performance in top-k query processing. The proposed parallel search engine can be optimized for performance by customizing the methods of partitioning the inverted index according to the application environment. The Odysseus/parallel OOSQL parallel search engine is capable of indexing, storing, and querying 100 million web documents per node or tens of billions of web documents for the entire system.

키워드

참고문헌

Frakes, W. and Baeze-Yates, R., Information Retrieval: Data Structures and Algorithms, Prentice- Hall, 1992
Tomasic, A., Garcia-Molina, H., and Shoens, K., "Incremental Updates of Inverted Lists for Text Document Retrieval," In Proc. 1994 ACM SIGMOD Int'l Conf. on Management of Data, pp. 289-300, June 1994
Tomasic, A. and Garcia-Molina, H., "Issues in Parallel Information Retrieval," IEEE Data Engineering Bulletin, Vol.17, No.3, pp. 41-49, Sept. 1994
Cahoon, B. and McKinley, K., "Performance Evaluation of a Distributed Architecture for Information Retrieval," In Proc. 19th Int'l Conf. on Information Retrieval(ACM SIGIR), 1996
Tomasic, A. and Garcia-Molina, H., "Query Processing and Inverted Indices in Shared-Nothing Text Document Information Retrieval Systems," In The VLDB Journal, Vol.2, No.3, pp. 243-275, 1993 https://doi.org/10.1007/BF01228671
MacFarlane, A., McCann, J., and Robertson, S., "PLIERS : A Parallel Information Retrieval System using MPI," In Proc. 6th Europrean PVM/ MPI Users' Group Meeting, pp. 317-324, Sept. 1999
Grossman, D. and Frieder, O., Information Retrieval: Algorithms and Heuristics, Springer, Dec. 2004
Zobel, J., Moffat, A., and Ramamohanarao, K., "Inverted Files Versus Signature Files for Text Indexing," ACM Trans. on Database Systems, Vol.23, No.4, pp. 453-490, Dec. 1998 https://doi.org/10.1145/296854.277632
황 규영, 이 민재, 이 재길, 김 민수, 한 욱신, "오디세우스/IR: 정보 검색 기능과 밀결합된 고성능 객체 관계형 DBMS", 한국정보과학회 논문지: 컴퓨팅의 실제, Vol.11, No.3, pp. 209-215, 2005년 6월
류 재준, 이 재길, 이 민재, 황 규영, "오디세우스/Parallel-OOSQL: 오디세우스 객체 관계형 데이타베이스 관리 시스템을 사용한 병렬 정보 검색 시스템", 한국정보과학회 봄 학술발표논문집(B), pp. 187-189, 2002년 4월
Salton, G., Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Aug. 1988
Baeze-Yates, R. and Ribeiro-Neto, B., Modern Information Retrieval, ACM Press, 1999
Faloutsos, C. and Oard, D., "A Survey of Information Retrieval and Filtering Methods," Tech. Report: CS-TR 3514, Univ. of Maryland, Aug. 1995
박 병권, "정보 검색과 데이타베이스 관리 시스템의 밀결합을 위한 역 색인 구조와 질의 최적화", 박사 학위 논문, KAIST 전산학과, 1998
Jeong, B. and Omiecinski, E., "Inverted File Partitioning Schemes in Multiple Disk Systems," IEEE Trans. on Parallel and Distributed Systems, Vol.6, No.2, pp. 142-153, Feb. 1995 https://doi.org/10.1109/71.342125
Baeza-Yates, R., Castillo, C., Junqueira, F., Plachouras, V., and Silvestri, F., "Challenges on Distributed Web Retrieval," In Proc. 23rd Int'l Conf. on Data Engineering, Istanbul, Turkey, pp. 6-20, Apr. 2007
Chaudhuri, S. and Gravano, L., "Evaluating Top-k Selection Queries," In Proc. 25th Int'l Conf. on Very Large Data Bases (VLDB), Edinburgh, Scotland, pp. 399-410, Sept. 1999
Chang, K. and Hwang, S., "Minimal probing: supporting expensive predicates for top-k queries," In Proc. 2002 ACM SIGMOD Int'l Conf. on Management of Data, Madison, Wisconsin, pp. 346- 357, June 2002
Li, C., Chang, K., Ilyas, I., and Song, S., "RankSQL: query algebra and optimization for relational top-k queries," In Proc. 2005 ACM SIGMOD Int'l Conf. on Management of Data, Baltimore, Maryland, pp. 131-142, June 2005
Oracle Corp., interMedia Text, http://otn.oracle. co.kr/docs/Oracle817/index.htm, 1999
Google, http://www.google.com
Gulli, A. and Signorini, A., "The Indexable Web is More than 11.5 Billion Pages," In Proc. 14th Int'l Conf. on World Wide Web, pp. 902-903, Chiba, Japan, May 2005
Barroso, L. A., Dean, J., and Holzle, U., "Web Search for a Plant: The Google Cluster Architecture," IEEE Micro, Vol.23, No.2, pp. 22-28, Mar./Apr. 2003 https://doi.org/10.1109/MM.2003.1196112
임 효상, 오디세우스/코스모스 객체 저장 시스템을 위한 벌크 로드 기능의 설계 및 구현, 석사 학위 논문, KAIST 전산학과, 1999
Codd, E. F., "Relational Completeness of Database Sublanguages," Prentice Hall and IBM Research Report RJ 987, San Jose, California, 1972
Bhatia, S. and Deogun, J., "Cluster Characterization in Information Retrieval," In Proc. 1993 ACM/ SIGAPP Symposium on Applied Computing States of the Art and Practice, pp. 721-728, Feb. 1993

한국정보과학회논문지:컴퓨팅의 실제 및 레터 (Journal of KIISE:Computing Practices and Letters)

오디세우스/Parallel-OOSQL: 오디세우스 정보검색용 밀결합 DBMS를 사용한 병렬 정보 검색 엔진

Odysseus/Parallel-OOSQL: A Parallel Search Engine using the Odysseus DBMS Tightly-Coupled with IR Capability

초록

키워드

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)