• Title/Summary/Keyword: Text Retrieval

Search Result 342, Processing Time 0.024 seconds

Embeded-type Search Function with Feedback for Smartphone Applications (스마트폰 애플리케이션을 위한 임베디드형 피드백 지원 검색체)

  • Kang, Moonjoong;Hwang, Mintae
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.5
    • /
    • pp.974-983
    • /
    • 2017
  • In this paper, we have discussed the search function that can be embedded and used on Android-based applications. We used BM25 to suppress insignificant and too frequent words such as postpositions, Pivoted Length Normalization technique used to resolve the search priority problem related to each item's length, and Rocchio's method to pull items inferred to be related to the query closer to the query vector on Vector Space Model to support implicit feedback function. The index operation is divided into two methods; simple index to support offline operation and complex index for online operation. The implementation uses query inference function to guess user's future input by collating given present input with indexed data and with it the function is able to handle and correct user's error. Thus the implementation could be easily adopted into smartphone applications to improve their search functions.

Phonetic Similarity Meausre for the Korean Transliterations of Foreign Words (외국어 음차 표기의 음성적 유사도 비교 알고리즘)

  • Gang, Byeong-Ju;Lee, Jae-Seong;Choe, Gi-Seon
    • Journal of KIISE:Software and Applications
    • /
    • v.26 no.10
    • /
    • pp.1237-1246
    • /
    • 1999
  • 최근 모든 분야에서 외국과의 교류가 증대됨에 따라서 한국어 문서에는 점점 더 많은 외국어 음차 표기가 사용되는 경향이 있다. 하지만 같은 외국어에 대한 음차 표기에 개인차가 심하여 이들 음차 표기를 포함한 문서들에 대한 검색을 어렵게 만드는 원인이 되고 있다. 한 가지 해결 방법은 색인 시에 같은 외국어에서 온 음차 표기들을 등가부류로 묶어서 색인해 놓았다가 질의 시에 확장하는 방법이다. 본 논문에서는 외국어 음차 표기들의 등가부류를 만드는데 필요한 음차 표기의 음성적 유사도 비교 알고리즘인 Kodex를 제안한다. Kodex 방법은 기존의 스트링 비교 방법인 비음성적 방법에 비해 음차 표기들을 등가부류로 클러스터링하는데 있어 더 나은 성능을 보이면서도, 계산이 간단하여 훨씬 효율적으로 구현될 수 있는 장점이 있다.Abstract With the advent of digital communication technologies, as Koreans communicate with foreigners more frequently, more foreign word transliterations are being used in Korean documents more than ever before. The transliterations of foreign words are very various among individuals. This makes text retrieval tasks about these documents very difficult. In this paper we propose a new method, called Kodex, of measuring the phonetic similarity among foreign word transliterations. Kodex can be used to generate the equivalence classes of the transliterations while indexing and conflate the equivalent transliterations at the querying stage. We show that Kodex gives higher precision at the similar recall level and is more efficient in computation than non-phonetic methods based on string similarity measure.

A Study on the CD ROM Network(LAN) (CD-ROM 네트워크(LAN)에 관한 소고(小考))

  • Kil, Hyung-Do
    • Journal of Information Management
    • /
    • v.21 no.2
    • /
    • pp.9-23
    • /
    • 1990
  • CD-ROM technique, not more than 10 years after development, goes through rapid growth, has been taken advantage of several practical application parts. Needless to say about bibliographic data, numeric value, the phonetics, an image and a picture data that are recorded as abstract or full text, and offered and applied to industry, information service including library, it can be used for library staffs, information retrieval. Escape from the need of one disc drive and one computer to access one disc, now we organize an ideal system that can be retrieved several CD-ROM used only one drive, several users can access several information, so networking is possible through LAN. In this article, we studied the function and type, characteristics, system, structure, data block, production procedure, standardization of CD-ROM LAN.

  • PDF

Principal Components Self-Organizing Map PC-SOM (주성분 자기조직화 지도 PC-SOM)

  • 허명회
    • The Korean Journal of Applied Statistics
    • /
    • v.16 no.2
    • /
    • pp.321-333
    • /
    • 2003
  • Self-organizing map (SOM), a unsupervised learning neural network, has been developed by T. Kohonen since 1980's. Main application areas were pattern recognition and text retrieval. Because of that, it has not been spread to statisticians until late. Recently, SOM's are frequently drawn in data mining fields. Kohonen's SOM, however, needs improvements to become a statistician's standard tool. First, there should be a good guideline as for the size of map. Second, an enhanced visualization mode is wanted. In this study, principal components self-organizing map (PC-SOM), a modification of Kohonen's SOM, is proposed to meet such needs. PC-SOM performs one-dimensional SOM during the first stage to decompose input units into node weights and residuals. At the second stage, another one-dimensional SOM is applied to the residuals of the first stage. Finally, by putting together two stages, one obtains two-dimensional SOM. Such procedure can be easily expanded to construct three or more dimensional maps. The number of grid lines along the second axis is determined automatically, once that of the first axis is given by the data analyst. Furthermore, PC-SOM provides easily interpretable map axes. Such merits of PC-SOM are demonstrated with well-known Fisher's iris data and a simulated data set.

Latent Keyphrase Extraction Using LDA Model (LDA 모델을 이용한 잠재 키워드 추출)

  • Cho, Taemin;Lee, Jee-Hyong
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.25 no.2
    • /
    • pp.180-185
    • /
    • 2015
  • As the number of document resources is continuously increasing, automatically extracting keyphrases from a document becomes one of the main issues in recent days. However, most previous works have tried to extract keyphrases from words in documents, so they overlooked latent keyphrases which did not appear in documents. Although latent keyphrases do not appear in documents, they can undertake an important role in text summarization and information retrieval because they implicate meaningful concepts or contents of documents. Also, they cover more than one fourth of the entire keyphrases in the real-world datasets and they can be utilized in short articles such as SNS which rarely have explicit keyphrases. In this paper, we propose a new approach that selects candidate keyphrases from the keyphrases of neighbor documents which are similar to the given document and evaluates the importance of the candidates with the individual words in the candidates. Experiment result shows that latent keyphrases can be extracted at a reasonable level.

The Application of Geography Markup Language(GML) to the Maritime Information

  • Oh, Se-Woong;Park, Jong-Min;Suh, Sang-Hyun
    • Proceedings of the Korean Institute of Navigation and Port Research Conference
    • /
    • v.1
    • /
    • pp.519-524
    • /
    • 2006
  • This paper describes an application of information presentation based geographic map for maritime information, including navigation information. The work is motivated by the need to prepare maritime information representation and distribution for future generation Web network technology. This works consist of map generation using GML and application to maritime information. GML 3.0 became an adopted specification of the Open Geospatial Consortium(OGC) in January 2003, and is rapidly emerging as the world standard for the encoding, transport and storage of all forms of geographic information. This paper looks at the application of GML to one of the more challenging areas of maritime information. Specific features of GML of interest to maritime information provider are discussed and then illustrated through a series of maritime information case studies. The first phase of the work consists of the construction of GML application schema for using as a base map of maritime information. Maritime information is acquired from multiple sources, including standards documents, database schemas, lexicons, collections of symbol definition. The sources of GML ontological knowledge and the contribution of each source to the overall ontology are described in this paper. In the second phase, the prepared GML is used to create a prototype of the mixed maritime information as a base map - for tagging documents within the maritime domain. An overview of this prototype is included. One application area for these information elements described here is the integrated retrieval of maritime information from diverse sources, ranging from Web sites to nautical chart databases and text documents.

  • PDF

A New Similarity Measure for Improving Ranking in QA Systems (질의응답시스템 응답순위 개선을 위한 새로운 유사도 계산방법)

  • Kim Myung-Gwan;Park Young-Tack
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.10 no.6
    • /
    • pp.529-536
    • /
    • 2004
  • The main idea of this paper is to combine position information in sentence and query type classification to make the documents ranking to query more accessible. First, the use of conceptual graphs for the representation of document contents In information retrieval is discussed. The method is based on well-known strategies of text comparison, such as Dice Coefficient, with position-based weighted term. Second, we introduce a method for learning query type classification that improves the ability to retrieve answers to questions from Question Answering system. Proposed methods employ naive bayes classification in machine learning fields. And, we used a collection of approximately 30,000 question-answer pairs for training, obtained from Frequently Asked Question(FAQ) files on various subjects. The evaluation on a set of queries from international TREC-9 question answering track shows that the method with machine learning outperforms the underline other systems in TREC-9 (0.29 for mean reciprocal rank and 55.1% for precision).

Multi-view learning review: understanding methods and their application (멀티 뷰 기법 리뷰: 이해와 응용)

  • Bae, Kang Il;Lee, Yung Seop;Lim, Changwon
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.1
    • /
    • pp.41-68
    • /
    • 2019
  • Multi-view learning considers data from various viewpoints as well as attempts to integrate various information from data. Multi-view learning has been studied recently and has showed superior performance to a model learned from only a single view. With the introduction of deep learning techniques to a multi-view learning approach, it has showed good results in various fields such as image, text, voice, and video. In this study, we introduce how multi-view learning methods solve various problems faced in human behavior recognition, medical areas, information retrieval and facial expression recognition. In addition, we review data integration principles of multi-view learning methods by classifying traditional multi-view learning methods into data integration, classifiers integration, and representation integration. Finally, we examine how CNN, RNN, RBM, Autoencoder, and GAN, which are commonly used among various deep learning methods, are applied to multi-view learning algorithms. We categorize CNN and RNN-based learning methods as supervised learning, and RBM, Autoencoder, and GAN-based learning methods as unsupervised learning.

Using Roots and Patterns to Detect Arabic Verbs without Affixes Removal

  • Abdulmonem Ahmed;Aybaba Hancrliogullari;Ali Riza Tosun
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.4
    • /
    • pp.1-6
    • /
    • 2023
  • Morphological analysis is a branch of natural language processing, is now a rapidly growing field. The fundamental tenet of morphological analysis is that it can establish the roots or stems of words and enable comparison to the original term. Arabic is a highly inflected and derivational language and it has a strong structure. Each root or stem can have a large number of affixes attached to it due to the non-concatenative nature of Arabic morphology, increasing the number of possible inflected words that can be created. Accurate verb recognition and extraction are necessary nearly all issues in well-known study topics include Web Search, Information Retrieval, Machine Translation, Question Answering and so forth. in this work we have designed and implemented an algorithm to detect and recognize Arbic Verbs from Arabic text.The suggested technique was created with "Python" and the "pyqt5" visual package, allowing for quick modification and easy addition of new patterns. We employed 17 alternative patterns to represent all verbs in terms of singular, plural, masculine, and feminine pronouns as well as past, present, and imperative verb tenses. All of the verbs that matched these patterns were used when a verb has a root, and the outcomes were reliable. The approach is able to recognize all verbs with the same structure without requiring any alterations to the code or design. The verbs that are not recognized by our method have no antecedents in the Arabic roots. According to our work, the strategy can rapidly and precisely identify verbs with roots, but it cannot be used to identify verbs that are not in the Arabic language. We advise employing a hybrid approach that combines many principles as a result.

A Study on Constructing a Digital Archive System of the Modern Korean Christian Collections (근대 한국기독교 자료의 디지털 아카이브 시스템 구축에 관한 연구)

  • Yang, Ji-Ann
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.8
    • /
    • pp.681-691
    • /
    • 2022
  • The purpose of this study is to construct a digital archive system by analyzing the collections of the Korean Christian Museum at S University, which has a large number of materials related to Korean Christianity published in the modern period from the time of Korea's enlightenment until liberation. In order to construct a digital archive system, indexes and metadata for the collection are complied according to the pre-defined format. After digitizing the selected collection, a database is built using metadata information, and the actual system is divided into a web standard-based management system and a user service system. Also a content-based search system is constructed, which provides the matching value of retrieval results in units of one character and an automatic search term completion function to enhance user convenience. Therefore, collections in the museum, which are difficult to access the original text, are digitized and provided so that they can be easily used, laying the foundation for the long-term development of humanities contents for improving the accessibility and availability of collections for both researchers and the public.