• Title/Summary/Keyword: automatic indexing

Search Result 138, Processing Time 0.027 seconds

Statistical Techniques for Automatic Indexing and Some Experiments with Korean Documents (자동색인의 통계적기법과 한국어 문헌의 실험)

  • Chung Young Mee;Lee Tae Young
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.9
    • /
    • pp.99-118
    • /
    • 1982
  • This paper first reviews various techniques proposed for automatic indexing with special emphasis placed on statistical techniques. Frequency-based statistical techniques are categorized into the following three approaches for further investigation on the basis of index term selection criteria: term frequency approach, document frequency approach, and probabilistic approach. In the experimental part of this study, Pao's technique based on the Goffman's transition region formula and Harter's 2-Poisson distribution model with a measure of the potential effectiveness of index term were tested. Experimental document collection consists of 30 agriculture-related documents written in Korean. Pao's technique did not yield good result presumably due to the difference in word usage between Korean and English. However, Harter's model holds some promise for Korean document indexing because the evaluation result from this experiment was similar to that of the Harter's.

  • PDF

A PROPOSAL OF SEMI-AUTOMATIC INDEXING ALGORITHM FOR MULTI-MEDIA DATABASE WITH USERS' SENSIBILITY

  • Mitsuishi, Takashi;Sasaki, Jun;Funyu, Yutaka
    • Proceedings of the Korean Society for Emotion and Sensibility Conference
    • /
    • 2000.04a
    • /
    • pp.120-125
    • /
    • 2000
  • We propose a semi-automatic and dynamic indexing algorithm for multi-media database(e.g. movie files, audio files), which are difficult to create indexes expressing their emotional or abstract contents, according to user's sensitivity by using user's histories of access to database. In this algorithm, we simply categorize data at first, create a vector space of each user's interest(user model) from the history of which categories the data belong to, and create vector space of each data(title model) from the history of which users the data had been accessed from. By continuing the above method, we could create suitable indexes, which show emotional content of each data. In this paper, we define the recurrence formulas based on the proposed algorithm. We also show the effectiveness of the algorithm by simulation result.

  • PDF

Issues and Empirical Results for Improving Text Classification

  • Ko, Young-Joong;Seo, Jung-Yun
    • Journal of Computing Science and Engineering
    • /
    • v.5 no.2
    • /
    • pp.150-160
    • /
    • 2011
  • Automatic text classification has a long history and many studies have been conducted in this field. In particular, many machine learning algorithms and information retrieval techniques have been applied to text classification tasks. Even though much technical progress has been made in text classification, there is still room for improvement in text classification. In this paper, we will discuss remaining issues in improving text classification. In this paper, three improvement issues are presented including automatic training data generation, noisy data treatment and term weighting and indexing, and four actual studies and their empirical results for those issues are introduced. First, the semi-supervised learning technique is applied to text classification to efficiently create training data. For effective noisy data treatment, a noisy data reduction method and a robust text classifier from noisy data are developed as a solution. Finally, the term weighting and indexing technique is revised by reflecting the importance of sentences into term weight calculation using summarization techniques.

Semantic Indexing for Soccer Videos Using Web-Extracted Information (웹에서 축출된 정보를 이용한 축구 경기의 시맨틱 인덱싱)

  • Hirata, Issao;Kim, Myeong-Hoon;Sull, Sang-Hoon
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2007.10c
    • /
    • pp.41-45
    • /
    • 2007
  • The rapid growing of video content production leads to the necessity of developing more complex indexing systems in order to efficiently allow searching, retrieval and presentation of the desired segments of videos. This paper presents a method for indexing soccer video through automatic extraction of information from internet. The proposed paper defines a metadata structure to formally represent the knowledge of soccer matches and provides an automatic method to extract semantic information from web-sites. This approach improves the capability to extract more reliable and richer semantic Information for soccer videos. Experimental results demonstrate that the proposed method provides an efficient performance.

  • PDF

A Study on Christian Website Indexing (기독교 관련 웹 사이트 내 색인에 관한 연구)

  • Yoo, Yeong-Jun
    • Journal of Korean Library and Information Science Society
    • /
    • v.38 no.4
    • /
    • pp.257-276
    • /
    • 2007
  • Back-of-book-style indexes have a similar function as back-of-book indexes. The best advantage o4 back-of-book-style indexes for Information access on the web is to give direct access to specific subjects of interest. Though back-of-book-style indexes are alphabetically arranged as back-of-book indexes, they have linked index entries to contents on the site by using a anchor tag of HTML. In this research, I have created back-of-book-style indexes in two separated ways, by hand-crafted and semi-automatic Indexing. We have utilized back-of-book-style indexes, that is similar to back-of-book index of traditional information organization method of library and information science, in library circumstances.

  • PDF

Retrieving Information from Korean OCR Text Database (문자 인식에 의해 구축된 한글 문서 데이터베이스에 대한 정보 검색)

  • Lee, Jun-Ho;Lee, Chung-Sik;Han, Seon-Hwa;Kim, Jin-Hyeong
    • The Transactions of the Korea Information Processing Society
    • /
    • v.6 no.4
    • /
    • pp.833-841
    • /
    • 1999
  • The texts constructed with Optical Character Recognition(OCR) contain more errors than those constructed with keyboard typing. Therefore, in order to retrieve useful information from OCR texts, we need to develop an effective automatic indexing method. In this paer, we investigate automatic indexing methods that can retrieve information effectively from Korean OCR text database with the character-level recognition ratio of 90%. Experimental result shows that 2-gram indexing provides similar retrieval effectiveness of morpheme-based indexing for the Korean OCR text database.

  • PDF

A Study of Designing the Han-Guel Thesaurus Browser for Automatic Information Retrieval (자동정보검색을 위한 한글 시소러스 브라우저 구축에 관한 연구)

  • Seo, Whee
    • Journal of Korean Library and Information Science Society
    • /
    • v.31 no.2
    • /
    • pp.279-302
    • /
    • 2000
  • This study is to develop a new automatic system for the Korean thesaurus browser by which we can automatically control all the processes of searching queries such as, representation, generation, extension and construction of searching strategy and feedback searching. The system in this study is programmed by Delphi 4.0(PASCAL) and consists of database system, automatic indexing, clustering technique, establishing and expressing thesaurus, and automatic information retrieval technique. The results proved by this system are as follows: 1)By using the new automatic thesaurus browser developed by the new algorithm, we can perform information retrieval, automatic indexing, clustering technique, establishing and expressing thesaurus, information retrieval technique, and retrieval feedback. Thus it turns out that even the beginner user can easily access special terms about the field of a specific subject. 2) The thesaurus browser in this paper has such merits as the easiness of establishing, the convenience of using, and the good results of information retrieval in terms of the rate of speed, degree, and regeneration. Thus, it t m out very pragmatic.

  • PDF

A Study of Designing the Intelligent Information Retrieval System by Automatic Classification Algorithm (자동분류 알고리즘을 이용한 지능형 정보검색시스템 구축에 관한 연구)

  • Seo, Whee
    • Journal of Korean Library and Information Science Society
    • /
    • v.39 no.4
    • /
    • pp.283-304
    • /
    • 2008
  • This is to develop Intelligent Retrieval System which can automatically present early query's category terms(association terms connected with knowledge structure of relevant terminology) through learning function and it changes searching form automatically and runs it with association terms. For the reason, this theoretical study of Intelligent Automatic Indexing System abstracts expert's index term through learning and clustering algorism about automatic classification, text mining(categorization), and document category representation. It also demonstrates a good capacity in the aspects of expense, time, recall ratio, and precision ratio.

  • PDF

An Optimized e-Lecture Video Search and Indexing framework

  • Medida, Lakshmi Haritha;Ramani, Kasarapu
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.8
    • /
    • pp.87-96
    • /
    • 2021
  • The demand for e-learning through video lectures is rapidly increasing due to its diverse advantages over the traditional learning methods. This led to massive volumes of web-based lecture videos. Indexing and retrieval of a lecture video or a lecture video topic has thus proved to be an exceptionally challenging problem. Many techniques listed by literature were either visual or audio based, but not both. Since the effects of both the visual and audio components are equally important for the content-based indexing and retrieval, the current work is focused on both these components. A framework for automatic topic-based indexing and search depending on the innate content of the lecture videos is presented. The text from the slides is extracted using the proposed Merged Bounding Box (MBB) text detector. The audio component text extraction is done using Google Speech Recognition (GSR) technology. This hybrid approach generates the indexing keywords from the merged transcripts of both the video and audio component extractors. The search within the indexed documents is optimized based on the Naïve Bayes (NB) Classification and K-Means Clustering models. This optimized search retrieves results by searching only the relevant document cluster in the predefined categories and not the whole lecture video corpus. The work is carried out on the dataset generated by assigning categories to the lecture video transcripts gathered from e-learning portals. The performance of search is assessed based on the accuracy and time taken. Further the improved accuracy of the proposed indexing technique is compared with the accepted chain indexing technique.

Automatic Korean to English Cross Language Keyword Assignment Using MeSH Thesaurus (MeSH 시소러스를 이용한 한영 교차언어 키워드 자동 부여)

  • Lee Jae-Sung;Kim Mi-Suk;Oh Yong-Soon;Lee Young-Sung
    • The KIPS Transactions:PartB
    • /
    • v.13B no.2 s.105
    • /
    • pp.155-162
    • /
    • 2006
  • The medical thesaurus, MeSH (Medical Subject Heading), has been used as a controlled vocabulary thesaurus for English medical paper indexing for a long time. In this paper, we propose an automatic cross language keyword assignment method, which assigns English MeSH index terms to the abstract of a Korean medical paper. We compare the performance with the indexing performance of human indexers and the authors. The procedure of index term assignment is that first extracting Korean MeSH terms from text, changing these terms into the corresponding English MeSH terms, and calculating the importance of the terms to find the highest rank terms as the keywords. For the process, an effective method to solve spacing variants problem is proposed. Experiment showed that the method solved the spacing variant problem and reduced the thesaurus space by about 42%. And the experiment also showed that the performance of automatic keyword assignment is much less than that of human indexers but is as good as that of authors.