• Title/Summary/Keyword: word Weighting

Search Result 51, Processing Time 0.02 seconds

Performance Improvement of Web Document Classification through Incorporation of Feature Selection and Weighting (특징선택과 특징가중의 융합을 통한 웹문서분류 성능의 개선)

  • Lee, Ah-Ram;Kim, Han-Joon;Man, Xuan
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.13 no.4
    • /
    • pp.141-148
    • /
    • 2013
  • Automated classification systems which utilize machine learning develops classification models through learning process, and then classify unknown data into predefined set of categories according to the model. The performance of machine learning-based classification systems relies greatly upon the quality of features composing classification models. For textual data, we can use their word terms and structure information in order to generate the set of features. Particularly, in order to extract feature from Web documents, we need to analyze tag and hyperlink information. Recent studies on Web document classification focus on feature engineering technology other than machine learning algorithms themselves. Thus this paper proposes a novel method of incorporating feature selection and weighting which can improves classification models effectively. Through extensive experiments using Web-KB document collections, the proposed method outperforms conventional ones.

A Statistical Word Sense Disambiguation Using Combinations of Syntactic Indicators (구문 지시자를 통합한 통계적 어의애매성 해결)

  • Kim, Kweonyang;Choi, Jaehuk
    • The Journal of Korean Association of Computer Education
    • /
    • v.5 no.2
    • /
    • pp.11-19
    • /
    • 2002
  • In this paper, we present a simple statistical method for performing word sense disambiguation(WSD), specially for Korean transitive verbs, based on a supervised learning algorithm. This approach combines a set of indicators based on syntactic relations between surrounding words and an ambiguous verb. Experiments with 10 Korean verbs show that accuracy performance of our WSD method using indicators based on syntactic relations is 27% higher than the baseline performance. Moreover, our method using weighting mechanism based on each indicator type is 12% higher than a method which uses only an unordered set of surrounding words in the context.

  • PDF

Future and Directions for Research in Full Text Databases (본문 데이타베이스 연구에 관한 고찰과 그 전망)

  • Ro Jung Soon
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.17
    • /
    • pp.49-83
    • /
    • 1989
  • A Full text retrieval system is a natural language document retrieval system in which the full text of all documents in a collection is stored on a computer so that every word in every sentence of every document can be located by the machine. This kind of IR System is recently becoming rapidly available online in the field of legal, newspaper, journal and reference book indexing. Increased research interest has been in this field. In this paper, research on full text databases and retrieval systems are reviewed, directions for research in this field are speculated, questions in the field that need answering are considered, and variables affecting online full text retrieval and various role that variables play in a research study are described. Two obvious research questions in full text retrieval have been how full text retrieval performs and how to improve the retrieval performance of full text databases. Research to improve the retrieval performance has been incorporated with ranking or weighting algorithms based on word occurrences, combined menu-driven and query-driven systems, and improvement of computer architectures and record structure for databases. Recent increase in the number of full text databases with various sizes, forms and subject matters, and recent development in computer architecture artificial intelligence, and videodisc technology promise new direction of its research and scholarly growth. Studies on the interrelationship between every elements of the full text retrieval situation and the relationship between each elements and retrieval performance may give a professional view in theory and practice of full text retrieval.

  • PDF

Design and Implementation of Web Crawler utilizing Unstructured data

  • Tanvir, Ahmed Md.;Chung, Mokdong
    • Journal of Korea Multimedia Society
    • /
    • v.22 no.3
    • /
    • pp.374-385
    • /
    • 2019
  • A Web Crawler is a program, which is commonly used by search engines to find the new brainchild on the internet. The use of crawlers has made the web easier for users. In this paper, we have used unstructured data by structuralization to collect data from the web pages. Our system is able to choose the word near our keyword in more than one document using unstructured way. Neighbor data were collected on the keyword through word2vec. The system goal is filtered at the data acquisition level and for a large taxonomy. The main problem in text taxonomy is how to improve the classification accuracy. In order to improve the accuracy, we propose a new weighting method of TF-IDF. In this paper, we modified TF-algorithm to calculate the accuracy of unstructured data. Finally, our system proposes a competent web pages search crawling algorithm, which is derived from TF-IDF and RL Web search algorithm to enhance the searching efficiency of the relevant information. In this paper, an attempt has been made to research and examine the work nature of crawlers and crawling algorithms in search engines for efficient information retrieval.

Chatbot Design Method Using Hybrid Word Vector Expression Model Based on Real Telemarketing Data

  • Zhang, Jie;Zhang, Jianing;Ma, Shuhao;Yang, Jie;Gui, Guan
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.4
    • /
    • pp.1400-1418
    • /
    • 2020
  • In the development of commercial promotion, chatbot is known as one of significant skill by application of natural language processing (NLP). Conventional design methods are using bag-of-words model (BOW) alone based on Google database and other online corpus. For one thing, in the bag-of-words model, the vectors are Irrelevant to one another. Even though this method is friendly to discrete features, it is not conducive to the machine to understand continuous statements due to the loss of the connection between words in the encoded word vector. For other thing, existing methods are used to test in state-of-the-art online corpus but it is hard to apply in real applications such as telemarketing data. In this paper, we propose an improved chatbot design way using hybrid bag-of-words model and skip-gram model based on the real telemarketing data. Specifically, we first collect the real data in the telemarketing field and perform data cleaning and data classification on the constructed corpus. Second, the word representation is adopted hybrid bag-of-words model and skip-gram model. The skip-gram model maps synonyms in the vicinity of vector space. The correlation between words is expressed, so the amount of information contained in the word vector is increased, making up for the shortcomings caused by using bag-of-words model alone. Third, we use the term frequency-inverse document frequency (TF-IDF) weighting method to improve the weight of key words, then output the final word expression. At last, the answer is produced using hybrid retrieval model and generate model. The retrieval model can accurately answer questions in the field. The generate model can supplement the question of answering the open domain, in which the answer to the final reply is completed by long-short term memory (LSTM) training and prediction. Experimental results show which the hybrid word vector expression model can improve the accuracy of the response and the whole system can communicate with humans.

Calculation of similarity by weighting title and summary in word co-occurrence of research reports (연구 보고서의 공기관계 정보에 제목 및 요약의 가중치를 적용한 유사도 계산)

  • Kim, Nam-Hun;Joo, Jong-Min;Park, Hyuk-Ro;Yang, Hyung-Jeong
    • Proceedings of The KACE
    • /
    • 2017.08a
    • /
    • pp.37-40
    • /
    • 2017
  • 본 논문에서는 국가 연구 보고서의 공기 관계 정보와 제목, 요약 등에 가중치를 적용한 유사도 계산방법을 제안한다. 이를 위해 국가 연구개발 보고서에서 텍스트를 추출하여 한 문장 단위로 문서를 분할하고, 기본 불용어와 보고서에서 특징적으로 나타나는 불용어를 처리하고 형태소 분석을 한 뒤 공기관계를 추출하였다. 또한 문서의 유사도 계산시 정확성을 높이기 위해 제목과 요약 부분에 가중치를 부여하였다. 이를 통해 본 논문에서 제안하는 방법이 문서 검색 라이브러인 루씬(Lucene)을 이용한 방법보다 2.5%의 검색성능 향상을 그리고 Knn-휴리스틱 방법보다는 1.1%의 검색성능 향상을 보였다. 이러한 결과를 통해 문서의 요약과 제목 그리고 공기관계 정보가 연구보고서의 유사도를 계산 하는데 영향을 미친다는 것을 보였다.

  • PDF

Classifying Temporal Topics with Similar Patterns on Twitter

  • Yun, Hong-Won
    • Journal of information and communication convergence engineering
    • /
    • v.9 no.3
    • /
    • pp.295-300
    • /
    • 2011
  • Twitter is a popular microblogging service that enables the users to send and read short text messages. These messages are becoming source to analyze topic trends and identify relations among temporal topics. In this paper, we propose a method to classify the temporal topics on Twitter as a problem of grouping the similar patterns. To provide a starting point for a classification under the same topics, we identify the content word weighting scheme based on Latent Dirichlet Allocation (LDA). And we formulate how the temporal topics in the time window can be classified like peaky topics, constant topics, and periodic topics. We provide different real case studies which show the validity of the proposed method. Evaluations show that the proposed method is useful as a classifying model in the analysis of the temporal topics.

Performance Improvement of Word Spotting Using State Weighting of HMM (HMM의 상태별 가중치를 이용한 핵심어 검출의 성능 향상)

  • 최동진
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • 1998.06e
    • /
    • pp.305-308
    • /
    • 1998
  • 본 논문에서는 핵심어 검출의 성능을 향상시키기 위한 새로운 후처리 방법을 제안한다. 일반적으로 핵심어 검출 시스템에 의해 검출된 상위 n개의 후보 단어들의 우도(likelihood)는 비슷한 경우가 많다. 따라서, 한 음성구간에 대해 음향학적으로 유사한 핵심어들간의 오인식 가능성이 높아진다. 그러나 기존의 핵심어 검출에 사용된 후처리 방법은 음성의 모든 구간에 같은 비중을 두고 우도를 평가하므로 비슷한 음향학적 특징을 가지는 유사한 핵심어들의 비교에 적합하지 못하다. 이를 해결하기 위하여, 본 논문에서는 후보단어들의 부분적인 음향학적 특징 차이에 기반한 가중치를 우도 계산 시에 반영함으로써 보다 변별력을 높이는 알고리즘을 제안한다. 실험 결과, 제안된 방법을 이용하여 유사한 후보단어들간의 변별력을 높일 수 있었고, 인식율이 93%일 때, 우도비검사 방법에 비해 19.6%의 false alarm rate을 감소시킬 수 있었다.

  • PDF

An Implementation of Best Match Algorithm for Korean Text Retrieval in the Client/Server Environment (클라이언트 서버 환경에서 한글텍스트 검색을 위한 베스티매치 알고리즘의 구현)

    • Journal of Korean Library and Information Science Society
    • /
    • v.32 no.1
    • /
    • pp.249-260
    • /
    • 2001
  • This paper presents the application of best match search algorithm in the client/server system for natural language access to Web-based database. For this purpose, the procedures to process Korean word variants as well as to execute probabilistic weighting scheme have been implemented in the client/server system. The experimental runs have been done using a Korean test set which included documents, queries and relevance judgements. The experimental results demonstrate that best match retrieval with relevance information is better than the retrieval without it.

  • PDF

HF-IFF: Applying TF-IDF to Measure Symptom-Medicinal Herb Relevancy and Visualize Medicinal Herb Characteristics - Studying Formulations in Cheongkangeuigam - (HF-IFF: TF-IDF를 응용한 병증-본초 연관성(relevancy) 측정과 본초 특성의 시각화 -청강의감 방제를 대상으로-)

  • Oh, Junho
    • The Korea Journal of Herbology
    • /
    • v.30 no.3
    • /
    • pp.63-68
    • /
    • 2015
  • Objectives : We applied the term weighting method used in the field of data search to quantify relevancy between symptoms and medicinal herbs, and, based on this, we aim to introduce a method of visualizing the characteristics of medicinal herbs. Methods : We proposed HF-IFF, an adaptation of TF-IDF, which is a term weighting measurement method adapted in the field of data search. Using this method, we deduced relevancy between symptoms and medicinal herbs In Cheongkangeuigam that was published in 1984 by organizing the medical theory of Cheongkang, Kim Younghoon, and visualized this as a graph in order to compare the characteristics of medicinal herbs used for different symptoms. Results : HF-IFF is the product of HF and IFF, where HF is the frequency of the relevant medicinal herb for a set of symptoms, and IFF is the inverse of the number of formulations (FF) containing that herb. A total of 251 types of medicinal herb are used in Cheongkangeuigam, and 1538 formulations are classified according to 67 types of symptom. The overall mean for HF-IFF was 0.491, with a maximum of 4.566 and a minimum of 0.013. Conclusions : In spite of several limitations, we were able to use HF-IFF to measure relevancy between symptoms and medicinal herbs, with formulations as an intermediate. We were able to use the quantified results to visually express the characteristics of the herbs used for symptoms by bubble chart and word-cloud from HF-IFF.