• 제목/요약/키워드: text information

검색결과 4,380건 처리시간 0.03초

전문데이터베이스의 특성과 정보검색성능 (On the Characteristics and Information Retrieval Performance of Full-Text Databases)

  • 조명희
    • 한국문헌정보학회지
    • /
    • 제17권
    • /
    • pp.339-366
    • /
    • 1989
  • Appearance of full-text online is the most encouraging phenomenon ·during the development of databases. The full-text databases of today is derived from by-product of electronic publication of printed materials. Now, there are also some movements toward electronic production of documents in Korea although not powerful. The present study is designed to examine the characteristics and effective retrieval method of full-text databases now commercially available through various vendors. The outline of this paper IS as follows: First, background and present situation of existing full-text database services through national and worldwide are examined. Second, free-text searching system of full-text databases is compared with controlled vocabulary system. The factors influencing on free-text retrieval performance, searching thesaurus, and hybrid or compromising system, which is using limited controlled vocabulary in conjunction with natural language for the enrichment needed for practical operation of the . system, are examined. Third, user demands through the analysis of preceding studies on 'various types of full-text databases are recognised. Fouth, application of CD-ROM full-text database to the libraries and information centers is examined as prospective resources for them. Finally, some problems and prospect of full-text databases are presented.

  • PDF

원문정보공개 서비스에서의 개인정보 보호 실태 (The Status of Personal Information Protection for Original Text Information Disclosure Service)

  • 안혜미
    • 한국기록관리학회지
    • /
    • 제19권2호
    • /
    • pp.147-172
    • /
    • 2019
  • 원문정보공개 서비스가 제공되면서 원문정보의 공개를 결정하는 데에 소모되는 시간이 매우 짧아지고 원문정보공개 건수는 크게 증가하였다. 공공기관에서는 개인정보 노출의 위험성 또한 높아졌다. 본 연구에서는 원문정보공개 서비스에서의 개인정보 보호 실태를 알아보고 개인정보 노출 원인을 분석하여 개선방안을 제안하였다. 실태 조사 결과는 다음과 같다. 첫째, 수집한 원문정보 중 13%의 원문정보가 비공개 대상정보인 개인정보를 포함하고 있었다. 둘째, 비공개 대상정보인 개인정보가 포함된 원문정보 중 공무원의 개인정보가 포함된 원문정보가 가장 많은 비중을 차지했다. 특히 휴가 병가에 관한 기록물이 많았다. 셋째, 계약업무를 주로 다루는 기관에서는 대표자 개인에 관한 정보가 노출되는 사례가 많았다. 넷째, 개인정보 필터링에 감지되지 않는 개인정보가 많았다. 개인정보 노출 원인을 분석하여 제안한 개선방안은 다음과 같다. 첫째, 개인정보 보호지침을 재설계해야 한다. 둘째, 원문정보의 공개 비공개를 결정하는 업무담당자의 교육을 강화해야 한다. 셋째, 정부의 양적 실적 위주의 과도한 정보공개정책을 완화해야 한다. 넷째, 원문정보공개 시스템의 개인정보 필터링 기능을 개선해야 한다.

Neural Text Categorizer for Exclusive Text Categorization

  • Jo, Tae-Ho
    • Journal of Information Processing Systems
    • /
    • 제4권2호
    • /
    • pp.77-86
    • /
    • 2008
  • This research proposes a new neural network for text categorization which uses alternative representations of documents to numerical vectors. Since the proposed neural network is intended originally only for text categorization, it is called NTC (Neural Text Categorizer) in this research. Numerical vectors representing documents for tasks of text mining have inherently two main problems: huge dimensionality and sparse distribution. Although many various feature selection methods are developed to address the first problem, the reduced dimension remains still large. If the dimension is reduced excessively by a feature selection method, robustness of text categorization is degraded. Even if SVM (Support Vector Machine) is tolerable to huge dimensionality, it is not so to the second problem. The goal of this research is to address the two problems at same time by proposing a new representation of documents and a new neural network using the representation for its input vector.

Improving Elasticsearch for Chinese, Japanese, and Korean Text Search through Language Detector

  • Kim, Ki-Ju;Cho, Young-Bok
    • Journal of information and communication convergence engineering
    • /
    • 제18권1호
    • /
    • pp.33-38
    • /
    • 2020
  • Elasticsearch is an open source search and analytics engine that can search petabytes of data in near real time. It is designed as a distributed system horizontally scalable and highly available. It provides RESTful APIs, thereby making it programming-language agnostic. Full text search of multilingual text requires language-specific analyzers and field mappings appropriate for indexing and searching multilingual text. Additionally, a language detector can be used in conjunction with the analyzers to improve the multilingual text search. Elasticsearch provides more than 40 language analysis plugins that can process text and extract language-specific tokens and language detector plugins that can determine the language of the given text. This study investigates three different approaches to index and search Chinese, Japanese, and Korean (CJK) text (single analyzer, multi-fields, and language detector-based), and identifies the advantages of the language detector-based approach compared to the other two.

A Text Similarity Measurement Method Based on Singular Value Decomposition and Semantic Relevance

  • Li, Xu;Yao, Chunlong;Fan, Fenglong;Yu, Xiaoqiang
    • Journal of Information Processing Systems
    • /
    • 제13권4호
    • /
    • pp.863-875
    • /
    • 2017
  • The traditional text similarity measurement methods based on word frequency vector ignore the semantic relationships between words, which has become the obstacle to text similarity calculation, together with the high-dimensionality and sparsity of document vector. To address the problems, the improved singular value decomposition is used to reduce dimensionality and remove noises of the text representation model. The optimal number of singular values is analyzed and the semantic relevance between words can be calculated in constructed semantic space. An inverted index construction algorithm and the similarity definitions between vectors are proposed to calculate the similarity between two documents on the semantic level. The experimental results on benchmark corpus demonstrate that the proposed method promotes the evaluation metrics of F-measure.

CNN-based Skip-Gram Method for Improving Classification Accuracy of Chinese Text

  • Xu, Wenhua;Huang, Hao;Zhang, Jie;Gu, Hao;Yang, Jie;Gui, Guan
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제13권12호
    • /
    • pp.6080-6096
    • /
    • 2019
  • Text classification is one of the fundamental techniques in natural language processing. Numerous studies are based on text classification, such as news subject classification, question answering system classification, and movie review classification. Traditional text classification methods are used to extract features and then classify them. However, traditional methods are too complex to operate, and their accuracy is not sufficiently high. Recently, convolutional neural network (CNN) based one-hot method has been proposed in text classification to solve this problem. In this paper, we propose an improved method using CNN based skip-gram method for Chinese text classification and it conducts in Sogou news corpus. Experimental results indicate that CNN with the skip-gram model performs more efficiently than CNN-based one-hot method.

An Exploratory Approach to Discovering Salary-Related Wording in Job Postings in Korea

  • Ha, Taehyun;Coh, Byoung-Youl;Lee, Mingook;Yun, Bitnari;Chun, Hong-Woo
    • Journal of Information Science Theory and Practice
    • /
    • 제10권spc호
    • /
    • pp.86-95
    • /
    • 2022
  • Online recruitment websites discuss job demands in various fields, and job postings contain detailed job specifications. Analyzing this text can elucidate the features that determine job salaries. Text embedding models can learn the contextual information in a text, and explainable artificial intelligence frameworks can be used to examine in detail how text features contribute to the models' outputs. We collected 733,625 job postings using the WORKNET API and classified them into low, mid, and high-range salary groups. A text embedding model that predicts job salaries based on the text in job postings was trained with the collected data. Then, we applied the SHapley Additive exPlanations (SHAP) framework to the trained model and discovered the significant words that determine each salary class. Several limitations and remaining words are also discussed.

Text Classification on Social Network Platforms Based on Deep Learning Models

  • YA, Chen;Tan, Juan;Hoekyung, Jung
    • Journal of information and communication convergence engineering
    • /
    • 제21권1호
    • /
    • pp.9-16
    • /
    • 2023
  • The natural language on social network platforms has a certain front-to-back dependency in structure, and the direct conversion of Chinese text into a vector makes the dimensionality very high, thereby resulting in the low accuracy of existing text classification methods. To this end, this study establishes a deep learning model that combines a big data ultra-deep convolutional neural network (UDCNN) and long short-term memory network (LSTM). The deep structure of UDCNN is used to extract the features of text vector classification. The LSTM stores historical information to extract the context dependency of long texts, and word embedding is introduced to convert the text into low-dimensional vectors. Experiments are conducted on the social network platforms Sogou corpus and the University HowNet Chinese corpus. The research results show that compared with CNN + rand, LSTM, and other models, the neural network deep learning hybrid model can effectively improve the accuracy of text classification.

가변적 클러스터 개수에 대한 문서군집화 평가방법 (The Evaluation Measure of Text Clustering for the Variable Number of Clusters)

  • 조태호
    • 한국정보과학회:학술대회논문집
    • /
    • 한국정보과학회 2006년도 가을 학술발표논문집 Vol.33 No.2 (B)
    • /
    • pp.233-237
    • /
    • 2006
  • This study proposes an innovative measure for evaluating the performance of text clustering. In using K-means algorithm and Kohonen Networks for text clustering, the number clusters is fixed initially by configuring it as their parameter, while in using single pass algorithm for text clustering, the number of clusters is not predictable. Using labeled documents, the result of text clustering using K-means algorithm or Kohonen Network is able to be evaluated by setting the number of clusters as the number of the given target categories, mapping each cluster to a target category, and using the evaluation measures of text. But in using single pass algorithm, if the number of clusters is different from the number of target categories, such measures are useless for evaluating the result of text clustering. This study proposes an evaluation measure of text clustering based on intra-cluster similarity and inter-cluster similarity, what is called CI (Clustering Index) in this article.

  • PDF

다중 영상 및 텍스트 동기화를 고려한 Music Player MAF 의 확장 포맷 연구 (A study on Extensions to Music Player MAF for Multiple JPEG images and Text data with Synchronization)

  • 양찬석;임정연;김문철
    • 대한전자공학회:학술대회논문집
    • /
    • 대한전자공학회 2005년도 추계종합학술대회
    • /
    • pp.967-970
    • /
    • 2005
  • The Music Player MAF Player Format of ISO/IEC 23000-2 FDIS consists of MP3 data, MPEG-7 metadata and one optional JPEG image data based on MPEG-4 File Format. However, the current Music Player MAF format does not allow multiple JPEG image data or timed text data. It is helpful to use timed text data and multiple JPEG images in the various multimedia applications. For example, listening material for the foreign language needs an additional book which has text and images, the audio contents which can get image and text data can be helpful to understand the whole story and situations well. In this paper, we propose the detailed file structure in conjunction with MPEG-4 File Format in order to improve the functionalities, which carry multiple image data and text data with synchronization information between MP3 data and other resources.

  • PDF