• Title/Summary/Keyword: WordRank

Search Result 48, Processing Time 0.028 seconds

Topic Model Augmentation and Extension Method using LDA and BERTopic (LDA와 BERTopic을 이용한 토픽모델링의 증강과 확장 기법 연구)

  • Kim, SeonWook;Yang, Kiduk
    • Journal of the Korean Society for information Management
    • /
    • v.39 no.3
    • /
    • pp.99-132
    • /
    • 2022
  • The purpose of this study is to propose AET (Augmented and Extended Topics), a novel method of synthesizing both LDA and BERTopic results, and to analyze the recently published LIS articles as an experimental approach. To achieve the purpose of this study, 55,442 abstracts from 85 LIS journals within the WoS database, which spans from January 2001 to October 2021, were analyzed. AET first constructs a WORD2VEC-based cosine similarity matrix between LDA and BERTopic results, extracts AT (Augmented Topics) by repeating the matrix reordering and segmentation procedures as long as their semantic relations are still valid, and finally determines ET (Extended Topics) by removing any LDA related residual subtopics from the matrix and ordering the rest of them by F1 (BERTopic topic size rank, Inverse cosine similarity rank). AET, by comparing with the baseline LDA result, shows that AT has effectively concretized the original LDA topic model and ET has discovered new meaningful topics that LDA didn't. When it comes to the qualitative performance evaluation, AT performs better than LDA while ET shows similar performances except in a few cases.

Concept-based Compound Keyword Extraction (개념기반 복합키워드 추출방법)

  • Lee, Sangkon;Lee, Taehun
    • The Journal of Korean Association of Computer Education
    • /
    • v.6 no.2
    • /
    • pp.23-31
    • /
    • 2003
  • In general, people use a key word or a phrase as the name of field or subject word in document. This paper has focused on keyword extraction. First of all, we investigate that an author suggests keywords that are not occurred as contents words in literature, and present generation rules to combine compound keywords based on concept of lexical information. Moreover, we present a new importance measurement to avoid useless keywords that are not related to documents' contents. To verify the validity of extraction result, we collect titles and abstracts from research papers about natural language and/or voice processing studies, and obtain the 96% precision in a top rank of extraction result.

  • PDF

Text Summarization on Large-scale Vietnamese Datasets

  • Ti-Hon, Nguyen;Thanh-Nghi, Do
    • Journal of information and communication convergence engineering
    • /
    • v.20 no.4
    • /
    • pp.309-316
    • /
    • 2022
  • This investigation is aimed at automatic text summarization on large-scale Vietnamese datasets. Vietnamese articles were collected from newspaper websites and plain text was extracted to build the dataset, that included 1,101,101 documents. Next, a new single-document extractive text summarization model was proposed to evaluate this dataset. In this summary model, the k-means algorithm is used to cluster the sentences of the input document using different text representations, such as BoW (bag-of-words), TF-IDF (term frequency - inverse document frequency), Word2Vec (Word-to-vector), Glove, and FastText. The summary algorithm then uses the trained k-means model to rank the candidate sentences and create a summary with the highest-ranked sentences. The empirical results of the F1-score achieved 51.91% ROUGE-1, 18.77% ROUGE-2 and 29.72% ROUGE-L, compared to 52.33% ROUGE-1, 16.17% ROUGE-2, and 33.09% ROUGE-L performed using a competitive abstractive model. The advantage of the proposed model is that it can perform well with O(n,k,p) = O(n(k+2/p)) + O(nlog2n) + O(np) + O(nk2) + O(k) time complexity.

Candidate Word List and Probability Score Guided for Korean Scene Text Recognition (후보 단어 리스트와 확률 점수에 기반한 한국어 문자 인식 모델)

  • Lee, Yoonji;Lee, Jong-Min
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.05a
    • /
    • pp.73-75
    • /
    • 2022
  • Scene Text Recognition is a technology used in the field of artificial intelligence that requires manless robot, automatic vehicles and human-computer interaction. Though scene text images are distorted by noise interference, such as illumination, low resolution and blurring. Unlike previous studies that recognized only English, this paper shows a strong recognition accuracy including various characters, English, Korean, special character and numbers. Instead of selecting only one class having the highest probability value, a candidate word can be generated by considering the probability value of the second rank as well, thus a method can be corrected an existing language misrecognition problem.

  • PDF

A Study of the Classification of Nursing Diagnoses (간호진단 분석 일 연구)

  • Shon Young-Hee
    • Journal of Korean Academy of Fundamentals of Nursing
    • /
    • v.4 no.1
    • /
    • pp.119-131
    • /
    • 1997
  • This study was conducted to analyze the nursing diagnoses applied for case studies of nursing students through their clinical practices, and to provide the educational basis of nursing diagnoses with its results. The data were collected for two years(1995 and 1996) from 70 case studies reported by the 2nd and 3rd year nursing junior college students. The students made 259 nursing diagnoses among which 230 diagnoses qualified NANDA classification and were taken for analysis. The results of the analysis were as follows : 1. The number of diagnoses indicating response patterns was 35(35.7%), whereas 98 diagnoses in NANDA table. Among the 35 diagnoses, the pattern of exchange was most frequent, then feeling, moving, knowing in rank. 2. The diagnoses were analyzed in the categories of response patterns. For Instance, 'Altered in Nutrition' was most frequent in exhange, then Risk for Infection', 'Ineffective Airway Clearance', in rank. 3. Among 230 diagnoses, 'Knowle Deficit' was most frequently mentioned, then 'Activity Intolerance' 'Anxiety', 'Pain', 'Altered in Nutrition', 'Risk for Infection', 'Ineffective airway clearance', in rank. 4. The types of word expression of each diagnoses were various. 'Activity Intolerance' was expressed in 6 types. 5. The relating factors applied to each diagnosis were analyzed. For Instance, the relating factor of 'Knowledge Deficit' were illness, and therapeutic process, lack of motivation, occurrance of complication, short experience, operation, and so on. From the above study, the researcher would like to recommend as follows : 1) The current diagnoses need to be verified its content validity, when they are applied to our culture. 2) The most effective educational method for applying nursing diagnoses should be explored. 3) Further study could be focused on not only 'relating factors' but also 'sign and symptoms'.

  • PDF

An Improved Automatic Text Summarization Based on Lexical Chaining Using Semantical Word Relatedness (단어 간 의미적 연관성을 고려한 어휘 체인 기반의 개선된 자동 문서요약 방법)

  • Cha, Jun Seok;Kim, Jeong In;Kim, Jung Min
    • Smart Media Journal
    • /
    • v.6 no.1
    • /
    • pp.22-29
    • /
    • 2017
  • Due to the rapid advancement and distribution of smart devices of late, document data on the Internet is on the sharp increase. The increment of information on the Web including a massive amount of documents makes it increasingly difficult for users to understand corresponding data. In order to efficiently summarize documents in the field of automated summary programs, various researches are under way. This study uses TextRank algorithm to efficiently summarize documents. TextRank algorithm expresses sentences or keywords in the form of a graph and understands the importance of sentences by using its vertices and edges to understand semantic relations between vocabulary and sentence. It extracts high-ranking keywords and based on keywords, it extracts important sentences. To extract important sentences, the algorithm first groups vocabulary. Grouping vocabulary is done using a scale of specific weight. The program sorts out sentences with higher scores on the weight scale, and based on selected sentences, it extracts important sentences to summarize the document. This study proved that this process confirmed an improved performance than summary methods shown in previous researches and that the algorithm can more efficiently summarize documents.

Increasing Returns to Information and Its Application to the Korean Movie Market

  • Kim, Sang-Hoon;Lee, Youseok
    • Asia Marketing Journal
    • /
    • v.15 no.1
    • /
    • pp.43-55
    • /
    • 2013
  • Since movies are experience goods, consumers are easily influenced by other consumers' behavior. For moviegoers, box office rank is the most credible and easily accessible information. Many studies have found that the relationship between a movie's box office rank and its revenue departs from the Pareto distribution, and this phenomenon has been named "increasing returns to information." The primary objective of the current research is to apply the empirical model proposed by De Vany and Walls (1996) to the Korean movie market in order to examine whether the same phenomenon prevails in the Korean movie market. The other purpose of the present study is to provide managers with useful implications about the release timing of a movie by finding different curvatures that depend upon seasonality. The empirical test on the Korean movie market shows similar results as prior studies conducted on the U.S., Hong Kong, and U.K. movie markets. The phenomenon of increasing returns is generated by information transmission among consumers, which makes some movies become blockbusters and others bombs. The proposed model can also be interpreted in such a way that a change in the rank has a nonlinear effect on the movie's performance. If a movie climbs up the chart, it would be rewarded more than its proportion. On the other hand, if a movie falls down in the ranks, its performance would drop rapidly. The research result also indicates that the phenomenon of increasing returns occurs differently depending on when the movies are released. Since the tendency of the increasing returns to information is stronger during the peak seasons, movie marketers should decide upon the release timing of a movie based on its competitiveness. If a movie has substantial potential to incur positive word-of-mouth, it would be more reasonable to release the movie during the peak season to enjoy increasing returns. Otherwise, a movie should be released during the low season to minimize the risk of being dropped from the chart.

  • PDF

Analysis of the different of Interest words between Korea and Vietnam using network theory - Focusing on smart city (네트워크 이론을 이용한 한국과 베트남의 관심어 차이 분석 - 스마트시티를 중심으로)

  • Jeong, Seong Yun;Kim, Nam Gon
    • Smart Media Journal
    • /
    • v.11 no.8
    • /
    • pp.73-83
    • /
    • 2022
  • In order to support new construction engineering companies with weak information power to successfully advance into the overseas construction market, this study tried to analyze what are the keywords of interest in the overseas construction market and how they differ from Korea. For this purpose, we recently collected 2,473 news article titles and major articles targeting smart cities that are of high interest in Korea and Vietnam. Through network configuration and topic modeling, we examined the connection relationship between the word of interest and the word of interest. In addition, the influence of the word of interest in the network was measured using PageRank centrality. Through this analysis, it was found that there is a high interest in smart city-related construction, cities, and digital in both countries, and the difference in terms of interest between Korea and Vietnam was inferred. Finally, the limitations of this study and additional research directions to complement them are presented.

Performance Analysis on Hadoop with SSD for Interative Process (SSD 타입 저장장치를 포함하는 Hadoop 시스템의 Iterative Processing 처리 성능 분석)

  • Oh, Sangyoon;Kwon, Seong-Min;Lee, Sookyung
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2016.07a
    • /
    • pp.191-193
    • /
    • 2016
  • 본 논문에서는 SSD 저장장치를 포함하는 하둡의 Iterative Processing에 대한 성능 분석 결과를 소개한다. 하둡은 맵 리듀스 병렬 프로그래밍 모델을 통해 Batch Processing에 특화된 구조를 가지고 있는 프레임 워크이다. 이는 병렬/분산 환경에서 큰 성능향상을 보장하지만, 반복 작업을 수행하는 Iterative Processing에 대하여는 성능이 낮아지는 문제가 존재하고 있다. 이에 본 논문에서는 점차 낮아지는 가격으로 인해 하둡시스템에 적용 가능성이 타진되는 SSD를 통해 반복 작업의 성능이슈를 해결할 수 있는지 확인하고, SSD를 통한 성능향상의 요소가 존재하는지 알아보고자 실험을 진행하였다. 실험에서는 Batch Processing인 word count와 Iterative Processing인 Page Rank 알고리즘을 MapReduce로 구현하고 데이터 크기에 따른 성능 향상도를 측정하였고, SSD 추가와 같은 하드웨어적인 성능을 통한 하둡의 반복 작업은 큰 효율을 기대하기가 어렵다는 결론을 보였다.

  • PDF

An Experimental Study on Ranking Output of Title Word Searching in the Boolean OPAC System (OPAC에서 서명단어탐색의 문헌순위화에 관한 연구)

  • 노정순
    • Journal of the Korean Society for information Management
    • /
    • v.18 no.2
    • /
    • pp.7-30
    • /
    • 2001
  • The characteristics of the short document representatives and short queries of OPAC systems need the different ranking algorithms from IR systems. This study tested and analyzed the effectiveness of four sorting schemes and four ranking algorithms and the six effectiveness measurements for the ranked Boolean OPAC systems. The sorting by publication year was better but without significant difference. The cover density ranking was significantly better than the frequency-based ranking of the Fuzzy or DNF models. The simple effectiveness measurement based on the average rank of relevant documents retrieved was as good as the others and better than the precision P.

  • PDF