• Title/Summary/Keyword: 주제 연관성 기반 분류

Search Result 14, Processing Time 0.026 seconds

An Automated Topic Specific Web Crawler Calculating Degree of Relevance (연관도를 계산하는 자동화된 주제 기반 웹 수집기)

  • Seo Hae-Sung;Choi Young-Soo;Choi Kyung-Hee;Jung Gi-Hyun;Noh Sang-Uk
    • Journal of Internet Computing and Services
    • /
    • v.7 no.3
    • /
    • pp.155-167
    • /
    • 2006
  • It is desirable if users surfing on the Internet could find Web pages related to their interests as closely as possible. Toward this ends, this paper presents a topic specific Web crawler computing the degree of relevance. collecting a cluster of pages given a specific topic, and refining the preliminary set of related web pages using term frequency/document frequency, entropy, and compiled rules. In the experiments, we tested our topic specific crawler in terms of the accuracy of its classification, crawling efficiency, and crawling consistency. First, the classification accuracy using the set of rules compiled by CN2 was the best, among those of C4.5 and back propagation learning algorithms. Second, we measured the classification efficiency to determine the best threshold value affecting the degree of relevance. In the third experiment, the consistency of our topic specific crawler was measured in terms of the number of the resulting URLs overlapped with different starting URLs. The experimental results imply that our topic specific crawler was fairly consistent, regardless of the starting URLs randomly chosen.

  • PDF

A Development Method of Framework for Collecting, Extracting, and Classifying Social Contents

  • Cho, Eun-Sook
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.1
    • /
    • pp.163-170
    • /
    • 2021
  • As a big data is being used in various industries, big data market is expanding from hardware to infrastructure software to service software. Especially it is expanding into a huge platform market that provides applications for holistic and intuitive visualizations such as big data meaning interpretation understandability, and analysis results. Demand for big data extraction and analysis using social media such as SNS is very active not only for companies but also for individuals. However despite such high demand for the collection and analysis of social media data for user trend analysis and marketing, there is a lack of research to address the difficulty of dynamic interlocking and the complexity of building and operating software platforms due to the heterogeneity of various social media service interfaces. In this paper, we propose a method for developing a framework to operate the process from collection to extraction and classification of social media data. The proposed framework solves the problem of heterogeneous social media data collection channels through adapter patterns, and improves the accuracy of social topic extraction and classification through semantic association-based extraction techniques and topic association-based classification techniques.

Issue summarization scheme based on real-time SNS trend analysis (실시간 SNS 트렌드 분석에 기반한 이슈 요약 기법)

  • Kim, Daeyong;Kim, Daehoon;Hwang, Eenjun
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2013.11a
    • /
    • pp.1096-1097
    • /
    • 2013
  • 최근 Twitter를 비롯한 소셜 네트워크 서비스의 급속한 확산으로 인해, 많은 수의 SNS 메시지가 실시간으로 생성되고 있다. 이러한 SNS 상의 모든 글을 읽어보는 것은 현실적으로 불가능하며, 여러 포탈 사이트에서 제공되는 실시간 검색어 순위만으로는 상세 내용을 직관적으로 파악하기 어렵다. 따라서, 이러한 SNS상의 글을 실시간으로 분석하여 최신의 트렌드를 찾고 이와 연관된 내용을 분류 및 요약할 수 있다면, 사용자에게 유용한 최신 정보를 생성하여 제공할 수 있다. 본 논문에서는 Tweet 들을 분석하여 얻은 트렌드 키워드를 기반으로 관련된 Tweet 들을 주제 별로 분류한 후, 각 주제 별로 세부 내용을 요약해서 제공하는 기법을 제안한다. 제안하는 기법은 실시간으로 생성되는 Tweet 내에서 최근 화제가 된 트렌드 및 연관 키워드를 추출해낸다. 그 후, 해당 키워드가 출현한 Tweet 내에서 핵심 키워드를 찾고, 이를 기반으로 Tweet 들을 각각의 주제별로 분류하고 각 주제를 '이슈'로 정의한다. 마지막으로, 특정한 이슈에 해당되는 Tweet들을 분석하여 각 이슈 별로 키워드 리스트 및 단문 형식으로 요약된 줄거리를 생성한다. 제안된 기법을 바탕으로 프로토타입 시스템을 구현하고, 다양한 실험을 통하여 이슈 검출 기법의 유용성 면에서 성능을 평가한다.

Topic Model Analysis of Research Themes and Trends in the Journal of Economic and Environmental Geology (기계학습 기반 토픽모델링을 이용한 학술지 "자원환경지질"의 연구주제 분류 및 연구동향 분석)

  • Kim, Taeyong;Park, Hyemin;Heo, Junyong;Yang, Minjune
    • Economic and Environmental Geology
    • /
    • v.54 no.3
    • /
    • pp.353-364
    • /
    • 2021
  • Since the mid-twentieth century, geology has gradually evolved as an interdisciplinary context in South Korea. The journal of Economic and Environmental Geology (EEG) has a long history of over 52 years and published interdisciplinary articles based on geology. In this study, we performed a literature review using topic modeling based on Latent Dirichlet Allocation (LDA), an unsupervised machine learning model, to identify geological topics, historical trends (classic topics and emerging topics), and association by analyzing titles, keywords, and abstracts of 2,571 publications in EEG during 1968-2020. The results showed that 8 topics ('petrology and geochemistry', 'hydrology and hydrogeology', 'economic geology', 'volcanology', 'soil contaminant and remediation', 'general and structural geology', 'geophysics and geophysical exploration', and 'clay mineral') were identified in the EEG. Before 1994, classic topics ('economic geology', 'volcanology', and 'general and structure geology') were dominant research trends. After 1994, emerging topics ('hydrology and hydrogeology', 'soil contaminant and remediation', 'clay mineral') have arisen, and its portion has gradually increased. The result of association analysis showed that EEG tends to be more comprehensive based on 'economic geology'. Our results provide understanding of how geological research topics branch out and merge with other fields using a useful literature review tool for geological research in South Korea.

WV-BTM: A Technique on Improving Accuracy of Topic Model for Short Texts in SNS (WV-BTM: SNS 단문의 주제 분석을 위한 토픽 모델 정확도 개선 기법)

  • Song, Ae-Rin;Park, Young-Ho
    • Journal of Digital Contents Society
    • /
    • v.19 no.1
    • /
    • pp.51-58
    • /
    • 2018
  • As the amount of users and data of NS explosively increased, research based on SNS Big data became active. In social mining, Latent Dirichlet Allocation(LDA), which is a typical topic model technique, is used to identify the similarity of each text from non-classified large-volume SNS text big data and to extract trends therefrom. However, LDA has the limitation that it is difficult to deduce a high-level topic due to the semantic sparsity of non-frequent word occurrence in the short sentence data. The BTM study improved the limitations of this LDA through a combination of two words. However, BTM also has a limitation that it is impossible to calculate the weight considering the relation with each subject because it is influenced more by the high frequency word among the combined words. In this paper, we propose a technique to improve the accuracy of existing BTM by reflecting semantic relation between words.

A Language Model based Knowledge Network for Analyzing Disaster Safety related Social Interest (재난안전 사회관심 분석을 위한 언어모델 활용 정보 네트워크 구축)

  • Choi, Dong-Jin;Han, So-Hee;Kim, Kyung-Jun;Bae, Eun-Sol
    • Proceedings of the Korean Society of Disaster Information Conference
    • /
    • 2022.10a
    • /
    • pp.145-147
    • /
    • 2022
  • 본 논문은 대규모 텍스트 데이터에서 이슈를 발굴할 때 사용되는 기존의 정보 네트워크 또는 지식 그래프 구축 방법의 한계점을 지적하고, 문장 단위로 정보 네트워크를 구축하는 새로운 방법에 대해서 제안한다. 먼저 문장을 구성하는 단어와 캐릭터수의 분포를 측정하며 의성어와 같은 노이즈를 제거하기 위한 역치값을 설정하였다. 다음으로 BERT 기반 언어모델을 이용하여 모든 문장을 벡터화하고, 코사인 유사도를 이용하여 두 문장벡터에 대한 유사성을 측정하였다. 오분류된 유사도 결과를 최소화하기 위하여 명사형 단어의 의미적 연관성을 비교하는 알고리즘을 개발하였다. 제안된 유사문장 비교 알고리즘의 결과를 검토해 보면, 두 문장은 서술되는 형태가 다르지만 동일한 주제와 내용을 다루고 있는 것을 확인할 수 있었다. 본 논문에서 제안하는 방법은 단어 단위 지식 그래프 해석의 어려움을 극복할 수 있는 새로운 방법이다. 향후 이슈 및 트랜드 분석과 같은 미래연구 분야에 적용하면, 데이터 기반으로 특정 주제에 대한 사회적 관심을 수렴하고, 수요를 반영한 정책적 제언을 도출하는데 기여할 수 있을 것이다

  • PDF

Design and Implementation of the Graphical Relational Searching for Folksonomy Tags in the Participational Architecture of Web 2.0 (웹2.0의 참여형 아키텍쳐 환경에서 그래픽 기반 포크소노미 태그 연관 검색의 설계 및 구현)

  • Kim, Woon-Yong;Park, Seok-Gyu
    • Journal of Internet Computing and Services
    • /
    • v.8 no.5
    • /
    • pp.1-10
    • /
    • 2007
  • Recently, the web 2.0 services which appear by exponential extension of the Internet can be expressed with the changes in the quality of structural evolution and in the quantity of increasing users. The structural base is in user participational architecture, the web 2.0 services such as Blog, UCC, SNS(Social Networking Service), Mash-up, Long tail, etc. play a important role in organization of web, and grouping and searching of user participational data in web 2.0 is broadly used by folksonomy. Folksonomy is a new form that categorizes by tags, not classic taxonomy skill. it is made by user participation. Searching based on tag is now done by a simple text or a tag cloud method. But searching to consider and express the relations among each tags is imperfect yet. Thus, this paper provides the relational searching based on tags using the relational graph of tags. It should improve the trust of the searching and provide the convenience of the searching.

  • PDF

Exploring the Research Topic Networks in the Technology Management Field Using Association Rule-based Co-word Analysis (연관규칙 기반 동시출현단어 분석을 활용한 기술경영 연구 주제 네트워크 분석)

  • Jeon, Ikjin;Lee, Hakyeon
    • Journal of Technology Innovation
    • /
    • v.24 no.4
    • /
    • pp.101-126
    • /
    • 2016
  • This paper identifies core research topics and their relationships by deriving the research topic networks in the technology management field using co-word analysis. Contrary to the conventional approach in which undirected networks are constructed based on normalized co-occurrence frequency, this study analyzes directed networks of keywords by employing the confidence index of association rule mining for pairs of keywords. Author keywords included in 2,456 articles published in nine international journals of technology management in 2011~2014 are extracted and categorized into three types: THEME, METHOD, and FIELD. One-mode networks for each type of keywords are constructed to identify core research keywords and their interrelationships with each type. We then derive the two-mode networks composed of different two types of keywords, THEME-METHOD and THEME-FIELD, to explore which methods or fields are frequently employed or studied for each theme. The findings of this study are expected to be fruitfully referred for researchers in the field of technology management to grasp research trends and set the future research directions.

Development of Historical Contents Based on Relational Structure of Minutes of State Council and Records of Ministries in the Period of Rhee Regime (이승만시기 국무회의록과 정부부처 기록의 연관구조 분석에 기반한 역사 컨텐츠 설계 방안)

  • Seol, Moon-Won;Kim, Ik-Han
    • Journal of the Korean BIBLIA Society for library and Information Science
    • /
    • v.17 no.2
    • /
    • pp.115-136
    • /
    • 2006
  • Minutes of the state council are the highest level records which can show the overall decision making process at the state level. The purpose of this study is to suggest the methodology for designing historical contents based on relational structure of minutes of state council and records of ministries in the period of Rhee Regime. The methodology has three steps; first, it suggests directions of DB design that represent the basic information and agenda of the state councils through the period of Rhee Regime. Second, it proposes subject classification scheme for major policy matters in the period. to which each agenda will be assigned and related ministries' records will be linked. Third, it suggests the basic structure and procedures to develop the historical contents on each subject matter based on the minutes and relational records of ministries.

A Study on the Influencing Factors of Continuous Usage Intention for a Scenario based FAQ Service regarding on Private Information Protection (개인정보보호에 관한 시나리오 기반 질의응답서비스 품질이 이용의도에 미치는 요인에 관한 연구)

  • Kang, Sang-Ug;Lee, Dae-Chul
    • Journal of Digital Convergence
    • /
    • v.12 no.2
    • /
    • pp.223-236
    • /
    • 2014
  • The paper studies the influencing factors of continuous usage intention for a scenario based cognitive FAQ service regrading on private information protection. The research result finds that three major factors are significantly positive to the continuous usage intention for the service. First, search easiness is an essential factor and it can be improved using sophisticate categorization. Second, Scenario based FAQ service is effective on understanding and solving questioner's situation. Related information is helpful for problem solving. The research shows that the new approach to private information protection area can lead to a more acceptable and reasonable problem solving tool.