• Title/Summary/Keyword: Web contents mining

Search Result 72, Processing Time 0.028 seconds

Text Extraction Algorithm using the HTML Logical Structure Analysis (HTML 논리적 구조분석을 통한 본문추출 알고리즘)

  • Jeon, Hyun-Gee;KOH, Chan
    • Journal of Digital Contents Society
    • /
    • v.16 no.3
    • /
    • pp.445-455
    • /
    • 2015
  • According as internet and computer technology develops, the amount of information has increased exponentially, arising from a variety of web authoring tools and is a new web standard of appearance and a wide variety of web content accessibility as more convenient for the web are produced very quickly. However, web documents are put out on a variety of topics divided into some blocks where each of the blocks are dealing with a topic unrelated to one another as well as you can not see with contents such as many navigations, simple decorations, advertisements, copyright. Extract only the exact area of the web document body to solve this problem and to meet user requirements, and to study the effective information. Later on, as the reconstruction method, we propose a web search system can be optimized systematically manage documents.

Research Trends Investigation Using Text Mining Techniques: Focusing on Social Network Services (텍스트마이닝을 활용한 연구동향 분석: 소셜네트워크서비스를 중심으로)

  • Yoon, Hyejin;Kim, Chang-Sik;Kwahk, Kee-Young
    • Journal of Digital Contents Society
    • /
    • v.19 no.3
    • /
    • pp.513-519
    • /
    • 2018
  • The objective of this study was to examine the trends on social network services. The abstracts of 308 articles were extracted from web of science database published between 1994 and 2016. Time series analysis and topic modeling of text mining were implemented. The topic modeling results showed that the research topics were mainly 20 topics: trust, support, satisfaction model, organization governance, mobile system, internet marketing, college student effect, opinion diffusion, customer, information privacy, health care, web collaboration, method, learning effectiveness, knowledge, individual theory, child support, algorithm, media participation, and context system. The time series regression results indicated that trust, support satisfaction model, and remains of the topics were hot topics. This study also provided suggestions for future research.

Ontology Construction of Technological Knowledge for R&D Trend Analysis (연구 개발 트렌드 분석을 위한 기술 지식 온톨로지 구축)

  • Hwang, Mi-Nyeong;Lee, Seungwoo;Cho, Minhee;Kim, Soon Young;Choi, Sung-Pil;Jung, Hanmin
    • The Journal of the Korea Contents Association
    • /
    • v.12 no.12
    • /
    • pp.35-45
    • /
    • 2012
  • Researchers and scientists spend huge amount of time in analyzing the previous studies and their results. In order to timely take the advantageous position, they usually analyze various resources such as paper, patents, and Web documents on recent research issues to preoccupy newly emerging technologies. However, it is difficult to select invest-worthy research fields out of huge corpus by using the traditional information search based on keywords and bibliographic information. In this paper, we propose a method for efficient creation, storage, and utilization of semantically relevant information among technologies, products and research agents extracted from 'big data' by using text mining. In order to implement the proposed method, we designed an ontology that creates technological knowledge for semantic web environment based on the relationships extracted by text mining techniques. The ontology was utilized for InSciTe Adaptive, a R&D trends analysis and forecast service which supports the search for the relevant technological knowledge.

A Study of Main Contents Extraction from Web News Pages based on XPath Analysis

  • Sun, Bok-Keun
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.7
    • /
    • pp.1-7
    • /
    • 2015
  • Although data on the internet can be used in various fields such as source of data of IR(Information Retrieval), Data mining and knowledge information servece, and contains a lot of unnecessary information. The removal of the unnecessary data is a problem to be solved prior to the study of the knowledge-based information service that is based on the data of the web page, in this paper, we solve the problem through the implementation of XTractor(XPath Extractor). Since XPath is used to navigate the attribute data and the data elements in the XML document, the XPath analysis to be carried out through the XTractor. XTractor Extracts main text by html parsing, XPath grouping and detecting the XPath contains the main data. The result, the recognition and precision rate are showed in 97.9%, 93.9%, except for a few cases in a large amount of experimental data and it was confirmed that it is possible to properly extract the main text of the news.

Topic-Specific Mobile Web Contents Adaptation (주제기반 모바일 웹 콘텐츠 적응화)

  • Lee, Eun-Shil;Kang, Jin-Beom;Choi, Joong-Min
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.6
    • /
    • pp.539-548
    • /
    • 2007
  • Mobile content adaptation is a technology of effectively representing the contents originally built for the desktop PC on wireless mobile devices. Previous approaches for Web content adaptation are mostly device-dependent. Also, the content transformation to suit to a smaller device is done manually. Furthermore, the same contents are provided to different users regardless of their individual preferences. As a result, the user has difficulty in selecting relevant information from a heavy volume of contents since the context information related to the content is not provided. To resolve these problems, this paper proposes an enhanced method of Web content adaptation for mobile devices. In our system, the process of Web content adaptation consists of 4 stages including block filtering, block title extraction, block content summarization, and personalization through learning. Learning is initiated when the user selects the full content menu from the content summary page. As a result of learning, personalization is realized by showing the information for the relevant block at the top of the content list. A series of experiments are performed to evaluate the content adaptation for a number of Web sites including online newspapers. The results of evaluation are satisfactory, both in block filtering accuracy and in user satisfaction by personalization.

Visualizing the Results of Opinion Mining from Social Media Contents: Case Study of a Noodle Company (소셜미디어 콘텐츠의 오피니언 마이닝결과 시각화: N라면 사례 분석 연구)

  • Kim, Yoosin;Kwon, Do Young;Jeong, Seung Ryul
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.4
    • /
    • pp.89-105
    • /
    • 2014
  • After emergence of Internet, social media with highly interactive Web 2.0 applications has provided very user friendly means for consumers and companies to communicate with each other. Users have routinely published contents involving their opinions and interests in social media such as blogs, forums, chatting rooms, and discussion boards, and the contents are released real-time in the Internet. For that reason, many researchers and marketers regard social media contents as the source of information for business analytics to develop business insights, and many studies have reported results on mining business intelligence from Social media content. In particular, opinion mining and sentiment analysis, as a technique to extract, classify, understand, and assess the opinions implicit in text contents, are frequently applied into social media content analysis because it emphasizes determining sentiment polarity and extracting authors' opinions. A number of frameworks, methods, techniques and tools have been presented by these researchers. However, we have found some weaknesses from their methods which are often technically complicated and are not sufficiently user-friendly for helping business decisions and planning. In this study, we attempted to formulate a more comprehensive and practical approach to conduct opinion mining with visual deliverables. First, we described the entire cycle of practical opinion mining using Social media content from the initial data gathering stage to the final presentation session. Our proposed approach to opinion mining consists of four phases: collecting, qualifying, analyzing, and visualizing. In the first phase, analysts have to choose target social media. Each target media requires different ways for analysts to gain access. There are open-API, searching tools, DB2DB interface, purchasing contents, and so son. Second phase is pre-processing to generate useful materials for meaningful analysis. If we do not remove garbage data, results of social media analysis will not provide meaningful and useful business insights. To clean social media data, natural language processing techniques should be applied. The next step is the opinion mining phase where the cleansed social media content set is to be analyzed. The qualified data set includes not only user-generated contents but also content identification information such as creation date, author name, user id, content id, hit counts, review or reply, favorite, etc. Depending on the purpose of the analysis, researchers or data analysts can select a suitable mining tool. Topic extraction and buzz analysis are usually related to market trends analysis, while sentiment analysis is utilized to conduct reputation analysis. There are also various applications, such as stock prediction, product recommendation, sales forecasting, and so on. The last phase is visualization and presentation of analysis results. The major focus and purpose of this phase are to explain results of analysis and help users to comprehend its meaning. Therefore, to the extent possible, deliverables from this phase should be made simple, clear and easy to understand, rather than complex and flashy. To illustrate our approach, we conducted a case study on a leading Korean instant noodle company. We targeted the leading company, NS Food, with 66.5% of market share; the firm has kept No. 1 position in the Korean "Ramen" business for several decades. We collected a total of 11,869 pieces of contents including blogs, forum contents and news articles. After collecting social media content data, we generated instant noodle business specific language resources for data manipulation and analysis using natural language processing. In addition, we tried to classify contents in more detail categories such as marketing features, environment, reputation, etc. In those phase, we used free ware software programs such as TM, KoNLP, ggplot2 and plyr packages in R project. As the result, we presented several useful visualization outputs like domain specific lexicons, volume and sentiment graphs, topic word cloud, heat maps, valence tree map, and other visualized images to provide vivid, full-colored examples using open library software packages of the R project. Business actors can quickly detect areas by a swift glance that are weak, strong, positive, negative, quiet or loud. Heat map is able to explain movement of sentiment or volume in categories and time matrix which shows density of color on time periods. Valence tree map, one of the most comprehensive and holistic visualization models, should be very helpful for analysts and decision makers to quickly understand the "big picture" business situation with a hierarchical structure since tree-map can present buzz volume and sentiment with a visualized result in a certain period. This case study offers real-world business insights from market sensing which would demonstrate to practical-minded business users how they can use these types of results for timely decision making in response to on-going changes in the market. We believe our approach can provide practical and reliable guide to opinion mining with visualized results that are immediately useful, not just in food industry but in other industries as well.

A Study on Extracting News Contents from News Web Pages (뉴스 웹 페이지에서 기사 본문 추출에 관한 연구)

  • Lee, Yong-Gu
    • Journal of the Korean Society for information Management
    • /
    • v.26 no.1
    • /
    • pp.305-320
    • /
    • 2009
  • The news pages provided through the web contain unnecessary information. This causes low performance and inefficiency of the news processing system. In this study, news content extraction methods, which are based on sentence identification and block-level tags news web pages, was suggested. To obtain optimal performance, combinations of these methods were applied. The results showed good performance when using an extraction method which applied the sentence identification and eliminated hyperlink text from web pages. Moreover, this method showed better results when combined with the extraction method which used block-level. Extraction methods, which used sentence identification, were effective for raising the extraction recall ratio.

A Research on User′s Query Processing in Search Engine for Ocean using the Association Rules (연관 규칙 탐사 기법을 이용한 해양 전문 검색 엔진에서의 질의어 처리에 관한 연구)

  • 하창승;윤병수;류길수
    • Proceedings of the Korea Inteligent Information System Society Conference
    • /
    • 2002.11a
    • /
    • pp.266-272
    • /
    • 2002
  • Recently various of information suppliers provide information via WWW so the necessary of search engine grows larger. However the efficiency of most search engines is low comparatively because of using simple pattern match technique between user's query and web document. And a manifest contents of query for special expert field so much worse A specialized search engine returns the specialized information depend on each user's search goal. It is trend to develop specialized search engines in many countries. For example, in America, there are a site that searches only the recently updated headline news and the federal law and the government and and so on. However, most such engines don't satisfy the user's needs. This paper proposes the specialized search engine for ocean information that uses user's query related with ocean and search engine uses the association rules in web data mining. So specialized search engine for ocean provides more information related to ocean because of raising recall about user's query

  • PDF

Design and Implementation of a Employment Information Service based on the Social Web Mining for Human-FTA (휴먼 FTA를 위한 소셜 웹 마이닝 기반 고용정보 서비스의 설계 및 구현)

  • Song, Jeo;Park, Yong-goo;Yoo, Jaesoo
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2015.05a
    • /
    • pp.419-420
    • /
    • 2015
  • 경제혁신 3개년 계획을 토대로 정부는 2015년 국내 생산가능 인구 감소에 대한 대응을 위해 외국인 인력 유치를 위한 휴먼 FTA를 발효하였다. 기존의 외국인 생산 인력에 대한 단순한 양적 증가뿐만이 아니라 해외로 생산거점을 이동한 국내 기업의 리턴을 유도하기 위해 석박사급의 고급 인력과 투자자 유치 등에 대한 내용도 포함하고 있다. 본 논문에서는 상기와 같은 노동시장의 새로운 제도인 휴먼 FTA에 대한 활성화와 원활한 운영을 위해 세계적으로 많이 사용되고 있는 트위터, 페이스북, 구글 등의 소셜 웹 데이터를 활용하여 국내 기업의 외국인 인력에 대한 고용 매칭을 위한 서비스 플랫폼을 제안한다.

  • PDF

Web based Text-mining and Biological Network Analysis System (웹기반 문헌분석 및 생물학적 네트워크 분석시스템 개발)

  • Seo, Dongmin;Cho, Sung-Hoon;Ahn, Kwang-Sung;Yu, Seok Jong;Park, Dong-Il
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2017.05a
    • /
    • pp.27-28
    • /
    • 2017
  • 다양한 위상학적 관계(topological relation)를 분석하는 네트워크 분석은 복잡한 데이터에서 숨어있는 특성과 사실을 발견하는 기술로 최근 빅데이터 분야에서 데이터 분석 핵심 기술로 급부상하고 있다. 본 연구에서는 질병연구에 핵심적인 생물학적 네트워크의 생성 및 사용자 친화적인 네트워크 분석시스템을 개발하였다. 개발한 시스템은 PubMed에서 특정 질병과 관련있는 논문 요약 정보를 자동 수집후 텍스트마이닝을 통해 질병 관련 화합물, 유전자 그리고 상호작용 정보를 추출해 생물학적 네트워크를 생성하는 기능을 제공한다. 또한, 연구자가 손쉽게 생성된 네트워크에 대한 검색 및 다차원 분석을 수행할 수 있는 기능을 제공한다. 마지막으로 개발한 시스템의 우수성을 입증하기 위해 크론병(Crohn's Disease)에 대한 적용사례를 소개한다.

  • PDF