• Title/Summary/Keyword: Web text

Search Result 814, Processing Time 0.024 seconds

Authorship Attribution of Web Texts with Korean Language Applying Deep Learning Method (딥러닝을 활용한 웹 텍스트 저자의 남녀 구분 및 연령 판별 : SNS 사용자를 중심으로)

  • Park, Chan Yub;Jang, In Ho;Lee, Zoon Ky
    • Journal of Information Technology Services
    • /
    • v.15 no.3
    • /
    • pp.147-155
    • /
    • 2016
  • According to rapid development of technology, web text is growing explosively and attracting many fields as substitution for survey. The user of Facebook is reaching up to 113 million people per month, Twitter is used in various institution or company as a behavioral analysis tool. However, many research has focused on meaning of the text itself. And there is a lack of study for text's creation subject. Therefore, this research consists of sex/age text classification with by using 20,187 Facebook users' posts that reveal the sex and age of the writer. This research utilized Convolution Neural Networks, a type of deep learning algorithms which came into the spotlight as a recent image classifier in web text analyzing. The following result assured with 92% of accuracy for possibility as a text classifier. Also, this research was minimizing the Korean morpheme analysis and it was conducted using a Korean web text to Authorship Attribution. Based on these feature, this study can develop users' multiple capacity such as web text management information resource for worker, non-grammatical analyzing system for researchers. Thus, this study proposes a new method for web text analysis.

HTML Text Extraction Using Frequency Analysis (빈도 분석을 이용한 HTML 텍스트 추출)

  • Kim, Jin-Hwan;Kim, Eun-Gyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.9
    • /
    • pp.1135-1143
    • /
    • 2021
  • Recently, text collection using a web crawler for big data analysis has been frequently performed. However, in order to collect only the necessary text from a web page that is complexly composed of numerous tags and texts, there is a cumbersome requirement to specify HTML tags and style attributes that contain the text required for big data analysis in the web crawler. In this paper, we proposed a method of extracting text using the frequency of text appearing in web pages without specifying HTML tags and style attributes. In the proposed method, the text was extracted from the DOM tree of all collected web pages, the frequency of appearance of the text was analyzed, and the main text was extracted by excluding the text with high frequency of appearance. Through this study, the superiority of the proposed method was verified.

Design and Implementation of Server-Based Web Reader kWebAnywhere (서버 기반 웹 리더 kWebAnywhere의 설계 및 구현)

  • Yun, Young-Sun
    • Phonetics and Speech Sciences
    • /
    • v.5 no.4
    • /
    • pp.217-225
    • /
    • 2013
  • This paper describes the design and implementation of the kWebAnywhere system based on WebAnywhere, which assists people with severely diminished eye sight and the blind people to access Internet information through Web interfaces. The WebAnywhere is a server-based web reader which reads aloud the web contents using TTS(text-to-speech) technology on the Internet without installing any software on the client's system. The system can be used in general web browsers using a built-in audio function, for blind users who are unable to afford to use a screen reader and for web developers to design web accessibility. However, the WebAnywhere is limited to supporting only a single language and cannot be applied to Korean web contents directly. Thus, in this paper, we modified the WebAnywhere to serve multiple language contents written in both English and Korean texts. The modified WebAnywhere system is called kWebAnywhere to differentiate it with the original system. The kWebAnywhere system is modified to support the Korean TTS system, VoiceText$^{TM}$, and to include user interface to control the parameters of the TTS system. Because the VoiceText$^{TM}$ system does not support the Festival API used in the WebAnywhere, we developed the Festival Wrapper to transform the VoiceText$^{TM}$'s private APIs to the Festival APIs in order to communicate with the WebAnywhere engine. We expect that the developed system can help people with severely diminished eye sight and the blind people to access the internet contents easily.

Web Image Clustering with Text Features and Measuring its Efficiency

  • Cho, Soo-Sun
    • Journal of Korea Multimedia Society
    • /
    • v.10 no.6
    • /
    • pp.699-706
    • /
    • 2007
  • This article is an approach to improving the clustering of Web images by using high-level semantic features from text information relevant to Web images as well as low-level visual features of image itself. These high-level text features can be obtained from image URLs and file names, page titles, hyperlinks, and surrounding text. As a clustering algorithm, a self-organizing map (SOM) proposed by Kohonen is used. To evaluate the clustering efficiencies of SOMs, we propose a simple but effective measure indicating the accumulativeness of same class images and the perplexities of class distributions. Our approach is to advance the existing measures through defining and using new measures accumulativeness on the most superior clustering node and concentricity to evaluate clustering efficiencies of SOMs. The experimental results show that the high-level text features are more useful in SOM-based Web image clustering.

  • PDF

The Informative Support and Emotional Support Classification Model for Medical Web Forums using Text Analysis (의료 웹포럼에서의 텍스트 분석을 통한 정보적 지지 및 감성적 지지 유형의 글 분류 모델)

  • Woo, Jiyoung;Lee, Min-Jung;Ku, Yungchang
    • Journal of Information Technology Services
    • /
    • v.11 no.sup
    • /
    • pp.139-152
    • /
    • 2012
  • In the medical web forum, people share medical experience and information as patients and patents' families. Some people search medical information written in non-expert language and some people offer words of comport to who are suffering from diseases. Medical web forums play a role of the informative support and the emotional support. We propose the automatic classification model of articles in the medical web forum into the information support and emotional support. We extract text features of articles in web forum using text mining techniques from the perspective of linguistics and then perform supervised learning to classify texts into the information support and the emotional support types. We adopt the Support Vector Machine (SVM), Naive-Bayesian, decision tree for automatic classification. We apply the proposed model to the HealthBoards forum, which is also one of the largest and most dynamic medical web forum.

Design and Implementation of Web-based Text Summarization System for Mobile Device (이동 단말을 위한 웹 기반 텍스트 요약 시스템의 설계 및 구현)

  • Cha, Ji-Eun;Chun, Seung-Man;Park, Jong-Tae
    • The KIPS Transactions:PartC
    • /
    • v.16C no.6
    • /
    • pp.725-730
    • /
    • 2009
  • Recently, there has been increasing interest to web access through mobile host due to the explosion of internet mobile terminal such as smart phone. However, small displays of mobile hosts make it difficult to browse the full content of a web page at a time. In order to overcome these limitation, we have designed and implemented Web-based text summarization system. The proposed system can summarize the text for the Web page in which abundant text exist in a page. This can reduce the amount of data transmission and minimize the unnecessary data output during browsing at mobile host. Through implementation, we have confirmed the functions of the proposed system.

Empirical Analysis on the Effect of Design Pattern of Web Page, Perceived Risk and Media Richness to Customer Satisfaction (콘텐츠 제작방식, 지각된 위험, 미디어 풍부성이 고객만족에 미치는 영향 분석)

  • Park, Bong-Won;Lee, Jung-Mann;Lee, Jong-Won
    • The Journal of the Korea Contents Association
    • /
    • v.11 no.6
    • /
    • pp.385-396
    • /
    • 2011
  • Internet web pages can be classified by three major types such as texts only, images with texts and videos with texts. The purpose of this paper is to analyze how customers recognize and respond perspective of perceived risk and media richness with regard to design patterns of internet web pages. Additionally, we will examine the extent to which aforementioned factors affect customer satisfaction. Analyses with perceived risks revealed that customers feel less personal risks including performance, psychology and time/convenience when used web pages of text-images and text-videos, compared to text only based web pages. However, customers feel that web pages consisting of image-text or video-text have higher points in terms of symbolism and social presence in media richness, compared to text only based web pages. Finally, we showed that personal risk and text-based Web page negatively affect but symbolism and social presence positively impact on customer satisfaction. Therefore, this study suggests a clue that why video-based Web content did not grow different from many people's expectation.

HTML Text Extraction Using Tag Path and Text Appearance Frequency (태그 경로 및 텍스트 출현 빈도를 이용한 HTML 본문 추출)

  • Kim, Jin-Hwan;Kim, Eun-Gyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.12
    • /
    • pp.1709-1715
    • /
    • 2021
  • In order to accurately extract the necessary text from the web page, the method of specifying the tag and style attributes where the main contents exist to the web crawler has a problem in that the logic for extracting the main contents. This method needs to be modified whenever the web page configuration is changed. In order to solve this problem, the method of extracting the text by analyzing the frequency of appearance of the text proposed in the previous study had a limitation in that the performance deviation was large depending on the collection channel of the web page. Therefore, in this paper, we proposed a method of extracting texts with high accuracy from various collection channels by analyzing not only the frequency of appearance of text but also parent tag paths of text nodes extracted from the DOM tree of web pages.

WCTT: Web Crawling System based on HTML Document Formalization (WCTT: HTML 문서 정형화 기반 웹 크롤링 시스템)

  • Kim, Jin-Hwan;Kim, Eun-Gyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.4
    • /
    • pp.495-502
    • /
    • 2022
  • Web crawler, which is mainly used to collect text on the web today, is difficult to maintain and expand because researchers must implement different collection logic by collection channel after analyzing tags and styles of HTML documents. To solve this problem, the web crawler should be able to collect text by formalizing HTML documents to the same structure. In this paper, we designed and implemented WCTT(Web Crawling system based on Tag path and Text appearance frequency), a web crawling system that collects text with a single collection logic by formalizing HTML documents based on tag path and text appearance frequency. Because WCTT collects texts with the same logic for all collection channels, it is easy to maintain and expand the collection channel. In addition, it provides the preprocessing function that removes stopwords and extracts only nouns for keyword network analysis and so on.

Implementation of a Web-Based Electronic Text for High School's Probability and Statistics Education

  • Choi, Sook-Hee
    • Communications for Statistical Applications and Methods
    • /
    • v.11 no.2
    • /
    • pp.329-343
    • /
    • 2004
  • With advancement of computer and network, world wide web(WWW) as a medium of information communication is generalized in many fields. In educational aspect, applications of WWW as alternative media for class teachings or printed matters are increasing. In this article, we demonstrate a web-based electronic text on the 'probability and statistics' which is one of six fields of mathematics in the 7th curriculum. This text places importance on comprehension of concepts of probability and statistics as an applied science.