• Title/Summary/Keyword: Web page classification

Search Result 22, Processing Time 0.033 seconds

An Automatic Web Page Classification System Using Meta-Tag (메타 태그를 이용한 자동 웹페이지 분류 시스템)

  • Kim, Sang-Il;Kim, Hwa-Sung
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.38B no.4
    • /
    • pp.291-297
    • /
    • 2013
  • Recently, the amount of web pages, which include various information, has been drastically increased according to the explosive increase of WWW usage. Therefore, the need for web page classification arose in order to make it easier to access web pages and to make it possible to search the web pages through the grouping. Web page classification means the classification of various web pages that are scattered on the web according to the similarity of documents or the keywords contained in the documents. Web page classification method can be applied to various areas such as web page searching, group searching and e-mail filtering. However, it is impossible to handle the tremendous amount of web pages on the web by using the manual classification. Also, the automatic web page classification has the accuracy problem in that it fails to distinguish the different web pages written in different forms without classification errors. In this paper, we propose the automatic web page classification system using meta-tag that can be obtained from the web pages in order to solve the inaccurate web page retrieval problem.

Web Page Classification System based upon Ontology (온톨로지 기반의 웹 페이지 분류 시스템)

  • Choi Jaehyuk;Seo Haesung;Noh Sanguk;Choi Kyunghee;Jung Gihyun
    • The KIPS Transactions:PartB
    • /
    • v.11B no.6
    • /
    • pp.723-734
    • /
    • 2004
  • In this paper, we present an automated Web page classification system based upon ontology. As a first step, to identify the representative terms given a set of classes, we compute the product of term frequency and document frequency. Secondly, the information gain of each term prioritizes it based on the possibility of classification. We compile a pair of the terms selected and a web page classification into rules using machine learning algorithms. The compiled rules classify any Web page into categories defined on a domain ontology. In the experiments, 78 terms out of 240 terms were identified as representative features given a set of Web pages. The resulting accuracy of the classification was, on the average, 83.52%.

Research on Training and Implementation of Deep Learning Models for Web Page Analysis (웹페이지 분석을 위한 딥러닝 모델 학습과 구현에 관한 연구)

  • Jung Hwan Kim;Jae Won Cho;Jin San Kim;Han Jin Lee
    • The Journal of the Convergence on Culture Technology
    • /
    • v.10 no.2
    • /
    • pp.517-524
    • /
    • 2024
  • This study aims to train and implement a deep learning model for the fusion of website creation and artificial intelligence, in the era known as the AI revolution following the launch of the ChatGPT service. The deep learning model was trained using 3,000 collected web page images, processed based on a system of component and layout classification. This process was divided into three stages. First, prior research on AI models was reviewed to select the most appropriate algorithm for the model we intended to implement. Second, suitable web page and paragraph images were collected, categorized, and processed. Third, the deep learning model was trained, and a serving interface was integrated to verify the actual outcomes of the model. This implemented model will be used to detect multiple paragraphs on a web page, analyzing the number of lines, elements, and features in each paragraph, and deriving meaningful data based on the classification system. This process is expected to evolve, enabling more precise analysis of web pages. Furthermore, it is anticipated that the development of precise analysis techniques will lay the groundwork for research into AI's capability to automatically generate perfect web pages.

Optimal dwelling time prediction for package tour using K-nearest neighbor classification algorithm

  • Aria Bisma Wahyutama;Mintae Hwang
    • ETRI Journal
    • /
    • v.46 no.3
    • /
    • pp.473-484
    • /
    • 2024
  • We introduce a machine learning-based web application to help travel agents plan a package tour schedule. K-nearest neighbor (KNN) classification predicts the optimal tourists' dwelling time based on a variety of information to automatically generate a convenient tour schedule. A database collected in collaboration with an established travel agency is fed into the KNN algorithm implemented in the Python language, and the predicted dwelling times are sent to the web application via a RESTful application programming interface provided by the Flask framework. The web application displays a page in which the agents can configure the initial data and predict the optimal dwelling time and automatically update the tour schedule. After conducting a performance evaluation by simulating a scenario on a computer running the Windows operating system, the average response time was 1.762 s, and the prediction consistency was 100% over 100 iterations.

Research on the Design of a Deep Learning-Based Automatic Web Page Generation System

  • Jung-Hwan Kim;Young-beom Ko;Jihoon Choi;Hanjin Lee
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.2
    • /
    • pp.21-30
    • /
    • 2024
  • This research aims to design a system capable of generating real web pages based on deep learning and big data, in three stages. First, a classification system was established based on the industry type and functionality of e-commerce websites. Second, the types of components of web pages were systematically categorized. Third, the entire web page auto-generation system, applicable for deep learning, was designed. By re-engineering the deep learning model, which was trained with actual industrial data, to analyze and automatically generate existing websites, a directly usable solution for the field was proposed. This research is expected to contribute technically and policy-wise to the field of generative AI-based complete website creation and industrial sectors.

The Research of Web Based superior Technology Classification system for Information and Communications venture entrepreneur. (정보통신 예비창업자를 위한 Web 기반 우위기술 도출 시스템 구축에 관한 연구)

  • 정민하;최문기
    • Proceedings of the Korea Inteligent Information System Society Conference
    • /
    • 2000.04a
    • /
    • pp.175-184
    • /
    • 2000
  • Recently Venture business in the area of information and communication industry is booming. Though Technology classification chart helps the potential entrepreneur through Survey paper and Internet Web Page, its service does not meet the customer demand. Hence Technology Classification system, which is proposed in this paper, will solve this problem by using virtual network among venture, technology experts and potential entrepreneurs. This system supports potential entrepreneurs' decision making for choice of venture business items by using dual client technology, and provides better services than existing systems by linking expert client and customer client, .

  • PDF

Context-based Web Application Design (컨텍스트 기반의 웹 애플리케이션 설계 방법론)

  • Park, Jin-Soo
    • The Journal of Society for e-Business Studies
    • /
    • v.12 no.2
    • /
    • pp.111-132
    • /
    • 2007
  • Developing and managing Web applications are more complex than ever because of their growing functionalities, advancing Web technologies, increasing demands for integration with legacy applications, and changing content and structure. All these factors call for a more inclusive and comprehensive Web application design method. In response, we propose a context-based Web application design methodology that is based on several classification schemes including a Webpage classification, which is useful for identifying the information delivery mechanism and its relevant Web technology; a link classification, which reflects the semantics of various associations between pages; and a software component classification, which is helpful for pinpointing the roles of various components in the course of design. The proposed methodology also incorporates a unique Web application model comprised of a set of information clusters called compendia, each of which consists of a theme, its contextual pages, links, and components. This view is useful for modular design as well as for management of ever-changing content and structure of a Web application. The proposed methodology brings together all the three classification schemes and the Web application model to arrive at a set of both semantically cohesive and syntactically loose-coupled design artifacts.

  • PDF

Classifying Malicious Web Pages by Using an Adaptive Support Vector Machine

  • Hwang, Young Sup;Kwon, Jin Baek;Moon, Jae Chan;Cho, Seong Je
    • Journal of Information Processing Systems
    • /
    • v.9 no.3
    • /
    • pp.395-404
    • /
    • 2013
  • In order to classify a web page as being benign or malicious, we designed 14 basic and 16 extended features. The basic features that we implemented were selected to represent the essential characteristics of a web page. The system heuristically combines two basic features into one extended feature in order to effectively distinguish benign and malicious pages. The support vector machine can be trained to successfully classify pages by using these features. Because more and more malicious web pages are appearing, and they change so rapidly, classifiers that are trained by old data may misclassify some new pages. To overcome this problem, we selected an adaptive support vector machine (aSVM) as a classifier. The aSVM can learn training data and can quickly learn additional training data based on the support vectors it obtained during its previous learning session. Experimental results verified that the aSVM can classify malicious web pages adaptively.

A Research for Web Documents Genre Classification using STW (STW를 이용한 웹 문서 장르 분류에 관한 연구)

  • Ko, Byeong-Kyu;Oh, Kun-Seok;Kim, Pan-Koo
    • Journal of Information Technology and Architecture
    • /
    • v.9 no.4
    • /
    • pp.413-422
    • /
    • 2012
  • Many researchers have been studied to reveal human natural language to let machine understand its meaning by text based, page rank based or more. Particularly, it has been considered that URL and HTML Tag information in web documents are attracting people' attention again to analyze huge amount of web document automatically. In this paper, we propose a STW (Semantic Term Weight) approach based on syntactic and linguistic structure of web documents in order to classify what genres are. For the evaluation, we analyzed more than 1,000 documents from 20-Genre-collection corpus for training the documents based on SVM algorithm. Afterwards, we tested KI-04 corpus to evaluate performance of our proposed method. This paper measured their accuracy by classifying them into an experiment using STW and one without u sing STW. As the results, the proposed STW based approach showed approximately 10.2% which Is higher than one without use of STW.

An Automated Topic Specific Web Crawler Calculating Degree of Relevance (연관도를 계산하는 자동화된 주제 기반 웹 수집기)

  • Seo Hae-Sung;Choi Young-Soo;Choi Kyung-Hee;Jung Gi-Hyun;Noh Sang-Uk
    • Journal of Internet Computing and Services
    • /
    • v.7 no.3
    • /
    • pp.155-167
    • /
    • 2006
  • It is desirable if users surfing on the Internet could find Web pages related to their interests as closely as possible. Toward this ends, this paper presents a topic specific Web crawler computing the degree of relevance. collecting a cluster of pages given a specific topic, and refining the preliminary set of related web pages using term frequency/document frequency, entropy, and compiled rules. In the experiments, we tested our topic specific crawler in terms of the accuracy of its classification, crawling efficiency, and crawling consistency. First, the classification accuracy using the set of rules compiled by CN2 was the best, among those of C4.5 and back propagation learning algorithms. Second, we measured the classification efficiency to determine the best threshold value affecting the degree of relevance. In the third experiment, the consistency of our topic specific crawler was measured in terms of the number of the resulting URLs overlapped with different starting URLs. The experimental results imply that our topic specific crawler was fairly consistent, regardless of the starting URLs randomly chosen.

  • PDF