• Title/Summary/Keyword: 분산 웹 크롤러

Search Result 7, Processing Time 0.02 seconds

Design and Implementation of Distributed Web Crawler Using Globus Environment (글로버스를 이용한 분산 웹 크롤러의 설계 및 구현)

  • 이지선;김양우;이필우
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2004.04a
    • /
    • pp.712-714
    • /
    • 2004
  • 대부분의 웹 검색 엔진들과 많은 특화된 검색 도구들은 웹 페이지의 색인화와 분석을 위한 전처리 단계로 대규모 웹 페이지들을 수집하기 위해 웹 크롤러에 의존한다. 일반적인 웹 크롤러는 몇 주 또는 몇 달의 주기에 걸쳐 수백만 개의 호스트들과 상호작용을 통해 웹 페이지 정보를 수집한다. 본 논문에서는 이러한 크롤러의 성능향상과 효율적인 실행을 위해 그리드 미들웨어인 글로버스 툴킷을 이용하여 분산된 크롤러를 제안한다. 본 웹 크롤러의 실행은 그 기능의 분산처리를 위한 각 호스트 서버들을 글로버스로 연결하고, 인증하여, 작업을 할당하는 단계와, 크롤러 프로그램이 실행되어 자료를 수집하는 단계. 마지막으로 이렇게 수집된 웹 페이지 정보들을 처음 명령한 시스템으로 반환하는 단계로 나누어진다. 결과 수집 작업을 보다 분산화 할 수 있게 하였으며 여러 대의 저 비용의 시스템에서 고 비용, 고 사양의 서버의 성능을 얻을 수 있었으며, 확장이 용이하고, 견고한 크롤러 프로그램 및 시스템 환경을 구축할 수 있었다.

  • PDF

Design and Implementation of a Search Engine based on Apache Spark (아파치 스파크 기반 검색엔진의 설계 및 구현)

  • Park, Ki-Sung;Choi, Jae-Hyun;Kim, Jong-Bae;Park, Jae-Won
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.1
    • /
    • pp.17-28
    • /
    • 2017
  • Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection recently spotlighted because it can take advantage of the various fields. Web crawler can be defined as a tool to analyze the web pages and collects the URL by traversing the web server in an automated manner. For the treatment of Big-data, distributed Web crawler is widely used which is based on the Hadoop MapReduce. But, it is difficult to use and has constraints on the performance. Apache spark that is the In-memory computing platform is an alternative to MapReduce. The search engine which is one of the main purposes of web crawler displays the information you search by keyword gathered by web crawler. If search engines implement a spark-based web crawler instead of traditional MapReduce-based web crawler, it would be a more rapid data collection.

Intelligent Web Crawler for Supporting Big Data Analysis Services (빅데이터 분석 서비스 지원을 위한 지능형 웹 크롤러)

  • Seo, Dongmin;Jung, Hanmin
    • The Journal of the Korea Contents Association
    • /
    • v.13 no.12
    • /
    • pp.575-584
    • /
    • 2013
  • Data types used for big-data analysis are very widely, such as news, blog, SNS, papers, patents, sensed data, and etc. Particularly, the utilization of web documents offering reliable data in real time is increasing gradually. And web crawlers that collect web documents automatically have grown in importance because big-data is being used in many different fields and web data are growing exponentially every year. However, existing web crawlers can't collect whole web documents in a web site because existing web crawlers collect web documents with only URLs included in web documents collected in some web sites. Also, existing web crawlers can collect web documents collected by other web crawlers already because information about web documents collected in each web crawler isn't efficiently managed between web crawlers. Therefore, this paper proposed a distributed web crawler. To resolve the problems of existing web crawler, the proposed web crawler collects web documents by RSS of each web site and Google search API. And the web crawler provides fast crawling performance by a client-server model based on RMI and NIO that minimize network traffic. Furthermore, the web crawler extracts core content from a web document by a keyword similarity comparison on tags included in a web documents. Finally, to verify the superiority of our web crawler, we compare our web crawler with existing web crawlers in various experiments.

Issue Analysis on Gas Safety Based on a Distributed Web Crawler Using Amazon Web Services (AWS를 활용한 분산 웹 크롤러 기반 가스 안전 이슈 분석)

  • Kim, Yong-Young;Kim, Yong-Ki;Kim, Dae-Sik;Kim, Mi-Hye
    • Journal of Digital Convergence
    • /
    • v.16 no.12
    • /
    • pp.317-325
    • /
    • 2018
  • With the aim of creating new economic values and strengthening national competitiveness, governments and major private companies around the world are continuing their interest in big data and making bold investments. In order to collect objective data, such as news, securing data integrity and quality should be a prerequisite. For researchers or practitioners who wish to make decisions or trend analyses based on objective and massive data, such as portal news, the problem of using the existing Crawler method is that data collection itself is blocked. In this study, we implemented a method of collecting web data by addressing existing crawler-style problems using the cloud service platform provided by Amazon Web Services (AWS). In addition, we collected 'gas safety' articles and analyzed issues related to gas safety. In order to ensure gas safety, the research confirmed that strategies for gas safety should be established and systematically operated based on five categories: accident/occurrence, prevention, maintenance/management, government/policy and target.

Design and Implementation of a Globus-based Distributed Web Crawler Manager on Grid Environment (글로버스 기반 그리드 환경에서의 분산 웹 크롤러 매니저 설계 및 구현)

  • Kim, Hyuk-Ho;Lee, Seung-Ha;Park, Chan-Ho;Kim, Yang-Woo;Lee, Phil-Woo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2005.05a
    • /
    • pp.945-948
    • /
    • 2005
  • 그리드 정보검색 시스템은 일반적인 정보검색 시스템의 문제점과 한계점을 인식하고, 그리드라는 분산처리 환경을 기반으로 정보검색 시스템을 구축함으로써 보다 효율적이고 유연한 확장성을 갖는 정보검색 서비스를 제공한다. 본 논문에서는 그리드 시스템 환경에 맞게 그리드 미들웨어 중에 하나인 글로버스 툴킷(Globus Toolkit)을 이용하여 정보검색을 위한 가상 조직(VO: Virtual Organization)을 구성했다. 그리고 그리드 정보검색을 위한 전단계로 웹상에서 각종 정보를 수집하는 P2P 기반 분산 크롤러들을 관리하는 크롤러 매니저를 그리드 서비스로 설계 및 구현하여 그리드 정보검색 시스템에 존재하는 다른 서비스들과 함께 활용할 수 있도록 하였다.

  • PDF

Distribute Parallel Crawler Design and Implementation (분산형 병렬 크롤러 설계 및 구현)

  • Jang, Hyun Ho;jeon, kyung-sik;Lee, HooKi
    • Convergence Security Journal
    • /
    • v.19 no.3
    • /
    • pp.21-28
    • /
    • 2019
  • As the number of websites managed by organizations or organizations increases, so does the number of web application servers and containers. In checking the status of the web service of the web application server and the container, it is very difficult for the person to check the status of the web service after accessing the physical server at the remote site through the terminal or using other accessible software It. Previous research on crawler-related research is hard to find any reference to the processing of data from crawling. Data loss occurs when the crawler accesses the database and stores the data. In this paper, we propose a method to store the inspection data according to crawl - based web application server management without losing data.

Multi-threaded Web Crawling Design using Queues (큐를 이용한 다중스레드 방식의 웹 크롤링 설계)

  • Kim, Hyo-Jong;Lee, Jun-Yun;Shin, Seung-Soo
    • Journal of Convergence for Information Technology
    • /
    • v.7 no.2
    • /
    • pp.43-51
    • /
    • 2017
  • Background/Objectives : The purpose of this study is to propose a multi-threaded web crawl using queues that can solve the problem of time delay of single processing method, cost increase of parallel processing method, and waste of manpower by utilizing multiple bots connected by wide area network Design and implement. Methods/Statistical analysis : This study designs and analyzes applications that run on independent systems based on multi-threaded system configuration using queues. Findings : We propose a multi-threaded web crawler design using queues. In addition, the throughput of web documents can be analyzed by dividing by client and thread according to the formula, and the efficiency and the number of optimal clients can be confirmed by checking efficiency of each thread. The proposed system is based on distributed processing. Clients in each independent environment provide fast and reliable web documents using queues and threads. Application/Improvements : There is a need for a system that quickly and efficiently navigates and collects various web sites by applying queues and multiple threads to a general purpose web crawler, rather than a web crawler design that targets a particular site.