• Title/Summary/Keyword: 웹수집 주기

Search Result 52, Processing Time 0.03 seconds

Acquisition Methods for Disaster Archives Based on the Issue Life Cycle Model (이슈 생존 주기 모형 기반 재난 아카이브 수집 방안)

  • Yoo, Ho-Suon;Oh, Hyo-Jung
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.2
    • /
    • pp.115-139
    • /
    • 2018
  • Due to the value and the importance of preservation of disaster web records, to build disaster archives is globally becoming a national challenge. This study proposes a acquisition methods based on the issue life cycle model for collecting disaster web records. We firstly analyzed web records acquisition status, methods and period of domestic and foreign disaster archives. In addition, the issue life cycle model was derived by collecting and analyzing the disaster issues in the last 10 years. As the results of the analysis, the issue life cycle model was divided into the sudden type and periodic type according to the characteristics of the disaster. In conclusion, this study propose a method to collect web records according to each model and verify its applicability.

Development of Web Crawler for Archiving Web Resources (웹 자원 아카이빙을 위한 웹 크롤러 연구 개발)

  • Kim, Kwang-Young;Lee, Won-Goo;Lee, Min-Ho;Yoon, Hwa-Mook;Shin, Sung-Ho
    • The Journal of the Korea Contents Association
    • /
    • v.11 no.9
    • /
    • pp.9-16
    • /
    • 2011
  • There are no way of collection, preservation and utilization for web resources after the service is terminated and is gone. However, these Web resources, regardless of the importance of periodically or aperiodically updated or have been destroyed. Therefore, to collect and preserve Web resources Web archive is being emphasized. Web resources collected periodically in order to develop Web archiving crawlers only was required. In this study, from the collection of Web resources to be used for archiving existing web crawlers to analyze the strengths and weaknesses. We have developed web archiving systems for the best collection of web resources.

A study on the enhanced filtering method of the deduplication for bulk harvest of web records (대규모 웹 기록물의 원격수집을 위한 콘텐츠 중복 필터링 개선 연구)

  • Lee, Yeon-Soo;Nam, Sung-un;Yoon, Dai-hyun
    • The Korean Journal of Archival Studies
    • /
    • no.35
    • /
    • pp.133-160
    • /
    • 2013
  • As the network and electronic devices have been developed rapidly, the influences the web exerts on our daily lives have been increasing. Information created on the web has been playing more and more essential role as the important records which reflect each era. So there is a strong demand to archive information on the web by a standardized method. One of the methods is the snapshot strategy, which is crawling the web contents periodically using automatic software. But there are two problems in this strategy. First, it can harvest the same and duplicate contents and it is also possible that meaningless and useless contents can be crawled due to complex IT skills implemented on the web. In this paper, we will categorize the problems which can emerge when crawling web contents using snapshot strategy and present the possible solutions to settle the problems through the technical aspects by crawling the web contents in the public institutions.

Refresh Cycle Optimization for Web Crawlers (웹크롤러의 수집주기 최적화)

  • Cho, Wan-Sup;Lee, Jeong-Eun;Choi, Chi-Hwan
    • The Journal of the Korea Contents Association
    • /
    • v.13 no.6
    • /
    • pp.30-39
    • /
    • 2013
  • Web crawler should maintain fresh data with minimum server overhead for large amount of data in the web sites. The overhead in the server increases rapidly as the amount of data is exploding as in the big data era. The amount of web information is increasing rapidly with advanced wireless networks and emergence of diverse smart devices. Furthermore, the information is continuously being produced and updated in anywhere and anytime by means of easy web platforms, and smart devices. Now, it is becoming a hot issue how frequently updated web data has to be refreshed in data collection and integration. In this paper, we propose dynamic web-data crawling methods, which include sensitive checking of web site changes, and dynamic retrieving of web pages from target web sites based on historical update patterns. Furthermore, we implemented a Java-based web crawling application and compared efficiency between conventional static approaches and our dynamic one. Our experiment results showed 46.2% overhead benefits with more fresh data compared to the static crawling methods.

Design and Implementation of Distributed Web Crawler Using Globus Environment (글로버스를 이용한 분산 웹 크롤러의 설계 및 구현)

  • 이지선;김양우;이필우
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2004.04a
    • /
    • pp.712-714
    • /
    • 2004
  • 대부분의 웹 검색 엔진들과 많은 특화된 검색 도구들은 웹 페이지의 색인화와 분석을 위한 전처리 단계로 대규모 웹 페이지들을 수집하기 위해 웹 크롤러에 의존한다. 일반적인 웹 크롤러는 몇 주 또는 몇 달의 주기에 걸쳐 수백만 개의 호스트들과 상호작용을 통해 웹 페이지 정보를 수집한다. 본 논문에서는 이러한 크롤러의 성능향상과 효율적인 실행을 위해 그리드 미들웨어인 글로버스 툴킷을 이용하여 분산된 크롤러를 제안한다. 본 웹 크롤러의 실행은 그 기능의 분산처리를 위한 각 호스트 서버들을 글로버스로 연결하고, 인증하여, 작업을 할당하는 단계와, 크롤러 프로그램이 실행되어 자료를 수집하는 단계. 마지막으로 이렇게 수집된 웹 페이지 정보들을 처음 명령한 시스템으로 반환하는 단계로 나누어진다. 결과 수집 작업을 보다 분산화 할 수 있게 하였으며 여러 대의 저 비용의 시스템에서 고 비용, 고 사양의 서버의 성능을 얻을 수 있었으며, 확장이 용이하고, 견고한 크롤러 프로그램 및 시스템 환경을 구축할 수 있었다.

  • PDF

A Study on Web Archiving System Development (웹 아카이빙 시스템에 관한 연구)

  • Kim, KwangYoung;Lee, SeokHyoung;Choie, HoSeop;Han, HeeJun;KIM, Jinsuk
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2011.11a
    • /
    • pp.1383-1385
    • /
    • 2011
  • 오늘날 디지털 정보가 기하급수적으로 늘어나고 반면 급속한 폐기와 망실이 일어나고 있다. 특히 웹 자원은 아직 수집, 보존, 활용에 대한 방안이 없어서 일정 기간의 서비스가 끝나면 사라져 버리는 문제점이 있다. 따라서 웹 자원을 수집하고 보존하기위한 웹 아카이빙 시스템이 요구되고 있다. 이러한 웹 자원들을 주기적으로 수집하여 웹 자원을 항구적인 보존과 접근을 위한 웹 아카이빙 개발이 필요하게 되었다. 따라서 본 연구에서는 웹 자원의 아카이빙 수집, 보존, 항구적인 접근을 위한 웹 아카이빙 시스템을 개발하였다.

Wrapper-based Economy Data Collection System Design And Implementation (래퍼 기반 경제 데이터 수집 시스템 설계 및 구현)

  • Piao, Zhegao;Gu, Yeong Hyeon;Yoo, Seong Joon
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2015.05a
    • /
    • pp.227-230
    • /
    • 2015
  • For analyzing and prediction of economic trends, it is necessary to collect particular economic news and stock data. Typical Web crawler to analyze the page content, collects document and extracts URL automatically. On the other hand there are forms of crawler that can collect only document of a particular topic. In order to collect economic news on a particular Web site, we need to design a crawler which could directly analyze its structure and gather data from it. The wrapper-based web crawler design is required. In this paper, we design a crawler wrapper for Economic news analysis system based on big data and implemented to collect data. we collect the data which stock data, sales data from USA auto market since 2000 with wrapper-based crawler. USA and South Korea's economic news data are also collected by wrapper-based crawler. To determining the data update frequency on the site. And periodically updated. We remove duplicate data and build a structured data set for next analysis. Primary to remove the noise data, such as advertising and public relations, etc.

  • PDF

A Study of the Workflow and the Metadata for Web Records Archiving (웹 기록물 아카이빙을 위한 워크플로우 및 메타데이터 연구)

  • Seung-Jun Cha;Dong-Suk Chun;Kyu-Chul Lee
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2008.11a
    • /
    • pp.1379-1382
    • /
    • 2008
  • 웹은 급속하게 변화하는 현대사회에서 정부와 시민들의 주요 의사소통의 채널이 되고 있다. 웹에서 유통되는 정보량이 급증하면서 정보원으로서의 웹에 대한 의존도가 크게 높아졌을 뿐만 아니라 전적으로 웹에만 존재하는 정보자원도 증가하고 있다. 중요한 가치를 지닌 웹사이트는 짧은 수명주기와 수집, 보존, 활용에 대한 방안이 없어 소멸되고 있는 실정이다. 이러한 문제를 해결하기 위해 웹 기록물 아카이빙을 위한 기반기술로 워크플로우 및 메타데이터 정의가 필요하다. 따라서 본 논문에서는 웹 기록물을 아카이빙하기 위해 선별, 수집, 품질관리 및 목록화, 보존, 저장으로 구성되는 워크플로우 및 장기 보존과 검색에 필수적인 메타데이터를 정의하였다. 이러한 연구 개발 및 적용을 통해 사라져 가는 중요한 자원인 웹 기록물을 후대에 중요한 기록물 자원으로 저장 및 관리할 수 있게 될 것이다.

Implementation of a Parallel Web Crawler for the Odysseus Large-Scale Search Engine (오디세우스 대용량 검색 엔진을 위한 병렬 웹 크롤러의 구현)

  • Shin, Eun-Jeong;Kim, Yi-Reun;Heo, Jun-Seok;Whang, Kyu-Young
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.14 no.6
    • /
    • pp.567-581
    • /
    • 2008
  • As the size of the web is growing explosively, search engines are becoming increasingly important as the primary means to retrieve information from the Internet. A search engine periodically downloads web pages and stores them in the database to provide readers with up-to-date search results. The web crawler is a program that downloads and stores web pages for this purpose. A large-scale search engines uses a parallel web crawler to retrieve the collection of web pages maximizing the download rate. However, the service architecture or experimental analysis of parallel web crawlers has not been fully discussed in the literature. In this paper, we propose an architecture of the parallel web crawler and discuss implementation issues in detail. The proposed parallel web crawler is based on the coordinator/agent model using multiple machines to download web pages in parallel. The coordinator/agent model consists of multiple agent machines to collect web pages and a single coordinator machine to manage them. The parallel web crawler consists of three components: a crawling module for collecting web pages, a converting module for transforming the web pages into a database-friendly format, a ranking module for rating web pages based on their relative importance. We explain each component of the parallel web crawler and implementation methods in detail. Finally, we conduct extensive experiments to analyze the effectiveness of the parallel web crawler. The experimental results clarify the merit of our architecture in that the proposed parallel web crawler is scalable to the number of web pages to crawl and the number of machines used.

A Study on Improving the OASIS Selection Guidelines (OASIS의 선정지침 개선(안)에 관한 연구)

  • Noh, Young-Hee;Go, Young-Sun
    • Journal of the Korean BIBLIA Society for library and Information Science
    • /
    • v.23 no.3
    • /
    • pp.105-137
    • /
    • 2012
  • The historical, social, and cultural value of Internet resources is indisputable, therefore many national institutions have created Web archiving projects to hand down this heritage to future generations. The selection guidelines are the most crucial aspect of these projects because they aid in differentiating which resources are worth collecting and preserving from the large number of web resources available. The purpose of this study was to suggest improvements for the OASIS Selection Guidelines by analyzing the selection guidelines of other domestic and international Web archiving projects. First, based on the results of Web archiving projects abroad, we proposed improvements for the definition of Web data and other terms, the basic principles of the collection, collection methods, and collection cycle. Second, we proposed substantial improvements in the target resources for archiving, and also stated what kind of web resources must be excluded from Web archiving. Finally, we discussed the relationship between data collection methods and the legal deposit of online resources, the necessity of constructing a database for selected target materials, and the necessity of cooperative archiving policies.