• Title/Summary/Keyword: Web Data Collection

Search Result 313, Processing Time 0.037 seconds

Numerical Formula and Verification of Web Robot for Collection Speedup of Web Documents

  • Kim Weon;Kim Young-Ki;Chin Yong-Ok
    • Journal of Internet Computing and Services
    • /
    • v.5 no.6
    • /
    • pp.1-10
    • /
    • 2004
  • A web robot is a software that has abilities of tracking and collecting web documents on the Internet(l), The performance scalability of recent web robots reached the limit CIS the number of web documents on the internet has increased sharply as the rapid growth of the Internet continues, Accordingly, it is strongly demanded to study on the performance scalability in searching and collecting documents on the web. 'Design of web robot based on Multi-Agent to speed up documents collection ' rather than 'Sequentially executing Web Robot based on the existing Fork-Join method' and the results of analysis on its performance scalability is presented in the thesis, For collection speedup, a Multi-Agent based web robot performs the independent process for inactive URL ('Dead-links' URL), which is caused by overloaded web documents, temporary network or web-server disturbance, after dividing them into each agent. The agents consist of four component; Loader, Extractor, Active URL Scanner and inactive URL Scanner. The thesis models a Multi-Agent based web robot based on 'Amdahl's Law' to speed up documents collection, introduces a numerical formula for collection speedup, and verifies its performance improvement by comparing data from the formula with data from experiments based on the formula. Moreover, 'Dynamic URL Partition algorithm' is introduced and realized to minimize the workload of the web server by maximizing a interval of the web server which can be a collection target.

  • PDF

A Study on the Construction of Contents for Collection Management Web Sites of University Libraries (대학도서관 장서관리 웹사이트 컨텐츠구성에 관한 연구)

  • Yoon, Hye-Young
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.36 no.1
    • /
    • pp.165-186
    • /
    • 2002
  • The collection management paradigm has drastically changed since the emergence of collection management activities on the internet. Due to the technological development of web, web has become the newly open field as a useful tool to facilitate collection management transactions. The purpose of this study is, by evaluating web sites of the university library’s collection management in United States, to construct the contents of collection management web site. To fulfill the purpose, 6 categories for the contents of the collection management web site are suggested. The result of this study can be used as the basic data for researching of collection management web site. Web site is playing a role of stepping up its activities and informing useful information on its selection tool and collection development policy.

A Design of SNS and Web Data Analysis System for Company Marketing Strategy (기업 마케팅 전략을 위한 SNS 및 Web 데이터 분석 시스템 설계)

  • Lee, ByungKwan;Jeong, EunHee;Jung, YiNa
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.6 no.4
    • /
    • pp.195-200
    • /
    • 2013
  • This paper proposes an SNS and Web Data Analytics System which can utilize a business marketing strategy by analyzing negative SNS and Web Data that can do great damage to a business image. It consists of the Data Collection Module collecting SNS and Web Data, the Hbase Module storing the collected data, the Data Analysis Module estimating and classifying the meaning of data after an semantic analysis of the collected data, and the PHS Module accomplishing an optimized Map Reduce by using SNS and Web data involved a Businesse. This paper can utilize this analysis result for a business marketing strategy by efficiently managing SNS and Web data with these modules.

Implementation of Search Engine to Minimize Traffic Using Blockchain-Based Web Usage History Management System

  • Yu, Sunghyun;Yeom, Cheolmin;Won, Yoojae
    • Journal of Information Processing Systems
    • /
    • v.17 no.5
    • /
    • pp.989-1003
    • /
    • 2021
  • With the recent increase in the types of services provided by Internet companies, collection of various types of data has become a necessity. Data collectors corresponding to web services profit by collecting users' data indiscriminately and providing it to the associated services. However, the data provider remains unaware of the manner in which the data are collected and used. Furthermore, the data collector of a web service consumes web resources by generating a large amount of web traffic. This traffic can damage servers by causing service outages. In this study, we propose a website search engine that employs a system that controls user information using blockchains and builds its database based on the recorded information. The system is divided into three parts: a collection section that uses proxy, a management section that uses blockchains, and a search engine that uses a built-in database. This structure allows data sovereigns to manage their data more transparently. Search engines that use blockchains do not use internet bots, and instead use the data generated by user behavior. This avoids generation of traffic from internet bots and can, thereby, contribute to creating a better web ecosystem.

Data Collection and Management on the World Wide Web : Evaluating system for Lecture (웹을 이요한 데이터 수집 및 관리에 관한 연구 : 강의평가 시스템 구현)

  • 안정용;최승현;한경수
    • The Korean Journal of Applied Statistics
    • /
    • v.13 no.2
    • /
    • pp.287-296
    • /
    • 2000
  • Data collection, management, and analysis to furnish information are very important in these modern days. In this paper, we discuss the methods of data collection and management on the World Wide \Veb and introduce an evaluating system for lecture.

  • PDF

A Study on Improving the OASIS Selection Guidelines (OASIS의 선정지침 개선(안)에 관한 연구)

  • Noh, Young-Hee;Go, Young-Sun
    • Journal of the Korean BIBLIA Society for library and Information Science
    • /
    • v.23 no.3
    • /
    • pp.105-137
    • /
    • 2012
  • The historical, social, and cultural value of Internet resources is indisputable, therefore many national institutions have created Web archiving projects to hand down this heritage to future generations. The selection guidelines are the most crucial aspect of these projects because they aid in differentiating which resources are worth collecting and preserving from the large number of web resources available. The purpose of this study was to suggest improvements for the OASIS Selection Guidelines by analyzing the selection guidelines of other domestic and international Web archiving projects. First, based on the results of Web archiving projects abroad, we proposed improvements for the definition of Web data and other terms, the basic principles of the collection, collection methods, and collection cycle. Second, we proposed substantial improvements in the target resources for archiving, and also stated what kind of web resources must be excluded from Web archiving. Finally, we discussed the relationship between data collection methods and the legal deposit of online resources, the necessity of constructing a database for selected target materials, and the necessity of cooperative archiving policies.

Refresh Cycle Optimization for Web Crawlers (웹크롤러의 수집주기 최적화)

  • Cho, Wan-Sup;Lee, Jeong-Eun;Choi, Chi-Hwan
    • The Journal of the Korea Contents Association
    • /
    • v.13 no.6
    • /
    • pp.30-39
    • /
    • 2013
  • Web crawler should maintain fresh data with minimum server overhead for large amount of data in the web sites. The overhead in the server increases rapidly as the amount of data is exploding as in the big data era. The amount of web information is increasing rapidly with advanced wireless networks and emergence of diverse smart devices. Furthermore, the information is continuously being produced and updated in anywhere and anytime by means of easy web platforms, and smart devices. Now, it is becoming a hot issue how frequently updated web data has to be refreshed in data collection and integration. In this paper, we propose dynamic web-data crawling methods, which include sensitive checking of web site changes, and dynamic retrieving of web pages from target web sites based on historical update patterns. Furthermore, we implemented a Java-based web crawling application and compared efficiency between conventional static approaches and our dynamic one. Our experiment results showed 46.2% overhead benefits with more fresh data compared to the static crawling methods.

Design and Implemention of Real-time web Crawling distributed monitoring system (실시간 웹 크롤링 분산 모니터링 시스템 설계 및 구현)

  • Kim, Yeong-A;Kim, Gea-Hee;Kim, Hyun-Ju;Kim, Chang-Geun
    • Journal of Convergence for Information Technology
    • /
    • v.9 no.1
    • /
    • pp.45-53
    • /
    • 2019
  • We face problems from excessive information served with websites in this rapidly changing information era. We find little information useful and much useless and spend a lot of time to select information needed. Many websites including search engines use web crawling in order to make data updated. Web crawling is usually used to generate copies of all the pages of visited sites. Search engines index the pages for faster searching. With regard to data collection for wholesale and order information changing in realtime, the keyword-oriented web data collection is not adequate. The alternative for selective collection of web information in realtime has not been suggested. In this paper, we propose a method of collecting information of restricted web sites by using Web crawling distributed monitoring system (R-WCMS) and estimating collection time through detailed analysis of data and storing them in parallel system. Experimental results show that web site information retrieval is applied to the proposed model, reducing the time of 15-17%.

A Method for Analyzing Web Log of the Hadoop System for Analyzing a Effective Pattern of Web Users (효과적인 웹 사용자의 패턴 분석을 위한 하둡 시스템의 웹 로그 분석 방안)

  • Lee, Byungju;Kwon, Jungsook;Go, Gicheol;Choi, Yonglak
    • Journal of Information Technology Services
    • /
    • v.13 no.4
    • /
    • pp.231-243
    • /
    • 2014
  • Of the various data that corporations can approach, web log data are important data that correspond to data analysis to implement customer relations management strategies. As the volume of approachable data has increased exponentially due to the Internet and popularization of smart phone, web log data have also increased a lot. As a result, it has become difficult to expand storage to process large amounts of web logs data flexibly and extremely hard to implement a system capable of categorizing, analyzing, and processing web log data accumulated over a long period of time. This study thus set out to apply Hadoop, a distributed processing system that had recently come into the spotlight for its capacity of processing large volumes of data, and propose an efficient analysis plan for large amounts of web log. The study checked the forms of web log by the effective web log collection methods and the web log levels by using Hadoop and proposed analysis techniques and Hadoop organization designs accordingly. The present study resolved the difficulty with processing large amounts of web log data and proposed the activity patterns of users through web log analysis, thus demonstrating its advantages as a new means of marketing.

Design of Facial Image Data Collection System for Heart Rate Measurement (심박수 측정을 위한 안면 얼굴 영상 데이터 수집 시스템 설계)

  • Jang, Seung-Ju
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.7
    • /
    • pp.971-976
    • /
    • 2021
  • In this paper, we design a facial facial image data collection system for heart rate measurement using a web camera. The design content of this paper is a function of collecting user face image information using a web camera and measuring heart rate using the user's face image information. There is a possibility that an error may occur due to non-contact heart rate measurement using a web camera. Therefore, in this paper, it is to be used for correcting heart rate program errors through classification of data in cases of error and normal. The data in case of error can be used for the purpose of reducing the error. Experiments were conducted on the proposed ideas and designed in this paper. As a result of the experiment, it was confirmed that it operates normally.