Search | Korea Science

Crawling Algorithm Design for Deep Web Document Collection (심층 웹 문서 수집을 위한 크롤링 알고리즘 설계)

Won, Dong-Hyun;Kang, Yun-Jeong;Park, Hyuk-Gyu
- Proceedings of the Korean Institute of Information and Commucation Sciences Conference
- /
- 2022.10a
- /
- pp.367-369
- /
- 2022
With the development of web technology, the web provides customized information that meets the needs of users. Information is provided according to the input form and the user's query, and a web service that provides information that is difficult to search with a search engine is called an in-depth web. These deep webs contain more information than surface webs, but it is difficult to collect information with general crawling, which collects information at the time of the visit. The deep web provides users with information on the server by running script languages such as javascript in their browsers. In this paper, we propose an algorithm capable of exploring dynamically changing websites and collecting information by analyzing scripts for deep web collection. In this paper, the script of the bulletin board of the Korea Centers for Disease Control and Prevention was analyzed for experiments.
PDF

Data Mapping between Korea Deep Web Archiving Format and Reference Model for OAIS (국가 심층 웹기록물 보존 포맷과 OAIS 참조모델 간의 데이터 맵핑)

Park, Boung-Joo;Cha, Seung-Jun;Lee, Kyu-Chul
- Proceedings of the Korean Information Science Society Conference
- /
- 2010.06c
- /
- pp.197-200
- /
- 2010
웹 기술이 발달함에 따라 공공기관 웹사이트는 단순한 행정기관의 홍보에서 벗어나 국민과 정부 간의 의사소통의 증거인 동시에 업무의 기록으로서 역할을 담당하고 있다. 따라서 공공기관의 웹사이트들은 공공기록물로 인식하고 보호해야 한다. 하지만 공공기관의 웹기록물 중 하나인 심층 웹기록물은 실시간으로 상이한 페이지를 동적으로 구성하기 때문에 기존의 보존방법과는 다른 수집 보존 활용 기술이 요구된다. 국가기록원은 이러한 특징을 가지고 있는 심층 웹기록물을 장기보존하기 위해서 심층 웹기록물 장기보존 포맷인 KoDeWeb을 연구하고 개발하였다. KoDeWeb은 전자기록물이기 때문에 전자기록물로서 진본성 및 무결성을 보장해야 한다. 본 연구에서는 KoDeWeb의 전자기록물로서의 진본성 및 무결성을 증명하기 위해 국제 전자기록물 표준인 OAIS 참조모델에 KoDeWeb을 맵핑시켰다. 나아가 OAIS표준을 따르고 있는 전자기록물 장기보존 시스템에 KoDeWeb을 사용함으로써, 정부 및 공공기관의 심층 웹기록물 생성 및 수집을 체계화하고, 또한 민간이 운영하는 웹의 심층 웹기록물 장기보존에 활용할 수 있다.
PDF

Metadata Design for Archiving Public Deep Web Records (공공기관 심층 웹기록물 아카이빙을 위한 메타데이터 설계)

Cha, Seung-Jun;Choi, Yun-Jeong;Lee, Kyu-Chul
- The Journal of Society for e-Business Studies
- /
- v.14 no.4
- /
- pp.181-193
- /
- 2009
According to the development of web sites' technologies, public institutions use web sites to carry out their business and also to utilize as pathway between government and the people also. Public web records means the result of business process over web sites in public institutions. Although there is much valuable information, it is vanished away easily because there is not yet proper methods and tools for preservation. The purpose of this paper is to design the metadata elements required when archiving deep web records, which is a kind of web records. For that, we first analyze oversea's related researches to define what public deep web records is. Then we define metadata elements about that and also explain the relationship on archival information package in Korea and dublin core metadata to support interoperability for them. The defined metadata can be used for the basis technologies in archiving domestic public web records.
PDF

Crawling algorithm design and experiment for automatic deep web document collection (심층 웹 문서 자동 수집을 위한 크롤링 알고리즘 설계 및 실험)

Yun-Jeong, Kang;Min-Hye, Lee;Dong-Hyun, Won
- Journal of the Korea Institute of Information and Communication Engineering
- /
- v.27 no.1
- /
- pp.1-7
- /
- 2023
Deep web collection means entering a query in a search form and collecting response results. It is estimated that the information possessed by the deep web has about 450 to 550 times more information than the statically constructed surface web. The static method does not show the changed information until the web page is refreshed, but the dynamic web page method updates the necessary information in real time and provides real-time information without reloading the web page, but crawler has difficulty accessing the updated information. Therefore, there is a need for a way to automatically collect information on these deep webs using a crawler. Therefore, this paper proposes a method of utilizing scripts as general links, and for this purpose, an algorithm that can utilize client scripts like regular URLs is proposed and experimented. The proposed algorithm focused on collecting web information by menu navigation and script execution instead of the usual method of entering data into search forms.
https://doi.org/10.6109/jkiice.2023.27.1.1 인용 PDF

The Development of Automatic Collection Method to Collect Information Resources for Wed Archiving: With Focus on Disaster Safety Information (웹 아카이빙을 위한 정보자원의 자동수집방법 개발 - 재난안전정보를 중심으로 -)

Lee, Su Jin;Han, Hui Lyeong;Sim, Min Jeong;Won, Dong Hyun;Kim, Yong
- Journal of Korean Society of Archives and Records Management
- /
- v.17 no.4
- /
- pp.1-26
- /
- 2017
This study aims to provide the efficient sharing and utilization method of disasters scattered by each institution and develop automated collection algorithm using web crawler for disaster information in deep web accounts. To achieve these goals, this study analyzes the logical structure of the deep web and develops algorithms to collect the information. With the proposed automatic algorithm, it is expected that disaster management will be helped by sharing and utilizing disaster safety information.
https://doi.org/10.14404/JKSARM.2017.17.4.001 인용 PDF KSCI

Web Archiving: What We Have Done and What We Should Do (웹 아카이빙의 성과와 과제)

Suh, Hye-Ran
- Journal of the Korean BIBLIA Society for library and Information Science
- /
- v.15 no.1
- /
- pp.5-22
- /
- 2004
The purpose of this study is to review what we have done and to identify what we have to do to be successful with Web archiving which is important to preserve our cultural heritage for the next generation. Some characteristics of Web resources as information sources were identified and some difficulties with Web archiving were discussed. The outcome of national and/or international Web archiving projects including Kurturarw3, PANDORA and Internet Archive were reviewed. Policy issues and technological problems of Web archiving we have to solve were listed.
PDF

Comparision and Analysis of Algorithm for web Sites Researching (웹 사이트 탐색 알고리즘 비교분석)

김덕수;권영직
- Journal of Korea Society of Industrial Information Systems
- /
- v.8 no.3
- /
- pp.91-98
- /
- 2003
Visitors who browse the web from wireless PDAs, cell phones are frequently frustrated by interfaces. Simply replacing graphics with text and reformatting tables does not solve this problem, because deep link structures can still require more time. To solve this problem, in the paper we propose an algorithm, Minimal Path Algorithm that automatically improves wireless web navigation by suggesting useful shortcut links in real time. In the result of this paper, Minimal Path algorithm offer the shortcut and the number of shortest links to web users.
PDF

Deep Analysis of Question for Question Answering System (질의 응답 시스템을 위한 질의문 심층 분석)

Shin Seung-Eun;Seo Young-Hoon
- The Journal of the Korea Contents Association
- /
- v.6 no.3
- /
- pp.12-19
- /
- 2006
In this paper, we describe a deep analysis of question for question answering system. It is difficult to offer the correct answer because general question answering systems do not analyze the semantic of user's natural language question. We analyze user's question semantically and extract semantic features using the semantic feature extraction grammar and characteristics of natural language question. They are represented as semantic features and grammatical morphemes that consider semantic and syntactic structure of user's questions. We evaluated our approach using 100 questions whose answer type is a person in the web. We showed that a deep analysis of questions which are comparatively short but enough to mean can analysis the user's intention and extract semantic features.
PDF

Extension of the Long-term Archival Information Package for Electronic Records to Accommodate Web Records (웹기록물 보존을 위한 전자기록물 장기보존포맷 확장 설계)

Park, Boung-Joo;Cha, Seung-Jun;Lee, Kyu-Chul
- The Journal of Society for e-Business Studies
- /
- v.15 no.4
- /
- pp.33-47
- /
- 2010
Web records is valuable information to preserve, because it can be used as a legal evidence about business or e-commerce of a public institution, but it is easily disappeared because of its volatile characteristic. Therefore, archival information package should be defined for long-term preservation. Web records can be stored in the archival information package for electronic records, because web records is a kind of electronic records. However, the NEO(NARS Encapsulation Object), the archival information package for electronic records in Korea, can't able to store web records, because it was developed without consideration of the characteristic of web records. In this paper, we define extended NEO based on the analysis of KoSurWeb and KoDeWeb, that archival information package for document of surface and deep web as well as the NEO. Web records can be preserved and utilized along with electronic records by using the extended NEO. Also it can be used for record and legal evudence by archiving web records of public institution about e-commerce.
PDF KSCI

Self-disclosure and Privacy in the Age of Web 2.0 A Case Study (웹 2.0 시대의 프라이버시 청년 UCC 이용자들의 인식과 실천을 중심으로)

Lee, Dong-Hoo
- Korean journal of communication and information
- /
- v.46
- /
- pp.556-589
- /
- 2009
With the advent of the so-called Web 2.0 age, the interconnections of various contents on the web, as well as the user-participatory services from blogs, web-based communities, picture sharing sites, and social networking sites, to the sites for collective knowledge productions, have been further vitalized. As the User Generated Contents(UGCs) are flourishing on the web, they have channeled users' desires for self-expression and social acknowledgement, and yet have created the new kinds of invasion of privacy. This study attempts to look at how the networked individuals' everyday perceptions of privacy have been reconstructed in the age of Web 2.0. By investigating how users have used the UGCs for their sociality on the web and how they have set the boundaries of the private and the public in these public or semi-public disclosures of self-expressions, it has traced the changing perceptions of privacy in everyday communication practices. For this study, it has interviewed Korean youngsters in their 10s and 20s who have grown up with the Internet and have received self-expressions and social communication on the web as everyday activities. Based on their interviews, it inquires into the concurrent notion of privacy and discuss its cultural implications.
PDF

Search Result 55, Processing Time 0.025 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)