Search | Korea Science

Seo Hae-Sung;Choi Young-Soo;Choi Kyung-Hee;Jung Gi-Hyun;Noh Sang-Uk
- Journal of Internet Computing and Services
- /
- v.7 no.3
- /
- pp.155-167
- /
- 2006
It is desirable if users surfing on the Internet could find Web pages related to their interests as closely as possible. Toward this ends, this paper presents a topic specific Web crawler computing the degree of relevance. collecting a cluster of pages given a specific topic, and refining the preliminary set of related web pages using term frequency/document frequency, entropy, and compiled rules. In the experiments, we tested our topic specific crawler in terms of the accuracy of its classification, crawling efficiency, and crawling consistency. First, the classification accuracy using the set of rules compiled by CN2 was the best, among those of C4.5 and back propagation learning algorithms. Second, we measured the classification efficiency to determine the best threshold value affecting the degree of relevance. In the third experiment, the consistency of our topic specific crawler was measured in terms of the number of the resulting URLs overlapped with different starting URLs. The experimental results imply that our topic specific crawler was fairly consistent, regardless of the starting URLs randomly chosen.
PDF

Cho, Chang-Hee;Lee, Nam-Yong;Kang, Jin-Bum;Yang, Jae-Young;Choi, Joong-Min
- The KIPS Transactions:PartB
- /
- v.12B no.6 s.102
- /
- pp.697-702
- /
- 2005
The focused crawler is a topic-driven document-collecting crawler that was suggested as a promising alternative of maintaining up-to-date web document Indices in search engines. A major problem inherent in previous focused crawlers is the liability of missing highly relevant documents that are linked from off-topic documents. This problem mainly originated from the lack of consideration of structural information in a document. Traditional weighting method such as TFIDF employed in document classification can lead to this problem. In order to improve the performance of focused crawlers, this paper proposes a scheme of locality-based document segmentation to determine the relevance of a document to a specific topic. We segment a document into a set of sub-documents using contextual features around the hyperlinks. This information is used to determine whether the crawler would fetch the documents that are linked from hyperlinks in an off-topic document.
https://doi.org/10.3745/KIPSTB.2005.12B.6.697 인용 PDF KSCI

Chang, Moon-Soo
- Journal of the Korean Institute of Intelligent Systems
- /
- v.18 no.1
- /
- pp.86-93
- /
- 2008
The construction of ontology defines as complicated semantic relations needs precise and expert skills. For the well defined ontology in real applications, plenty of information of instances for ontology classes is very critical. In this study, crawling algorithm which extracts the fittest topic from the Web overflowing over by a great number of documents has been focused and developed. Proposed crawling algorithm made a progress to gather documents at high speed by extracting topic-specific Link using URL patterns. And topic fitness of Link block text has been represented by fuzzy sets which will improve a precision of the focused crawler.
https://doi.org/10.5391/JKIIS.2008.18.1.086 인용 PDF KSCI