• Title/Summary/Keyword: Web text mining

Search Result 186, Processing Time 0.025 seconds

Performance analysis of volleyball games using the social network and text mining techniques (사회네트워크분석과 텍스트마이닝을 이용한 배구 경기력 분석)

  • Kang, Byounguk;Huh, Mankyu;Choi, Seungbae
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.3
    • /
    • pp.619-630
    • /
    • 2015
  • The purpose of this study is to provide basic information to develop a game strategy plan of a team in a future by identifying the patterns of attack and pass of national men's professional volleyball teams and extracting core key words related with volleyball game performance to evaluate game performance using 'social network analysis' and 'text mining'. As for the analysis result of 'social network analysis' with the whole data, group '0' (6 players) and group '1' (11 players) were partitioned. A point of view the degree centrality and betweenness centrality in 'social network analysis' results, we can know that the group '1' more active game performance than the group '0'. The significant result for two group (win and loss) obtained by 'text mining' according to two groups ('0' and '1') obtained by 'social network analysis' showed significant difference (p-value: 0.001). As for clustering of each network, group '0' had the tendency to score points through set player D and E. In group '1', the player K had the tendency to fail if he attack through 'dig'; players C and D have a good performance through 'set' play.

A Study on Extracting News Contents from News Web Pages (뉴스 웹 페이지에서 기사 본문 추출에 관한 연구)

  • Lee, Yong-Gu
    • Journal of the Korean Society for information Management
    • /
    • v.26 no.1
    • /
    • pp.305-320
    • /
    • 2009
  • The news pages provided through the web contain unnecessary information. This causes low performance and inefficiency of the news processing system. In this study, news content extraction methods, which are based on sentence identification and block-level tags news web pages, was suggested. To obtain optimal performance, combinations of these methods were applied. The results showed good performance when using an extraction method which applied the sentence identification and eliminated hyperlink text from web pages. Moreover, this method showed better results when combined with the extraction method which used block-level. Extraction methods, which used sentence identification, were effective for raising the extraction recall ratio.

Analysis of Social Media Utilization based on Big Data-Focusing on the Chinese Government Weibo

  • Li, Xiang;Guo, Xiaoqin;Kim, Soo Kyun;Lee, Hyukku
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.8
    • /
    • pp.2571-2586
    • /
    • 2022
  • The rapid popularity of government social media has generated huge amounts of text data, and the analysis of these data has gradually become the focus of digital government research. This study uses Python language to analyze the big data of the Chinese provincial government Weibo. First, this study uses a web crawler approach to collect and statistically describe over 360,000 data from 31 provincial government microblogs in China, covering the period from January 2018 to April 2022. Second, a word separation engine is constructed and these text data are analyzed using word cloud word frequencies as well as semantic relationships. Finally, the text data were analyzed for sentiment using natural language processing methods, and the text topics were studied using LDA algorithm. The results of this study show that, first, the number and scale of posts on the Chinese government Weibo have grown rapidly. Second, government Weibo has certain social attributes, and the epidemics, people's livelihood, and services have become the focus of government Weibo. Third, the contents of government Weibo account for more than 30% of negative sentiments. The classified topics show that the epidemics and epidemic prevention and control overshadowed the other topics, which inhibits the diversification of government Weibo.

Analysis of Consumer Awareness of Cycling Wear Using Web Mining (웹마이닝을 활용한 사이클웨어 소비자 인식 분석)

  • Kim, Chungjeong;Yi, Eunjou
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.19 no.5
    • /
    • pp.640-649
    • /
    • 2018
  • This study analyzed the consumer awareness of cycling wear using web mining, one of the big data analysis methods. For this, the texts of postings and comments related to cycling wear from 2006 to 2017 at Naver cafe, 'people who commute by bicycle' were collected and analyzed using R packages. A total of 15,321 documents were used for data analysis. The keywords of cycling wear were extracted using a Korean morphological analyzer (KoNLP) and converted to TDM (Term Document Matrix) and co-occurrence matrix to calculate the frequency of the keywords. The most frequent keyword in cycling wear was 'tights', including the opinion that they feel embarrassed because they are too tight. When they purchase cycling wear, they appeared to consider 'price', 'size', and 'brand'. Recently 'low price' and 'cost effectiveness' have become more frequent since 2016 than before, which indicates that consumers tend to prefer practical products. Moreover, the findings showed that it is necessary to improve not only the design and wearability, but also the material functionality, such as sweat-absorbance and quick drying, and the function of pad. These showed similar results to previous studies using a questionnaire. Therefore, it is expected to be used as an objective indicator that can be reflected in product development by real-time analysis of the opinions and requirements of consumers using web mining.

Terminology Recognition System based on Machine Learning for Scientific Document Analysis (과학 기술 문헌 분석을 위한 기계학습 기반 범용 전문용어 인식 시스템)

  • Choi, Yun-Soo;Song, Sa-Kwang;Chun, Hong-Woo;Jeong, Chang-Hoo;Choi, Sung-Pil
    • The KIPS Transactions:PartD
    • /
    • v.18D no.5
    • /
    • pp.329-338
    • /
    • 2011
  • Terminology recognition system which is a preceding research for text mining, information extraction, information retrieval, semantic web, and question-answering has been intensively studied in limited range of domains, especially in bio-medical domain. We propose a domain independent terminology recognition system based on machine learning method using dictionary, syntactic features, and Web search results, since the previous works revealed limitation on applying their approaches to general domain because their resources were domain specific. We achieved F-score 80.8 and 6.5% improvement after comparing the proposed approach with the related approach, C-value, which has been widely used and is based on local domain frequencies. In the second experiment with various combinations of unithood features, the method combined with NGD(Normalized Google Distance) showed the best performance of 81.8 on F-score. We applied three machine learning methods such as Logistic regression, C4.5, and SVMs, and got the best score from the decision tree method, C4.5.

Web crawler Improvement and Dynamic process Design and Implementation for Effective Data Collection (효과적인 데이터 수집을 위한 웹 크롤러 개선 및 동적 프로세스 설계 및 구현)

  • Wang, Tae-su;Song, JaeBaek;Son, Dayeon;Kim, Minyoung;Choi, Donggyu;Jang, Jongwook
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.11
    • /
    • pp.1729-1740
    • /
    • 2022
  • Recently, a lot of data has been generated according to the diversity and utilization of information, and the importance of big data analysis to collect, store, process and predict data has increased, and the ability to collect only necessary information is required. More than half of the web space consists of text, and a lot of data is generated through the organic interaction of users. There is a crawling technique as a representative method for collecting text data, but many crawlers are being developed that do not consider web servers or administrators because they focus on methods that can obtain data. In this paper, we design and implement an improved dynamic web crawler that can efficiently fetch data by examining problems that may occur during the crawling process and precautions to be considered. The crawler, which improved the problems of the existing crawler, was designed as a multi-process, and the work time was reduced by 4 times on average.

Protein Named Entity Identification Based on Probabilistic Features Derived from GENIA Corpus and Medical Text on the Web

  • Sumathipala, Sagara;Yamada, Koichi;Unehara, Muneyuki;Suzuki, Izumi
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.15 no.2
    • /
    • pp.111-120
    • /
    • 2015
  • Protein named entity identification is one of the most essential and fundamental predecessor for extracting information about protein-protein interactions from biomedical literature. In this paper, we explore the use of abstracts of biomedical literature in MEDLINE for protein name identification and present the results of the conducted experiments. We present a robust and effective approach to classify biomedical named entities into protein and non-protein classes, based on a rich set of features: orthographic, keyword, morphological and newly introduced Protein-Score features. Our procedure shows significant performance in the experiments on GENIA corpus using Random Forest, achieving the highest values of precision 92.7%, recall 91.7%, and F-measure 92.2% for protein identification, while reducing the training and testing time significantly.

A Web Text Mining Technique using Semantic Relations based on WordNet and Text Corpus (WordNet과 텍스트 코퍼스에 기반한 의미 관계를 활용한 웹 텍스트 조사 기법)

  • Lee, Ho-Suk;Kim, Yung-Taek
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2007.06c
    • /
    • pp.181-184
    • /
    • 2007
  • 본 논문은 문장 분석에 의하여 의미 관계를 생성하고 의미 네트워크에 의하여 유사한 의미 관계를 고려하는 의미 중심의 웹 텍스트 검색 기법에 대하여 논의한다. 기존의 웹 텍스트 검색은 단어만을 혹은 의미 관계만을 고려한 검색이었다고 할 수 있다. 그러나 문장 분석에 의한 의미 관계의 생성과 의미 네트워크에 의한 유사한 의미 관계의 고려는 기존의 단어 중심 혹은 의미 관계 중심의 검색 한계를 넘어서 유사한 의미 관계를 고려한 좀 더 포괄적이고 계층적인 검색을 가능하게 할 것으로 생각된다.

  • PDF

Development of e-Mail Classifiers for e-Mail Response Management Systems (전자메일 자동관리 시스템을 위한 전자메일 분류기의 개발)

  • Kim, Kuk-Pyo;Kwon, Young-S.
    • Journal of Information Technology Services
    • /
    • v.2 no.2
    • /
    • pp.87-95
    • /
    • 2003
  • With the increasing proliferation of World Wide Web, electronic mail systems have become very widely used communication tools. Researches on e-mail classification have been very important in that e-mail classification system is a major engine for e-mail response management systems which mine unstructured e-mail messages and automatically categorize them. in this research we develop e-mail classifiers for e-mail Response Management Systems (ERMS) using naive bayesian learning and centroid-based classification. We analyze which method performs better under which conditions, comparing classification accuracies which may depend on the structure, the size of training data set and number of classes, using the different data set of an on-line shopping mall and a credit card company. The developed e-mail classifiers have been successfully implemented in practice. The experimental results show that naive bayesian learning performs better, while centroid-based classification is more robust in terms of classification accuracy.

Pilot Experiment for Named Entity Recognition of Construction-related Organizations from Unstructured Text Data

  • Baek, Seungwon;Han, Seung H.;Jung, Wooyong;Kim, Yuri
    • International conference on construction engineering and project management
    • /
    • 2022.06a
    • /
    • pp.847-854
    • /
    • 2022
  • The aim of this study is to develop a Named Entity Recognition (NER) model to automatically identify construction-related organizations from news articles. This study collected news articles using web crawling technique and construction-related organizations were labeled within a total of 1,000 news articles. The Bidirectional Encoder Representations from Transformers (BERT) model was used to recognize clients, constructors, consultants, engineers, and others. As a pilot experiment of this study, the best average F1 score of NER was 0.692. The result of this study is expected to contribute to the establishment of international business strategies by collecting timely information and analyzing it automatically.

  • PDF