• Title/Summary/Keyword: Doc2vec

Search Result 42, Processing Time 0.027 seconds

Development of an Intelligent Illegal Gambling Site Detection Model Based on Tag2Vec (Tag2vec 기반의 지능형 불법 도박 사이트 탐지 모형 개발)

  • Song, ChanWoo;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.4
    • /
    • pp.211-227
    • /
    • 2022
  • Illegal gambling through online gambling sites has become a significant social problem. The development of Internet technology and the spread of smartphones have led to the proliferation of illegal gambling sites, so now illegal online gambling has become accessible to anyone. In order to mitigate its negative effect, the Korean government is trying to detect illegal gambling sites by using self-monitoring agents or reporting systems such as 'Nuricops.' However, it is difficult to detect all illegal sites due to limitations such as a lack of staffing. Accordingly, several scholars have proposed intelligent illegal gambling site detection techniques. Xu et al. (2019) found that fake or illegal websites generally have unique features in the HTML tag structure. It implies that the HTML tag structure can be important for detecting illegal sites. However, prior studies to improve the model's performance by utilizing the HTML tag structure in the illegal site detection model are rare. Against this background, our study aimed to improve the model's performance by utilizing the HTML tag structure and proposes Tag2Vec, a modified version of Doc2Vec, as a methodology to vectorize the HTML tag structure properly. To validate the proposed model, we perform the empirical analysis using a data set consisting of the list of harmful sites from 'The Cheat' and normal sites through Google search. As a result, it was confirmed that the Tag2Vec-based detection model proposed in this study showed better classification accuracy, recall, and F1_Score than the URL-based detection model-a comparative model. The proposed model of this study is expected to be effectively utilized to improve the health of our society through intelligent technology.

Improvement of a Product Recommendation Model using Customers' Search Patterns and Product Details

  • Lee, Yunju;Lee, Jaejun;Ahn, Hyunchul
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.1
    • /
    • pp.265-274
    • /
    • 2021
  • In this paper, we propose a novel recommendation model based on Doc2vec using search keywords and product details. Until now, a lot of prior studies on recommender systems have proposed collaborative filtering (CF) as the main algorithm for recommendation, which uses only structured input data such as customers' purchase history or ratings. However, the use of unstructured data like online customer review in CF may lead to better recommendation. Under this background, we propose to use search keyword data and product detail information, which are seldom used in previous studies, for product recommendation. The proposed model makes recommendation by using CF which simultaneously considers ratings, search keywords and detailed information of the products purchased by customers. To extract quantitative patterns from these unstructured data, Doc2vec is applied. As a result of the experiment, the proposed model was found to outperform the conventional recommendation model. In addition, it was confirmed that search keywords and product details had a significant effect on recommendation. This study has academic significance in that it tries to apply the customers' online behavior information to the recommendation system and that it mitigates the cold start problem, which is one of the critical limitations of CF.

Multiple Fusion-based Deep Cross-domain Recommendation (다중 융합 기반 심층 교차 도메인 추천)

  • Hong, Minsung;Lee, WonJin
    • Journal of Korea Multimedia Society
    • /
    • v.25 no.6
    • /
    • pp.819-832
    • /
    • 2022
  • Cross-domain recommender system transfers knowledge across different domains to improve the recommendation performance in a target domain that has a relatively sparse model. However, they suffer from the "negative transfer" in which transferred knowledge operates as noise. This paper proposes a novel Multiple Fusion-based Deep Cross-Domain Recommendation named MFDCR. We exploit Doc2Vec, one of the famous word embedding techniques, to fuse data user-wise and transfer knowledge across multi-domains. It alleviates the "negative transfer" problem. Additionally, we introduce a simple multi-layer perception to learn the user-item interactions and predict the possibility of preferring items by users. Extensive experiments with three domain datasets from one of the most famous services Amazon demonstrate that MFDCR outperforms recent single and cross-domain recommendation algorithms. Furthermore, experimental results show that MFDCR can address the problem of "negative transfer" and improve recommendation performance for multiple domains simultaneously. In addition, we show that our approach is efficient in extending toward more domains.

A Study of Research on Methods of Automated Biomedical Document Classification using Topic Modeling and Deep Learning (토픽모델링과 딥 러닝을 활용한 생의학 문헌 자동 분류 기법 연구)

  • Yuk, JeeHee;Song, Min
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.2
    • /
    • pp.63-88
    • /
    • 2018
  • This research evaluated differences of classification performance for feature selection methods using LDA topic model and Doc2Vec which is based on word embedding using deep learning, feature corpus sizes and classification algorithms. In addition to find the feature corpus with high performance of classification, an experiment was conducted using feature corpus was composed differently according to the location of the document and by adjusting the size of the feature corpus. Conclusionally, in the experiments using deep learning evaluate training frequency and specifically considered information for context inference. This study constructed biomedical document dataset, Disease-35083 which consisted biomedical scholarly documents provided by PMC and categorized by the disease category. Throughout the study this research verifies which type and size of feature corpus produces the highest performance and, also suggests some feature corpus which carry an extensibility to specific feature by displaying efficiency during the training time. Additionally, this research compares the differences between deep learning and existing method and suggests an appropriate method by classification environment.

Identify the Failure Mode of Weapon System (or equipment) using Machine Learning (Machine Learning을 이용한 무기 체계(or 구성품) 고장 유형 식별)

  • Park, Yun-Kyung;Lee, Hye-Won;Kim, Sang-Moon
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.19 no.8
    • /
    • pp.64-70
    • /
    • 2018
  • The development of weapon systems (or components) is hindered by the number of tests due to the limited development period and cost, which reduces the scale of accumulated data related to failures. Nevertheless, because a large amount of failure data and maintenance details during the operational period are managed by computerized data, the cause of failure of weapon systems (or components) can be analyzed using the data. On the other hand, analyzing the failure and maintenance details of various weapon systems is difficult because of the variation among groups and companies, and details of the cause of failure are described as unstructured text data. Fortunately, the recent developments of big data processing technology, machine learning algorithm, and improved HW computation ability have supported major research into various methods for processing the above unstructured data. In this paper, unstructured data related to the failure / maintenance of defense weapon systems (or components) is presented by applying doc2vec, a machine learning technique, to analyze the failure cases.

Related Documents Classification System by Similarity between Documents (문서 유사도를 통한 관련 문서 분류 시스템 연구)

  • Jeong, Jisoo;Jee, Minkyu;Go, Myunghyun;Kim, Hakdong;Lim, Heonyeong;Lee, Yurim;Kim, Wonil
    • Journal of Broadcast Engineering
    • /
    • v.24 no.1
    • /
    • pp.77-86
    • /
    • 2019
  • This paper proposes using machine-learning technology to analyze and classify historical collected documents based on them. Data is collected based on keywords associated with a specific domain and the non-conceptuals such as special characters are removed. Then, tag each word of the document collected using a Korean-language morpheme analyzer with its nouns, verbs, and sentences. Embedded documents using Doc2Vec model that converts documents into vectors. Measure the similarity between documents through the embedded model and learn the document classifier using the machine running algorithm. The highest performance support vector machine measured 0.83 of F1-score as a result of comparing the classification model learned.

Application of Text-Classification Based Machine Learning in Predicting Psychiatric Diagnosis (텍스트 분류 기반 기계학습의 정신과 진단 예측 적용)

  • Pak, Doohyun;Hwang, Mingyu;Lee, Minji;Woo, Sung-Il;Hahn, Sang-Woo;Lee, Yeon Jung;Hwang, Jaeuk
    • Korean Journal of Biological Psychiatry
    • /
    • v.27 no.1
    • /
    • pp.18-26
    • /
    • 2020
  • Objectives The aim was to find effective vectorization and classification models to predict a psychiatric diagnosis from text-based medical records. Methods Electronic medical records (n = 494) of present illness were collected retrospectively in inpatient admission notes with three diagnoses of major depressive disorder, type 1 bipolar disorder, and schizophrenia. Data were split into 400 training data and 94 independent validation data. Data were vectorized by two different models such as term frequency-inverse document frequency (TF-IDF) and Doc2vec. Machine learning models for classification including stochastic gradient descent, logistic regression, support vector classification, and deep learning (DL) were applied to predict three psychiatric diagnoses. Five-fold cross-validation was used to find an effective model. Metrics such as accuracy, precision, recall, and F1-score were measured for comparison between the models. Results Five-fold cross-validation in training data showed DL model with Doc2vec was the most effective model to predict the diagnosis (accuracy = 0.87, F1-score = 0.87). However, these metrics have been reduced in independent test data set with final working DL models (accuracy = 0.79, F1-score = 0.79), while the model of logistic regression and support vector machine with Doc2vec showed slightly better performance (accuracy = 0.80, F1-score = 0.80) than the DL models with Doc2vec and others with TF-IDF. Conclusions The current results suggest that the vectorization may have more impact on the performance of classification than the machine learning model. However, data set had a number of limitations including small sample size, imbalance among the category, and its generalizability. With this regard, the need for research with multi-sites and large samples is suggested to improve the machine learning models.

Detecting Improper Sentences in a News Article Using Text Mining (텍스트 마이닝을 이용한 기사 내 부적합 문단 검출 시스템)

  • Kim, Kyu-Wan;Sin, Hyun-Ju;Kim, Seon-Jin;Lee, Hyun Ah
    • Annual Conference on Human and Language Technology
    • /
    • 2017.10a
    • /
    • pp.294-297
    • /
    • 2017
  • SNS와 스마트기기의 발전으로 온라인을 통한 뉴스 배포가 용이해지면서 악의적으로 조작된 뉴스가 급속도로 생성되어 확산되고 있다. 뉴스 조작은 다양한 형태로 이루어지는데, 이 중에서 정상적인 기사 내에 광고나 낚시성 내용을 포함시켜 독자가 의도하지 않은 정보에 노출되게 하는 형태는 독자가 해당 내용을 진짜 뉴스로 받아들이기 쉽다. 본 논문에서는 뉴스 기사 내에 포함된 문단 중에서 부적합한 문단이 포함되었는지를 판정하기 위한 방법을 제안한다. 제안하는 방식에서는 자연어 처리에 유용한 Convolutional Neural Network(CNN)모델 중 Word2Vec과 tf-idf 알고리즘, 로지스틱 회귀를 함께 이용하여 뉴스 부적합 문단을 검출한다. 본 시스템에서는 로지스틱 회귀를 이용하여 문단의 카테고리를 분류하여 본문의 카테고리 분포도를 계산하고 Word2Vec을 이용하여 문단간의 유사도를 계산한 결과에 가중치를 부여하여 부적합 문단을 검출한다.

  • PDF

Detecting Improper Sentences in a News Article Using Text Mining (텍스트 마이닝을 이용한 기사 내 부적합 문단 검출 시스템)

  • Kim, Kyu-Wan;Sin, Hyun-Ju;Kim, Seon-Jin;Lee, Hyun Ah
    • 한국어정보학회:학술대회논문집
    • /
    • 2017.10a
    • /
    • pp.294-297
    • /
    • 2017
  • SNS와 스마트기기의 발전으로 온라인을 통한 뉴스 배포가 용이해지면서 악의적으로 조작된 뉴스가 급속도로 생성되어 확산되고 있다. 뉴스 조작은 다양한 형태로 이루어지는데, 이 중에서 정상적인 기사 내에 광고나 낚시성 내용을 포함시켜 독자가 의도하지 않은 정보에 노출되게 하는 형태는 독자가 해당 내용을 진짜 뉴스로 받아들이기 쉽다. 본 논문에서는 뉴스 기사 내에 포함된 문단 중에서 부적합한 문단이 포함 되었는지를 판정하기 위한 방법을 제안한다. 제안하는 방식에서는 자연어 처리에 유용한 Convolutional Neural Network(CNN)모델 중 Word2Vec과 tf-idf 알고리즘, 로지스틱 회귀를 함께 이용하여 뉴스 부적합 문단을 검출한다. 본 시스템에서는 로지스틱 회귀를 이용하여 문단의 카테고리를 분류하여 본문의 카테고리 분포도를 계산하고 Word2Vec을 이용하여 문단간의 유사도를 계산한 결과에 가중치를 부여하여 부적합 문단을 검출한다.

  • PDF

Implementation of Recipe Recommendation System Using Ingredients Combination Analysis based on Recipe Data (레시피 데이터 기반의 식재료 궁합 분석을 이용한 레시피 추천 시스템 구현)

  • Min, Seonghee;Oh, Yoosoo
    • Journal of Korea Multimedia Society
    • /
    • v.24 no.8
    • /
    • pp.1114-1121
    • /
    • 2021
  • In this paper, we implement a recipe recommendation system using ingredient harmonization analysis based on recipe data. The proposed system receives an image of a food ingredient purchase receipt to recommend ingredients and recipes to the user. Moreover, it performs preprocessing of the receipt images and text extraction using the OCR algorithm. The proposed system can recommend recipes based on the combined data of ingredients. It collects recipe data to calculate the combination for each food ingredient and extracts the food ingredients of the collected recipe as training data. And then, it acquires vector data by learning with a natural language processing algorithm. Moreover, it can recommend recipes based on ingredients with high similarity. Also, the proposed system can recommend recipes using replaceable ingredients to improve the accuracy of the result through preprocessing and postprocessing. For our evaluation, we created a random input dataset to evaluate the proposed recipe recommendation system's performance and calculated the accuracy for each algorithm. As a result of performance evaluation, the accuracy of the Word2Vec algorithm was the highest.