• Title/Summary/Keyword: Data Collection and Preprocessing

Search Result 59, Processing Time 0.026 seconds

Artificial Intelligence Algorithms, Model-Based Social Data Collection and Content Exploration (소셜데이터 분석 및 인공지능 알고리즘 기반 범죄 수사 기법 연구)

  • An, Dong-Uk;Leem, Choon Seong
    • The Journal of Bigdata
    • /
    • v.4 no.2
    • /
    • pp.23-34
    • /
    • 2019
  • Recently, the crime that utilizes the digital platform is continuously increasing. About 140,000 cases occurred in 2015 and about 150,000 cases occurred in 2016. Therefore, it is considered that there is a limit handling those online crimes by old-fashioned investigation techniques. Investigators' manual online search and cognitive investigation methods those are broadly used today are not enough to proactively cope with rapid changing civil crimes. In addition, the characteristics of the content that is posted to unspecified users of social media makes investigations more difficult. This study suggests the site-based collection and the Open API among the content web collection methods considering the characteristics of the online media where the infringement crimes occur. Since illegal content is published and deleted quickly, and new words and alterations are generated quickly and variously, it is difficult to recognize them quickly by dictionary-based morphological analysis registered manually. In order to solve this problem, we propose a tokenizing method in the existing dictionary-based morphological analysis through WPM (Word Piece Model), which is a data preprocessing method for quick recognizing and responding to illegal contents posting online infringement crimes. In the analysis of data, the optimal precision is verified through the Vote-based ensemble method by utilizing a classification learning model based on supervised learning for the investigation of illegal contents. This study utilizes a sorting algorithm model centering on illegal multilevel business cases to proactively recognize crimes invading the public economy, and presents an empirical study to effectively deal with social data collection and content investigation.

  • PDF

The Frequency Analysis of Teacher's Emotional Response in Mathematics Class (수학 담화에서 나타나는 교사의 감성적 언어 빈도 분석)

  • Son, Bok Eun;Ko, Ho Kyoung
    • Communications of Mathematical Education
    • /
    • v.32 no.4
    • /
    • pp.555-573
    • /
    • 2018
  • The purpose of this study is to identify the emotional language of math teachers in math class using text mining techniques. For this purpose, we collected the discourse data of the teachers in the class by using the excellent class video. The analysis of the extracted unstructured data proceeded to three stages: data collection, data preprocessing, and text mining analysis. According to text mining analysis, there was few emotional language in teacher's response in mathematics class. This result can infer the characteristics of mathematics class in the aspect of affective domain.

Financial Fraud Detection using Text Mining Analysis against Municipal Cybercriminality (지자체 사이버 공간 안전을 위한 금융사기 탐지 텍스트 마이닝 방법)

  • Choi, Sukjae;Lee, Jungwon;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.3
    • /
    • pp.119-138
    • /
    • 2017
  • Recently, SNS has become an important channel for marketing as well as personal communication. However, cybercrime has also evolved with the development of information and communication technology, and illegal advertising is distributed to SNS in large quantity. As a result, personal information is lost and even monetary damages occur more frequently. In this study, we propose a method to analyze which sentences and documents, which have been sent to the SNS, are related to financial fraud. First of all, as a conceptual framework, we developed a matrix of conceptual characteristics of cybercriminality on SNS and emergency management. We also suggested emergency management process which consists of Pre-Cybercriminality (e.g. risk identification) and Post-Cybercriminality steps. Among those we focused on risk identification in this paper. The main process consists of data collection, preprocessing and analysis. First, we selected two words 'daechul(loan)' and 'sachae(private loan)' as seed words and collected data with this word from SNS such as twitter. The collected data are given to the two researchers to decide whether they are related to the cybercriminality, particularly financial fraud, or not. Then we selected some of them as keywords if the vocabularies are related to the nominals and symbols. With the selected keywords, we searched and collected data from web materials such as twitter, news, blog, and more than 820,000 articles collected. The collected articles were refined through preprocessing and made into learning data. The preprocessing process is divided into performing morphological analysis step, removing stop words step, and selecting valid part-of-speech step. In the morphological analysis step, a complex sentence is transformed into some morpheme units to enable mechanical analysis. In the removing stop words step, non-lexical elements such as numbers, punctuation marks, and double spaces are removed from the text. In the step of selecting valid part-of-speech, only two kinds of nouns and symbols are considered. Since nouns could refer to things, the intent of message is expressed better than the other part-of-speech. Moreover, the more illegal the text is, the more frequently symbols are used. The selected data is given 'legal' or 'illegal'. To make the selected data as learning data through the preprocessing process, it is necessary to classify whether each data is legitimate or not. The processed data is then converted into Corpus type and Document-Term Matrix. Finally, the two types of 'legal' and 'illegal' files were mixed and randomly divided into learning data set and test data set. In this study, we set the learning data as 70% and the test data as 30%. SVM was used as the discrimination algorithm. Since SVM requires gamma and cost values as the main parameters, we set gamma as 0.5 and cost as 10, based on the optimal value function. The cost is set higher than general cases. To show the feasibility of the idea proposed in this paper, we compared the proposed method with MLE (Maximum Likelihood Estimation), Term Frequency, and Collective Intelligence method. Overall accuracy and was used as the metric. As a result, the overall accuracy of the proposed method was 92.41% of illegal loan advertisement and 77.75% of illegal visit sales, which is apparently superior to that of the Term Frequency, MLE, etc. Hence, the result suggests that the proposed method is valid and usable practically. In this paper, we propose a framework for crisis management caused by abnormalities of unstructured data sources such as SNS. We hope this study will contribute to the academia by identifying what to consider when applying the SVM-like discrimination algorithm to text analysis. Moreover, the study will also contribute to the practitioners in the field of brand management and opinion mining.

Deep Learning for Herbal Medicine Image Recognition: Case Study on Four-herb Product

  • Shin, Kyungseop;Lee, Taegyeom;Kim, Jinseong;Jun, Jaesung;Kim, Kyeong-Geun;Kim, Dongyeon;Kim, Dongwoo;Kim, Se Hee;Lee, Eun Jun;Hyun, Okpyung;Leem, Kang-Hyun;Kim, Wonnam
    • Proceedings of the Plant Resources Society of Korea Conference
    • /
    • 2019.10a
    • /
    • pp.87-87
    • /
    • 2019
  • The consumption of herbal medicine and related products (herbal products) have increased in South Korea. At the same time the quality, safety, and efficacy of herbal products is being raised. Currently, the herbal products are standardized and controlled according to the requirements of the Korean Pharmacopoeia, the National Institute of Health and the Ministry of Public Health and Social Affairs. The validation of herbal products and their medicinal component is important, since many of these herbal products are composed of two or more medicinal plants. However, there are no tools to support the validation process. Interest in deep learning has exploded over the past decade, for herbal medicine using algorithms to achieve herb recognition, symptom related target prediction, and drug repositioning have been reported. In this study, individual images of four herbs (Panax ginseng C.A. Meyer, Atractylodes macrocephala Koidz, Poria cocos Wolf, Glycyrrhiza uralensis Fischer), actually sold in the market, were achieved. Certain image preprocessing steps such as noise reduction and resize were formatted. After the features are optimized, we applied GoogLeNet_Inception v4 model for herb image recognition. Experimental results show that our method achieved test accuracy of 95%. However, there are two limitations in the current study. Firstly, due to the relatively small data collection (100 images), the training loss is much lower than validation loss which possess overfitting problem. Secondly, herbal products are mostly in a mixture, the applied method cannot be reliable to detect a single herb from a mixture. Thus, further large data collection and improved object detection is needed for better classification.

  • PDF

Creating a Smartphone User Recommendation System Using Clustering (클러스터링을 이용한 스마트폰 사용자 추천 시스템 만들기)

  • Jin Hyoung AN
    • Journal of Korea Artificial Intelligence Association
    • /
    • v.2 no.1
    • /
    • pp.1-6
    • /
    • 2024
  • In this paper, we develop an AI-based recommendation system that matches the specifications of smartphones from company 'S'. The system aims to simplify the complex decision-making process of consumers and guide them to choose the smartphone that best suits their daily needs. The recommendation system analyzes five specifications of smartphones (price, battery capacity, weight, camera quality, capacity) to help users make informed decisions without searching for extensive information. This approach not only saves time but also improves user satisfaction by ensuring that the selected smartphone closely matches the user's lifestyle and needs. The system utilizes unsupervised learning, i.e. clustering (K-MEANS, DBSCAN, Hierarchical Clustering), and provides personalized recommendations by evaluating them with silhouette scores, ensuring accurate and reliable grouping of similar smartphone models. By leveraging advanced data analysis techniques, the system can identify subtle patterns and preferences that might not be immediately apparent to consumers, enhancing the overall user experience. The ultimate goal of this AI recommendation system is to simplify the smartphone selection process, making it more accessible and user-friendly for all consumers. This paper discusses the data collection, preprocessing, development, implementation, and potential impact of the system using Pandas, crawling, scikit-learn, etc., and highlights the benefits of helping consumers explore the various options available and confidently choose the smartphone that best suits their daily lives.

Development of a Machine-Learning based Human Activity Recognition System including Eastern-Asian Specific Activities

  • Jeong, Seungmin;Choi, Cheolwoo;Oh, Dongik
    • Journal of Internet Computing and Services
    • /
    • v.21 no.4
    • /
    • pp.127-135
    • /
    • 2020
  • The purpose of this study is to develop a human activity recognition (HAR) system, which distinguishes 13 activities, including five activities commonly dealt with in conventional HAR researches and eight activities from the Eastern-Asian culture. The eight special activities include floor-sitting/standing, chair-sitting/standing, floor-lying/up, and bed-lying/up. We used a 3-axis accelerometer sensor on the wrist for data collection and designed a machine learning model for the activity classification. Data clustering through preprocessing and feature extraction/reduction is performed. We then tested six machine learning algorithms for recognition accuracy comparison. As a result, we have achieved an average accuracy of 99.7% for the 13 activities. This result is far better than the average accuracy of current HAR researches based on a smartwatch (89.4%). The superiority of the HAR system developed in this study is proven because we have achieved 98.7% accuracy with publically available 'pamap2' dataset of 12 activities, whose conventionally met the best accuracy is 96.6%.

Fishery R&D Big Data Platform and Metadata Management Strategy (수산과학 빅데이터 플랫폼 구축과 메타 데이터 관리방안)

  • Kim, Jae-Sung;Choi, Youngjin;Han, Myeong-Soo;Hwang, Jae-Dong;Cho, Wan-Sup
    • The Journal of Bigdata
    • /
    • v.4 no.2
    • /
    • pp.93-103
    • /
    • 2019
  • In this paper, we introduce a big data platform and a metadata management technique for fishery science R & D information. The big data platform collects and integrates various types of fisheries science R & D information and suggests how to build it in the form of a data lake. In addition to existing data collected and accumulated in the field of fisheries science, we also propose to build a big data platform that supports diverse analysis by collecting unstructured big data such as satellite image data, research reports, and research data. Next, by collecting and managing metadata during data extraction, preprocessing and storage, systematic management of fisheries science big data is possible. By establishing metadata in a standard form along with the construction of a big data platform, it is meaningful to suggest a systematic and continuous big data management method throughout the data lifecycle such as data collection, storage, utilization and distribution.

  • PDF

Exploration of Types and Context of Errors in the Weather Data Analysis Process (기상 데이터 분석 과정에서 나타나는 오류의 유형과 맥락 탐색)

  • Seok-Young Hong
    • Journal of the Korean Society of Earth Science Education
    • /
    • v.17 no.2
    • /
    • pp.153-167
    • /
    • 2024
  • This study explored the errors and context occurred during high school students' data analysis processes. For the study, 222 data inquiry reports produced by 74 students from 'A' High School were collected and explored the detailed error types in the data analysis processes such as data collection and preprocessing, data representation, and data interpretation. The results of study found that in the data interpretation process, students had a somewhat insufficient understanding of seasonal variations and periodic patterns about weather elements. And, various types of errors were identified in the data representation process, such as basic unit in graphs, legend settings, trend lines. The causes of these errors are the feature of authoring tools, misconceptions related to weather elements, and cognitive biases, etc. Based on the study's results, educational implications for big data education, a significant topic in future science education, were derived. And related follow-up studies were suggested.

Development of Machine Learning Model Use Cases for Intelligent Internet of Things Technology Education (지능형 사물인터넷 기술 교육을 위한 머신러닝 모델 활용 사례 개발)

  • Kyeong Hur
    • Journal of Practical Engineering Education
    • /
    • v.16 no.4
    • /
    • pp.449-457
    • /
    • 2024
  • AIoT, the intelligent Internet of Things, refers to a technology that collects data measured by IoT devices and applies machine learning technology to create and utilize predictive models. Existing research on AIoT technology education focused on building an educational AIoT platform and teaching how to use it. However, there was a lack of case studies that taught the process of automatically creating and utilizing machine learning models from data measured by IoT devices. In this paper, we developed a case study using a machine learning model for AIoT technology education. The case developed in this paper consists of the following steps: data collection from AIoT devices, data preprocessing, automatic creation of machine learning models, calculation of accuracy for each model, determination of valid models, and data prediction using the valid models. In this paper, we considered that sensors in AIoT devices measure different ranges of values, and presented an example of data preprocessing accordingly. In addition, we developed a case where AIoT devices automatically determine what information they can predict by automatically generating several machine learning models and determining effective models with high accuracy among these models. By applying the developed cases, a variety of educational contents using AIoT, such as prediction-based object control using AIoT, can be developed.

Development of big data based Skin Care Information System SCIS for skin condition diagnosis and management

  • Kim, Hyung-Hoon;Cho, Jeong-Ran
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.3
    • /
    • pp.137-147
    • /
    • 2022
  • Diagnosis and management of skin condition is a very basic and important function in performing its role for workers in the beauty industry and cosmetics industry. For accurate skin condition diagnosis and management, it is necessary to understand the skin condition and needs of customers. In this paper, we developed SCIS, a big data-based skin care information system that supports skin condition diagnosis and management using social media big data for skin condition diagnosis and management. By using the developed system, it is possible to analyze and extract core information for skin condition diagnosis and management based on text information. The skin care information system SCIS developed in this paper consists of big data collection stage, text preprocessing stage, image preprocessing stage, and text word analysis stage. SCIS collected big data necessary for skin diagnosis and management, and extracted key words and topics from text information through simple frequency analysis, relative frequency analysis, co-occurrence analysis, and correlation analysis of key words. In addition, by analyzing the extracted key words and information and performing various visualization processes such as scatter plot, NetworkX, t-SNE, and clustering, it can be used efficiently in diagnosing and managing skin conditions.