• Title/Summary/Keyword: Keyword Filtering

Search Result 62, Processing Time 0.024 seconds

A Distinction Technology for Harmful Web Documents by Rates (등급에 따른 웹 유해 문서 분류 기술)

  • Kim, Yong-Soo;Nam, Taek-Yong;Won, Dong-Ho
    • The KIPS Transactions:PartC
    • /
    • v.13C no.7 s.110
    • /
    • pp.859-864
    • /
    • 2006
  • The openness of the Web allows any user to access almost any type of information easily at any time and anywhere. However, with function of easy access for useful information, internet has dysfunctions of providing users with harmful contents indiscriminately. Some information, such as adult content, is not appropriate for all users, notably children. Additionally for adults, some contents included in abnormal porn sites can do ordinary people's mental health harm. In the meantime, since Internet is a worldwide open network it has a limit to regulate users providing harmful contents through each countrie's national laws or systems. Additionally it is not a desirable way of developing a certain system-specific classification technology for harmful contents, because internet users can contact with them in diverse way, for example, porn sites, harmful spams, or peer-to-peer networks, etc. Therefore, it is being emphasized to research and develop context-based core technologies for classifying harmful contents. In this paper, we propose an efficient text filter for blocking harmful texts of web documents using context-based technologies.

Analysis of Patents on the Recycling Technologies for Waste Batteries (폐전지 재활용 관련 기술의 특허 동향분석)

  • Kang Tae-Won;Jeong Jinki;Lee Jae-Chun;Sohn Jeong-Soo;Kang Kyung-Seok
    • Resources Recycling
    • /
    • v.14 no.6 s.68
    • /
    • pp.44-59
    • /
    • 2005
  • In this paper the world wide patents on the recycling of used batteries were inspected. The trend and direction of on-going and future technologies on this matter were analyzed. The range of search was limited in the open patents and in DB of U.S.A.(USPTO, DLPHION), Japan(PAJ), Europe(EPO), and Korea(KIPRIS). For the search condition the keyword, battery, batteries, electric cell, patent, and recycling, and IPC classification were used. The total of 2,490 cases was found at the first search stage, then, through the 2 steps of filtering processes the total of 871 cases was selected for the final analysis. These 871 cases were classified by countries, companies, and technologies between the year 1971 and the you 2000.

Bibliometric Analysis of the Effect of Acupuncture on Cancer Pain in the Last 20 Years (최근 20년간 침의 암성통증에 대한 효과 연구의 계량서지학적 분석)

  • Park, Han-song;Lee, Do-eun;Ha, Ji-su;Seo, Ho-seok;Kim, Jin-won
    • The Journal of Internal Korean Medicine
    • /
    • v.42 no.3
    • /
    • pp.279-292
    • /
    • 2021
  • Objectives: Analyze papers on the effect of acupuncture on cancer pain from a macroscopic point of view, suggesting global trends and future research directions to promote acupuncture treatment for cancer pain. Methods: By filtering the papers searched for (acupuncture) AND (cancer pain) in the Web of Science database, 351 papers were selected and analyzed by year, field, journal, institution, author, and keyword. Results: Most papers were published in 2020, and research was active in the field of complementary and alternative medicine. Research on the effects of acupuncture in cancer pain has been active in cancer centers and university hospitals, research has been active in various countries. The most frequently mentioned keywords in the titles and abstracts were acupuncture, pain, and quality of life. The latest top 5 keywords were inhibitor-induced arthralgia, acupuncture therapy, risk factors, opioids, and recovery. Conclusions: Acupuncture treatment has the potential to reduce pain and improve quality of life in cancer patients, and it should be actively studied in the future.

A Design of Similar Video Recommendation System using Extracted Words in Big Data Cluster (빅데이터 클러스터에서의 추출된 형태소를 이용한 유사 동영상 추천 시스템 설계)

  • Lee, Hyun-Sup;Kim, Jindeog
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.24 no.2
    • /
    • pp.172-178
    • /
    • 2020
  • In order to recommend contents, the company generally uses collaborative filtering that takes into account both user preferences and video (item) similarities. Such services are primarily intended to facilitate user convenience by leveraging personal preferences such as user search keywords and viewing time. It will also be ranked around the keywords specified in the video. However, there is a limit to analyzing video similarities using limited keywords. In such cases, the problem becomes serious if the specified keyword does not properly reflect the item. In this paper, I would like to propose a system that identifies the characteristics of a video as it is by the system without human intervention, and analyzes and recommends similarities between videos. The proposed system analyzes similarities by taking into account all words (keywords) that have different meanings from training videos, and in such cases, the methods handled by big data clusters are applied because of the large scale of data and operations.

Keyword Extraction through Text Mining and Open Source Software Category Classification based on Machine Learning Algorithms (텍스트 마이닝을 통한 키워드 추출과 머신러닝 기반의 오픈소스 소프트웨어 주제 분류)

  • Lee, Ye-Seul;Back, Seung-Chan;Joe, Yong-Joon;Shin, Dong-Myung
    • Journal of Software Assessment and Valuation
    • /
    • v.14 no.2
    • /
    • pp.1-9
    • /
    • 2018
  • The proportion of users and companies using open source continues to grow. The size of open source software market is growing rapidly not only in foreign countries but also in Korea. However, compared to the continuous development of open source software, there is little research on open source software subject classification, and the classification system of software is not specified either. At present, the user uses a method of directly inputting or tagging the subject, and there is a misclassification and hassle as a result. Research on open source software classification can also be used as a basis for open source software evaluation, recommendation, and filtering. Therefore, in this study, we propose a method to classify open source software by using machine learning model and propose performance comparison by machine learning model.

A Multimodal Profile Ensemble Approach to Development of Recommender Systems Using Big Data (빅데이터 기반 추천시스템 구현을 위한 다중 프로파일 앙상블 기법)

  • Kim, Minjeong;Cho, Yoonho
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.4
    • /
    • pp.93-110
    • /
    • 2015
  • The recommender system is a system which recommends products to the customers who are likely to be interested in. Based on automated information filtering technology, various recommender systems have been developed. Collaborative filtering (CF), one of the most successful recommendation algorithms, has been applied in a number of different domains such as recommending Web pages, books, movies, music and products. But, it has been known that CF has a critical shortcoming. CF finds neighbors whose preferences are like those of the target customer and recommends products those customers have most liked. Thus, CF works properly only when there's a sufficient number of ratings on common product from customers. When there's a shortage of customer ratings, CF makes the formation of a neighborhood inaccurate, thereby resulting in poor recommendations. To improve the performance of CF based recommender systems, most of the related studies have been focused on the development of novel algorithms under the assumption of using a single profile, which is created from user's rating information for items, purchase transactions, or Web access logs. With the advent of big data, companies got to collect more data and to use a variety of information with big size. So, many companies recognize it very importantly to utilize big data because it makes companies to improve their competitiveness and to create new value. In particular, on the rise is the issue of utilizing personal big data in the recommender system. It is why personal big data facilitate more accurate identification of the preferences or behaviors of users. The proposed recommendation methodology is as follows: First, multimodal user profiles are created from personal big data in order to grasp the preferences and behavior of users from various viewpoints. We derive five user profiles based on the personal information such as rating, site preference, demographic, Internet usage, and topic in text. Next, the similarity between users is calculated based on the profiles and then neighbors of users are found from the results. One of three ensemble approaches is applied to calculate the similarity. Each ensemble approach uses the similarity of combined profile, the average similarity of each profile, and the weighted average similarity of each profile, respectively. Finally, the products that people among the neighborhood prefer most to are recommended to the target users. For the experiments, we used the demographic data and a very large volume of Web log transaction for 5,000 panel users of a company that is specialized to analyzing ranks of Web sites. R and SAS E-miner was used to implement the proposed recommender system and to conduct the topic analysis using the keyword search, respectively. To evaluate the recommendation performance, we used 60% of data for training and 40% of data for test. The 5-fold cross validation was also conducted to enhance the reliability of our experiments. A widely used combination metric called F1 metric that gives equal weight to both recall and precision was employed for our evaluation. As the results of evaluation, the proposed methodology achieved the significant improvement over the single profile based CF algorithm. In particular, the ensemble approach using weighted average similarity shows the highest performance. That is, the rate of improvement in F1 is 16.9 percent for the ensemble approach using weighted average similarity and 8.1 percent for the ensemble approach using average similarity of each profile. From these results, we conclude that the multimodal profile ensemble approach is a viable solution to the problems encountered when there's a shortage of customer ratings. This study has significance in suggesting what kind of information could we use to create profile in the environment of big data and how could we combine and utilize them effectively. However, our methodology should be further studied to consider for its real-world application. We need to compare the differences in recommendation accuracy by applying the proposed method to different recommendation algorithms and then to identify which combination of them would show the best performance.

A Proposal of a Keyword Extraction System for Detecting Social Issues (사회문제 해결형 기술수요 발굴을 위한 키워드 추출 시스템 제안)

  • Jeong, Dami;Kim, Jaeseok;Kim, Gi-Nam;Heo, Jong-Uk;On, Byung-Won;Kang, Mijung
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.3
    • /
    • pp.1-23
    • /
    • 2013
  • To discover significant social issues such as unemployment, economy crisis, social welfare etc. that are urgent issues to be solved in a modern society, in the existing approach, researchers usually collect opinions from professional experts and scholars through either online or offline surveys. However, such a method does not seem to be effective from time to time. As usual, due to the problem of expense, a large number of survey replies are seldom gathered. In some cases, it is also hard to find out professional persons dealing with specific social issues. Thus, the sample set is often small and may have some bias. Furthermore, regarding a social issue, several experts may make totally different conclusions because each expert has his subjective point of view and different background. In this case, it is considerably hard to figure out what current social issues are and which social issues are really important. To surmount the shortcomings of the current approach, in this paper, we develop a prototype system that semi-automatically detects social issue keywords representing social issues and problems from about 1.3 million news articles issued by about 10 major domestic presses in Korea from June 2009 until July 2012. Our proposed system consists of (1) collecting and extracting texts from the collected news articles, (2) identifying only news articles related to social issues, (3) analyzing the lexical items of Korean sentences, (4) finding a set of topics regarding social keywords over time based on probabilistic topic modeling, (5) matching relevant paragraphs to a given topic, and (6) visualizing social keywords for easy understanding. In particular, we propose a novel matching algorithm relying on generative models. The goal of our proposed matching algorithm is to best match paragraphs to each topic. Technically, using a topic model such as Latent Dirichlet Allocation (LDA), we can obtain a set of topics, each of which has relevant terms and their probability values. In our problem, given a set of text documents (e.g., news articles), LDA shows a set of topic clusters, and then each topic cluster is labeled by human annotators, where each topic label stands for a social keyword. For example, suppose there is a topic (e.g., Topic1 = {(unemployment, 0.4), (layoff, 0.3), (business, 0.3)}) and then a human annotator labels "Unemployment Problem" on Topic1. In this example, it is non-trivial to understand what happened to the unemployment problem in our society. In other words, taking a look at only social keywords, we have no idea of the detailed events occurring in our society. To tackle this matter, we develop the matching algorithm that computes the probability value of a paragraph given a topic, relying on (i) topic terms and (ii) their probability values. For instance, given a set of text documents, we segment each text document to paragraphs. In the meantime, using LDA, we can extract a set of topics from the text documents. Based on our matching process, each paragraph is assigned to a topic, indicating that the paragraph best matches the topic. Finally, each topic has several best matched paragraphs. Furthermore, assuming there are a topic (e.g., Unemployment Problem) and the best matched paragraph (e.g., Up to 300 workers lost their jobs in XXX company at Seoul). In this case, we can grasp the detailed information of the social keyword such as "300 workers", "unemployment", "XXX company", and "Seoul". In addition, our system visualizes social keywords over time. Therefore, through our matching process and keyword visualization, most researchers will be able to detect social issues easily and quickly. Through this prototype system, we have detected various social issues appearing in our society and also showed effectiveness of our proposed methods according to our experimental results. Note that you can also use our proof-of-concept system in http://dslab.snu.ac.kr/demo.html.

Development of Intelligent Job Classification System based on Job Posting on Job Sites (구인구직사이트의 구인정보 기반 지능형 직무분류체계의 구축)

  • Lee, Jung Seung
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.4
    • /
    • pp.123-139
    • /
    • 2019
  • The job classification system of major job sites differs from site to site and is different from the job classification system of the 'SQF(Sectoral Qualifications Framework)' proposed by the SW field. Therefore, a new job classification system is needed for SW companies, SW job seekers, and job sites to understand. The purpose of this study is to establish a standard job classification system that reflects market demand by analyzing SQF based on job offer information of major job sites and the NCS(National Competency Standards). For this purpose, the association analysis between occupations of major job sites is conducted and the association rule between SQF and occupation is conducted to derive the association rule between occupations. Using this association rule, we proposed an intelligent job classification system based on data mapping the job classification system of major job sites and SQF and job classification system. First, major job sites are selected to obtain information on the job classification system of the SW market. Then We identify ways to collect job information from each site and collect data through open API. Focusing on the relationship between the data, filtering only the job information posted on each job site at the same time, other job information is deleted. Next, we will map the job classification system between job sites using the association rules derived from the association analysis. We will complete the mapping between these market segments, discuss with the experts, further map the SQF, and finally propose a new job classification system. As a result, more than 30,000 job listings were collected in XML format using open API in 'WORKNET,' 'JOBKOREA,' and 'saramin', which are the main job sites in Korea. After filtering out about 900 job postings simultaneously posted on multiple job sites, 800 association rules were derived by applying the Apriori algorithm, which is a frequent pattern mining. Based on 800 related rules, the job classification system of WORKNET, JOBKOREA, and saramin and the SQF job classification system were mapped and classified into 1st and 4th stages. In the new job taxonomy, the first primary class, IT consulting, computer system, network, and security related job system, consisted of three secondary classifications, five tertiary classifications, and five fourth classifications. The second primary classification, the database and the job system related to system operation, consisted of three secondary classifications, three tertiary classifications, and four fourth classifications. The third primary category, Web Planning, Web Programming, Web Design, and Game, was composed of four secondary classifications, nine tertiary classifications, and two fourth classifications. The last primary classification, job systems related to ICT management, computer and communication engineering technology, consisted of three secondary classifications and six tertiary classifications. In particular, the new job classification system has a relatively flexible stage of classification, unlike other existing classification systems. WORKNET divides jobs into third categories, JOBKOREA divides jobs into second categories, and the subdivided jobs into keywords. saramin divided the job into the second classification, and the subdivided the job into keyword form. The newly proposed standard job classification system accepts some keyword-based jobs, and treats some product names as jobs. In the classification system, not only are jobs suspended in the second classification, but there are also jobs that are subdivided into the fourth classification. This reflected the idea that not all jobs could be broken down into the same steps. We also proposed a combination of rules and experts' opinions from market data collected and conducted associative analysis. Therefore, the newly proposed job classification system can be regarded as a data-based intelligent job classification system that reflects the market demand, unlike the existing job classification system. This study is meaningful in that it suggests a new job classification system that reflects market demand by attempting mapping between occupations based on data through the association analysis between occupations rather than intuition of some experts. However, this study has a limitation in that it cannot fully reflect the market demand that changes over time because the data collection point is temporary. As market demands change over time, including seasonal factors and major corporate public recruitment timings, continuous data monitoring and repeated experiments are needed to achieve more accurate matching. The results of this study can be used to suggest the direction of improvement of SQF in the SW industry in the future, and it is expected to be transferred to other industries with the experience of success in the SW industry.

A Two-Phase On-Device Analysis for Gender Prediction of Mobile Users Using Discriminative and Popular Wordsets (모바일 사용자의 성별 예측을 위한 식별 및 인기 단어 집합 기반 2단계 기기 내 분석)

  • Choi, Yerim;Park, Kyuyon;Kim, Solee;Park, Jonghun
    • The Journal of Society for e-Business Studies
    • /
    • v.21 no.1
    • /
    • pp.65-77
    • /
    • 2016
  • As respecting one's privacy becomes an important issue in mobile device data analysis, on-device analysis is getting attention, in which the data analysis is conducted inside a mobile device without sending data from the device to outside. One possible application of the on-device analysis is gender prediction using text data in mobile devices, such as text messages, search keyword, website bookmarks, and contact, which are highly private, and the limited computing power of mobile devices can be addressed by utilizing the word comparison method, where words are selected beforehand and delivered to a mobile device of a user to determine the user's gender by matching mobile text data and the selected words. Moreover, it is known that performing prediction after filtering instances using definite evidences increases accuracy and reduces computational complexity. In this regard, we propose a two-phase approach to on-device gender prediction, where both discriminability and popularity of a word are sequentially considered. The proposed method performs predictions using a few highly discriminative words for all instances and popular words for unclassified instances from the previous prediction. From the experiments conducted on real-world dataset, the proposed method outperformed the compared methods.

Geographical Name Denoising by Machine Learning of Event Detection Based on Twitter (트위터 기반 이벤트 탐지에서의 기계학습을 통한 지명 노이즈제거)

  • Woo, Seungmin;Hwang, Byung-Yeon
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.4 no.10
    • /
    • pp.447-454
    • /
    • 2015
  • This paper proposes geographical name denoising by machine learning of event detection based on twitter. Recently, the increasing number of smart phone users are leading the growing user of SNS. Especially, the functions of short message (less than 140 words) and follow service make twitter has the power of conveying and diffusing the information more quickly. These characteristics and mobile optimised feature make twitter has fast information conveying speed, which can play a role of conveying disasters or events. Related research used the individuals of twitter user as the sensor of event detection to detect events that occur in reality. This research employed geographical name as the keyword by using the characteristic that an event occurs in a specific place. However, it ignored the denoising of relationship between geographical name and homograph, it became an important factor to lower the accuracy of event detection. In this paper, we used removing and forecasting, these two method to applied denoising technique. First after processing the filtering step by using noise related database building, we have determined the existence of geographical name by using the Naive Bayesian classification. Finally by using the experimental data, we earned the probability value of machine learning. On the basis of forecast technique which is proposed in this paper, the reliability of the need for denoising technique has turned out to be 89.6%.