• Title/Summary/Keyword: similarity based clustering

Search Result 323, Processing Time 0.033 seconds

Water resources potential assessment of ungauged catchments in Lake Tana Basin, Ethiopia

  • Damtew, Getachew Tegegne;Kim, Young-Oh
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2015.05a
    • /
    • pp.217-217
    • /
    • 2015
  • The objective of this study was mainly to evaluate the water resources potential of Lake Tana Basin (LTB) by using Soil and Water Assessment Tool (SWAT). From SWAT simulation of LTB, about 5236 km2 area of LTB is gauged watershed and the remaining 9878 km2 area is ungauged watershed. For calibration of model parameters, four gauged stations were considered namely: Gilgel Abay, Gummera, Rib, and Megech. The SWAT-CUP built-in techniques, particle swarm optimization (PSO) and generalized likelihood uncertainty estimation (GLUE) method was used for calibration of model parameters and PSO method were selected for the study based on its performance results in four gauging stations. However the level of sensitivity of flow parameters differ from catchment to catchment, the curve number (CN2) has been found the most sensitive parameters in all gauged catchments. To facilitate the transfer of data from gauged catchments to ungauged catchments, clustering of hydrologic response units (HRUs) were done based on physical similarity measured between gauged and ungauged catchment attributes. From SWAT land use/ soil use/slope reclassification of LTB, a total of 142 HRUs were identified and these HRUs are clustered in to 39 similar hydrologic groups. In order to transfer the optimized model parameters from gauged to ungauged catchments based on these clustered hydrologic groups, this study evaluates three parameter transfer schemes: parameters transfer based on homogeneous regions (PT-I), parameter transfer based on global averaging (PT-II), and parameter transfer by considering Gilgel Abay catchment as a representative catchment (PT-III) since its model performance values are better than the other three gauged catchments. The performance of these parameter transfer approach was evaluated based on values of Nash-Sutcliffe efficiency (NSE) and coefficient of determination (R2). The computed NSE values was found to be 0.71, 0.58, and 0.31 for PT-I, PT-II and PT-III respectively and the computed R2 values was found to be 0.93, 0.82, and 0.95 for PT-I, PT-II, and PT-III respectively. Based on the performance evaluation criteria, PT-I were selected for modelling ungauged catchments by transferring optimized model parameters from gauged catchment. From the model result, yearly average stream flow for all homogeneous regions was found 29.54 m3/s, 112.92 m3/s, and 130.10 m3/s for time period (1989 - 2005) for region-I, region-II, and region-III respectively.

  • PDF

A Study on Market Size Estimation Method by Product Group Using Word2Vec Algorithm (Word2Vec을 활용한 제품군별 시장규모 추정 방법에 관한 연구)

  • Jung, Ye Lim;Kim, Ji Hui;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.1-21
    • /
    • 2020
  • With the rapid development of artificial intelligence technology, various techniques have been developed to extract meaningful information from unstructured text data which constitutes a large portion of big data. Over the past decades, text mining technologies have been utilized in various industries for practical applications. In the field of business intelligence, it has been employed to discover new market and/or technology opportunities and support rational decision making of business participants. The market information such as market size, market growth rate, and market share is essential for setting companies' business strategies. There has been a continuous demand in various fields for specific product level-market information. However, the information has been generally provided at industry level or broad categories based on classification standards, making it difficult to obtain specific and proper information. In this regard, we propose a new methodology that can estimate the market sizes of product groups at more detailed levels than that of previously offered. We applied Word2Vec algorithm, a neural network based semantic word embedding model, to enable automatic market size estimation from individual companies' product information in a bottom-up manner. The overall process is as follows: First, the data related to product information is collected, refined, and restructured into suitable form for applying Word2Vec model. Next, the preprocessed data is embedded into vector space by Word2Vec and then the product groups are derived by extracting similar products names based on cosine similarity calculation. Finally, the sales data on the extracted products is summated to estimate the market size of the product groups. As an experimental data, text data of product names from Statistics Korea's microdata (345,103 cases) were mapped in multidimensional vector space by Word2Vec training. We performed parameters optimization for training and then applied vector dimension of 300 and window size of 15 as optimized parameters for further experiments. We employed index words of Korean Standard Industry Classification (KSIC) as a product name dataset to more efficiently cluster product groups. The product names which are similar to KSIC indexes were extracted based on cosine similarity. The market size of extracted products as one product category was calculated from individual companies' sales data. The market sizes of 11,654 specific product lines were automatically estimated by the proposed model. For the performance verification, the results were compared with actual market size of some items. The Pearson's correlation coefficient was 0.513. Our approach has several advantages differing from the previous studies. First, text mining and machine learning techniques were applied for the first time on market size estimation, overcoming the limitations of traditional sampling based- or multiple assumption required-methods. In addition, the level of market category can be easily and efficiently adjusted according to the purpose of information use by changing cosine similarity threshold. Furthermore, it has a high potential of practical applications since it can resolve unmet needs for detailed market size information in public and private sectors. Specifically, it can be utilized in technology evaluation and technology commercialization support program conducted by governmental institutions, as well as business strategies consulting and market analysis report publishing by private firms. The limitation of our study is that the presented model needs to be improved in terms of accuracy and reliability. The semantic-based word embedding module can be advanced by giving a proper order in the preprocessed dataset or by combining another algorithm such as Jaccard similarity with Word2Vec. Also, the methods of product group clustering can be changed to other types of unsupervised machine learning algorithm. Our group is currently working on subsequent studies and we expect that it can further improve the performance of the conceptually proposed basic model in this study.

An Energy-Efficient Topology Control Scheme based on Application Layer Data in Wireless Sensor Networks (응용 계층 정보 기반의 에너지 효율적인 센서 네트워크 토폴로지 제어 기법)

  • Kim, Seung-Mok;Kim, Seung-Hoon
    • Journal of Korea Multimedia Society
    • /
    • v.12 no.9
    • /
    • pp.1297-1308
    • /
    • 2009
  • The life time of a wireless sensor network composed of numerous sensor nodes depend on ones of its sensor nodes. The energy efficiency operation of nodes, therefore, is one of the crucial factors to design the network. Researches based on the hierarchical network topology have been proposed and evolved in terms of energy efficiency. However, in existing researches, application layer data obtained from sensor nodes are not considered properly to compose cluster, including issue that nodes communicate with their cluster heads in TDMA scheduling. In this paper, we suggest an energy-efficient topology control scheme based on application layer data in wireless sensor networks. By using application layer data, sensor nodes form a section which is defined as the area of adjacent nodes that retain similar characteristics of application environments. These sections are further organized into clusters. We suggest an algorithm for selecting a cluster head as well as a way of scheduling to reduce the number of unnecessary transmissions from each node to its cluster head, which based on the degree and the duration of similarity between the node's data and its head's data in each cluster without seriously damaging the integrity of application data. The results show that the suggested scheme can save the energy of nodes and increase the life time of the entire network.

  • PDF

Analysis of Knowledge Community for Knowledge Creation and Use (지식 생성 및 활용을 위한 지식 커뮤니티 효과 분석)

  • Huh, Jun-Hyuk;Lee, Jung-Seung
    • Journal of Intelligence and Information Systems
    • /
    • v.16 no.4
    • /
    • pp.85-97
    • /
    • 2010
  • Internet communities are a typical space for knowledge creation and use on the Internet as people discuss their common interests within the internet communities. When we define 'Knowledge Communities' as internet communities that are related to knowledge creation and use, they are categorized into 4 different types such as 'Search Engine,' 'Open Communities,' 'Specialty Communities,' and 'Activity Communities.' Each type of knowledge community does not remain the same, for example. Rather, it changes with time and is also affected by the external business environment. Therefore, it is critical to develop processes for practical use of such changeable knowledge communities. Yet there is little research regarding a strategic framework for knowledge communities as a source of knowledge creation and use. The purposes of this study are (1) to find factors that can affect knowledge creation and use for each type of knowledge community and (2) to develop a strategic framework for practical use of the knowledge communities. Based on previous research, we found 7 factors that have considerable impacts on knowledge creation and use. They were 'Fitness,' 'Reliability,' 'Systemicity,' 'Richness,' 'Similarity,' 'Feedback,' and 'Understanding.' We created 30 different questions from each type of knowledge community. The questions included common sense, IT, business and hobbies, and were uniformly selected from various knowledge communities. Instead of using survey, we used these questions to ask users of the 4 representative web sites such as Google from Search Engine, NAVER Knowledge iN from Open Communities, SLRClub from Specialty Communities, and Wikipedia from Activity Communities. These 4 representative web sites were selected based on popularity (i.e., the 4 most popular sites in Korea). They were also among the 4 most frequently mentioned sitesin previous research. The answers of the 30 knowledge questions were collected and evaluated by the 11 IT experts who have been working for IT companies more than 3 years. When evaluating, the 11 experts used the above 7 knowledge factors as criteria. Using a stepwise linear regression for the evaluation of the 7 knowledge factors, we found that each factors affects differently knowledge creation and use for each type of knowledge community. The results of the stepwise linear regression analysis showed the relationship between 'Understanding' and other knowledge factors. The relationship was different regarding the type of knowledge community. The results indicated that 'Understanding' was significantly related to 'Reliability' at 'Search Engine type', to 'Fitness' at 'Open Community type', to 'Reliability' and 'Similarity' at 'Specialty Community type', and to 'Richness' and 'Similarity' at 'Activity Community type'. A strategic framework was created from the results of this study and such framework can be useful for knowledge communities that are not stable with time. For the success of knowledge community, the results of this study suggest that it is essential to ensure there are factors that can influence knowledge communities. It is also vital to reinforce each factor has its unique influence on related knowledge community. Thus, these changeable knowledge communities should be transformed into an adequate type with proper business strategies and objectives. They also should be progressed into a type that covers varioustypes of knowledge communities. For example, DCInside started from a small specialty community focusing on digital camera hardware and camerawork and then was transformed to an open community focusing on social issues through well-known photo galleries. NAVER started from a typical search engine and now covers an open community and a special community through additional web services such as NAVER knowledge iN, NAVER Cafe, and NAVER Blog. NAVER is currently competing withan activity community such as Wikipedia through the NAVER encyclopedia that provides similar services with NAVER encyclopedia's users as Wikipedia does. Finally, the results of this study provide meaningfully practical guidance for practitioners in that which type of knowledge community is most appropriate to the fluctuated business environment as knowledge community itself evolves with time.

A Study on the Impact Factors of Contents Diffusion in Youtube using Integrated Content Network Analysis (일반영향요인과 댓글기반 콘텐츠 네트워크 분석을 통합한 유튜브(Youtube)상의 콘텐츠 확산 영향요인 연구)

  • Park, Byung Eun;Lim, Gyoo Gun
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.3
    • /
    • pp.19-36
    • /
    • 2015
  • Social media is an emerging issue in content services and in current business environment. YouTube is the most representative social media service in the world. YouTube is different from other conventional content services in its open user participation and contents creation methods. To promote a content in YouTube, it is important to understand the diffusion phenomena of contents and the network structural characteristics. Most previous studies analyzed impact factors of contents diffusion from the view point of general behavioral factors. Currently some researchers use network structure factors. However, these two approaches have been used separately. However this study tries to analyze the general impact factors on the view count and content based network structures all together. In addition, when building a content based network, this study forms the network structure by analyzing user comments on 22,370 contents of YouTube not based on the individual user based network. From this study, we re-proved statistically the causal relations between view count and not only general factors but also network factors. Moreover by analyzing this integrated research model, we found that these factors affect the view count of YouTube according to the following order; Uploader Followers, Video Age, Betweenness Centrality, Comments, Closeness Centrality, Clustering Coefficient and Rating. However Degree Centrality and Eigenvector Centrality affect the view count negatively. From this research some strategic points for the utilizing of contents diffusion are as followings. First, it is needed to manage general factors such as the number of uploader followers or subscribers, the video age, the number of comments, average rating points, and etc. The impact of average rating points is not so much important as we thought before. However, it is needed to increase the number of uploader followers strategically and sustain the contents in the service as long as possible. Second, we need to pay attention to the impacts of betweenness centrality and closeness centrality among other network factors. Users seems to search the related subject or similar contents after watching a content. It is needed to shorten the distance between other popular contents in the service. Namely, this study showed that it is beneficial for increasing view counts by decreasing the number of search attempts and increasing similarity with many other contents. This is consistent with the result of the clustering coefficient impact analysis. Third, it is important to notice the negative impact of degree centrality and eigenvector centrality on the view count. If the number of connections with other contents is too much increased it means there are many similar contents and eventually it might distribute the view counts. Moreover, too high eigenvector centrality means that there are connections with popular contents around the content, and it might lose the view count because of the impact of the popular contents. It would be better to avoid connections with too powerful popular contents. From this study we analyzed the phenomenon and verified diffusion factors of Youtube contents by using an integrated model consisting of general factors and network structure factors. From the viewpoints of social contribution, this study might provide useful information to music or movie industry or other contents vendors for their effective contents services. This research provides basic schemes that can be applied strategically in online contents marketing. One of the limitations of this study is that this study formed a contents based network for the network structure analysis. It might be an indirect method to see the content network structure. We can use more various methods to establish direct content network. Further researches include more detailed researches like an analysis according to the types of contents or domains or characteristics of the contents or users, and etc.

Color-related Query Processing for Intelligent E-Commerce Search (지능형 검색엔진을 위한 색상 질의 처리 방안)

  • Hong, Jung A;Koo, Kyo Jung;Cha, Ji Won;Seo, Ah Jeong;Yeo, Un Yeong;Kim, Jong Woo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.1
    • /
    • pp.109-125
    • /
    • 2019
  • As interest on intelligent search engines increases, various studies have been conducted to extract and utilize the features related to products intelligencely. In particular, when users search for goods in e-commerce search engines, the 'color' of a product is an important feature that describes the product. Therefore, it is necessary to deal with the synonyms of color terms in order to produce accurate results to user's color-related queries. Previous studies have suggested dictionary-based approach to process synonyms for color features. However, the dictionary-based approach has a limitation that it cannot handle unregistered color-related terms in user queries. In order to overcome the limitation of the conventional methods, this research proposes a model which extracts RGB values from an internet search engine in real time, and outputs similar color names based on designated color information. At first, a color term dictionary was constructed which includes color names and R, G, B values of each color from Korean color standard digital palette program and the Wikipedia color list for the basic color search. The dictionary has been made more robust by adding 138 color names converted from English color names to foreign words in Korean, and with corresponding RGB values. Therefore, the fininal color dictionary includes a total of 671 color names and corresponding RGB values. The method proposed in this research starts by searching for a specific color which a user searched for. Then, the presence of the searched color in the built-in color dictionary is checked. If there exists the color in the dictionary, the RGB values of the color in the dictioanry are used as reference values of the retrieved color. If the searched color does not exist in the dictionary, the top-5 Google image search results of the searched color are crawled and average RGB values are extracted in certain middle area of each image. To extract the RGB values in images, a variety of different ways was attempted since there are limits to simply obtain the average of the RGB values of the center area of images. As a result, clustering RGB values in image's certain area and making average value of the cluster with the highest density as the reference values showed the best performance. Based on the reference RGB values of the searched color, the RGB values of all the colors in the color dictionary constructed aforetime are compared. Then a color list is created with colors within the range of ${\pm}50$ for each R value, G value, and B value. Finally, using the Euclidean distance between the above results and the reference RGB values of the searched color, the color with the highest similarity from up to five colors becomes the final outcome. In order to evaluate the usefulness of the proposed method, we performed an experiment. In the experiment, 300 color names and corresponding color RGB values by the questionnaires were obtained. They are used to compare the RGB values obtained from four different methods including the proposed method. The average euclidean distance of CIE-Lab using our method was about 13.85, which showed a relatively low distance compared to 3088 for the case using synonym dictionary only and 30.38 for the case using the dictionary with Korean synonym website WordNet. The case which didn't use clustering method of the proposed method showed 13.88 of average euclidean distance, which implies the DBSCAN clustering of the proposed method can reduce the Euclidean distance. This research suggests a new color synonym processing method based on RGB values that combines the dictionary method with the real time synonym processing method for new color names. This method enables to get rid of the limit of the dictionary-based approach which is a conventional synonym processing method. This research can contribute to improve the intelligence of e-commerce search systems especially on the color searching feature.

Automatic Clustering of Same-Name Authors Using Full-text of Articles (논문 원문을 이용한 동명 저자 자동 군집화)

  • Kang, In-Su;Jung, Han-Min;Lee, Seung-Woo;Kim, Pyung;Goo, Hee-Kwan;Lee, Mi-Kyung;Goo, Nam-Ang;Sung, Won-Kyung
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2006.11a
    • /
    • pp.652-656
    • /
    • 2006
  • Bibliographic information retrieval systems require bibliographic data such as authors, organizations, source of publication to be uniquely identified using keys. In particular, when authors are represented simply as their names, users bear the burden of manually discriminating different users of the same name. Previous approaches to resolving the problem of same-name authors rely on bibliographic data such as co-author information, titles of articles, etc. However, these methods cannot handle the case of single author articles, or the case when articles do not have common terms in their titles. To complement the previous methods, this study introduces a classification-based approach using similarity between full-text of articles. Experiments using recent domestic proceedings showed that the proposed method has the potential to supplement the previous meta-data based approaches.

  • PDF

Morphological Characteristics and Genetic Diversity Analysis of Cultivated Sancho (Zanthoxylum schinifolium) and Chopi (Zanthoxylum piperitum) in Korea (국내 재배지의 산초(Zanthoxylum schinifolium)와 초피(Zanthoxylum piperitum)의 형태학적 특성과 유전적 다양성)

  • Ryu, Jaihyunk;Choi, Hae-Sik;Lyu, Jae-il;Bae, Chang-Hyu
    • Korean Journal of Plant Resources
    • /
    • v.29 no.5
    • /
    • pp.555-563
    • /
    • 2016
  • The morphological characteristics and genetic relationships among 32 germplasms of Zanthoxylum schinifolium and Zanthoxylum piperitum collected from two farms in Korea were investigated. The traits with the most variability were seed color, leaf size, and spine size. The intraspecific polymorphism of Z. schinifolium and Z. piperitum was 96.5% and 60.3%, respectively. The genetic diversity and Shannon’s information index values ranged from 0.11 to 0.33 and 0.19 to 0.50, with average values of 0.26 and 0.42, respectively. Two ISSR primers (UBC861 and UBC862) were able to distinguish the different species. The genetic similarity matrix (GSM) revealed variability among the accessions ranging from 0.116 to 0.816. The intraspecific GSM for Z. schinifolium and Z. piperitum was 0.177-0.780 and 0.250-0.816, respectively. The GSM findings indicate that Z. schinifolium and Z. piperitum accessions have high genetic diversity and possess germplasms qualifying as good genetic resources for cross breeding. The clustering analysis separated Z. schinifolium and Z. piperitum into independent groups, and all accessions could be classified into three categories. Z. Schinifolium var. nermis belonged to independent groups. Comparison of the clusters based on morphological analysis with those based on ISSR data resulted in an unclear pattern of division among the accessions. The study findings indicate that Z. schinifolium and Z. piperitum accessions have genetic diversity, and ISSR markers were useful for identifying Z. schinifolium and Z. piperitum.

Assessment of Genetic Relationship among Date (Zizyphus jujuba) Cultivars Revealed by I-SSR Marker (I-SSR 표지자분석을 이용한 대추나무 품종간 유연관계 분석)

  • Nam, Jae-Ik;Kim, Young-Mi;Choi, Go-Eun;Lee, Gwi-Young;Park, Jae-In
    • Journal of Korean Society of Forest Science
    • /
    • v.102 no.1
    • /
    • pp.59-65
    • /
    • 2013
  • The jujube is an important fruit tree species in Korea. Traditionally, classifications of jujube cultivars have been based on morphological characters; however, morphological identification can be problematic because morphological traits are affected by environmental conditions. Therefore, DNA markers are now being used for the rapid and accurate identification of plant species. Inter-simple sequence repeat (I-SSR) is one of the best DNA-based molecular marker techniques, which is useful for studying genetic relations and for the identification of closely related cultivars. In this study, 5 Korean jujube trees and 1 jujube tree imported from China were analyzed for 16 I-SSR primers. Amplification of the genomic DNA of jujube cultivars by using I-SSR analysis generated 100 bands, with an average of 6.25 bands per primer, of which 45 bands (45%) were polymorphic. The number of amplified fragments with I-SSR primers ranged from 2 to 13. The percentage of polymorphism ranged from 10% to 100%. I-SSR finger printing profiles showed that 'Boeun jujube' and 'Daeri jujube' had characteristic DNA patterns, indicating unequivocal cultivar identification at molecular level. According to the results of clustering analysis, the genetic similarity coefficient ranged from 0.68 to 0.92. 'Boeun jujube' and 'Daeri jujube' were divided into independent groups, and 'Bokjo jujube', 'Geumseong jujube', 'Wolchul jujube', and 'Mudeung jujube' were placed in the same group. Therefore, I-SSR markers are suitable for the discrimination of 'Boeun jujube' and 'Daeri jujube' cultivars.

Managing the Reverse Extrapolation Model of Radar Threats Based Upon an Incremental Machine Learning Technique (점진적 기계학습 기반의 레이더 위협체 역추정 모델 생성 및 갱신)

  • Kim, Chulpyo;Noh, Sanguk
    • The Journal of Korean Institute of Next Generation Computing
    • /
    • v.13 no.4
    • /
    • pp.29-39
    • /
    • 2017
  • Various electronic warfare situations drive the need to develop an integrated electronic warfare simulator that can perform electronic warfare modeling and simulation on radar threats. In this paper, we analyze the components of a simulation system to reversely model the radar threats that emit electromagnetic signals based on the parameters of the electronic information, and propose a method to gradually maintain the reverse extrapolation model of RF threats. In the experiment, we will evaluate the effectiveness of the incremental model update and also assess the integration method of reverse extrapolation models. The individual model of RF threats are constructed by using decision tree, naive Bayesian classifier, artificial neural network, and clustering algorithms through Euclidean distance and cosine similarity measurement, respectively. Experimental results show that the accuracy of reverse extrapolation models improves, while the size of the threat sample increases. In addition, we use voting, weighted voting, and the Dempster-Shafer algorithm to integrate the results of the five different models of RF threats. As a result, the final decision of reverse extrapolation through the Dempster-Shafer algorithm shows the best performance in its accuracy.