• Title/Summary/Keyword: Matrix system

Search Result 4,730, Processing Time 0.036 seconds

Prediction of Key Variables Affecting NBA Playoffs Advancement: Focusing on 3 Points and Turnover Features (미국 프로농구(NBA)의 플레이오프 진출에 영향을 미치는 주요 변수 예측: 3점과 턴오버 속성을 중심으로)

  • An, Sehwan;Kim, Youngmin
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.1
    • /
    • pp.263-286
    • /
    • 2022
  • This study acquires NBA statistical information for a total of 32 years from 1990 to 2022 using web crawling, observes variables of interest through exploratory data analysis, and generates related derived variables. Unused variables were removed through a purification process on the input data, and correlation analysis, t-test, and ANOVA were performed on the remaining variables. For the variable of interest, the difference in the mean between the groups that advanced to the playoffs and did not advance to the playoffs was tested, and then to compensate for this, the average difference between the three groups (higher/middle/lower) based on ranking was reconfirmed. Of the input data, only this year's season data was used as a test set, and 5-fold cross-validation was performed by dividing the training set and the validation set for model training. The overfitting problem was solved by comparing the cross-validation result and the final analysis result using the test set to confirm that there was no difference in the performance matrix. Because the quality level of the raw data is high and the statistical assumptions are satisfied, most of the models showed good results despite the small data set. This study not only predicts NBA game results or classifies whether or not to advance to the playoffs using machine learning, but also examines whether the variables of interest are included in the major variables with high importance by understanding the importance of input attribute. Through the visualization of SHAP value, it was possible to overcome the limitation that could not be interpreted only with the result of feature importance, and to compensate for the lack of consistency in the importance calculation in the process of entering/removing variables. It was found that a number of variables related to three points and errors classified as subjects of interest in this study were included in the major variables affecting advancing to the playoffs in the NBA. Although this study is similar in that it includes topics such as match results, playoffs, and championship predictions, which have been dealt with in the existing sports data analysis field, and comparatively analyzed several machine learning models for analysis, there is a difference in that the interest features are set in advance and statistically verified, so that it is compared with the machine learning analysis result. Also, it was differentiated from existing studies by presenting explanatory visualization results using SHAP, one of the XAI models.

Export Prediction Using Separated Learning Method and Recommendation of Potential Export Countries (분리학습 모델을 이용한 수출액 예측 및 수출 유망국가 추천)

  • Jang, Yeongjin;Won, Jongkwan;Lee, Chaerok
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.1
    • /
    • pp.69-88
    • /
    • 2022
  • One of the characteristics of South Korea's economic structure is that it is highly dependent on exports. Thus, many businesses are closely related to the global economy and diplomatic situation. In addition, small and medium-sized enterprises(SMEs) specialized in exporting are struggling due to the spread of COVID-19. Therefore, this study aimed to develop a model to forecast exports for next year to support SMEs' export strategy and decision making. Also, this study proposed a strategy to recommend promising export countries of each item based on the forecasting model. We analyzed important variables used in previous studies such as country-specific, item-specific, and macro-economic variables and collected those variables to train our prediction model. Next, through the exploratory data analysis(EDA) it was found that exports, which is a target variable, have a highly skewed distribution. To deal with this issue and improve predictive performance, we suggest a separated learning method. In a separated learning method, the whole dataset is divided into homogeneous subgroups and a prediction algorithm is applied to each group. Thus, characteristics of each group can be more precisely trained using different input variables and algorithms. In this study, we divided the dataset into five subgroups based on the exports to decrease skewness of the target variable. After the separation, we found that each group has different characteristics in countries and goods. For example, In Group 1, most of the exporting countries are developing countries and the majority of exporting goods are low value products such as glass and prints. On the other hand, major exporting countries of South Korea such as China, USA, and Vietnam are included in Group 4 and Group 5 and most exporting goods in these groups are high value products. Then we used LightGBM(LGBM) and Exponential Moving Average(EMA) for prediction. Considering the characteristics of each group, models were built using LGBM for Group 1 to 4 and EMA for Group 5. To evaluate the performance of the model, we compare different model structures and algorithms. As a result, it was found that the separated learning model had best performance compared to other models. After the model was built, we also provided variable importance of each group using SHAP-value to add explainability of our model. Based on the prediction model, we proposed a second-stage recommendation strategy for potential export countries. In the first phase, BCG matrix was used to find Star and Question Mark markets that are expected to grow rapidly. In the second phase, we calculated scores for each country and recommendations were made according to ranking. Using this recommendation framework, potential export countries were selected and information about those countries for each item was presented. There are several implications of this study. First of all, most of the preceding studies have conducted research on the specific situation or country. However, this study use various variables and develops a machine learning model for a wide range of countries and items. Second, as to our knowledge, it is the first attempt to adopt a separated learning method for exports prediction. By separating the dataset into 5 homogeneous subgroups, we could enhance the predictive performance of the model. Also, more detailed explanation of models by group is provided using SHAP values. Lastly, this study has several practical implications. There are some platforms which serve trade information including KOTRA, but most of them are based on past data. Therefore, it is not easy for companies to predict future trends. By utilizing the model and recommendation strategy in this research, trade related services in each platform can be improved so that companies including SMEs can fully utilize the service when making strategies and decisions for exports.

Comparative Study on the Carbon Stock Changes Measurement Methodologies of Perennial Woody Crops-focusing on Overseas Cases (다년생 목본작물의 탄소축적 변화량 산정방법론 비교 연구-해외사례를 중심으로)

  • Hae-In Lee;Yong-Ju Lee;Kyeong-Hak Lee;Chang-Bae Lee
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.25 no.4
    • /
    • pp.258-266
    • /
    • 2023
  • This study analyzed methodologies for estimating carbon stocks of perennial woody crops and the research cases in overseas countries. As a result, we found that Australia, Bulgaria, Canada, and Japan are using the stock-difference method, while Austria, Denmark, and Germany are estimating the change in the carbon stock based on the gain-loss method. In some overseas countries, the researches were conducted on estimating the carbon stock change using image data as tier 3 phase beyond the research developing country-specific factors as tier 2 phase. In South Korea, convergence studies as the third stage were conducted in forestry field, but advanced research in the agricultural field is at the beginning stage. Based on these results, we suggest directions for the following four future researches: 1) securing national-specific factors related to emissions and removals in the agricultural field through the development of allometric equation and carbon conversion factors for perennial woody crops to improve the completeness of emission and removals statistics, 2) implementing policy studies on the cultivation area calculation refinement with fruit tree-biomass-based maturity, 3) developing a more advanced estimation technique for perennial woody crops in the agricultural sector using allometric equation and remote sensing techniques based on the agricultural and forestry satellite scheduled to be launched in 2025, and to establish a matrix and monitoring system for perennial woody crop cultivation areas in the agricultural sector, Lastly, 4) estimating soil carbon stocks change, which is currently estimated by treating all agricultural areas as one, by sub-land classification to implement a dynamic carbon cycle model. This study suggests a detailed guideline and advanced methods of carbon stock change calculation for perennial woody crops, which supports 2050 Carbon Neutral Strategy of Ministry of Agriculture, Food, and Rural Affairs and activate related research in agricultural sector.

The Effect of Information Quality and System Quality on Knowledge Service Competence: Focusing on Knowledge Service Types (지식서비스의 정보품질과 시스템품질이 지식서비스 역량에 미치는 영향: 지식서비스 유형을 중심으로)

  • Geun-Wan Park;Hyun-Ji Park;Sung-Hoon Mo;Cheol-Hyun Lim;Hee-Seok Choi;Seok-Hyoung Lee;Hye-Jin Lee;Seung-June Hwang;Chang-Hee Han
    • Information Systems Review
    • /
    • v.21 no.4
    • /
    • pp.1-29
    • /
    • 2019
  • The knowledge resources take a role in promoting the sustainable growth of organization. Therefore, it is important for the members of organization to acquire knowledge consistently so that the company can continue to grow. Knowledge service is the field that provides information and infrastructure which enable the members of organization to acquire new knowledge. As we recognized the importance of knowledge services, we analyzed the level of knowledge service management and development through the impact of knowledge quality on user capabilities. First, the matrix of knowledge patterns was presented based on the type of information and the level of customer interaction. According to patterns, the knowledge service was classified into three types of information providing, information analysis, and infrastructure, and then the results of structural model analysis were presented for each type. It found that the impact of knowledge service quality on user competence was different according to the type of service. The results suggested new indicators for measuring the performance of knowledge services, and provided information for reconstructing services based on the user considering the integrated operation of knowledge service and organizational designing knowledge service.

Soil Classification of Paddy Soils by Soil Taxonomy (미국신분류법(美國新分類法)에 의(依)한 답토양의 분류(分類)에 관한 연구)

  • Joo, Yeong-Hee;Shin, Yong-Hwa
    • Korean Journal of Soil Science and Fertilizer
    • /
    • v.11 no.2
    • /
    • pp.97-104
    • /
    • 1979
  • According to Soil Taxonomy which has been developed over the past 20 years in the soil conservation service of the U. S. D. A, Soils in Korea are classified. This system is well suited for the classification of the most of soils. But paddy field soils have some difficulties in classification because Soil Taxonomy states no proposals have yet been developed for classifying artificially irrigated soils. This paper discusses some problems in the application of Taxonomy and suggestes the classification of paddy field soils in Korea. Following is the summary of the paper. 1. Anthro aquic, Aquic Udipsamments : The top soils of these soils are saturated with irrigated water at some time of year and have mottles of low chroma(2 or less) more than 50cm of the soil surface. (Ex. Sadu, Geumcheon series) 2. Anthroaquic Udipsamments : These sails are like Anthroaquic, Aquic Udipsamments except for the mottles of low chroma within 50cm of the soil surface. (Ex. Baegsu series) 3. Halic Psammaquents : These soils contain enough salts as distributed in the profile that they interfere with the growth of most crop plants and located on the coastal dunes. The water table fluctuates with the tides. (Ex. Nagcheon series) 4. Anthroaquic, Aquic Udifluvents : They have some mottles that have chroma of 2 or less in more than 50cm of the surface. The upper horizon is saturated with irrigated water at sometime. (Ex. Maryeong series) 5. Anthro aquic Udifluvents : These soils are saturated with irrigated water at some time of year and have mottles of low chroma(2 or less) within 50cm of the surface soils. (Ex. Haenggog series) 6. Fluventic Haplaquepts : These soils have a content of organic carbon that decreases irregularly with depth and do not have an argillic horizon in any part of the pedon. Since ground water occur on the surface or near the surface, they are dominantly gray soils in a thick mineral regolith. (Ex Baeggu, Hagseong series) 7. Fluventic Thapto-Histic Haplaquepts : These soils have a buried organic matter layer and the upper boundary is within 1m of the surface. Other properties are same as Fluventic Haplaquepts. (Ex. Gongdeog, Seotan series) 8. Fluventic Aeric Haplaquepts : These soils have a horizon that has chroma too high for Fluventic Haplaquepts. The higher chroma is thought to indicate either a shorter period of saturation of the whole soils with water or some what deeper ground water than in the Fluventic Haplaquepts. The correlation of color with soil drainage classes is imperfect. (Ex. Mangyeong, Jeonbug series) 9. Fluventic Thapto-Histic Aeric Haplaquepts : These soils are similar to Fluventic Thapto Histic Haplaquepts except for the deeper ground water. (Ex. Bongnam series) 10. Fluventic Aeric Sulfic Haplaquepts : These soils are similar to Fluventic Aeric Haplaquepts except for the yellow mottles and low pH (<4.0) in some part between 50 and 150cm of the surface. (Ex. Deunggu series) 11. Fluventic Sulfaquepts : These soils are extremely acid and toxic to most plant. Their horizons are mostly dark gray and have yellow mottles of iron sulfate with in 50cm of the soil surface. They occur mainly in coastal marshes near the mouth of rivers. (Ex. Bongrim, Haecheog series) 12. Fluventic Aeric Sulfaquepts : They have a horizon that has chroma too high for Fluventic Sulfaquepts. Other properties are same as Fluventic Sulfaquepts. (Ex. Gimhae series) 13. Anthroaquic Fluvaquentic Eutrochrepts : These soils have mottles of low chroma in more than 50cm of the surface due to irrigated water. The base saturation is 60 percent or more in some subhroizon that is between depth of 25 and 75cm below the surface. (Ex. Jangyu, Chilgog series) 14. Anthroaquic Dystric Fluventic Eutrochrepts : These soils are similar to Anthroaquic Fluvaquentic Eutrochrepts except for the low chroma within 50cm of the surface. (Ex. Weolgog, Gyeongsan series) 15. Anthroaquic Fluventic Dystrochrepts : These soils have mottles that have chroma of 2 or less within 50cm of the soil surface due to artificial irrigation. They have lower base saturation (<60 percert) in all subhorizons between depths of 25 and 75cm below the soil surface. (Ex. Gocheon, Bigog series) 16. Anthro aquic Eutrandepts : These soils are similar to Anthroaquic Dystric Fluventic Eutrochrepts except for lower bulk density in the horizon. (Ex. Daejeong series) 17. Anthroaquic Hapludalfs : These soils' have a surface that is saturated with irrigated water at some time and have chroma of 2 or less in the matrix and higher chroma of mottles within 50cm of the surface. (Ex. Hwadong, Yongsu series) 18. Anthro aquic, Aquic Hapludalfs : These soils are similar to Anthro aquic Hapludalfs except for the matrix that has chroma 2 or less and higher chroma of mottles in more than 50cm of the surface. (Ex. Geugrag, Deogpyeong se ries)

  • PDF

An Analytical Study on Stem Growth of Chamaecyparis obtusa (편백(扁栢)의 수간성장(樹幹成長)에 관(關)한 해석적(解析的) 연구(硏究))

  • An, Jong Man;Lee, Kwang Nam
    • Journal of Korean Society of Forest Science
    • /
    • v.77 no.4
    • /
    • pp.429-444
    • /
    • 1988
  • Considering the recent trent toward the development of multiple-use of forest trees, investigations for comprehensive information on these young stands of Hinoki cypress are necessary for rational forest management. From this point of view, 83 sample trees were selected and cut down from 23-ear old stands of Hinoki cypress at Changsung-gun, Chonnam-do. Various stem growth factors of felled trees were measured and canonical correlaton analysis, principal component analysis and factor analysis were applied to investigate the stem growth characteristics, relationships among stem growth factors, and to get potential information and comprehensive information. The results are as follows ; Canonical correlation coefficient between stem volume and quality growth factor was 0.9877. Coefficient of canonical variates showed that DBH among diameter growth factors and height among height growth factors had important effects on stem volume. From the analysis of relationship between stem-volume and canonical variates, which were linearly combined DBH with height as one set, DBH had greater influence on volume growth than height. The 1st-2nd principal components here adopted to fit the effective value of 85% from the pincipal component analysis for 12 stem growth factors. The result showed that the 1st-2nd principal component had cumulative contribution rate of 88.10%. The 1st and the 2nd principal components were interpreted as "size factor" and "shape factor", respectively. From summed proportion of the efficient principal component fur each variate, information of variates except crown diameter, clear length and form height explained more than 87%. Two common factors were set by the eigen value obtained from SMC (squared multiple correlation) of diagonal elements of canonical matrix. There were 2 latent factors, $f_1$ and $f_2$. The former way interpreted as nature of diameter growth system. In inherent phenomenon of 12 growth factor, communalities except clear length and crown diameter had great explanatory poorer of 78.62-98.30%. Eighty three sample trees could he classified into 5 stem types as follows ; medium type within a radius of ${\pm}1$ standard deviation of factor scores, uniformity type in diameter and height growth in the 1st quadrant, slim type in the 2nd quadrant, dwarfish type in the 3rd quadrant, and fall-holed type in the 4 th quadrant.

  • PDF

Visualizing the Results of Opinion Mining from Social Media Contents: Case Study of a Noodle Company (소셜미디어 콘텐츠의 오피니언 마이닝결과 시각화: N라면 사례 분석 연구)

  • Kim, Yoosin;Kwon, Do Young;Jeong, Seung Ryul
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.4
    • /
    • pp.89-105
    • /
    • 2014
  • After emergence of Internet, social media with highly interactive Web 2.0 applications has provided very user friendly means for consumers and companies to communicate with each other. Users have routinely published contents involving their opinions and interests in social media such as blogs, forums, chatting rooms, and discussion boards, and the contents are released real-time in the Internet. For that reason, many researchers and marketers regard social media contents as the source of information for business analytics to develop business insights, and many studies have reported results on mining business intelligence from Social media content. In particular, opinion mining and sentiment analysis, as a technique to extract, classify, understand, and assess the opinions implicit in text contents, are frequently applied into social media content analysis because it emphasizes determining sentiment polarity and extracting authors' opinions. A number of frameworks, methods, techniques and tools have been presented by these researchers. However, we have found some weaknesses from their methods which are often technically complicated and are not sufficiently user-friendly for helping business decisions and planning. In this study, we attempted to formulate a more comprehensive and practical approach to conduct opinion mining with visual deliverables. First, we described the entire cycle of practical opinion mining using Social media content from the initial data gathering stage to the final presentation session. Our proposed approach to opinion mining consists of four phases: collecting, qualifying, analyzing, and visualizing. In the first phase, analysts have to choose target social media. Each target media requires different ways for analysts to gain access. There are open-API, searching tools, DB2DB interface, purchasing contents, and so son. Second phase is pre-processing to generate useful materials for meaningful analysis. If we do not remove garbage data, results of social media analysis will not provide meaningful and useful business insights. To clean social media data, natural language processing techniques should be applied. The next step is the opinion mining phase where the cleansed social media content set is to be analyzed. The qualified data set includes not only user-generated contents but also content identification information such as creation date, author name, user id, content id, hit counts, review or reply, favorite, etc. Depending on the purpose of the analysis, researchers or data analysts can select a suitable mining tool. Topic extraction and buzz analysis are usually related to market trends analysis, while sentiment analysis is utilized to conduct reputation analysis. There are also various applications, such as stock prediction, product recommendation, sales forecasting, and so on. The last phase is visualization and presentation of analysis results. The major focus and purpose of this phase are to explain results of analysis and help users to comprehend its meaning. Therefore, to the extent possible, deliverables from this phase should be made simple, clear and easy to understand, rather than complex and flashy. To illustrate our approach, we conducted a case study on a leading Korean instant noodle company. We targeted the leading company, NS Food, with 66.5% of market share; the firm has kept No. 1 position in the Korean "Ramen" business for several decades. We collected a total of 11,869 pieces of contents including blogs, forum contents and news articles. After collecting social media content data, we generated instant noodle business specific language resources for data manipulation and analysis using natural language processing. In addition, we tried to classify contents in more detail categories such as marketing features, environment, reputation, etc. In those phase, we used free ware software programs such as TM, KoNLP, ggplot2 and plyr packages in R project. As the result, we presented several useful visualization outputs like domain specific lexicons, volume and sentiment graphs, topic word cloud, heat maps, valence tree map, and other visualized images to provide vivid, full-colored examples using open library software packages of the R project. Business actors can quickly detect areas by a swift glance that are weak, strong, positive, negative, quiet or loud. Heat map is able to explain movement of sentiment or volume in categories and time matrix which shows density of color on time periods. Valence tree map, one of the most comprehensive and holistic visualization models, should be very helpful for analysts and decision makers to quickly understand the "big picture" business situation with a hierarchical structure since tree-map can present buzz volume and sentiment with a visualized result in a certain period. This case study offers real-world business insights from market sensing which would demonstrate to practical-minded business users how they can use these types of results for timely decision making in response to on-going changes in the market. We believe our approach can provide practical and reliable guide to opinion mining with visualized results that are immediately useful, not just in food industry but in other industries as well.

The Relationship between Expression of EGFR, MMP-9, and C-erbB-2 and Survival Time in Resected Non-Small Cell Lung Cancer (수술을 시행한 비소세포 폐암 환자에서 EGFR, MMP-9 및 C-erbB-2의 발현과 환자 생존율과의 관계)

  • Lee, Seung Heon;Jung, Jin Yong;Lee, Kyoung Ju;Lee, Seung Hyeun;Kim, Se Joong;Ha, Eun Sil;Kim, Jeong-Ha;Lee, Eun Joo;Hur, Gyu Young;Jung, Ki Hwan;Jung, Hye Cheol;Lee, Sung Yong;Lee, Sang Yeub;Kim, Je Hyeong;Shin, Chol;Shim, Jae Jeong;In, Kwang Ho;Kang, Kyung Ho;Yoo, Se Hwa;Kim, Chul Hwan
    • Tuberculosis and Respiratory Diseases
    • /
    • v.59 no.3
    • /
    • pp.286-297
    • /
    • 2005
  • Background : Non-small cell lung cancer (NSCLC) is a common cause of cancer-related death in North America and Korea, with an overall 5-year survival rate of between 4 and 14%. The TNM staging system is the best prognostic index for operable NSCLC . However, epidermal growth factor receptor (EGFR), matrix metalloproteinase-9(MMP-9), and C-erbB-2 have all been implicated in the pathogenesis of NSCLC and might provide prognostic information. Methods : Immunohistochemical staining of 81 specimens from a resected primary non-small cell lung cancer was evaluated in order to determine the role of the biological markers on NSCLC . Immunohistochemical staining for EGFR, MMP-9, and C-erbB-2 was performed on paraffin-embedded tissue sections to observe the expression pattern according to the pathologic type and surgical staging. The correlations between the expression of each biological marker and the survival time was determined. Results : When positive immunohistochemical staining was defined as the extent area>20%(more than Grade 2), the positive rates for EGFR, MMP-9, and C-erbB-2 staining were 71.6%, 44.3%, and 24.1% of the 81 patients, respectively. The positive rates of EGFR and MMP-9 stain for NSCLC according to the surgical stages I, II, and IIIa were 75.0% and 41.7%, 66.7% and 47.6%, and 76.9% and 46.2%, respectively. The median survival time of the EGFR(-) group, 71.8 months, was significantly longer than that of the EGFR(+) group, 33.5 months.(p=0.018, Kaplan-Meier Method, log-rank test).. The MMP-9(+) group had a shorter median survival time than the MMP-9(-) group, 35.0 and 65.3 months, respectively (p=0.2). The co-expression of EGFR and MMP-9 was associated with a worse prognosis with a median survival time of 26.9 months, when compared with the 77 months for both negative-expression groups (p=0.0023). There were no significant differences between the C-erbB-2(+) and C-erbB-2 (-) groups. Conclusion : In NSCLC, the expression of EGFR might be a prognostic factor, and the co-expression of EGFR and MMP-9 was found to be associated with a poor prognosis. However, C-erbB-2 expression had no prognostic significance.

A Study on the Expression of CD44s and CD44v6 in Non-Small Cell Lung Carcinomas (비소세포성 폐암종의 CD44s 및 CD44v6의 발현에 대한 연구 -CD44의 발현에 대한 연구-)

  • Chang, Woon-Ha;Oh, Tae-Yun;Kim, Jung-Tae
    • Journal of Chest Surgery
    • /
    • v.39 no.1 s.258
    • /
    • pp.1-11
    • /
    • 2006
  • Background: CD44 is a glycoprotein on the cell surface which is involved in the cell-to-cell and cell-to-matrix interaction. The standard form, CD44s and multiple isoforms are determined by alternative splicing of 10 exons. Recent studies have suggested that CD44 may help invasion and metastasis of various epithelial tumors as well as activation of Iymphocytes and monocytes. The expression pattern of CD44 can be different according to tumor types. The author studied the expression pattern of CD44s and one of its variants, CD44v6 in non-small cell lung carcinomas (NSCLC) to find their implications on clinicopathologic aspects, including the survival of the patients. Material and Method: A total of 89 primary NSCLSs (48 squamous cell carcinomas, 33 adenocarcinomas, and 8 undifferentiated large cell carcinomas) were retrieved during the years between 1985 to 1994. The immunohisto chemistry was done by using monoclonal antibodies and the CD44 expression for angiogenesis was evaluated by counting the number of tumor microvessels. Result: Seventy-one (79.8$\%$) and 64 (71 .9$\%$) among 89 NSCLSs revealed the expression of CD44s and CD44v6, respectively. The expression of CD44s was well correlated with that of CD44v6 (r=0.710, p < 0.0001). The expression of CD44s and CD44v6 was associated with the histopathologic type of the NSCLCs, and squamous cell carcinoma was the type that showed the highest expression of CD44s and CD44v6 (p < 0.0001). Microvessel count was the highest in adenocarcinomas (113.6$\pm$69.7 on 200-fold magnification and 54.8$\pm$41.1 on 400-fold magnification) and correlated with the tumor size of TNM system (r=0.217, p=0.043) and CD44s expression (r=0.218, p=0.040). In adenocarcinoma, the patients with higher CD44s expression survived shorter than those with lower CD44s expression (p=0.0194) but there was no statistical significance on multivariate analysis(p=0.3298). Conclusion: The expression of both CD44s and CD44v6 may be associated with the squamous differentiation in non-small cell lung carcinomas. The relationship of CD44s expression with micro-vessel density of the tumor suggests an involvement of CD44s in tumor angiogenesis, which in turn would help tumor growth.

Subject-Balanced Intelligent Text Summarization Scheme (주제 균형 지능형 텍스트 요약 기법)

  • Yun, Yeoil;Ko, Eunjung;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.2
    • /
    • pp.141-166
    • /
    • 2019
  • Recently, channels like social media and SNS create enormous amount of data. In all kinds of data, portions of unstructured data which represented as text data has increased geometrically. But there are some difficulties to check all text data, so it is important to access those data rapidly and grasp key points of text. Due to needs of efficient understanding, many studies about text summarization for handling and using tremendous amounts of text data have been proposed. Especially, a lot of summarization methods using machine learning and artificial intelligence algorithms have been proposed lately to generate summary objectively and effectively which called "automatic summarization". However almost text summarization methods proposed up to date construct summary focused on frequency of contents in original documents. Those summaries have a limitation for contain small-weight subjects that mentioned less in original text. If summaries include contents with only major subject, bias occurs and it causes loss of information so that it is hard to ascertain every subject documents have. To avoid those bias, it is possible to summarize in point of balance between topics document have so all subject in document can be ascertained, but still unbalance of distribution between those subjects remains. To retain balance of subjects in summary, it is necessary to consider proportion of every subject documents originally have and also allocate the portion of subjects equally so that even sentences of minor subjects can be included in summary sufficiently. In this study, we propose "subject-balanced" text summarization method that procure balance between all subjects and minimize omission of low-frequency subjects. For subject-balanced summary, we use two concept of summary evaluation metrics "completeness" and "succinctness". Completeness is the feature that summary should include contents of original documents fully and succinctness means summary has minimum duplication with contents in itself. Proposed method has 3-phases for summarization. First phase is constructing subject term dictionaries. Topic modeling is used for calculating topic-term weight which indicates degrees that each terms are related to each topic. From derived weight, it is possible to figure out highly related terms for every topic and subjects of documents can be found from various topic composed similar meaning terms. And then, few terms are selected which represent subject well. In this method, it is called "seed terms". However, those terms are too small to explain each subject enough, so sufficient similar terms with seed terms are needed for well-constructed subject dictionary. Word2Vec is used for word expansion, finds similar terms with seed terms. Word vectors are created after Word2Vec modeling, and from those vectors, similarity between all terms can be derived by using cosine-similarity. Higher cosine similarity between two terms calculated, higher relationship between two terms defined. So terms that have high similarity values with seed terms for each subjects are selected and filtering those expanded terms subject dictionary is finally constructed. Next phase is allocating subjects to every sentences which original documents have. To grasp contents of all sentences first, frequency analysis is conducted with specific terms that subject dictionaries compose. TF-IDF weight of each subjects are calculated after frequency analysis, and it is possible to figure out how much sentences are explaining about each subjects. However, TF-IDF weight has limitation that the weight can be increased infinitely, so by normalizing TF-IDF weights for every subject sentences have, all values are changed to 0 to 1 values. Then allocating subject for every sentences with maximum TF-IDF weight between all subjects, sentence group are constructed for each subjects finally. Last phase is summary generation parts. Sen2Vec is used to figure out similarity between subject-sentences, and similarity matrix can be formed. By repetitive sentences selecting, it is possible to generate summary that include contents of original documents fully and minimize duplication in summary itself. For evaluation of proposed method, 50,000 reviews of TripAdvisor are used for constructing subject dictionaries and 23,087 reviews are used for generating summary. Also comparison between proposed method summary and frequency-based summary is performed and as a result, it is verified that summary from proposed method can retain balance of all subject more which documents originally have.