• Title/Summary/Keyword: T_{10-90}$

Search Result 2,433, Processing Time 0.024 seconds

A New Approach to Automatic Keyword Generation Using Inverse Vector Space Model (키워드 자동 생성에 대한 새로운 접근법: 역 벡터공간모델을 이용한 키워드 할당 방법)

  • Cho, Won-Chin;Rho, Sang-Kyu;Yun, Ji-Young Agnes;Park, Jin-Soo
    • Asia pacific journal of information systems
    • /
    • v.21 no.1
    • /
    • pp.103-122
    • /
    • 2011
  • Recently, numerous documents have been made available electronically. Internet search engines and digital libraries commonly return query results containing hundreds or even thousands of documents. In this situation, it is virtually impossible for users to examine complete documents to determine whether they might be useful for them. For this reason, some on-line documents are accompanied by a list of keywords specified by the authors in an effort to guide the users by facilitating the filtering process. In this way, a set of keywords is often considered a condensed version of the whole document and therefore plays an important role for document retrieval, Web page retrieval, document clustering, summarization, text mining, and so on. Since many academic journals ask the authors to provide a list of five or six keywords on the first page of an article, keywords are most familiar in the context of journal articles. However, many other types of documents could not benefit from the use of keywords, including Web pages, email messages, news reports, magazine articles, and business papers. Although the potential benefit is large, the implementation itself is the obstacle; manually assigning keywords to all documents is a daunting task, or even impractical in that it is extremely tedious and time-consuming requiring a certain level of domain knowledge. Therefore, it is highly desirable to automate the keyword generation process. There are mainly two approaches to achieving this aim: keyword assignment approach and keyword extraction approach. Both approaches use machine learning methods and require, for training purposes, a set of documents with keywords already attached. In the former approach, there is a given set of vocabulary, and the aim is to match them to the texts. In other words, the keywords assignment approach seeks to select the words from a controlled vocabulary that best describes a document. Although this approach is domain dependent and is not easy to transfer and expand, it can generate implicit keywords that do not appear in a document. On the other hand, in the latter approach, the aim is to extract keywords with respect to their relevance in the text without prior vocabulary. In this approach, automatic keyword generation is treated as a classification task, and keywords are commonly extracted based on supervised learning techniques. Thus, keyword extraction algorithms classify candidate keywords in a document into positive or negative examples. Several systems such as Extractor and Kea were developed using keyword extraction approach. Most indicative words in a document are selected as keywords for that document and as a result, keywords extraction is limited to terms that appear in the document. Therefore, keywords extraction cannot generate implicit keywords that are not included in a document. According to the experiment results of Turney, about 64% to 90% of keywords assigned by the authors can be found in the full text of an article. Inversely, it also means that 10% to 36% of the keywords assigned by the authors do not appear in the article, which cannot be generated through keyword extraction algorithms. Our preliminary experiment result also shows that 37% of keywords assigned by the authors are not included in the full text. This is the reason why we have decided to adopt the keyword assignment approach. In this paper, we propose a new approach for automatic keyword assignment namely IVSM(Inverse Vector Space Model). The model is based on a vector space model. which is a conventional information retrieval model that represents documents and queries by vectors in a multidimensional space. IVSM generates an appropriate keyword set for a specific document by measuring the distance between the document and the keyword sets. The keyword assignment process of IVSM is as follows: (1) calculating the vector length of each keyword set based on each keyword weight; (2) preprocessing and parsing a target document that does not have keywords; (3) calculating the vector length of the target document based on the term frequency; (4) measuring the cosine similarity between each keyword set and the target document; and (5) generating keywords that have high similarity scores. Two keyword generation systems were implemented applying IVSM: IVSM system for Web-based community service and stand-alone IVSM system. Firstly, the IVSM system is implemented in a community service for sharing knowledge and opinions on current trends such as fashion, movies, social problems, and health information. The stand-alone IVSM system is dedicated to generating keywords for academic papers, and, indeed, it has been tested through a number of academic papers including those published by the Korean Association of Shipping and Logistics, the Korea Research Academy of Distribution Information, the Korea Logistics Society, the Korea Logistics Research Association, and the Korea Port Economic Association. We measured the performance of IVSM by the number of matches between the IVSM-generated keywords and the author-assigned keywords. According to our experiment, the precisions of IVSM applied to Web-based community service and academic journals were 0.75 and 0.71, respectively. The performance of both systems is much better than that of baseline systems that generate keywords based on simple probability. Also, IVSM shows comparable performance to Extractor that is a representative system of keyword extraction approach developed by Turney. As electronic documents increase, we expect that IVSM proposed in this paper can be applied to many electronic documents in Web-based community and digital library.

Sentiment Analysis of Movie Review Using Integrated CNN-LSTM Mode (CNN-LSTM 조합모델을 이용한 영화리뷰 감성분석)

  • Park, Ho-yeon;Kim, Kyoung-jae
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.4
    • /
    • pp.141-154
    • /
    • 2019
  • Rapid growth of internet technology and social media is progressing. Data mining technology has evolved to enable unstructured document representations in a variety of applications. Sentiment analysis is an important technology that can distinguish poor or high-quality content through text data of products, and it has proliferated during text mining. Sentiment analysis mainly analyzes people's opinions in text data by assigning predefined data categories as positive and negative. This has been studied in various directions in terms of accuracy from simple rule-based to dictionary-based approaches using predefined labels. In fact, sentiment analysis is one of the most active researches in natural language processing and is widely studied in text mining. When real online reviews aren't available for others, it's not only easy to openly collect information, but it also affects your business. In marketing, real-world information from customers is gathered on websites, not surveys. Depending on whether the website's posts are positive or negative, the customer response is reflected in the sales and tries to identify the information. However, many reviews on a website are not always good, and difficult to identify. The earlier studies in this research area used the reviews data of the Amazon.com shopping mal, but the research data used in the recent studies uses the data for stock market trends, blogs, news articles, weather forecasts, IMDB, and facebook etc. However, the lack of accuracy is recognized because sentiment calculations are changed according to the subject, paragraph, sentiment lexicon direction, and sentence strength. This study aims to classify the polarity analysis of sentiment analysis into positive and negative categories and increase the prediction accuracy of the polarity analysis using the pretrained IMDB review data set. First, the text classification algorithm related to sentiment analysis adopts the popular machine learning algorithms such as NB (naive bayes), SVM (support vector machines), XGboost, RF (random forests), and Gradient Boost as comparative models. Second, deep learning has demonstrated discriminative features that can extract complex features of data. Representative algorithms are CNN (convolution neural networks), RNN (recurrent neural networks), LSTM (long-short term memory). CNN can be used similarly to BoW when processing a sentence in vector format, but does not consider sequential data attributes. RNN can handle well in order because it takes into account the time information of the data, but there is a long-term dependency on memory. To solve the problem of long-term dependence, LSTM is used. For the comparison, CNN and LSTM were chosen as simple deep learning models. In addition to classical machine learning algorithms, CNN, LSTM, and the integrated models were analyzed. Although there are many parameters for the algorithms, we examined the relationship between numerical value and precision to find the optimal combination. And, we tried to figure out how the models work well for sentiment analysis and how these models work. This study proposes integrated CNN and LSTM algorithms to extract the positive and negative features of text analysis. The reasons for mixing these two algorithms are as follows. CNN can extract features for the classification automatically by applying convolution layer and massively parallel processing. LSTM is not capable of highly parallel processing. Like faucets, the LSTM has input, output, and forget gates that can be moved and controlled at a desired time. These gates have the advantage of placing memory blocks on hidden nodes. The memory block of the LSTM may not store all the data, but it can solve the CNN's long-term dependency problem. Furthermore, when LSTM is used in CNN's pooling layer, it has an end-to-end structure, so that spatial and temporal features can be designed simultaneously. In combination with CNN-LSTM, 90.33% accuracy was measured. This is slower than CNN, but faster than LSTM. The presented model was more accurate than other models. In addition, each word embedding layer can be improved when training the kernel step by step. CNN-LSTM can improve the weakness of each model, and there is an advantage of improving the learning by layer using the end-to-end structure of LSTM. Based on these reasons, this study tries to enhance the classification accuracy of movie reviews using the integrated CNN-LSTM model.

Studies on the morphological variation of plant organs of elongating node-part in rice plant (수도 신장 절위 경엽의 형태변이에 관한 연구)

  • 김만수
    • KOREAN JOURNAL OF CROP SCIENCE
    • /
    • v.5 no.1
    • /
    • pp.1-35
    • /
    • 1969
  • Attempts were made to obtain the fundamental knowledge on the quantitative constitution status of leaves and stem of elongating node-part, and the relationships between these morphological characteristics along with the nitrogen contents of leaves and grain yield were examined varing application amounts of nitrogen in rice plant. I. The agronomic characteristics of leaves and nodes of elongation node-part (4-node parts from the top of stem) were observed at heading stage with 20 leading rice varieties of Kang Won district. The results are summarized as follows: 1. Leaf area magnitude of the flag and the fourth leaf was smaller than that of the second and the third with the average value of flag leaf 18.61 $cm^2$, the second leaf 21.84 $cm^2$, the third 21.52 $cm^2$ and the fourth 18.56 $cm^2$. The weight of leaf blade showed an isotonic tendency with the magnitude of leaf area with the value of the flag leaf 97.0 mg, the second leaf 117.1 mg, the third 115.4 mg, and the fourth 95.3 mg. The weight of each leaf sheath was remarkably larger at the higher node-part than at the lower node-part of the stem with the value of flag leaf sheath 176.3 mg, the second 163.7 mg, the third 163.4 mg and the fourth 123.9 mg. Accordingly, the total leaf weight of each part was larger at the second and the third leaf than at the first and the fourth. Total plant weight of each part (weight of leaf blade, leaf sheath, and culm) also was larger at the middle node-part. 2. Coefficients of variation for the varietal differences of the morphological characteristics of elongating node-part were 12.75% for the leaf area, 15.29% for the weight of leaf blade, 15.90%, for the weight of leaf sheath, 11.42% for the weight of internode, 15.45% for the leaf weight (leaf blade & leaf sheath) and 13.24% for the straw weight. And these coefficient values of the most characteristics were, on the whole, smaller at the second and the third node-part than at the first and the fourth node-part, but the coefficient value of the internode weight was rather small at the third and fourth node-part. 3. Constitutional ratio of each plant organ to the total plant weight in term of dry matter weight (excluding head and root wight) was 39.2% for the leaf sheath, 34.2% for the culm, 26.6% for the leaf blade. And ocnstitutional ratio of leaf sheath in term of dry matter weight was larger at the higher position in contrast with that of culm. 4. Average weight ration of leaf blade to culm, leaf sheath to culm, leaf blades to sheath and the leaf blades to culm plus leaf sheath were 77.7 %, 114.5%, 67.9% and 36.2%, respectively. With regard to the position of the plant organ, the weight ratio of leaf blade to culm and that of leaf sheath to culm were larger at higher part in contrast with that of leaf blade to leaf sheath. 5. Generally, there founded deep relationships between grain yield and each morphological characteristics of plant organ of elongating node-part as follows; Correlation coefficient between total area of 4 leaves (from flag to the fourth leaf) and grain yield was ${\gamma}$=0.666$^{**}$ In regard to the position of leaves, correlation coefficient values of flag, the second, the third and the fourth leaf were ${\gamma}$=0.659$^{**}$, ${\gamma}$=0.609$^{**}$, ${\gamma}$=0.464$^{*}$ and ${\gamma}$=0.523$^{*}$, respectively. Correlation coefficient between total weight of leaf blades and the grain yield was ${\gamma}$=0.678$^{**}$. In regard to the position of leaves, that of flag leaf was ${\gamma}$=0.691$^{**}$, and ${\gamma}$=0.654$^{**}$ for the second leaf, ${\gamma}$=0.570$^{**}$ for the third, and ${\gamma}$=0.544$^{**}$ for the fourth. Correlation between the weight of leaves (blade weight plus sheath weight) and the grain yield showed similar values. In the relationship between plant weight and grain yield there also was significant correlation, but with highly significant value only for the first node-part. There appeared correlation between total weight of leaf sheath and grain yield with the value of ${\gamma}$=0.572$^{**}$ and in regard to the position of each leaf sheath the values were ${\gamma}$=0.623$^{**}$ for the flag leaf, ${\gamma}$=0.486$^{**}$ for the second leaf, ${\gamma}$=0.513$^{**}$ for the third, ${\gamma}$=0.450$^{**}$ for the fourth. However, there was no significant correlation between culm weight and grain yield. 6. With respect to in gain yield, varietal differences in magnitude of leaf area, weight of leaf blade, leaf weight per unit area, weight of leaf sheath, culm weight, total leaf and stem weight were larger in the case of high yielding varieties and decreased in accordance with decreasing yield. And this tendency also was shown in the varietal differences of magnitude of each part. Variation in magnitude of each part for the leaf area, weight of leaf blade, culm weight was significantly small in high yielding varieties compared to low yielding varieties. 7. Plant constitutional ratio of each organ of the elongating node-part in term of weight magnitnde varied to som extent according to varieties indicating leaf blade 27.6%, leaf sheath 39.5%, culm 32.9% in the case of high yielding varieties, leaf blade 25.5%, leaf sheath 38.1%, culm 36.4% in the case of low yielding varieties, and medium yielding varieties showed intermadiate values. 8. Far higher values of the weight ration of leaf blade to culm and leaf sheath to culm were given to the high yielding varieties compared to low yielding varieties. And medium yielding varieties showed intermadiate values. II. Effects of application rate of nitrogen on the morphological characteristics of the elongating node-part, nitrogen content of leaf blade, and their relation with the grain yield of the rice were observed with 3 rice varieties; Shin No.2, Shirogane, and Jinheung varying application amounts of nitrogen as 8kg, 12kg and 16kg per 10 are. 1. As for the variation of morphological magnitude s affected by the amounts of nitrogen application, total leaf area (4 leaves from the flag leaf) increased to 16.5% at 12kg N plot, and about 30% at 16kg N polt compared to 8kg N plot and total weight of leaf blade also increased to similar extent, respectively, in contrast with weight of leaf sheath increasing 4.9% and 7.8%, respectively. However, the weight of culm decreased to 1.5% and 11.2%at the 12kg N plot and 16kg N plot, respectively, and these decreasing rate was noted at the nodes of lower part. 2. As for the verietal differences in variation of morphological magnitude as affected by the amount of nitrogen fertilization, leaf area coefficient value of variation of the total leaf area was 15.40% for Shin No. 2, 12.87% for Shirogane, and 10.99% for Jinheung. With respect to the position of nodes, the largest variation of leaf blade magnitude was observed at the fourth for Shin No. 2, the second for Shirogan, and flag leaf for Jinheung. And there also was an isotonic varietal difference in the weight of leaf blade. Variation in total culm weight showed varietal differences with the coefficient value of 7.72% for Shin No.2, 12.11% for Shirogane, and 0.94% for Jinheung. There also was varietal differences in the variation according to the position of nodes. 3. Variation of each elongating node-part related to the fertilization amount decreased with the increase of fertilization amount in the items of leaf area, weight of leaf sheath, culm weight, but weight of leaf sheath varied more at heavier fertilization than at others. 4. Constitutional ratio of each organ excluding head also varied with fertilization amount; constitutional ratio of leaf blade increased much with the increasing amount of fertilization in contrast with the response of culm eight. However, constitutional ration of the weight of leaf sheath was not much affected. 5. Lower value of the ration of leaf blade to culm was given to the 8kg N per 10 are plot, and the ratio of leaf blade to leaf sheath decreased with the increasing amount of fertilization in contrast with the increase in the ratio of leaf sheath to culm. however, the ration of leaf blade to culm plus leaf sheath decreased. 6. With the increase of nitrogen fertilization, leaf area, weight of leaf blade and leaf sheath increased. Accordingly, grin yield also increased to some extent. It was noted that culm weight was changed inversely to the changes in grain yield, but the degree of this variation varied with varietal characteristics. 7. Nitrogen content of leaves at heading and fruiting stage varied with the fertilization amount, and average nitrogen content of leaves of the varieties used 2.19%, 2.49% and 2.74% at the plot of 8kg N, and 12kg N and 16kg N per 10 are, respectively, at heading time, and 0.80%, 0.92% and 1.03% at each plot at fruiting stage. Thus, nitrogen content of leaves increased much with the increasing amount of fertilization, and higher value was given to the leaves on the higher position of elongating node-part. 8. There also was variation of nitrogen content of leaves in accordance with the varieties. However higher grain yield was obtained from the plants retaining higher nitrogen content in leaves at heading or fruiting stage.

  • PDF