• Title/Summary/Keyword: term frequency

Search Result 1,615, Processing Time 0.023 seconds

Normalized Term Frequency Weighting Method in Automatic Text Categorization (자동 문서분류에서의 정규화 용어빈도 가중치방법)

  • 김수진;박혁로
    • Proceedings of the IEEK Conference
    • /
    • 2003.11b
    • /
    • pp.255-258
    • /
    • 2003
  • This paper defines Normalized Term Frequency Weighting method for automatic text categorization by using Box-Cox, and then it applies automatic text categorization. Box-Cox transformation is statistical transformation method which makes normalized data. This paper applies that and suggests new term frequency weighting method. Because Normalized Term Frequency is different from every term compared by existing term frequency weighting method, it is general method more than fixed weighting method such as log or root. Normalized term frequency weighting method's reasonability has been proved though experiments, used 8000 newspapers divided in 4 groups, which resulted high categorization correctness in all cases.

  • PDF

A Term Importance-based Approach to Identifying Core Citations in Computational Linguistics Articles

  • Kang, In-Su
    • Journal of the Korea Society of Computer and Information
    • /
    • v.22 no.9
    • /
    • pp.17-24
    • /
    • 2017
  • Core citation recognition is to identify influential ones among the prior articles that a scholarly article cite. Previous approaches have employed citing-text occurrence information, textual similarities between citing and cited article, etc. This study proposes a term-based approach to core citation recognition, which exploits the importance of individual terms appearing in in-text citation to calculate influence-strength for each cited article. Term importance is computed using various frequency information such as term frequency(tf) in in-text citation, tf in the citing article, inverse sentence frequency in the citing article, inverse document frequency in a collection of articles. Experiments using a previous test set consisting of computational linguistics articles show that the term-based approach performs comparably with the previous approaches. The proposed technique could be easily extended by employing other term units such as n-grams and phrases, or by using new term-importance formulae.

Automatic Classification of Blog Posts using Various Term Weighting (다양한 어휘 가중치를 이용한 블로그 포스트의 자동 분류)

  • Kim, Su-Ah;Jho, Hee-Sun;Lee, Hyun Ah
    • Journal of Advanced Marine Engineering and Technology
    • /
    • v.39 no.1
    • /
    • pp.58-62
    • /
    • 2015
  • Most blog sites provide predefined classes based on contents or topics, but few bloggers choose classes for their posts because of its cumbersome manual process. This paper proposes an automatic blog post classification method that variously combines term frequency, document frequency and class frequency from each classes to find appropriate weighting scheme. In experiment, combination of term frequency, category term frequency and inversed (excepted category's) document frequency shows 77.02% classification precisions.

Term Frequency-Inverse Document Frequency (TF-IDF) Technique Using Principal Component Analysis (PCA) with Naive Bayes Classification

  • J.Uma;K.Prabha
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.4
    • /
    • pp.113-118
    • /
    • 2024
  • Pursuance Sentiment Analysis on Twitter is difficult then performance it's used for great review. The present be for the reason to the tweet is extremely small with mostly contain slang, emoticon, and hash tag with other tweet words. A feature extraction stands every technique concerning structure and aspect point beginning particular tweets. The subdivision in a aspect vector is an integer that has a commitment on ascribing a supposition class to a tweet. The cycle of feature extraction is to eradicate the exact quality to get better the accurateness of the classifications models. In this manuscript we proposed Term Frequency-Inverse Document Frequency (TF-IDF) method is to secure Principal Component Analysis (PCA) with Naïve Bayes Classifiers. As the classifications process, the work proposed can produce different aspects from wildly valued feature commencing a Twitter dataset.

Estimation of Occurrence Frequency of Short Term Air Pollution Concentration Using Texas Climatological Model (Texas Climatological Model에 의한 短期 大氣汚染濃度 發生頻度의 推定)

  • Lee, Chong-Bum
    • Journal of Korean Society for Atmospheric Environment
    • /
    • v.4 no.2
    • /
    • pp.67-71
    • /
    • 1988
  • To estimate the probability of short term concentration of air pollution using long term arithmetic average concentration, the procedure was developed and added to Texas Climatological Model version 2. In the procedure, such statistical characteristics that frequency distribution of short term concentration may be approximated by a lognormal distribution, were applied. This procedure is capable of estimating not only highest concentration for a variety of averaging times but also concentrations for arbitrary occurrence frequency. Evaluation of the procedure with the results of short term concentrations calculated by Texas Episodic Model version 8 using the meteorological data and emission data in Seoul shows that the procedure estimates concentrations fairly well for wide range of percentiles.

  • PDF

Analysis of Drought Characteristics in Gyeongbuk Based on the Duration of Standard Precipitation Index

  • Ahn, Seung Seop;Park, Ki bum;Yim, Dong Hee
    • Journal of Environmental Science International
    • /
    • v.28 no.10
    • /
    • pp.863-872
    • /
    • 2019
  • Using the Standard Precipitation Index (SPI), this study analyzed the drought characteristics of ten weather stations in Gyeongbuk, South Korea, that precipitation data over a period of 30 years. For the number of months that had a SPI of -1.0 or less, the drought occurrence index was calculated and a maximum shortage months, resilience and vulnerability in each weather station were analyzed. According to the analysis, in terms of vulnerability, the weather stations with acute short-term drought were Andong, Bonghwa, Moongyeong, and Gumi. The weather stations with acute medium-term drought were Daegu and Uljin. Finally the weather stations with acute long-term drought were Pohang, Youngdeok, and Youngju. In terms of severe drought frequency, the stations with relatively high frequency of mid-term droughts were Andong, Bonghwa, Daegu, Uiseong, Uljin, and Youngju. Gumi station had high frequency of short-term droughts. Pohang station had severe short-term ad long-term droughts. Youngdeok had severe droughts during all the terms. Based on the analysis results, it is inferred that the size of the drought should be evaluated depending on how serious vulnerability, resilience, and drought index are. Through proper evaluation of drought, it is possible to take systematic measures for the duration of the drought.

Time-Frequency Analysis of Electrohysterogram for Classification of Term and Preterm Birth

  • Ryu, Jiwoo;Park, Cheolsoo
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.4 no.2
    • /
    • pp.103-109
    • /
    • 2015
  • In this paper, a novel method for the classification of term and preterm birth is proposed based on time-frequency analysis of electrohysterogram (EHG) using multivariate empirical mode decomposition (MEMD). EHG is a promising study for preterm birth prediction, because it is low-cost and accurate compared to other preterm birth prediction methods, such as tocodynamometry (TOCO). Previous studies on preterm birth prediction applied prefilterings based on Fourier analysis of an EHG, followed by feature extraction and classification, even though Fourier analysis is suboptimal to biomedical signals, such as EHG, because of its nonlinearity and nonstationarity. Therefore, the proposed method applies prefiltering based on MEMD instead of Fourier-based prefilters before extracting the sample entropy feature and classifying the term and preterm birth groups. For the evaluation, the Physionet term-preterm EHG database was used where the proposed method and Fourier prefiltering-based method were adopted for comparative study. The result showed that the area under curve (AUC) of the receiver operating characteristic (ROC) was increased by 0.0351 when MEMD was used instead of the Fourier-based prefilter.

Comparison of term weighting schemes for document classification (문서 분류를 위한 용어 가중치 기법 비교)

  • Jeong, Ho Young;Shin, Sang Min;Choi, Yong-Seok
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.2
    • /
    • pp.265-276
    • /
    • 2019
  • The document-term frequency matrix is a general data of objects in text mining. In this study, we introduce a traditional term weighting scheme TF-IDF (term frequency-inverse document frequency) which is applied in the document-term frequency matrix and used for text classifications. In addition, we introduce and compare TF-IDF-ICSDF and TF-IGM schemes which are well known recently. This study also provides a method to extract keyword enhancing the quality of text classifications. Based on the keywords extracted, we applied support vector machine for the text classification. In this study, to compare the performance term weighting schemes, we used some performance metrics such as precision, recall, and F1-score. Therefore, we know that TF-IGM scheme provided high performance metrics and was optimal for text classification.

A Study on the Pivoted Inverse Document Frequency Weighting Method (피벗 역문헌빈도 가중치 기법에 대한 연구)

  • Lee, Jae-Yun
    • Journal of the Korean Society for information Management
    • /
    • v.20 no.4 s.50
    • /
    • pp.233-248
    • /
    • 2003
  • The Inverse Document Frequency (IDF) weighting method is based on the hypothesis that in the document collection the lower the frequency of a term is, the more important the term is as a subject word. This well-known hypothesis is, however, somewhat questionable because some low frequency terms turn out to be insufficient subject words. This study suggests the pivoted IDF weighting method for better retrieval effectiveness, on the assumption that medium frequency terms are more important than low frequency terms. We thoroughly evaluated this method on three test collections and it showed performance improvements especially at high ranks.

Sensor Fusion of GPS/INS/Baroaltimeter Using Wavelet Analysis (GPS/INS/기압고도계의 웨이블릿 센서융합 기법)

  • Kim, Seong-Pil;Kim, Eung-Tai;Seong, Kie-Jeong
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.14 no.12
    • /
    • pp.1232-1237
    • /
    • 2008
  • This paper introduces an application of wavelet analysis to the sensor fusion of GPS/INS/baroaltimeter. Using wavelet analysis the baro-inertial altitude is decomposed into the low frequency content and the high frequency content. The high frequency components, 'details', represent the perturbed altitude change from the long time trend. GPS altitude is also broken down by a wavelet decomposition. The low frequency components, 'approximations', of the decomposed signal address the long-term trend of altitude. It is proposed that the final altitude be determined as the sum of both the details of the baro-inertial altitude and the approximations of GPS altitude. Then the final altitude exclude long-term baro-inertial errors and short-term GPS errors. Finally, it is shown from the test results that the proposed method produces continuous and sensitive altitude successfully.