• Title/Summary/Keyword: datasets

Search Result 2,012, Processing Time 0.034 seconds

Reference String Recognition based on Word Sequence Tagging and Post-processing: Evaluation with English and German Datasets

  • Kang, In-Su
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.5
    • /
    • pp.1-7
    • /
    • 2018
  • Reference string recognition is to extract individual reference strings from a reference section of an academic article, which consists of a sequence of reference lines. This task has been attacked by heuristic-based, clustering-based, classification-based approaches, exploiting lexical and layout characteristics of reference lines. Most classification-based methods have used sequence labeling to assign labels to either a sequence of tokens within reference lines, or a sequence of reference lines. Unlike the previous token-level sequence labeling approach, this study attempts to assign different labels to the beginning, intermediate and terminating tokens of a reference string. After that, post-processing is applied to identify reference strings by predicting their beginning and/or terminating tokens. Experimental evaluation using English and German reference string recognition datasets shows that the proposed method obtains above 94% in the macro-averaged F1.

Search for galaxy clusters in SA22

  • Kim, Jae-Woo;Im, Myungshin;Hyun, Minhee
    • The Bulletin of The Korean Astronomical Society
    • /
    • v.37 no.2
    • /
    • pp.83.1-83.1
    • /
    • 2012
  • The galaxy cluster is a good laboratory to test the cosmological model as well as the evolution of galaxies in the dense region. However the lack of wide and deep near-IR datasets has prevented to identify galaxy clusters at z>1. Here we merge a wide, deep near-IR datasets of UKIDSS DXS (J and K bands) and IMS (J band) with the CFHT Legacy Survey (CFHTLS) ugriz catalogue to detect galaxy clusters. We identify candidate galaxy clusters at z>0.8, where the near-IR dataset plays an important role to detect galaxies efficiently. The cluster mass is also estimated based on the cluster richness and the semi-analytical cosmological simulation.

  • PDF

Pruning the Boosting Ensemble of Decision Trees

  • Yoon, Young-Joo;Song, Moon-Sup
    • Communications for Statistical Applications and Methods
    • /
    • v.13 no.2
    • /
    • pp.449-466
    • /
    • 2006
  • We propose to use variable selection methods based on penalized regression for pruning decision tree ensembles. Pruning methods based on LASSO and SCAD are compared with the cluster pruning method. Comparative studies are performed on some artificial datasets and real datasets. According to the results of comparative studies, the proposed methods based on penalized regression reduce the size of boosting ensembles without decreasing accuracy significantly and have better performance than the cluster pruning method. In terms of classification noise, the proposed pruning methods can mitigate the weakness of AdaBoost to some degree.

An Improved Semi-Empirical Model for Radar Backscattering from Rough Sea Surfaces at X-Band

  • Jin, Taekyeong;Oh, Yisok
    • Journal of electromagnetic engineering and science
    • /
    • v.18 no.2
    • /
    • pp.136-140
    • /
    • 2018
  • We propose an improved semi-empirical scattering model for X-band radar backscattering from rough sea surfaces. This new model has a wider validity range of wind speeds than does the existing semi-empirical sea spectrum (SESS) model. First, we retrieved the small-roughness parameters from the sea surfaces, which were numerically generated using the Pierson-Moskowitz spectrum and measurement datasets for various wind speeds. Then, we computed the backscattering coefficients of the small-roughness surfaces for various wind speeds using the integral equation method model. Finally, the large-roughness characteristics were taken into account by integrating the small-roughness backscattering coefficients multiplying them with the surface slope probability density function for all possible surface slopes. The new model includes a wind speed range below 3.46 m/s, which was not covered by the existing SESS model. The accuracy of the new model was verified with two measurement datasets for various wind speeds from 0.5 m/s to 14 m/s.

Enhanced Markov-Difference Based Power Consumption Prediction for Smart Grids

  • Le, Yiwen;He, Jinghan
    • Journal of Electrical Engineering and Technology
    • /
    • v.12 no.3
    • /
    • pp.1053-1063
    • /
    • 2017
  • Power prediction is critical to improve power efficiency in Smart Grids. Markov chain provides a useful tool for power prediction. With careful investigation of practical power datasets, we find an interesting phenomenon that the stochastic property of practical power datasets does not follow the Markov features. This mismatch affects the prediction accuracy if directly using Markov prediction methods. In this paper, we innovatively propose a spatial transform based data processing to alleviate this inconsistency. Furthermore, we propose an enhanced power prediction method, named by Spatial Mapping Markov-Difference (SMMD), to guarantee the prediction accuracy. In particular, SMMD adopts a second prediction adjustment based on the differential data to reduce the stochastic error. Experimental results validate that the proposed SMMD achieves an improvement in terms of the prediction accuracy with respect to state-of-the-art solutions.

ModifiedFAST: A New Optimal Feature Subset Selection Algorithm

  • Nagpal, Arpita;Gaur, Deepti
    • Journal of information and communication convergence engineering
    • /
    • v.13 no.2
    • /
    • pp.113-122
    • /
    • 2015
  • Feature subset selection is as a pre-processing step in learning algorithms. In this paper, we propose an efficient algorithm, ModifiedFAST, for feature subset selection. This algorithm is suitable for text datasets, and uses the concept of information gain to remove irrelevant and redundant features. A new optimal value of the threshold for symmetric uncertainty, used to identify relevant features, is found. The thresholds used by previous feature selection algorithms such as FAST, Relief, and CFS were not optimal. It has been proven that the threshold value greatly affects the percentage of selected features and the classification accuracy. A new performance unified metric that combines accuracy and the number of features selected has been proposed and applied in the proposed algorithm. It was experimentally shown that the percentage of selected features obtained by the proposed algorithm was lower than that obtained using existing algorithms in most of the datasets. The effectiveness of our algorithm on the optimal threshold was statistically validated with other algorithms.

Selectivity Estimation for Spatial Databases

  • Chi, Jeong-Hee;Lee, Jin-Yul;Ryu, Keun-Ho
    • Proceedings of the KSRS Conference
    • /
    • 2003.11a
    • /
    • pp.766-768
    • /
    • 2003
  • Selectivity estimation for spatial query is curial in Spatial Database Management Systems(SDBMS). Many works have been performed to estimate accurate selectivity. Although they deal with some problems such as false-count, multi-count arising from properties of spatial dataset, they can not get such effects in little memory space.* Therefore, we need to compress spatial dataset into little memory. In this paper, we propose a new technique called MW Histogram which is able to compress summary data and get reasonable results. Our method is based on two techniques:(a)MinSkew partitioning algorithm which deal with skewed spatial datasets. efficiently (b) Wavelet transformation which compression effect is proven. We evaluate our method via real datasets. The experimental result shows that the MW Histogram has the ability of providing estimates with low relative error and retaining the similar estimates even if memory space is small.

  • PDF

A Study on Classifications of Useful Customer Reviews by Applying Text Mining Approach (텍스트 마이닝을 활용한 고객 리뷰의 유용성 지수 개선에 관한 연구)

  • Lee, Hong Joo
    • Journal of Information Technology Services
    • /
    • v.14 no.4
    • /
    • pp.159-169
    • /
    • 2015
  • Customer reviews are one of the important sources for purchase decision makings in online stores. Online stores have tried to provide useful reviews in product pages to customers. To assess the usefulness of customer reviews before other users have voted enough on the reviews, diverse aspects of reviews were utilized in prevous studies. Style and semantic information were utilized in many studies. This study aims to test diverse alogrithms and datasets for identifying a proper classification method and threshold to classify useful reviews. In particular, most researches utilized ratio type helpfulness index as Amazon.com used. However, there is another type of usefulness index utilized in TripAdviser.com or Yelp.com, count type helpfulness index. There was no proper threshold to classify useful reviews yet for count type helpfulness index. This study used reivews and their usefulness votes on restaurnats from Yelp.com to devise diverse datasets and applied text mining approaches to classify useful reviews. Random Forest, SVM, and GLMNET showed the greater values of accuracy than other approaches.

Cooperative Coevolution Differential Evolution Based on Spark for Large-Scale Optimization Problems

  • Tan, Xujie;Lee, Hyun-Ae;Shin, Seong-Yoon
    • Journal of information and communication convergence engineering
    • /
    • v.19 no.3
    • /
    • pp.155-160
    • /
    • 2021
  • Differential evolution is an efficient algorithm for solving continuous optimization problems. However, its performance deteriorates rapidly, and the runtime increases exponentially when differential evolution is applied for solving large-scale optimization problems. Hence, a novel cooperative coevolution differential evolution based on Spark (known as SparkDECC) is proposed. The divide-and-conquer strategy is used in SparkDECC. First, the large-scale problem is decomposed into several low-dimensional subproblems using the random grouping strategy. Subsequently, each subproblem can be addressed in a parallel manner by exploiting the parallel computation capability of the resilient distributed datasets model in Spark. Finally, the optimal solution of the entire problem is obtained using the cooperation mechanism. The experimental results on 13 high-benchmark functions show that the new algorithm performs well in terms of speedup and scalability. The effectiveness and applicability of the proposed algorithm are verified.

Citation-based Article Summarization using a Combination of Lexical Text Similarities: Evaluation with Computational Linguistics Literature Summarization Datasets

  • Kang, In-Su
    • Journal of the Korea Society of Computer and Information
    • /
    • v.24 no.7
    • /
    • pp.31-37
    • /
    • 2019
  • Citation-based article summarization is to create a shortened text for an academic article, reflecting the content of citing sentences which contain other's thoughts about the target article to be summarized. To deal with the problem, this study introduces an extractive summarization method based on calculating a linear combination of various sentence salience scores, which represent the degrees to which a candidate sentence reflects the content of author's abstract text, reader's citing text, and the target article to be summarized. In the current study, salience scores are obtained by computing surface-level textual similarities. Experiments using CL-SciSumm datasets show that the proposed method parallels or outperforms the previous approaches in ROUGE evaluations against SciSumm-2017 human summaries and SciSumm-2016/2017 community summaries.