• Title/Summary/Keyword: datasets

Search Result 2,012, Processing Time 0.031 seconds

An Analysis of IT Trends Using Tweet Data (트윗 데이터를 활용한 IT 트렌드 분석)

  • Yi, Jin Baek;Lee, Choong Kwon;Cha, Kyung Jin
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.1
    • /
    • pp.143-159
    • /
    • 2015
  • Predicting IT trends has been a long and important subject for information systems research. IT trend prediction makes it possible to acknowledge emerging eras of innovation and allocate budgets to prepare against rapidly changing technological trends. Towards the end of each year, various domestic and global organizations predict and announce IT trends for the following year. For example, Gartner Predicts 10 top IT trend during the next year, and these predictions affect IT and industry leaders and organization's basic assumptions about technology and the future of IT, but the accuracy of these reports are difficult to verify. Social media data can be useful tool to verify the accuracy. As social media services have gained in popularity, it is used in a variety of ways, from posting about personal daily life to keeping up to date with news and trends. In the recent years, rates of social media activity in Korea have reached unprecedented levels. Hundreds of millions of users now participate in online social networks and communicate with colleague and friends their opinions and thoughts. In particular, Twitter is currently the major micro blog service, it has an important function named 'tweets' which is to report their current thoughts and actions, comments on news and engage in discussions. For an analysis on IT trends, we chose Tweet data because not only it produces massive unstructured textual data in real time but also it serves as an influential channel for opinion leading on technology. Previous studies found that the tweet data provides useful information and detects the trend of society effectively, these studies also identifies that Twitter can track the issue faster than the other media, newspapers. Therefore, this study investigates how frequently the predicted IT trends for the following year announced by public organizations are mentioned on social network services like Twitter. IT trend predictions for 2013, announced near the end of 2012 from two domestic organizations, the National IT Industry Promotion Agency (NIPA) and the National Information Society Agency (NIA), were used as a basis for this research. The present study analyzes the Twitter data generated from Seoul (Korea) compared with the predictions of the two organizations to analyze the differences. Thus, Twitter data analysis requires various natural language processing techniques, including the removal of stop words, and noun extraction for processing various unrefined forms of unstructured data. To overcome these challenges, we used SAS IRS (Information Retrieval Studio) developed by SAS to capture the trend in real-time processing big stream datasets of Twitter. The system offers a framework for crawling, normalizing, analyzing, indexing and searching tweet data. As a result, we have crawled the entire Twitter sphere in Seoul area and obtained 21,589 tweets in 2013 to review how frequently the IT trend topics announced by the two organizations were mentioned by the people in Seoul. The results shows that most IT trend predicted by NIPA and NIA were all frequently mentioned in Twitter except some topics such as 'new types of security threat', 'green IT', 'next generation semiconductor' since these topics non generalized compound words so they can be mentioned in Twitter with other words. To answer whether the IT trend tweets from Korea is related to the following year's IT trends in real world, we compared Twitter's trending topics with those in Nara Market, Korea's online e-Procurement system which is a nationwide web-based procurement system, dealing with whole procurement process of all public organizations in Korea. The correlation analysis show that Tweet frequencies on IT trending topics predicted by NIPA and NIA are significantly correlated with frequencies on IT topics mentioned in project announcements by Nara market in 2012 and 2013. The main contribution of our research can be found in the following aspects: i) the IT topic predictions announced by NIPA and NIA can provide an effective guideline to IT professionals and researchers in Korea who are looking for verified IT topic trends in the following topic, ii) researchers can use Twitter to get some useful ideas to detect and predict dynamic trends of technological and social issues.

Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity (문장 분류를 위한 정보 이득 및 유사도에 따른 단어 제거와 선택적 단어 임베딩 방안)

  • Lee, Min Seok;Yang, Seok Woo;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.4
    • /
    • pp.105-122
    • /
    • 2019
  • Dimensionality reduction is one of the methods to handle big data in text mining. For dimensionality reduction, we should consider the density of data, which has a significant influence on the performance of sentence classification. It requires lots of computations for data of higher dimensions. Eventually, it can cause lots of computational cost and overfitting in the model. Thus, the dimension reduction process is necessary to improve the performance of the model. Diverse methods have been proposed from only lessening the noise of data like misspelling or informal text to including semantic and syntactic information. On top of it, the expression and selection of the text features have impacts on the performance of the classifier for sentence classification, which is one of the fields of Natural Language Processing. The common goal of dimension reduction is to find latent space that is representative of raw data from observation space. Existing methods utilize various algorithms for dimensionality reduction, such as feature extraction and feature selection. In addition to these algorithms, word embeddings, learning low-dimensional vector space representations of words, that can capture semantic and syntactic information from data are also utilized. For improving performance, recent studies have suggested methods that the word dictionary is modified according to the positive and negative score of pre-defined words. The basic idea of this study is that similar words have similar vector representations. Once the feature selection algorithm selects the words that are not important, we thought the words that are similar to the selected words also have no impacts on sentence classification. This study proposes two ways to achieve more accurate classification that conduct selective word elimination under specific regulations and construct word embedding based on Word2Vec embedding. To select words having low importance from the text, we use information gain algorithm to measure the importance and cosine similarity to search for similar words. First, we eliminate words that have comparatively low information gain values from the raw text and form word embedding. Second, we select words additionally that are similar to the words that have a low level of information gain values and make word embedding. In the end, these filtered text and word embedding apply to the deep learning models; Convolutional Neural Network and Attention-Based Bidirectional LSTM. This study uses customer reviews on Kindle in Amazon.com, IMDB, and Yelp as datasets, and classify each data using the deep learning models. The reviews got more than five helpful votes, and the ratio of helpful votes was over 70% classified as helpful reviews. Also, Yelp only shows the number of helpful votes. We extracted 100,000 reviews which got more than five helpful votes using a random sampling method among 750,000 reviews. The minimal preprocessing was executed to each dataset, such as removing numbers and special characters from text data. To evaluate the proposed methods, we compared the performances of Word2Vec and GloVe word embeddings, which used all the words. We showed that one of the proposed methods is better than the embeddings with all the words. By removing unimportant words, we can get better performance. However, if we removed too many words, it showed that the performance was lowered. For future research, it is required to consider diverse ways of preprocessing and the in-depth analysis for the co-occurrence of words to measure similarity values among words. Also, we only applied the proposed method with Word2Vec. Other embedding methods such as GloVe, fastText, ELMo can be applied with the proposed methods, and it is possible to identify the possible combinations between word embedding methods and elimination methods.

Comparison of the Mineral Contents of Sun-dried Salt Depending on Wet Digestion and Dissolution (습식분해 및 직접용해법에 따른 천일염 중 무기성분 함량 비교)

  • Jin, Yong-Xie;Je, Jeong-Hwan;Lee, Yeon-Hee;Kim, Jin-Hyo;Cho, Young-Suk;Kim, So-Young
    • Food Science and Preservation
    • /
    • v.18 no.6
    • /
    • pp.993-997
    • /
    • 2011
  • The aims of this research were to determine the proximate composition of various salts and to compare two digestion methods (direct digestion without heating, and microwave digestion) for the determination of the main mineral contents of various salts. Twelve salt samples were divided into three groups of four samples each (imported, Korean gray, and Korean white salts). As a result, the NaCl contents of the Korean white, Korean gray, and imported salts were 85.1, 89.3, and 91.3%, respectively. The salts in the three groups were analyzed for their main mineral contents via AAS. The sodium (Na) content of the Korean white salt was found to be slightly lower than that of the imported salt while the magnesium (Mg) and potassium (K) contents of the Korean white salt were found to be higher than those of the imported salt. The mineral composition (% Na:Mg) obtained using microwave-assisted digestion procedures, and the other dissolutions for the subsequent sample analysis, were 89:1 (for both the imported and Korean gray salts) and 82:3 vs. 81:3 (Korean white salt), respectively. The data regarding the mineral contents and composition of the sun-dried salts obtained through the analysis method of wet digestion and the dissolution procedure were compared, and no significant difference was found between the two datasets. Consequently, in this paper, a direct dissolution procedure is suggested for the analysis of the mineral composition of salt.

Comparative Research of Image Classification and Image Segmentation Methods for Mapping Rural Roads Using a High-resolution Satellite Image (고해상도 위성영상을 이용한 농촌 도로 매핑을 위한 영상 분류 및 영상 분할 방법 비교에 관한 연구)

  • CHOUNG, Yun-Jae;GU, Bon-Yup
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.24 no.3
    • /
    • pp.73-82
    • /
    • 2021
  • Rural roads are the significant infrastructure for developing and managing the rural areas, hence the utilization of the remote sensing datasets for managing the rural roads is necessary for expanding the rural transportation infrastructure and improving the life quality of the rural residents. In this research, the two different methods such as image classification and image segmentation were compared for mapping the rural road based on the given high-resolution satellite image acquired in the rural areas. In the image classification method, the deep learning with the multiple neural networks was employed to the given high-resolution satellite image for generating the object classification map, then the rural roads were mapped by extracting the road objects from the generated object classification map. In the image segmentation method, the multiresolution segmentation was employed to the same satellite image for generating the segment image, then the rural roads were mapped by merging the road objects located on the rural roads on the satellite image. We used the 100 checkpoints for assessing the accuracy of the two rural roads mapped by the different methods and drew the following conclusions. The image segmentation method had the better performance than the image classification method for mapping the rural roads using the give satellite image, because some of the rural roads mapped by the image classification method were not identified due to the miclassification errors occurred in the object classification map, while all of the rural roads mapped by the image segmentation method were identified. However some of the rural roads mapped by the image segmentation method also had the miclassfication errors due to some rural road segments including the non-rural road objects. In future research the object-oriented classification or the convolutional neural networks widely used for detecting the precise objects from the image sources would be used for improving the accuracy of the rural roads using the high-resolution satellite image.

Imputation Accuracy from 770K SNP Chips to Next Generation Sequencing Data in a Hanwoo (Korean Native Cattle) Population using Minimac3 and Beagle (Minimac3와 Beagle 프로그램을 이용한 한우 770K chip 데이터에서 차세대 염기서열분석 데이터로의 결측치 대치의 정확도 분석)

  • An, Na-Rae;Son, Ju-Hwan;Park, Jong-Eun;Chai, Han-Ha;Jang, Gul-Won;Lim, Dajeong
    • Journal of Life Science
    • /
    • v.28 no.11
    • /
    • pp.1255-1261
    • /
    • 2018
  • Whole genome analysis have been made possible with the development of DNA sequencing technologies and discovery of many single nucleotide polymorphisms (SNPs). Large number of SNP can be analyzed with SNP chips, since SNPs of human as well as livestock genomes are available. Among the various missing nucleotide imputation programs, Minimac3 software is suggested to be highly accurate, with a simplified workflow and relatively fast. In the present study, we used Minimac3 program to perform genomic missing value substitution 1,226 animals 770K SNP chip and imputing missing SNPs with next generation sequencing data from 311 animals. The accuracy on each chromosome was about 94~96%, and individual sample accuracy was about 92~98%. After imputation of the genotypes, SNPs with R Square ($R^2$) values for three conditions were 0.4, 0.6, and 0.8 and the percentage of SNPs were 91%, 84%, and 70% respectively. The differences in the Minor Allele Frequency gave $R^2$ values corresponding to seven intervals (0, 0.025), (0.025, 0.05), (0.05, 0.1), (0.1, 0.2), (0.2, 0.3). (0.3, 0.4) and (0.4, 0.5) of 64~88%. The total analysis time was about 12 hr. In future SNP chip studies, as the size and complexity of the genomic datasets increase, we expect that genomic imputation using Minimac3 can improve the reliability of chip data for Hanwoo discrimination.

The Effect of Data Size on the k-NN Predictability: Application to Samsung Electronics Stock Market Prediction (데이터 크기에 따른 k-NN의 예측력 연구: 삼성전자주가를 사례로)

  • Chun, Se-Hak
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.239-251
    • /
    • 2019
  • Statistical methods such as moving averages, Kalman filtering, exponential smoothing, regression analysis, and ARIMA (autoregressive integrated moving average) have been used for stock market predictions. However, these statistical methods have not produced superior performances. In recent years, machine learning techniques have been widely used in stock market predictions, including artificial neural network, SVM, and genetic algorithm. In particular, a case-based reasoning method, known as k-nearest neighbor is also widely used for stock price prediction. Case based reasoning retrieves several similar cases from previous cases when a new problem occurs, and combines the class labels of similar cases to create a classification for the new problem. However, case based reasoning has some problems. First, case based reasoning has a tendency to search for a fixed number of neighbors in the observation space and always selects the same number of neighbors rather than the best similar neighbors for the target case. So, case based reasoning may have to take into account more cases even when there are fewer cases applicable depending on the subject. Second, case based reasoning may select neighbors that are far away from the target case. Thus, case based reasoning does not guarantee an optimal pseudo-neighborhood for various target cases, and the predictability can be degraded due to a deviation from the desired similar neighbor. This paper examines how the size of learning data affects stock price predictability through k-nearest neighbor and compares the predictability of k-nearest neighbor with the random walk model according to the size of the learning data and the number of neighbors. In this study, Samsung electronics stock prices were predicted by dividing the learning dataset into two types. For the prediction of next day's closing price, we used four variables: opening value, daily high, daily low, and daily close. In the first experiment, data from January 1, 2000 to December 31, 2017 were used for the learning process. In the second experiment, data from January 1, 2015 to December 31, 2017 were used for the learning process. The test data is from January 1, 2018 to August 31, 2018 for both experiments. We compared the performance of k-NN with the random walk model using the two learning dataset. The mean absolute percentage error (MAPE) was 1.3497 for the random walk model and 1.3570 for the k-NN for the first experiment when the learning data was small. However, the mean absolute percentage error (MAPE) for the random walk model was 1.3497 and the k-NN was 1.2928 for the second experiment when the learning data was large. These results show that the prediction power when more learning data are used is higher than when less learning data are used. Also, this paper shows that k-NN generally produces a better predictive power than random walk model for larger learning datasets and does not when the learning dataset is relatively small. Future studies need to consider macroeconomic variables related to stock price forecasting including opening price, low price, high price, and closing price. Also, to produce better results, it is recommended that the k-nearest neighbor needs to find nearest neighbors using the second step filtering method considering fundamental economic variables as well as a sufficient amount of learning data.

BVOCs Estimates Using MEGAN in South Korea: A Case Study of June in 2012 (MEGAN을 이용한 국내 BVOCs 배출량 산정: 2012년 6월 사례 연구)

  • Kim, Kyeongsu;Lee, Seung-Jae
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.24 no.1
    • /
    • pp.48-61
    • /
    • 2022
  • South Korea is quite vegetation rich country which has 63% forests and 16% cropland area. Massive NOx emissions from megacities, therefore, are easily combined with BVOCs emitted from the forest and cropland area, then produce high ozone concentration. BVOCs emissions have been estimated using well-known emission models, such as BEIS (Biogenic Emission Inventory System) or MEGAN (Model of Emission of Gases and Aerosol from Nature) which were developed using non-Korean emission factors. In this study, we ran MEGAN v2.1 model to estimate BVO Cs emissions in Korea. The MO DIS Land Cover and LAI (Leaf Area Index) products over Korea were used to run the MEGAN model for June 2012. Isoprene and Monoterpenes emissions from the model were inter-compared against the enclosure chamber measurements from Taehwa research forest in Korea, during June 11 and 12, 2012. For estimating emission from the enclosed chamber measurement data. The initial results show that isoprene emissions from the MEGAN model were up to 6.4 times higher than those from the enclosure chamber measurement. Monoterpenes from enclosure chamber measurement were up to 5.6 times higher than MEGAN emission. The differences between two datasets, however, were much smaller during the time of high emissions. More inter-comparison results and the possibilities of improving the MEGAN modeling performance using local measurement data over Korea will be presented and discussed.

Physical Offset of UAVs Calibration Method for Multi-sensor Fusion (다중 센서 융합을 위한 무인항공기 물리 오프셋 검보정 방법)

  • Kim, Cheolwook;Lim, Pyeong-chae;Chi, Junhwa;Kim, Taejung;Rhee, Sooahm
    • Korean Journal of Remote Sensing
    • /
    • v.38 no.6_1
    • /
    • pp.1125-1139
    • /
    • 2022
  • In an unmanned aerial vehicles (UAVs) system, a physical offset can be existed between the global positioning system/inertial measurement unit (GPS/IMU) sensor and the observation sensor such as a hyperspectral sensor, and a lidar sensor. As a result of the physical offset, a misalignment between each image can be occurred along with a flight direction. In particular, in a case of multi-sensor system, an observation sensor has to be replaced regularly to equip another observation sensor, and then, a high cost should be paid to acquire a calibration parameter. In this study, we establish a precise sensor model equation to apply for a multiple sensor in common and propose an independent physical offset estimation method. The proposed method consists of 3 steps. Firstly, we define an appropriate rotation matrix for our system, and an initial sensor model equation for direct-georeferencing. Next, an observation equation for the physical offset estimation is established by extracting a corresponding point between a ground control point and the observed data from a sensor. Finally, the physical offset is estimated based on the observed data, and the precise sensor model equation is established by applying the estimated parameters to the initial sensor model equation. 4 region's datasets(Jeon-ju, Incheon, Alaska, Norway) with a different latitude, longitude were compared to analyze the effects of the calibration parameter. We confirmed that a misalignment between images were adjusted after applying for the physical offset in the sensor model equation. An absolute position accuracy was analyzed in the Incheon dataset, compared to a ground control point. For the hyperspectral image, root mean square error (RMSE) for X, Y direction was calculated for 0.12 m, and for the point cloud, RMSE was calculated for 0.03 m. Furthermore, a relative position accuracy for a specific point between the adjusted point cloud and the hyperspectral images were also analyzed for 0.07 m, so we confirmed that a precise data mapping is available for an observation without a ground control point through the proposed estimation method, and we also confirmed a possibility of multi-sensor fusion. From this study, we expect that a flexible multi-sensor platform system can be operated through the independent parameter estimation method with an economic cost saving.

Reliability and validity of Korean version of the OHIP for edentulous subjects: A pilot study (무치악 환자들을 위한 한국어 버전의 구강건강영향지수 신뢰도와 타당성 평가를 위한 모의연구)

  • Shin, Jae Seob;Bae, So Young;Park, Jin Hong;Shim, Ji Suk;Lee, Jeong Yol
    • The Journal of Korean Academy of Prosthodontics
    • /
    • v.59 no.3
    • /
    • pp.305-313
    • /
    • 2021
  • Purpose. The purpose of this pilot study is to evaluate the reliability and validity of the Korean version of the oral health impact profile (OHIP-EDENT K) for edentulous patients. Materials and methods. The study was conducted on 12 patients who fabricated overdenture in the Department of Prosthodontics, Korea University, Guro Hospital. All subjects completed the Korean version of Oral Health Impact Profile (OHIP K) questionnaire. Shorten version of the OHIP called OHIP-14 K and OHIP-EDENT K were derived from the datasets. Cronbach's alpha was used to measure internal consistency of the summary scores for OHIP-EDENT K. The Spearman's correlation coefficient between the summary scores for OHIP-EDENT K and OHIP K was calculated to evaluate concurrent validity. Results. The reliability of the summary scores for OHIP-EDENT K was acceptable (α=.736). The Spearman's correlation coefficient of the summary scores for OHIP-EDENT K and OHIP K was 0.966, which was statistically significant (P<.001). OHIP-EDENT K exhibited less susceptibility to floor effects than OHIP-14 K and appeared to measure change as effectively as OHIP K. In order to prove the reliability, responsiveness and validity of OHIP-EDENT K, further studies with more samples are needed. Conclusion. The OHIP-EDENT K, a questionnaire on oral health-related QOL comprising 19 items, has measurement properties comparable with the full 49-item version. This modified shortened version can be an alternative questionnaire to full version of OHIP K and OHIP-14 K in edentulous patients.

Future Prospects of Forest Type Change Determined from National Forest Inventory Time-series Data (시계열 국가산림자원조사 자료를 이용한 전국 산림의 임상 변화 특성 분석과 미래 전망)

  • Eun-Sook, Kim;Byung-Heon, Jung;Jae-Soo, Bae;Jong-Hwan, Lim
    • Journal of Korean Society of Forest Science
    • /
    • v.111 no.4
    • /
    • pp.461-472
    • /
    • 2022
  • Natural and anthropogenic factors cause forest types to continuously change. Since the ratio of forest area by forest type is important information for identifying the characteristics of national forest resources, an accurate understanding of the prospect of forest type change is required. The study aim was to use National Forest Inventory (NFI) time-series data to understand the characteristics of forest type change and to estimate future prospects of nationwide forest type change. We used forest type change information from the fifth and seventh NFI datasets, climate, topography, forest stand, and disturbance variables related to forest type change to analyze trends and characteristics of forest type change. The results showed that the forests in Korea are changing in the direction of decreasing coniferous forests and increasing mixed and broadleaf forests. The forest sites that were changing from coniferous to mixed forests or from mixed to broadleaf forests were mainly located in wet topographic environments and climatic conditions. The forest type changes occurred more frequently in sites with high disturbance potential (high temperature, young or sparse forest stands, and non-forest areas). We used a climate change scenario (RCP 8.5) to establish a forest type change model (SVM) to predict future changes. During the 40-year period from 2015 to 2055, the SVM predicted that coniferous forests will decrease from 38.1% to 28.5%, broadleaf forests will increase from 34.2% to 38.8%, and mixed forests will increase from 27.7% to 32.7%. These results can be used as basic data for establishing future forest management strategies.