• 제목/요약/키워드: high dimensional large-scale data

검색결과 45건 처리시간 0.024초

CUDA 및 분할-정복 기반의 효율적인 다차원 척도법 (An Efficient Multidimensional Scaling Method based on CUDA and Divide-and-Conquer)

  • 박성인;황규백
    • 한국정보과학회논문지:컴퓨팅의 실제 및 레터
    • /
    • 제16권4호
    • /
    • pp.427-431
    • /
    • 2010
  • 다차원 척도법(multidimensional scaling)은 고차원의 데이터를 낮은 차원의 공간에 매핑(mapping)하여 데이터 간의 유사성을 표현하는 방법이다. 이는 주로 자질 선정 및 데이터를 시각화하는 데 이용된다. 그러한 다차원 척도법 중, 전통 다차원 척도법(classical multidimensional scaling)은 긴 수행 시간과 큰 공간을 필요로 하기 때문에 객체의 수가 많은 경우에 대해 적용하기 어렵다. 이는 유클리드 거리(Euclidean distance)에 기반한 $n{\times}n$ 상이도 행렬(dissimilarity matrix)에 대해 고유쌍 문제(eigenpair problem)를 풀어야 하기 때문이다(단, n은 객체의 개수). 따라서, n이 커질수록 수행 시간이 길어지며, 메모리 사용량 증가로 인해 적용할 수 있는 데이터 크기에 한계가 있다. 본 논문에서는 이러한 문제를 완화하기 위해 GPGPU 기술 중 하나인 CUDA와 분할-정복(divide-and-conquer)기법을 활용한 효율적인 다차원 척도법을 제안하며, 다양한 실험을 통해 제안하는 기법이 객체의 개수가 많은 경우에 매우 효율적일 수 있음을 보인다.

F_MixBERT: Sentiment Analysis Model using Focal Loss for Imbalanced E-commerce Reviews

  • Fengqian Pang;Xi Chen;Letong Li;Xin Xu;Zhiqiang Xing
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제18권2호
    • /
    • pp.263-283
    • /
    • 2024
  • Users' comments after online shopping are critical to product reputation and business improvement. These comments, sometimes known as e-commerce reviews, influence other customers' purchasing decisions. To confront large amounts of e-commerce reviews, automatic analysis based on machine learning and deep learning draws more and more attention. A core task therein is sentiment analysis. However, the e-commerce reviews exhibit the following characteristics: (1) inconsistency between comment content and the star rating; (2) a large number of unlabeled data, i.e., comments without a star rating, and (3) the data imbalance caused by the sparse negative comments. This paper employs Bidirectional Encoder Representation from Transformers (BERT), one of the best natural language processing models, as the base model. According to the above data characteristics, we propose the F_MixBERT framework, to more effectively use inconsistently low-quality and unlabeled data and resolve the problem of data imbalance. In the framework, the proposed MixBERT incorporates the MixMatch approach into BERT's high-dimensional vectors to train the unlabeled and low-quality data with generated pseudo labels. Meanwhile, data imbalance is resolved by Focal loss, which penalizes the contribution of large-scale data and easily-identifiable data to total loss. Comparative experiments demonstrate that the proposed framework outperforms BERT and MixBERT for sentiment analysis of e-commerce comments.

A Novel Fundus Image Reading Tool for Efficient Generation of a Multi-dimensional Categorical Image Database for Machine Learning Algorithm Training

  • Park, Sang Jun;Shin, Joo Young;Kim, Sangkeun;Son, Jaemin;Jung, Kyu-Hwan;Park, Kyu Hyung
    • Journal of Korean Medical Science
    • /
    • 제33권43호
    • /
    • pp.239.1-239.12
    • /
    • 2018
  • Background: We described a novel multi-step retinal fundus image reading system for providing high-quality large data for machine learning algorithms, and assessed the grader variability in the large-scale dataset generated with this system. Methods: A 5-step retinal fundus image reading tool was developed that rates image quality, presence of abnormality, findings with location information, diagnoses, and clinical significance. Each image was evaluated by 3 different graders. Agreements among graders for each decision were evaluated. Results: The 234,242 readings of 79,458 images were collected from 55 licensed ophthalmologists during 6 months. The 34,364 images were graded as abnormal by at-least one rater. Of these, all three raters agreed in 46.6% in abnormality, while 69.9% of the images were rated as abnormal by two or more raters. Agreement rate of at-least two raters on a certain finding was 26.7%-65.2%, and complete agreement rate of all-three raters was 5.7%-43.3%. As for diagnoses, agreement of at-least two raters was 35.6%-65.6%, and complete agreement rate was 11.0%-40.0%. Agreement of findings and diagnoses were higher when restricted to images with prior complete agreement on abnormality. Retinal/glaucoma specialists showed higher agreements on findings and diagnoses of their corresponding subspecialties. Conclusion: This novel reading tool for retinal fundus images generated a large-scale dataset with high level of information, which can be utilized in future development of machine learning-based algorithms for automated identification of abnormal conditions and clinical decision supporting system. These results emphasize the importance of addressing grader variability in algorithm developments.

CNN based data anomaly detection using multi-channel imagery for structural health monitoring

  • Shajihan, Shaik Althaf V.;Wang, Shuo;Zhai, Guanghao;Spencer, Billie F. Jr.
    • Smart Structures and Systems
    • /
    • 제29권1호
    • /
    • pp.181-193
    • /
    • 2022
  • Data-driven structural health monitoring (SHM) of civil infrastructure can be used to continuously assess the state of a structure, allowing preemptive safety measures to be carried out. Long-term monitoring of large-scale civil infrastructure often involves data-collection using a network of numerous sensors of various types. Malfunctioning sensors in the network are common, which can disrupt the condition assessment and even lead to false-negative indications of damage. The overwhelming size of the data collected renders manual approaches to ensure data quality intractable. The task of detecting and classifying an anomaly in the raw data is non-trivial. We propose an approach to automate this task, improving upon the previously developed technique of image-based pre-processing on one-dimensional (1D) data by enriching the features of the neural network input data with multiple channels. In particular, feature engineering is employed to convert the measured time histories into a 3-channel image comprised of (i) the time history, (ii) the spectrogram, and (iii) the probability density function representation of the signal. To demonstrate this approach, a CNN model is designed and trained on a dataset consisting of acceleration records of sensors installed on a long-span bridge, with the goal of fault detection and classification. The effect of imbalance in anomaly patterns observed is studied to better account for unseen test cases. The proposed framework achieves high overall accuracy and recall even when tested on an unseen dataset that is much larger than the samples used for training, offering a viable solution for implementation on full-scale structures where limited labeled-training data is available.

Proteomic Screening of Antigenic Proteins from the Hard Tick, Haemaphysalis longicornis (Acari: Ixodidae)

  • Kim, Young-Ha;Islam, Mohammad Saiful;You, Myung-Jo
    • Parasites, Hosts and Diseases
    • /
    • 제53권1호
    • /
    • pp.85-93
    • /
    • 2015
  • Proteomic tools allow large-scale, high-throughput analyses for the detection, identification, and functional investigation of proteome. For detection of antigens from Haemaphysalis longicornis, 1-dimensional electrophoresis (1-DE) quantitative immunoblotting technique combined with 2-dimensional electrophoresis (2-DE) immunoblotting was used for whole body proteins from unfed and partially fed female ticks. Reactivity bands and 2-DE immunoblotting were performed following 2-DE electrophoresis to identify protein spots. The proteome of the partially fed female had a larger number of lower molecular weight proteins than that of the unfed female tick. The total number of detected spots was 818 for unfed and 670 for partially fed female ticks. The 2-DE immunoblotting identified 10 antigenic spots from unfed females and 8 antigenic spots from partially fed females. Matrix Assisted Laser Desorption Ionization-Time of Flight Mass Spectrometry (MALDI-TOF) of relevant spots identified calreticulin, putative secreted WC salivary protein, and a conserved hypothetical protein from the National Center for Biotechnology Information and Swiss Prot protein sequence databases. These findings indicate that most of the whole body components of these ticks are non-immunogenic. The data reported here will provide guidance in the identification of antigenic proteins to prevent infestation and diseases transmitted by H. longicornis.

수치지도의 수정 및 갱신을 위한 고해상도 위성영상의 적용 가능성 평가 (Estimating the Application Possibility of High-resolution Satellite Image for Update and Revision of Digital Map)

  • 강준묵;이철희;이형석
    • 한국측량학회지
    • /
    • 제20권3호
    • /
    • pp.313-321
    • /
    • 2002
  • 고해상도 위성영상의 공급이 현실화됨에 따라 위성영상을 기반으로 한 수치지형도나 주제도의 신규제작 및 갱신에 많은 관심이 모이고 있다. 본 연구는 아이코노스 위성영상을 이용하여 기존의 축척 l/5,000 및 l/25,000수치지도의 수정 및 갱신 가능성을 제시하고자 하였다. 아이코노스 단영상에서 기존 수치지도상의 기준점을 활용하여 기하보정을 수행하였으며, 기존 수치지도의 3차원 등고선자료와 표고성과에 의해 수치표고모형을 추출하여 정사영상을 생성하였다. 정사보정된 위성영상과 기존 수치지도를 중첩하여 변화된 지형지물들을 스크린 디지타이징 방법으로 수정하였고, 위치정확도 분석을 위해 변하지 않은 지역의 지형지물들을 위성영상위에 직접 작도하여 비교하였다. 그 결과, 평면위치오차를 $\pm$3.35m의 평균제곱근오차로 산출할 수 있었으므로 축척 1/10,000 이하의 수치지도 갱신에는 충분히 그 활용이 가능한 것으로 판단되며, 입체영상을 사용하고 지상기준점 측량을 병행한 갱신방법을 사용한다면 축척 l/5,000이상의 대축척 수치지도 갱신도 가능할 것이다.

대형 고분에서의 3차원 전기비저항 탐사 (3-D Resistivity Imaing of a Large Scale Tumulus)

  • 오현덕;이명종;김정호;신종우
    • 지구물리와물리탐사
    • /
    • 제14권4호
    • /
    • pp.316-323
    • /
    • 2011
  • 대형 고분의 발굴 조사를 위한 전기비저항 탐사법의 적용 가능성을 시험하기 위하여, 나주 복암리 3호분에 대하여 3차원 전기비저항 탐사를 수행하였다. 높은 해상도의 지하구조 영상을 얻기 위해서는 고분의 정확한 지형 구조 및 전극 위치에 대한 정보 획득이 필수적이다. 이에 따라 문화재 발굴 조사에서 사용하는 방법인 실을 이용한 구획설정법을 응용하여 전극을 설치하였다. 탐사 자료는, 전극 간격은 2 m, 측선 간격은 1 m로 하여 얻었으며, 각 탐사 측선은 상대적으로 1 m 엇갈리게 배열함으로써 전체적으로 고분에 대하여 1 m ${\times}$ 1m 크기의 격자망으로 구성하였다. 탐사 자료에 대하여 3차원 전기비저항 영상화를 수행하고 이를 기존의 발굴 조사 결과와 대비하였으며, 이로부터 확인된 매장 유구 분포와 전기비저항 영상이 매우 잘 일치함을 확인하였다. 이 연구를 통하여 대형 고분에서 매장 유구를 조사할 때 3차원 전기비저항 영상화 기술이 매우 유용함을 밝혔다.

3차원 공간데이터 처리를 위한 차로 및 시설물 운영 관리 시스템 아키텍처 설계 연구 (A Study on the Architecture Design of Road and Facility Operation Management System for 3D Spatial Data Processing)

  • 김덕호;김성진;이정욱
    • 한국지리정보학회지
    • /
    • 제24권4호
    • /
    • pp.136-147
    • /
    • 2021
  • 현재 자율주행 관련 기술은 주행의 정도를 적용하여 단계별로 발전하고 있다. 자율주행 차량이 움직이는 도로에 대한 운영 관리 기술도 자율주행 기술에 발맞춰 발전해야 하는 것은 필수적이다. 그러나 현재 도로 운영 관리의 경우 2차원 정보만을 사용하여 관리되고 있어 차로 및 시설물 정보관리, 유지 보수의 체계화된 운영 관리에 한계를 보이고 있다. 본 연구는 현재 2차원 공간데이터를 기반으로 관리하고 있는 차로 및 시설물 운영 관리 시스템을 정밀도로 지도 데이터와 실시간 빅 데이터 처리가 가능한 융합 형태의 데이터베이스를 설계하여 3차원 공간정보 기반의 운영 관리가 가능한 운영 관리 시스템 아키텍처 구성 방안을 제시하였다. 본 연구를 통해 향후 정밀도로 지도를 기반으로 한 운영 관리 시스템을 구축하여 차로 및 시설물 유지관리에 사용할 경우 시설물을 시각화하여 관리할 수 있으며, 다중 사용자의 데이터 편집 및 분석이 가능하고, 다양한 GIS S/W와 연동이 가능하며, 보안 및 백업·복구 등의 기능이 강화되어 대용량의 실시간 데이터를 효율적으로 처리할 수 있을 것으로 판단된다.

MASSIVE STRUCTURES OF GALAXIES AT HIGH REDSHIFTS IN THE GREAT OBSERVATORIES ORIGINS DEEP SURVEY FIELDS

  • Kang, Eugene;Im, Myungshin
    • 천문학회지
    • /
    • 제48권1호
    • /
    • pp.21-55
    • /
    • 2015
  • If the Universe is dominated by cold dark matter and dark energy as in the currently popular ${\Lambda}CDM$ cosmology, it is expected that large scale structures form gradually, with galaxy clusters of mass $M{\geq}10^{14}M_{\odot}$ appearing at around 6 Gyrs after the Big Bang (z ~ 1). Here, we report the discovery of 59 massive structures of galaxies with masses greater than a few times $10^{13}M_{\odot}$ at redshifts between z = 0.6 and 4.5 in the Great Observatories Origins Deep Survey fields. The massive structures are identified by running top-hat filters on the two dimensional spatial distribution of magnitude-limited samples of galaxies using a combination of spectroscopic and photometric redshifts. We analyze the Millennium simulation data in a similar way to the analysis of the observational data in order to test the ${\Lambda}CDM$ cosmology. We find that there are too many massive structures (M > $7{\times}10^{13}M_{\odot}$) observed at z > 2 in comparison with the simulation predictions by a factor of a few, giving a probability of < 1/2500 of the observed data being consistent with the simulation. Our result suggests that massive structures have emerged early, but the reason for the discrepancy with the simulation is unclear. It could be due to the limitation of the simulation such as the lack of key, unrecognized ingredients (strong non-Gaussianity or other baryonic physics), or simply a difficulty in the halo mass estimation from observation, or a fundamental problem of the ${\Lambda}CDM$ cosmology. On the other hand, the over-abundance of massive structures at high redshifts does not favor heavy neutrino mass of ~ 0.3 eV or larger, as heavy neutrinos make the discrepancy between the observation and the simulation more pronounced by a factor of 3 or more.

Korea Emissions Inventory Processing Using the US EPA's SMOKE System

  • Kim, Soon-Tae;Moon, Nan-Kyoung;Byun, Dae-Won W.
    • Asian Journal of Atmospheric Environment
    • /
    • 제2권1호
    • /
    • pp.34-46
    • /
    • 2008
  • Emissions inputs for use in air quality modeling of Korea were generated with the emissions inventory data from the National Institute of Environmental Research (NIER), maintained under the Clean Air Policy Support System (CAPSS) database. Source Classification Codes (SCC) in the Korea emissions inventory were adapted to use with the U.S. EPA's Sparse Matrix Operator Kernel Emissions (SMOKE) by finding the best-matching SMOKE default SCCs for the chemical speciation and temporal allocation. A set of 19 surrogate spatial allocation factors for South Korea were developed utilizing the Multi-scale Integrated Modeling System (MIMS) Spatial Allocator and Korean GIS databases. The mobile and area source emissions data, after temporal allocation, show typical sinusoidal diurnal variations with high peaks during daytime, while point source emissions show weak diurnal variations. The model-ready emissions are speciated for the carbon bond version 4 (CB-4) chemical mechanism. Volatile organic carbon (VOC) emissions from painting related industries in area source category significantly contribute to TOL (Toluene) and XYL (Xylene) emissions. ETH (Ethylene) emissions are largely contributed from point industrial incineration facilities and various mobile sources. On the other hand, a large portion of OLE (Olefin) emissions are speciated from mobile sources in addition to those contributed by the polypropylene industry in point source. It was found that FORM (Formaldehyde) is mostly emitted from petroleum industry and heavy duty diesel vehicles. Chemical speciation of PM2.5 emissions shows that PEC (primary fine elemental carbon) and POA (primary fine organic aerosol) are the most abundant species from diesel and gasoline vehicles. To reduce uncertainties in processing the Korea emission inventory due to the mapping of Korean SCCs to those of U.S., it would be practical to develop and use domestic source profiles for the top 10 SCCs for area and point sources and top 5 SCCs for on-road mobile sources when VOC emissions from the sources are more than 90% of the total.