• Title/Summary/Keyword: multi-dimensional databases

Search Result 54, Processing Time 0.024 seconds

Multi-Dimensional Keyword Search and Analysis of Hotel Review Data Using Multi-Dimensional Text Cubes (다차원 텍스트 큐브를 이용한 호텔 리뷰 데이터의 다차원 키워드 검색 및 분석)

  • Kim, Namsoo;Lee, Suan;Jo, Sunhwa;Kim, Jinho
    • Journal of Information Technology and Architecture
    • /
    • v.11 no.1
    • /
    • pp.63-73
    • /
    • 2014
  • As the advance of WWW, unstructured data including texts are taking users' interests more and more. These unstructured data created by WWW users represent users' subjective opinions thus we can get very useful information such as users' personal tastes or perspectives from them if we analyze appropriately. In this paper, we provide various analysis efficiently for unstructured text documents by taking advantage of OLAP (On-Line Analytical Processing) multidimensional cube technology. OLAP cubes have been widely used for the multidimensional analysis for structured data such as simple alphabetic and numberic data but they didn't have used for unstructured data consisting of long texts. In order to provide multidimensional analysis for unstructured text data, however, Text Cube model has been proposed precently. It incorporates term frequency and inverted index as measurements to search and analyze text databases which play key roles in information retrieval. The primary goal of this paper is to apply this text cube model to a real data set from in an Internet site sharing hotel information and to provide multidimensional analysis for users' reviews on hotels written in texts. To achieve this goal, we first build text cubes for the hotel review data. By using the text cubes, we design and implement the system which provides multidimensional keyword search features to search and to analyze review texts on various dimensions. This system will be able to help users to get valuable guest-subjective summary information easily. Furthermore, this paper evaluats the proposed systems through various experiments and it reveals the effectiveness of the system.

An Improved Algorithm for Building Multi-dimensional Histograms with Overlapped Buckets (중첩된 버킷을 사용하는 다차원 히스토그램에 대한 개선된 알고리즘)

  • 문진영;심규석
    • Journal of KIISE:Databases
    • /
    • v.30 no.3
    • /
    • pp.336-349
    • /
    • 2003
  • Histograms have been getting a lot of attention recently. Histograms are commonly utilized in commercial database systems to capture attribute value distributions for query optimization Recently, in the advent of researches on approximate query answering and stream data, the interests in histograms are widely being spread. The simplest approach assumes that the attributes in relational tables are independent by AVI(Attribute Value Independence) assumption. However, this assumption is not generally valid for real-life datasets. To alleviate the problem of approximation on multi-dimensional data with multiple one-dimensional histograms, several techniques such as wavelet, random sampling and multi-dimensional histograms are proposed. Among them, GENHIST is a multi-dimensional histogram that is designed to approximate the data distribution with real attributes. It uses overlapping buckets that allow more efficient approximation on the data distribution. In this paper, we propose a scheme, OPT that can determine the optimal frequencies of overlapped buckets that minimize the SSE(Sum Squared Error). A histogram with overlapping buckets is first generated by GENHIST and OPT can improve the histogram by calculating the optimal frequency for each bucket. Our experimental result confirms that our technique can improve the accuracy of histograms generated by GENHIST significantly.

Similarity-Based Subsequence Search in Image Sequence Databases (이미지 시퀀스 데이터베이스에서의 유사성 기반 서브시퀀스 검색)

  • Kim, In-Bum;Park, Sang-Hyun
    • The KIPS Transactions:PartD
    • /
    • v.10D no.3
    • /
    • pp.501-512
    • /
    • 2003
  • This paper proposes an indexing technique for fast retrieval of similar image subsequences using the multi-dimensional time warping distance. The time warping distance is a more suitable similarity measure than Lp distance in many applications where sequences may be of different lengths and/or different sampling rates. Our indexing scheme employs a disk-based suffix tree as an index structure and uses a lower-bound distance function to filter out dissimilar subsequences without false dismissals. It applies the normaliration for an easier control of relative weighting of feature dimensions and the discretization to compress the index tree. Experiments on medical and synthetic image sequences verify that the proposed method significantly outperforms the naive method and scales well in a large volume of image sequence databases.

Efficient Execution of Range $Top-\kappa$ Queries using a Hierarchical Max R-Tree (계층 최대 R-트리를 이용한 범위 상위-$\kappa$ 질의의 효율적인 수행)

  • 홍석진;이상준;이석호
    • Journal of KIISE:Databases
    • /
    • v.31 no.2
    • /
    • pp.132-139
    • /
    • 2004
  • A range $Top-\kappa$ query returns top k records in order of a measure attribute within a specified region on multi-dimensional data, and it is a powerful tool for analysis in spatial databases and data warehouse environments. In this paper, we propose an algorithm for answering the query via selective traverse of a Hierarchical Max R-Tree(HMR-tree). It is possible to execute the query by accessing only a small part of the leaf nodes in the query region, and the query performance is nearly constant regardless of the size of the query region. The algorithm manages the priority queue efficiently to reduce cost of handling the queue and the proposed HMR-tree can guarantee the same fan-out as the original R-tree.

Korea Emissions Inventory Processing Using the US EPA's SMOKE System

  • Kim, Soon-Tae;Moon, Nan-Kyoung;Byun, Dae-Won W.
    • Asian Journal of Atmospheric Environment
    • /
    • v.2 no.1
    • /
    • pp.34-46
    • /
    • 2008
  • Emissions inputs for use in air quality modeling of Korea were generated with the emissions inventory data from the National Institute of Environmental Research (NIER), maintained under the Clean Air Policy Support System (CAPSS) database. Source Classification Codes (SCC) in the Korea emissions inventory were adapted to use with the U.S. EPA's Sparse Matrix Operator Kernel Emissions (SMOKE) by finding the best-matching SMOKE default SCCs for the chemical speciation and temporal allocation. A set of 19 surrogate spatial allocation factors for South Korea were developed utilizing the Multi-scale Integrated Modeling System (MIMS) Spatial Allocator and Korean GIS databases. The mobile and area source emissions data, after temporal allocation, show typical sinusoidal diurnal variations with high peaks during daytime, while point source emissions show weak diurnal variations. The model-ready emissions are speciated for the carbon bond version 4 (CB-4) chemical mechanism. Volatile organic carbon (VOC) emissions from painting related industries in area source category significantly contribute to TOL (Toluene) and XYL (Xylene) emissions. ETH (Ethylene) emissions are largely contributed from point industrial incineration facilities and various mobile sources. On the other hand, a large portion of OLE (Olefin) emissions are speciated from mobile sources in addition to those contributed by the polypropylene industry in point source. It was found that FORM (Formaldehyde) is mostly emitted from petroleum industry and heavy duty diesel vehicles. Chemical speciation of PM2.5 emissions shows that PEC (primary fine elemental carbon) and POA (primary fine organic aerosol) are the most abundant species from diesel and gasoline vehicles. To reduce uncertainties in processing the Korea emission inventory due to the mapping of Korean SCCs to those of U.S., it would be practical to develop and use domestic source profiles for the top 10 SCCs for area and point sources and top 5 SCCs for on-road mobile sources when VOC emissions from the sources are more than 90% of the total.

High-Dimensional Image Indexing based on Adaptive Partitioning ana Vector Approximation (적응 분할과 벡터 근사에 기반한 고차원 이미지 색인 기법)

  • Cha, Gwang-Ho;Jeong, Jin-Wan
    • Journal of KIISE:Databases
    • /
    • v.29 no.2
    • /
    • pp.128-137
    • /
    • 2002
  • In this paper, we propose the LPC+-file for efficient indexing of high-dimensional image data. With the proliferation of multimedia data, there Is an increasing need to support the indexing and retrieval of high-dimensional image data. Recently, the LPC-file (5) that based on vector approximation has been developed for indexing high-dimensional data. The LPC-file gives good performance especially when the dataset is uniformly distributed. However, compared with for the uniformly distributed dataset, its performance degrades when the dataset is clustered. We improve the performance of the LPC-file for the strongly clustered image dataset. The basic idea is to adaptively partition the data space to find subspaces with high-density clusters and to assign more bits to them than others to increase the discriminatory power of the approximation of vectors. The total number of bits used to represent vector approximations is rather less than that of the LPC-file since the partitioned cells in the LPC+-file share the bits. An empirical evaluation shows that the LPC+-file results in significant performance improvements for real image data sets which are strongly clustered.

A search mechanism for moving objects in a spatial database (공간 데이타베이스에서 이동 객체의 탐색기법)

  • 유병구;황수찬;백중환
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.35C no.1
    • /
    • pp.25-33
    • /
    • 1998
  • This paepr presents an algorithm for searching an object in a fast way which contains a continuous moving object in multi-dimensional spatical databases. This algorithm improves the search method of R-tree for the case that a target object is continuously moving in a spatial database. It starts the searching from the current node instead of the root of R-tree. Thus, the algorithm will find the target object from the entries of current node or sibling nodes in the most cases. The performance analysis shows that it is more efficient than the existing algorithm for R-tree when search windows or target objects are continuously moving.

  • PDF

A Clustering Method for Optimizing Spatial Locality (공간국부성을 최적화하는 클러스터링 방법)

  • 김홍기
    • Journal of KIISE:Databases
    • /
    • v.31 no.2
    • /
    • pp.83-90
    • /
    • 2004
  • In this paper, we study the CCD(Clustering with Circular Distance) and the COD(Clustering with Obstructed Distance) problems to be considered when objects are being clustered in a circularly search space and a search space with the presence of obstacles. We also propose a now clustering algorithm for clustering efficiently objects that the insertion or the deletion is occurring frequently in multi-dimensional search space. The distance function for solving the CCD and COD Problems is defined in the Proposed clustering algorithm. This algorithm is included a clustering method to create clusters that have a high spatial locality by minimum computation time.

VP Filtering for Efficient Query Processing in R-tree Variants Index Structures (R-tree 계열의 인덱싱 구조에서의 효율적 질의 처리를 위한 VP 필터링)

  • Kim, Byung-Gon;Lee, Jae-Ho;Lim, Hae-Chull
    • Journal of KIISE:Databases
    • /
    • v.29 no.6
    • /
    • pp.453-463
    • /
    • 2002
  • With the prevalence of multi-dimensional data such as images, content-based retrieval of data is becoming increasingly important. To handle multi-dimensional data, multi-dimensional index structures such as the R-tree, Rr-tree, TV-tree, and MVP-tree have been proposed. Numerous research results on how to effectively manipulate these structures have been presented during the last decade. Query processing strategies, which is important for reducing the processing time, is one such area of research. In this paper, we propose query processing algorithms for R-tree based structures. The novel aspect of these algorithms is that they make use of the notion of VP filtering, a concept borrowed from the MVP-tree. The filtering notion allows for delaying of computational overhead until absolutely necessary. By so doing, we attain considerable performance benefits while paying insignificant overhead during the construction of the index structure. We implemented our algorithms and carried out experiments to demonstrate the capability and usefulness of our method. Both for range query and incremental query, for all dimensional index trees, the response time using VP filtering was always shorter than without VP filtering. We quantitatively showed that VP filtering is closely related with the response time of the query.

Design of an Efficient Parallel High-Dimensional Index Structure (효율적인 병렬 고차원 색인구조 설계)

  • Park, Chun-Seo;Song, Seok-Il;Sin, Jae-Ryong;Yu, Jae-Su
    • Journal of KIISE:Databases
    • /
    • v.29 no.1
    • /
    • pp.58-71
    • /
    • 2002
  • Generally, multi-dimensional data such as image and spatial data require large amount of storage space. There is a limit to store and manage those large amount of data in single workstation. If we manage the data on parallel computing environment which is being actively researched these days, we can get highly improved performance. In this paper, we propose a parallel high-dimensional index structure that exploits the parallelism of the parallel computing environment. The proposed index structure is nP(processor)-n$\times$mD(disk) architecture which is the hybrid type of nP-nD and lP-nD. Its node structure increases fan-out and reduces the height of a index tree. Also, A range search algorithm that maximizes I/O parallelism is devised, and it is applied to K-nearest neighbor queries. Through various experiments, it is shown that the proposed method outperforms other parallel index structures.