• Title/Summary/Keyword: Database management

Search Result 3,908, Processing Time 0.039 seconds

CNN-based Recommendation Model for Classifying HS Code (HS 코드 분류를 위한 CNN 기반의 추천 모델 개발)

  • Lee, Dongju;Kim, Gunwoo;Choi, Keunho
    • Management & Information Systems Review
    • /
    • v.39 no.3
    • /
    • pp.1-16
    • /
    • 2020
  • The current tariff return system requires tax officials to calculate tax amount by themselves and pay the tax amount on their own responsibility. In other words, in principle, the duty and responsibility of reporting payment system are imposed only on the taxee who is required to calculate and pay the tax accurately. In case the tax payment system fails to fulfill the duty and responsibility, the additional tax is imposed on the taxee by collecting the tax shortfall and imposing the tax deduction on For this reason, item classifications, together with tariff assessments, are the most difficult and could pose a significant risk to entities if they are misclassified. For this reason, import reports are consigned to customs officials, who are customs experts, while paying a substantial fee. The purpose of this study is to classify HS items to be reported upon import declaration and to indicate HS codes to be recorded on import declaration. HS items were classified using the attached image in the case of item classification based on the case of the classification of items by the Korea Customs Service for classification of HS items. For image classification, CNN was used as a deep learning algorithm commonly used for image recognition and Vgg16, Vgg19, ResNet50 and Inception-V3 models were used among CNN models. To improve classification accuracy, two datasets were created. Dataset1 selected five types with the most HS code images, and Dataset2 was tested by dividing them into five types with 87 Chapter, the most among HS code 2 units. The classification accuracy was highest when HS item classification was performed by learning with dual database2, the corresponding model was Inception-V3, and the ResNet50 had the lowest classification accuracy. The study identified the possibility of HS item classification based on the first item image registered in the item classification determination case, and the second point of this study is that HS item classification, which has not been attempted before, was attempted through the CNN model.

A Semantic Classification Model for e-Catalogs (전자 카탈로그를 위한 의미적 분류 모형)

  • Kim Dongkyu;Lee Sang-goo;Chun Jonghoon;Choi Dong-Hoon
    • Journal of KIISE:Databases
    • /
    • v.33 no.1
    • /
    • pp.102-116
    • /
    • 2006
  • Electronic catalogs (or e-catalogs) hold information about the goods and services offered or requested by the participants, and consequently, form the basis of an e-commerce transaction. Catalog management is complicated by a number of factors and product classification is at the core of these issues. Classification hierarchy is used for spend analysis, custom3 regulation, and product identification. Classification is the foundation on which product databases are designed, and plays a central role in almost all aspects of management and use of product information. However, product classification has received little formal treatment in terms of underlying model, operations, and semantics. We believe that the lack of a logical model for classification Introduces a number of problems not only for the classification itself but also for the product database in general. It needs to meet diverse user views to support efficient and convenient use of product information. It needs to be changed and evolved very often without breaking consistency in the cases of introduction of new products, extinction of existing products, class reorganization, and class specialization. It also needs to be merged and mapped with other classification schemes without information loss when B2B transactions occur. For these requirements, a classification scheme should be so dynamic that it takes in them within right time and cost. The existing classification schemes widely used today such as UNSPSC and eClass, however, have a lot of limitations to meet these requirements for dynamic features of classification. In this paper, we try to understand what it means to classify products and present how best to represent classification schemes so as to capture the semantics behind the classifications and facilitate mappings between them. Product information implies a plenty of semantics such as class attributes like material, time, place, etc., and integrity constraints. In this paper, we analyze the dynamic features of product databases and the limitation of existing code based classification schemes. And describe the semantic classification model, which satisfies the requirements for dynamic features oi product databases. It provides a means to explicitly and formally express more semantics for product classes and organizes class relationships into a graph. We believe the model proposed in this paper satisfies the requirements and challenges that have been raised by previous works.

Present Status on the Pesticide Residue Monitoring Program of South Korea and Its Improvement (한국의 잔류농약 모니터링 프로그램 현황과 개선)

  • Lee, Mi-Gyung
    • Journal of Food Hygiene and Safety
    • /
    • v.34 no.3
    • /
    • pp.219-226
    • /
    • 2019
  • This study was conducted to understand the overall status of the monitoring program for pesticide residues in foods of South Korea. Further propositions for its improvement were made, and from this study, the status on this program can be summarized as follows. In South Korea, the Ministry of Food and Drug Safety (MFDS) is responsible for overall control of pesticide residue monitoring. Depending on the time of monitoring (sampling at distribution or production step), the government agency responsible for monitoring is different: MFDS, Regional Offices of Food and Drug Safety and local governments are responsible for monitoring of foods at the distribution step, while the National Agricultural Products Quality Management Service (NAQS) and local governments are responsible for monitoring of foods in the production step (partially at sale and distribution steps). According to purpose of monitoring, domestic monitoring programs could be divided into two types: MFDS's "Residue Survey" and NAQS's "National Residue Survey" are conducted mainly for risk assessment purposes and various monitoring programs by the Regional Offices of Food and Drug Safety and local governments are conducted mainly for regulation purposes. For imported foods, monitoring should be conducted at both steps of customs clearance and distribution: the MFDS and the Regional Offices of Food and Drug Safety are responsible for the former, and for the latter, local governments are also responsible. However, it appeared that systematic and consistent monitoring programs are not being conducted for imported foods at the distribution step. Based on the information described above and more detailed information included in this paper, the following proposals for improving the monitoring program were forwarded: i) further clarification of monitoring program purpose, ii) strengthening of the monitoring program for imported foods, iii) providing the public with monitoring results by publication of an annual report and database. It is thought that exhaustive review on the pesticide residue monitoring program and efforts for its improvement are needed in order to assure both food safety and the success of the recently begun positive list system (PLS).

Establishment of A WebGIS-based Information System for Continuous Observation during Ocean Research Vessel Operation (WebGIS 기반 해양 연구선 상시관측 정보 체계 구축)

  • HAN, Hyeon-Gyeong;LEE, Cholyoung;KIM, Tae-Hoon;HAN, Jae-Rim;CHOI, Hyun-Woo
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.24 no.1
    • /
    • pp.40-53
    • /
    • 2021
  • Research vessels(R/Vs) used for ocean research move to the planned research area and perform ocean observations suitable for the research purpose. The five research vessels of the Korea Institute of Ocean Science & Technology(KIOST) are equipped with global positioning system(GPS), water depth, weather, sea surface layer temperature and salinity measurement equipment that can be observed at all times during cruise. An information platform is required to systematically manage and utilize the data produced through such continuous observation equipment. Therefore, the data flow was defined through a series of business analysis ranging from the research vessel operation plan to observation during the operation of the research vessel, data collection, data processing, data storage, display and service. After creating a functional design for each stage of the business process, KIOST Underway Meteorological & Oceanographic Information System(KUMOS), a Web-Geographic information system (Web-GIS) based information platform, was built. Since the data produced during the cruise of the R/Vs have characteristics of temporal and spatial variability, a quality management system was developed that considered these variabilities. For the systematic management and service of data, the KUMOS integrated Database(DB) was established, and functions such as R/V tracking, data display, search and provision were implemented. The dataset provided by KUMOS consists of cruise report, raw data, Quality Control(QC) flagged data, filtered data, cruise track line data, and data report for each cruise of the R/V. The business processing procedure and system of KUMOS for each function developed through this study are expected to serve as a benchmark for domestic ocean-related institutions and universities that have research vessels capable of continuous observations during cruise.

A Study on the Linkage and Development of the BRM Based National Tasks and the Policy Information Contents (BRM기반 국정과제와 정책정보콘텐츠 연계 및 구축방안에 관한 연구)

  • Younghee, Noh;Inho, Chang;Hyojung, Sim;Woojung, Kwak
    • Journal of the Korean Society for information Management
    • /
    • v.39 no.4
    • /
    • pp.191-213
    • /
    • 2022
  • With a view to providing a high-quality policy information service beyond the existing national task service of the national policy information portal (POINT) of the National Library of Korea Sejong, it would be necessary to effectively provide the policy data needed for the implementation of the new national tasks. Accordingly, in this study, an attempt has been made to find a way to connect and develop the BRM-based national tasks and the policy information contents. Towards this end, first, the types of national tasks and the contents of each field and area of the government function's classification system were analyzed, with a focus placed on the 120 national tasks of the new administration. Furthermore, by comparing and analyzing the national tasks of the previous administration and the current information, the contents ought to be reflected for the development of contents related to the national tasks identified. Second, the method for linking and collecting the policy information was sought based on the analysis of the current status of policy information and the national information portal. As a result of the study, first, examining the 1st stage BRM of the national tasks, it turned out that there were 21 tasks for social welfare, 14 for unification and diplomacy, 17 for small and medium-sized businesses in industry and trade, 12 for general public administration, 8 for the economy, taxation and finance, 6 for culture, sports and tourism, science and technology, and education each, 5 for communication, public order and safety each, 4 for health, transportation and logistics, and environment each, 3 for agriculture and forestry, 2 for national defense and regional development each, and 1 for maritime and fisheries each, among others. As for the new administration, it is apparent that science technology and IT are important, and hence, it is necessary to consider such when developing the information services for the core national tasks. Second, to link the database with external organizations, it would be necessary to form a linked operation council, link and collect the information on the national tasks, and link and provide the national task-related information for the POINTs.

Preservation of World Records Heritage in Korea and Further Registry (한국의 세계기록유산 보존 현황 및 과제)

  • Kim, Sung-Soo
    • Journal of Korean Society of Archives and Records Management
    • /
    • v.5 no.2
    • /
    • pp.27-48
    • /
    • 2005
  • This study investigates the current preservation and management of four records and documentary heritage in Korea that is in the UNESCO's Memory of the World Register. The study analyzes their problems and corresponding solutions in digitizing those world records heritages. This study also reviews additional four documentary books in Korea that are in the wish list to add to UNESCO's Memory of the World Register. This study is organized as the following: Chapter 2 examines the value and meanings of world records and documentary heritage in Korea. The registry requirements and procedures of UNESCO's Memory of the World Register are examined. The currently registered records of Korea include Hunmin-Chongum, the Annals of the Choson Dynasty, the Diaries of the Royal Secretariat (Seungjeongwon Ilgi), and Buljo- Jikji-Simche-Yojeol (vol. II). These records heritage's worth and significance are carefully analyzed. For example, Hunmin-Chongum("訓民正音") is consisted of unique and systematic letters. Letters were delicately explained with examples in its original manual at the time of letter's creation, which is an unparalleled case in the world documentary history. The Annals of the Choson Dynasty("朝鮮王朝實錄") are the most comprehensive historic documents that contain the longest period of time in history. Their truthfulness and reliability in describing history give credits to the annals. The Royal Secretariat Diary (called Seungjeongwon-Ilgi("承政院日記")) is the most voluminous primary resources in history, superior to the Annals of Choson Dynasty and Twenty Five Histories in China. Jikji("直指") is the oldest existing book published by movable metal print sets in the world. It evidences the beginning of metal printing in the world printing history and is worthy of being as world heritage. The review of the four registered records confirms that they are valuable world documentary heritage that transfers culture of mankind to next generations and should be preserved carefully and safely without deterioration or loss. Chapter 3 investigates the current status of preservation and management of three repositories that store the four registered records in Korea. The repositories include Kyujanggak Archives in Seoul National University, Pusan Records and Information Center of National Records and Archives Service, and Gansong Art Museum. The quality of their preservation and management are excellent in all of three institutions by the following aspects: 1) detailed security measures are close to perfection 2) archiving practices are very careful by using a special stack room in steady temperature and humidity and depositing it in stack or archival box made of paulownia tree and 3) fire prevention, lighting, and fumigation are thoroughly prepared. Chapter 4 summarizes the status quo of digitization projects of records heritage in Korea. The most important issue related to digitization and database construction on Korean records heritage is likely to set up the standardization of digitization processes and facilities. It is urgently necessary to develop comprehensive standard systems for digitization. Two institutions are closely interested in these tasks: 1) the National Records and Archives Service experienced in developing government records management systems; and 2) the Cultural Heritage Administration interested in digitization of Korean old documents. In collaboration of these two institutions, a new standard system will be designed for digitizing records heritage on Korean Studies. Chapter 5 deals with additional Korean records heritage in the wish list for UNESCO's Memory of the World Register, including: 1) Wooden Printing Blocks(經板) of Koryo-Taejangkyong(高麗大藏經) in Haein Temple(海印寺); 2) Dongui-Bogam("東醫寶鑑") 3) Samguk-Yusa("三國遺事") and 4) Mugujeonggwangdaedaranigyeong. Their world value and importance are examined as followings. Wooden Printing Blocks of Koryo-Taejangkyong in Haein Temple is the worldly oldest wooden printing block of cannon of Buddhism that still exist and was created over 750 years ago. It needs a special conservation treatment to disinfect germs residing in surface and inside of wooden plates. Otherwise, it may be damaged seriously. For its effective conservation and preservation, we hope that UNESCO and Government will schedule special care and budget and join the list of Memory of the Word Register. Dongui-Bogam is the most comprehensive and well-written medical book in the Korean history, summarizing all medical books in Korea and China from the Ancient Times through the early 17th century and concentrating on Korean herb medicine and prescriptions. It is proved as the best clinical guidebook in the 17th century for doctors and practitioners to easily use. The book was also published in China and Japan in the 18th century and greatly influenced the development of practical clinic and medical research in Asia at that time. This is why Dongui Bogam is in the wish list to register to the Memory of the World. Samguk-Yusa is evaluated as one of the most comprehensive history books and treasure sources in Korea, which illustrates foundations of Korean people and covers histories and cultures of ancient Korean peninsula and nearby countries. The book contains the oldest fixed form verse, called Hyang-Ka(鄕歌), and became the origin of Korean literature. In particular, the section of Gi-ee(紀異篇) describes the historical processes of dynasty transition from the first dynasty Gochosun(古朝鮮) to Goguryeo(高句麗) and illustrates the identity of Korean people from its historical origin. This book is worthy of adding to the Memory of the World Register. Mugujeonggwangdaedaranigyeong is the oldest book printed by wooden type plates, and it is estimated to print in between 706 and 751. It contains several reasons and evidence to be worthy of adding to the list of the Memory of the World. It is the greatest documentary heritage that represents the first wooden printing book that still exists in the world as well as illustrates the history of wooden printing in Korea.

Design and Implementation of MongoDB-based Unstructured Log Processing System over Cloud Computing Environment (클라우드 환경에서 MongoDB 기반의 비정형 로그 처리 시스템 설계 및 구현)

  • Kim, Myoungjin;Han, Seungho;Cui, Yun;Lee, Hanku
    • Journal of Internet Computing and Services
    • /
    • v.14 no.6
    • /
    • pp.71-84
    • /
    • 2013
  • Log data, which record the multitude of information created when operating computer systems, are utilized in many processes, from carrying out computer system inspection and process optimization to providing customized user optimization. In this paper, we propose a MongoDB-based unstructured log processing system in a cloud environment for processing the massive amount of log data of banks. Most of the log data generated during banking operations come from handling a client's business. Therefore, in order to gather, store, categorize, and analyze the log data generated while processing the client's business, a separate log data processing system needs to be established. However, the realization of flexible storage expansion functions for processing a massive amount of unstructured log data and executing a considerable number of functions to categorize and analyze the stored unstructured log data is difficult in existing computer environments. Thus, in this study, we use cloud computing technology to realize a cloud-based log data processing system for processing unstructured log data that are difficult to process using the existing computing infrastructure's analysis tools and management system. The proposed system uses the IaaS (Infrastructure as a Service) cloud environment to provide a flexible expansion of computing resources and includes the ability to flexibly expand resources such as storage space and memory under conditions such as extended storage or rapid increase in log data. Moreover, to overcome the processing limits of the existing analysis tool when a real-time analysis of the aggregated unstructured log data is required, the proposed system includes a Hadoop-based analysis module for quick and reliable parallel-distributed processing of the massive amount of log data. Furthermore, because the HDFS (Hadoop Distributed File System) stores data by generating copies of the block units of the aggregated log data, the proposed system offers automatic restore functions for the system to continually operate after it recovers from a malfunction. Finally, by establishing a distributed database using the NoSQL-based Mongo DB, the proposed system provides methods of effectively processing unstructured log data. Relational databases such as the MySQL databases have complex schemas that are inappropriate for processing unstructured log data. Further, strict schemas like those of relational databases cannot expand nodes in the case wherein the stored data are distributed to various nodes when the amount of data rapidly increases. NoSQL does not provide the complex computations that relational databases may provide but can easily expand the database through node dispersion when the amount of data increases rapidly; it is a non-relational database with an appropriate structure for processing unstructured data. The data models of the NoSQL are usually classified as Key-Value, column-oriented, and document-oriented types. Of these, the representative document-oriented data model, MongoDB, which has a free schema structure, is used in the proposed system. MongoDB is introduced to the proposed system because it makes it easy to process unstructured log data through a flexible schema structure, facilitates flexible node expansion when the amount of data is rapidly increasing, and provides an Auto-Sharding function that automatically expands storage. The proposed system is composed of a log collector module, a log graph generator module, a MongoDB module, a Hadoop-based analysis module, and a MySQL module. When the log data generated over the entire client business process of each bank are sent to the cloud server, the log collector module collects and classifies data according to the type of log data and distributes it to the MongoDB module and the MySQL module. The log graph generator module generates the results of the log analysis of the MongoDB module, Hadoop-based analysis module, and the MySQL module per analysis time and type of the aggregated log data, and provides them to the user through a web interface. Log data that require a real-time log data analysis are stored in the MySQL module and provided real-time by the log graph generator module. The aggregated log data per unit time are stored in the MongoDB module and plotted in a graph according to the user's various analysis conditions. The aggregated log data in the MongoDB module are parallel-distributed and processed by the Hadoop-based analysis module. A comparative evaluation is carried out against a log data processing system that uses only MySQL for inserting log data and estimating query performance; this evaluation proves the proposed system's superiority. Moreover, an optimal chunk size is confirmed through the log data insert performance evaluation of MongoDB for various chunk sizes.

Development of a complex failure prediction system using Hierarchical Attention Network (Hierarchical Attention Network를 이용한 복합 장애 발생 예측 시스템 개발)

  • Park, Youngchan;An, Sangjun;Kim, Mintae;Kim, Wooju
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.4
    • /
    • pp.127-148
    • /
    • 2020
  • The data center is a physical environment facility for accommodating computer systems and related components, and is an essential foundation technology for next-generation core industries such as big data, smart factories, wearables, and smart homes. In particular, with the growth of cloud computing, the proportional expansion of the data center infrastructure is inevitable. Monitoring the health of these data center facilities is a way to maintain and manage the system and prevent failure. If a failure occurs in some elements of the facility, it may affect not only the relevant equipment but also other connected equipment, and may cause enormous damage. In particular, IT facilities are irregular due to interdependence and it is difficult to know the cause. In the previous study predicting failure in data center, failure was predicted by looking at a single server as a single state without assuming that the devices were mixed. Therefore, in this study, data center failures were classified into failures occurring inside the server (Outage A) and failures occurring outside the server (Outage B), and focused on analyzing complex failures occurring within the server. Server external failures include power, cooling, user errors, etc. Since such failures can be prevented in the early stages of data center facility construction, various solutions are being developed. On the other hand, the cause of the failure occurring in the server is difficult to determine, and adequate prevention has not yet been achieved. In particular, this is the reason why server failures do not occur singularly, cause other server failures, or receive something that causes failures from other servers. In other words, while the existing studies assumed that it was a single server that did not affect the servers and analyzed the failure, in this study, the failure occurred on the assumption that it had an effect between servers. In order to define the complex failure situation in the data center, failure history data for each equipment existing in the data center was used. There are four major failures considered in this study: Network Node Down, Server Down, Windows Activation Services Down, and Database Management System Service Down. The failures that occur for each device are sorted in chronological order, and when a failure occurs in a specific equipment, if a failure occurs in a specific equipment within 5 minutes from the time of occurrence, it is defined that the failure occurs simultaneously. After configuring the sequence for the devices that have failed at the same time, 5 devices that frequently occur simultaneously within the configured sequence were selected, and the case where the selected devices failed at the same time was confirmed through visualization. Since the server resource information collected for failure analysis is in units of time series and has flow, we used Long Short-term Memory (LSTM), a deep learning algorithm that can predict the next state through the previous state. In addition, unlike a single server, the Hierarchical Attention Network deep learning model structure was used in consideration of the fact that the level of multiple failures for each server is different. This algorithm is a method of increasing the prediction accuracy by giving weight to the server as the impact on the failure increases. The study began with defining the type of failure and selecting the analysis target. In the first experiment, the same collected data was assumed as a single server state and a multiple server state, and compared and analyzed. The second experiment improved the prediction accuracy in the case of a complex server by optimizing each server threshold. In the first experiment, which assumed each of a single server and multiple servers, in the case of a single server, it was predicted that three of the five servers did not have a failure even though the actual failure occurred. However, assuming multiple servers, all five servers were predicted to have failed. As a result of the experiment, the hypothesis that there is an effect between servers is proven. As a result of this study, it was confirmed that the prediction performance was superior when the multiple servers were assumed than when the single server was assumed. In particular, applying the Hierarchical Attention Network algorithm, assuming that the effects of each server will be different, played a role in improving the analysis effect. In addition, by applying a different threshold for each server, the prediction accuracy could be improved. This study showed that failures that are difficult to determine the cause can be predicted through historical data, and a model that can predict failures occurring in servers in data centers is presented. It is expected that the occurrence of disability can be prevented in advance using the results of this study.

A Study on the Identifying OECMs in Korea for Achieving the Kunming-Montreal Global Biodiversity Framework - Focusing on the Concept and Experts' Perception - (쿤밍-몬트리올 글로벌 생물다양성 보전목표 성취를 위한 우리나라 OECM 발굴방향 연구 - 개념 고찰 및 전문가 인식을 중심으로 -)

  • Hag-Young Heo;Sun-Joo Park
    • Korean Journal of Environment and Ecology
    • /
    • v.37 no.4
    • /
    • pp.302-314
    • /
    • 2023
  • This study aims to explore the direction for Korea's effective response to Target 3 (30by30), which can be said to be the core of the Kunming-Montreal Global Biodiversity Framework (K-M GBF) of the Convention on Biological Diversity (CBD), to find the direction of systematic OECM (Other Effective area-based Conservation Measures) discovery at the national level through a survey of global conceptual review and expert perception of OECM. This study examined ① the use of Korean terms related to OECM, ② derivation of determining criteria reflecting global standards, ③ deriving types of potential OECM candidates in Korea, and ④ considerations for OECM identification and reporting to explore the direction for identifying systematic, national-level OECM that complies with global standards and reflects the Korean context. First, there was consensus for using Korean terminology that reflects the concept of OECM rather than simple translations, and it was determined that "nature coexistence area" was the most preferred term (12 people) and had the same context as CBD 2050 Vision of "a world of living in harmony with nature." This study suggests utilizing four criteria (1. No protected areas, 2. Geographic boundaries, 3. Governance/management, and 4. Biodiversity value) that reflect OECM's core characteristics in the first-stage selection process, carrying out the consensus-building process (stage 2) with the relevant agencies, and adding two criteria (3-1 Effectiveness and sustainability of governance and management and 4-1 Long-term conservation) and performing the in-depth diagnosis in stage 3 (full assessment for reporting). The 28 types examined in this study were generally compatible with OECMs (4.45-6.21/7 points, mean 5.24). In particular, the "Conservation Properties (6.21 points)" and "Conservation Agreements (6.07 points)", which are controlled by National Nature Trust, are shown to be the most in line with the OECM concept. They were followed by "Buffer zone of World Natural Heritage (5.77 points)", "Temple Forest (5.73 points)", "Green-belt (Restricted development zones, 5.63 points)", "DMZ (5.60 points)", and "Buffer zone of biosphere reserve (5.50 point)" to have high potential. In the case of "Uninhabited Islands under Absolute Conservation", the response that they conformed to the protected areas (5.83/7 points) was higher than the OECM compatibility (5.52/7 points), it is determined that in the future, it would be preferable to promote the listing of absolute unprotected islands in the Korea Database on Protected Areas (KDPA) along with their surrounding waters (1 km). Based on the results of a global OECM standard review and expert perception survey, 10 items were suggested as considerations when identifying OECM in the Korean context. In the future, continuous research is needed to identify the potential OECMs through site-level assessment regarding these considerations and establish an effective in-situ conservation system at the national level by linking existing protected area systems and identified OECMs.

Index-based Searching on Timestamped Event Sequences (타임스탬프를 갖는 이벤트 시퀀스의 인덱스 기반 검색)

  • 박상현;원정임;윤지희;김상욱
    • Journal of KIISE:Databases
    • /
    • v.31 no.5
    • /
    • pp.468-478
    • /
    • 2004
  • It is essential in various application areas of data mining and bioinformatics to effectively retrieve the occurrences of interesting patterns from sequence databases. For example, let's consider a network event management system that records the types and timestamp values of events occurred in a specific network component(ex. router). The typical query to find out the temporal casual relationships among the network events is as fellows: 'Find all occurrences of CiscoDCDLinkUp that are fellowed by MLMStatusUP that are subsequently followed by TCPConnectionClose, under the constraint that the interval between the first two events is not larger than 20 seconds, and the interval between the first and third events is not larger than 40 secondsTCPConnectionClose. This paper proposes an indexing method that enables to efficiently answer such a query. Unlike the previous methods that rely on inefficient sequential scan methods or data structures not easily supported by DBMSs, the proposed method uses a multi-dimensional spatial index, which is proven to be efficient both in storage and search, to find the answers quickly without false dismissals. Given a sliding window W, the input to a multi-dimensional spatial index is a n-dimensional vector whose i-th element is the interval between the first event of W and the first occurrence of the event type Ei in W. Here, n is the number of event types that can be occurred in the system of interest. The problem of‘dimensionality curse’may happen when n is large. Therefore, we use the dimension selection or event type grouping to avoid this problem. The experimental results reveal that our proposed technique can be a few orders of magnitude faster than the sequential scan and ISO-Depth index methods.hods.