• Title/Summary/Keyword: Text Collection

Search Result 298, Processing Time 0.029 seconds

Building a text collection for Urdu information retrieval

  • Rasheed, Imran;Banka, Haider;Khan, Hamaid M.
    • ETRI Journal
    • /
    • v.43 no.5
    • /
    • pp.856-868
    • /
    • 2021
  • Urdu is a widely spoken language in the Indian subcontinent with over 300 million speakers worldwide. However, linguistic advancements in Urdu are rare compared to those in other European and Asian languages. Therefore, by following Text Retrieval Conference standards, we attempted to construct an extensive text collection of 85 304 documents from diverse categories covering over 52 topics with relevance judgment sets at 100 pool depth. We also present several applications to demonstrate the effectiveness of our collection. Although this collection is primarily intended for text retrieval, it can also be used for named entity recognition, text summarization, and other linguistic applications with suitable modifications. Ours is the most extensive existing collection for the Urdu language, and it will be freely available for future research and academic education.

A study on the xylographica of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ ("의방류취(醫方類聚)"에 대한 판본(版本) 연구)

  • Shin, Soon-Shik;Choi, Hwan-Soo
    • Korean Journal of Oriental Medicine
    • /
    • v.3 no.1
    • /
    • pp.1-15
    • /
    • 1997
  • ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$(1445) is a book compiled the medical achievements of China and Choseon in those times and it's our source of pride to have it In this country. It also deserves careful investigation since this book can provide some clues of features of missing books in China and Korea. The extent of accuracy of xylographica of old books determines the possiblity of in depth further study. So authors attempted to investigate the xylographica of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ one of the 3 main books in Korea. Previous investigation done by Miki Sakae and Kim Doo Jong are noticeable. On the basis of their respective works, we analyzed 'Annals of the Choseon Dynasty' to find records related with ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ and estimated the situation of its publication. We tried figure the situation of those times of China, Japan and Korea(including North Korea) and tried to estimate the book's original xylographica as much as we could. By King Sejong's command, the first draft of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ consisted of 365 books was made by collaboration of civil officials and medical officers during the period from 1443 to 1445. And then from 1451(first year of Moonjong's reign) to 1464(l0th year of Sejo's reign) lots of manpowers were employed and through the process of countless erasure, proofreading, arrangement and rearrangement revised version of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ which is called by Sejo text was completed. After 3 years of wood engraving work, the first printed form of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ (alternately called Seongjong text) in folding case consisted of 266 chapters, 264 volumes came into the world in 1477.(8th year of Seongjong's reign). This was 32 years after the initial completion of the edition. So ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ exists in three forms as Sejong text, Sejo text and Seongjong text respectively. Since those texts were plundered during the Japanese invasion of Korea in 1592, none of the original copy remains within korea. The texts were constantly moved to kadeungcheongieong, to Kongdeungpyeongio, Jesookoan of Edo, to East University of department of classic books, to Cheoncho archives, to the Imperial Museum and finally is kept in the royal palace at present. (Doseoryo text Eulhae printing type) Reduced-size republication books of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ in wooden type were imported at the time of 'Byeongja Korea-Japan Treaty in 1876' and of those 2 books, one copy was treasured in the Royal Household of the Yi Dynasty and than was lost during the Korean War circa 1950. The other remaining copy has been kept succesively by Kojong's imperial grant, Royal doctor Hong Cheol Bo, Hong Taek Joo, Hong Ik Pyo the book agent, and now is kept In Yonsei University Library and this is the only existing copy in Korea at present. In 1965, Dongyang Medical college published the transcription version of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ consisting of 11 books and then in 1981 after edition and arrangement by Choonghoa(中華) publishing company, photoprint copy of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ was published in Keumgang(金剛) publishing company In 1991, October Yeokang(驛江) publishing company producd photocopies of ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ which were previously translated into Korean by North Korea Institute of Oriental Medicine and then issued by medical publishing company. In China, two institutes, Zhejiang Institute of Traditional Chinese Medicine and Huzhou Traditional Chinese Medical Hospital cooperated to publish a revised and marked text consiting of 11 books by adding marking points to japanse Edohakhoondang text which were used as a reference. Both the korean and chinese texts issued were grounded by the ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ kept in the royal palace. Any further study concerning ${\ulcorner}$Classified Collection of Medical Prescriptions${\lrcorner}$ can acquire its accuracy and objectivity when the japanese text kept in the royal palace is taken as an original copy.

  • PDF

WCTT: Web Crawling System based on HTML Document Formalization (WCTT: HTML 문서 정형화 기반 웹 크롤링 시스템)

  • Kim, Jin-Hwan;Kim, Eun-Gyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.4
    • /
    • pp.495-502
    • /
    • 2022
  • Web crawler, which is mainly used to collect text on the web today, is difficult to maintain and expand because researchers must implement different collection logic by collection channel after analyzing tags and styles of HTML documents. To solve this problem, the web crawler should be able to collect text by formalizing HTML documents to the same structure. In this paper, we designed and implemented WCTT(Web Crawling system based on Tag path and Text appearance frequency), a web crawling system that collects text with a single collection logic by formalizing HTML documents based on tag path and text appearance frequency. Because WCTT collects texts with the same logic for all collection channels, it is easy to maintain and expand the collection channel. In addition, it provides the preprocessing function that removes stopwords and extracts only nouns for keyword network analysis and so on.

A Study Comparing the Han Period Bamboo Slats of the Beijing University Collection with the Laoguanshan Collection (북경대학 소장 한대의간(漢代醫簡)과 노관산 의간(老官山醫簡)의 비교 연구)

  • Kim, Beomsu;Kim, Kiwang
    • Journal of Korean Medical classics
    • /
    • v.36 no.1
    • /
    • pp.33-43
    • /
    • 2023
  • Objectives : Overlapping contents between two recently discovered Han period bamboo slats, the so-called "Beidahanjian" and the "Liushibingfang" have been identified. This study aims to present new knowledge that could be inferred from the concordance of these two texts. Methods : The most recent original texts of the medical part of the Beidahanjian and medical texts excavated from the Laoguanshan in addition to the Liushibingfang were compared with each other to determine identical parts. The meaning of these concordances was explored. Results : Identical sentences in two verses in the Beidahanjian and the Laoguanshan were identified. Conclusions : The Beidahanjian is a credible Western Han period text, of which the medical bamboo slats are likely to comprise an independent text that is a combination of ancient folk prescriptions and those of doctors.

HTML Text Extraction Using Tag Path and Text Appearance Frequency (태그 경로 및 텍스트 출현 빈도를 이용한 HTML 본문 추출)

  • Kim, Jin-Hwan;Kim, Eun-Gyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.12
    • /
    • pp.1709-1715
    • /
    • 2021
  • In order to accurately extract the necessary text from the web page, the method of specifying the tag and style attributes where the main contents exist to the web crawler has a problem in that the logic for extracting the main contents. This method needs to be modified whenever the web page configuration is changed. In order to solve this problem, the method of extracting the text by analyzing the frequency of appearance of the text proposed in the previous study had a limitation in that the performance deviation was large depending on the collection channel of the web page. Therefore, in this paper, we proposed a method of extracting texts with high accuracy from various collection channels by analyzing not only the frequency of appearance of text but also parent tag paths of text nodes extracted from the DOM tree of web pages.

Dissemination of the Tale of meifeizhuan to Korea and its Translation Practice (《매비전(梅妃傳)》의 국내유입과 번역양상)

  • Yoo, Hee June;Min, Kuan dong
    • Cross-Cultural Studies
    • /
    • v.27
    • /
    • pp.255-289
    • /
    • 2012
  • In the course of completing a National Research Foundation project, I recently found that a handwritten Korean manuscript of The Tale of Mei Fei is kept in the Adan Collection, which is a significant scholarly discovery given that no relevant research is available. The editions of the Tale of Mei Fei available in Korea include ${\ll}$藝苑?華${\gg}$ edition, ${\ll}$說?${\gg}$ edition, and the handwritten manuscript in Korean collected in the Adan Collection. Being the only handwritten Korean translation of the work, the Tale of Mei Fei in the Adan Collection was appended by the translations of ${\ll}$한셩뎨됴비연합덕젼${\gg}$ and ${\ll}$당고종무후뎐${\gg}$. As for the practice of translation of the work, literal "word to word" translation was done for the most part of the text; some sentences were occasionally translated liberally. Also, as for the poems in the text, pronunciation of each Chinese character was provided along with the translated text.

A Study of the Taesangugeupbang (Emergency Prescriptions for Childbirth) in the Context of Related Historical Medical Texts (태산구급방 정본화 연구)

  • Park, Hun-Pyeong
    • The Journal of Korean Medical History
    • /
    • v.32 no.1
    • /
    • pp.1-10
    • /
    • 2019
  • The Taesangugeupbang (Emergency Prescriptions for Childbirth) is a medical text written by Li-Chengong of China in the early 14th century. It incorporates forms of obstetrics and gynecology in use in the Chosun Dynasty and is quoted in the Hyangyakjibsungbang (Compendium of Prescription from the Countryside), the Euibangyoochui (Classified Collection of Medical Prescriptions), and the Taesanjibyo (Collection of Essentials for Childbirth). The recent rediscovery of Taesangugeupbang manuscripts in Japan has enabled full-scale research of this text. This article is based on a study of these manuscripts and attempts to synthesize the text through the various documents. The article suggests that: (1) critical texts for understanding the Taesangugeupbang include the Uijeoggo (A Review of Medical Books), the Euibangyoochui, and the Taesanjibyo; (2) there is a possibility that the Taesangugeupbang had disappeared from use in Joseon by the late 15th century; (3) the Taesangugeupbang complemented the treatment regimen of other texts and influenced the development of early Chosun ophthalmology; (4) The Taesangugeupbang is quoted in many Joseon's medical texts and is related to the author's mentor.

A Study about Inter-Textuality in Modern Hair Style - Focused on Collections - (현대 헤어스타일에 표현된 텍스트의 다원화 현상에 관한 연구 - 컬렉션을 중심으로 -)

  • Kim, Sung-Ah;Yoo, Tae-Soon
    • Fashion & Textile Research Journal
    • /
    • v.11 no.6
    • /
    • pp.934-941
    • /
    • 2009
  • The purpose of this study is to examine by which correlation the pluralistic phenomenon in text is functioned in comparison with hair style and fashion in collection. As a result, the pluralistic image in text, which was shown in modern fashion, was indicated to be pluralistic phenomenon by gender, T.P.O, coordination, and material. The pluralistic image in text for hair style can be known to have been indicated to be the pluralistic phenomenon in text for gender and to be the pluralistic phenomenon in text according to material and cultural category. As for a method of this study, it did put limitation on the part that is shown in the fashion collection from 2001 to 2007, analyzed hair-style features centering on photos, which were extracted from style.com, the online site of specializing in fashion, and carried out a literature research side by side with the theoretical background on intertextuality. The analysis in work according to the pluralistic phenomenon in text made it possible for looking at with a new sight differently from the recognition in the past, and opened the potentiality for being able to understand lots of strange representations, which have been impossible so far. The process of imitating and reconstructing each text according to compositional principle led to possibly knowing the necessity of an artist's ability that can implement the originative world.