Proceedings of the Korean Society for Language and Information Conference (한국언어정보학회:학술대회논문집)
Korean Society for Language and Information (KSLI)
- Annual
Domain
- Linguistics > Linguistics, General
2007.11a
-
Coverage has been a constant thorn in the side of deployed deep linguistic processing applications, largely because of the difficulty in constructing, maintaining and domaintuning the complex lexicons that they rely on. This paper reviews various strands of research on deep lexical acquisition (DLA), i.e. the (semi-)automatic creation of linguistically-rich language resources, particularly from the viewpoint of DLA for precision grammars.
-
This is a speculative paper, describing a recently started effort to give a formal semantics to semantic annotation schemes. Semantic annotations are intended to capture certain semantic information in a text, which means that it only makes sense to use semantic annotations if these have a well-defined semantics. In practice, however, semantic annotation schemes are used that lack any formal semantics. In this paper we outline how existing approaches to the annotation of temporal information, semantic roles, and reference relations can be integrated in a single XML-based format and can be given a formal semantics by translating them into second-order logic. This is argued to offer an incremental aproach to the incorporation of semantic information in natural language processing that does not suffer from the problems of ambiguity and lack of robustness that are common to traditional approaches to computational semantics.
-
Concepts of greater and greater complexity can be constructed by building systems of entities, by relating other entities to that system with a figure-ground relation, by embedding concepts of figure-ground in the concept of change, by embedding that in causality, and by coarsening the granularity and beginning the process over again. This process can be called the Ontological Ascent. It pervades natural language discourse, and suggests that to do lexical semantics properly, we must carefully axiomatize abstract theories of systems of entities, the figure-ground relation, change, causality, and granularity. In this paper, I outline what these theories should look like.
-
Looking back the history of formal treatment of linguistics, we cannot disregard the contribution of possible world semantics. Intensional logic of Montague semantics, DRT (Discourse Representation Theory), mental space, and situation theory are closely related to or compared with the notion of possible world. All these theories have commonly clarified the structure of belief context or uncertain knowledge, employing hypothesized worlds. In this talk, I firstly brief the pedigree of these theories. Next, I will introduce the recent development of modal logic for the representation of (i) knowledge and belief and (ii) time, in which belief modality is precisely discussed together with the accessibility among possible worlds. I will refer to BDI (belief-desire-intention) logic, CTL (computational tree logic), and sphere-based model in belief revision. Finally, I will discuss how these theories could be applied to the further development of analyses of natural language.
-
Case markers in Korean are omissible in colloquial speech. Previous discourse studies of Caseless bare NPs in Korean show that the information structure of zero Nominative not only differs from that of overt Nominative but it also differs from that of zero Accusative in many respects. This paper aims to provide a basis for these semantic/pragmatic properties of Caseless NPs through the syntactic difference between bare subjects and bare objects: namely, the former are left-dislocated NPs, whereas the latter form complex predicates with the subcategorizing verbs. Our analysis will account for the facts that (i) the distribution of bare subject NPs are more restricted than that of bare object NPs; (ii) bare subject NPs must be specific or topical; (iii) Acc-marked NPs in canonical position tend to be focalized.
-
This paper presents the architectural aspects of the phrase analyzer that attempts to recognize phrases and identify the functional roles in the sentences in formal Japanese documents. Since the object of interest is a phrase, the current system, designed in an object-oriented architecture, contains the Phrase class, and makes use of the linguistic generalization about languages with Case markers that a phrase, whether a noun phrase, a verb phrase, a postposition (or preposition) phrase or a clause phrase, can be separated into the content and the function components. Without a dictionary, and drawing on the orthographic information on the words to parse, it also contains a class that identifies the types of characters, a class representing grammar, and a class playing the role of a controller. The system has a simple and intuitive structure, externally and internally, and therefore is easy to modify and extend.
-
This paper addresses a task of opinion extraction from given documents and its positive/negative classification. We propose a sentence classification method using a notion of syntactic piece. Syntactic piece is a minimum unit of structure, and is used as an alternative processing unit of n-gram and whole tree structure. We compute its semantic orientation, and classify opinion sentences into positive or negative. We have conducted an experiment on more than 5000 opinion sentences of multiple domains, and have proven that our approach attains high performance at 91% precision.
-
This paper introduces BEYTrans (Better Environment for Your TRANSlation), the first experimental environment for free online collaborative computer-aided translation. The requirements and functionalities related to individual translators and communities of translators are distinguished and described. These functionalities have been integrated in a Wiki-based complete environment, equipped with all currently possible asynchronous linguistic resources and translation aids. Functions provided by BEYTrans are also compared with existing CAT systems and ongoing experiments are discussed.
-
We explore a computational algebraic approach to grammar via pregroups. We examine how the structures of Japanese causatives can be treated in the framework of a pregroup grammar. In our grammar, the dictionary assigns one or more syntactic types to each word and the grammar rules are used to infer types to strings of words. We developed a practical parser representing our pregroup grammar, which validates our analysis.
-
This paper addresses a method for customizing an English-to-Korean machine translation system from general domain to patent domain. The customizing method consists of following steps: 1) linguistically studying about characteristics of patent documents, 2) extracting unknown words from large patent documents and constructing large bilingual terminology, 3) extracting and constructing the patent-specific translation patterns 4) customizing the translation engine modules of the existing general MT system according to linguistic study about characteristics of patent documents, and 5) evaluating the accuracy of translation modules and the translation quality. This research was performed under the auspices of the MIC (Ministry of Information and Communication) of Korean government during 2005-2006. The translation accuracy of the customized English-Korean patent translation system is 82.43% on the average in 5 patent fields (machinery, electronics, chemistry, medicine and computer) according to the evaluation of 7 professional human translators. In 2006, the patent MT system started an on-line patent MT service in IPAC (International Patent Assistance Center) under MOCIE (Ministry of Commerce, Industry and Energy) in Korea. In 2007, KIPO (Korean Intellectual Property Office) tries to launch an English-Korean patent MT service.
-
The purpose of this paper is to propose a new type of NPI licensing context through French subjunctive and ne expletif. The distribution of NPIs on previous studies does not exactly correspond to negative function types. French subjunctive and ne expletif are good guidelines for reclassifying NPI licensing context. My classification is by a hierarchy of strength in negative force: overtly negative proposition > negative entailment > negative implicature. A new type of NPI licensing context is: (i) I-domain for negative implicature (ⅱ) E-domain for negative entailment and (ⅲ) overt negation.
-
We propose and test several computational methods to automatically determine possible saliency cut-off points in Sketch Engine (Kilgarriff and Tugwell, 2001). Sketch Engine currently displays collocations in descending importance, as well as according to grammatical relations. However, Sketch Engine does not provide suggestions for a cut-off point such that any items above this cut-off point may be considered significantly salient. This proposal suggests improvement to the present Sketch Engine interface by calculating three different cut-off point methods, so that the presentation of results can be made more meaningful to users. In addition, our findings also contribute to linguistic analyses based on empirical data.
-
This paper explains how we define and represent modality in E-HowNet. Following Lyons (1977, reviewed in Hsieh 2003, among others), we hold that modals express a speaker's opinion or attitude toward a proposition and hence have a pragmatic dimension and recognize five kinds of modal categories, i.e. epistemic, deontic, ability, volition and expectation modality. We then present a representational formalism that contains the three most basic components of modal meaning: modal category, positive or negative and strength. Such a formula can define not only modal words but also words that contain modal meanings and cope with co-compositions of modals and the negation construction.
-
AutoCor is a method for the automatic acquisition and classification of corpora of documents in closely-related languages. It is an extension and enhancement of CorpusBuilder, a system that automatically builds specific minority language corpora from a closed corpus, since some Tagalog documents retrieved by CorpusBuilder are actually documents in other closely-related Philippine languages. AutoCor used the query generation method odds ratio, and introduced the concept of common word pruning to differentiate between documents of closely-related Philippine languages and Tagalog. The performance of the system using with and without pruning are compared, and common word pruning was found to improve the precision of the system.
-
In this study, we explore the polysemy of da3 through the ontological conceptual structure found in SUMO. First, we divide several different senses for da3, clustering physical event senses and metaphorical event senses. In here, we only focus on physical event senses of da3. From the physical event senses of da3, we divide them into two main categories: 1) hit and 2) pump. We then use SUMO ontological concepts to identify these physical senses. Finally, we can observe the common patterns of the "hit" sense group and the "pump" sense group for da3.
-
This paper examines some syntactic and semantic properties of the negative construction V+bo NP (VbN) in Taiwanese Southern Min (TSM). It finds out that there are ambiguities between an episode reading and a generic reading in VbN construction which require further investigations and explanations. Therefore, the goal of this paper is to account for the ambiguities lying in the negative VbN construction.
-
This is a semantic pilot study which concentrates on how people in Taiwan process the temporal metaphors, ego-moving metaphor and time-moving metaphor. Motivated by the research of Gentner, Imai, and Boroditsky (2002) in which the English native speakers comprehend ego-moving metaphors faster than time-moving metaphors, the present study attempts to reexamine whether the faster reaction to ego-moving metaphors is shared by both the Chinese native speakers and EFL learners. To achieve the goals, 25 Chinese/English bilinguals are invited to be examined via the16 Chinese and 16 English test sentences. The recordings of their accuracy on each item are served as the databases used to compare with the study of Gentner, Imai, and Boroditsky (2002). The two finding presented here are: (1) when the subjects tested in their native language, Chinese, they process ego-moving metaphors better. (2) when tested in the foreign language, English, they conceptualize time-moving metaphors much better.
-
This paper gives a thorough investigation into Mandarin sentence-final particles (henceforth SFPs). First I induce core grammatical functions and semantic interpretations of SFPs. Based on Rizzi's (1997) Split CP hypothesis, I make some modifications to accommodate Mandarin SFPs and map them onto separate functional heads within a proper hierarchy. I also examine some empirical evidence of head directionality and tentatively assume Mandarin C is head-initial. To explain the surface head-final order, in light of Chomsky's (2001) Phase Theory and Hsieh's (2005) revised Spell-out hypothesis, I pose a CP complement to Spec movement. Following Moro's (2000) idea, I further claim the motivation behind is to seek for antisymetry.
-
This paper is a study on constructing a natural language interface to database, concentrating on generating textual answers. TGEN, a system that generates textual answer from query result tables is presented. The TGEN architecture guarantees its portability across domains. A combination of a frame-based approach and natural language generation techniques in the TGEN provides text fluency and text flexibility. The implementation result shows that this approach is feasible while a deep NLG approach is still far to be reached.
-
Compound verbs in Korean show properties of both syntactic phrases and lexical items. Earlier studies of compound verbs have either assumed two homonymous types, i.e. one as a syntactic phrase and the other as a lexical item, or posited some sort of transformation from a syntactic phrase into a lexical item. In this paper, I show empirical and conceptual problems for earlier studies, and present an alternative account in terms of Talmy's (2000) theory of lexicalization. Unlike Talmy who proposed [Path] conflation into [MOVE] for Korean, I suggest several types of [Co-Event] conflation; e.g. [
$_{Co-Event}$ Manner] conflation as in kwul-e-kata 'to go by rolling', [$_{Co-Event}$ Concomitance] conflation as in ttal-a-kata 'to follow', [$_{Co-Event}$ Concurrent Result] conflation as in cap-a-kata 'to catch somebody and go', etc. The present proposal not only places Korean compound verbs in a broader picture of cross-linguistic generalizations, but, when viewed from Jackendoff's (1997) productive vs. semi-productive morphology, provides a natural account for classifying the compounds that allow -se intervention from those that do not. -
Regarding Korean psych-adjectives and their -e ha- counterparts, e.i., [psych-adjective + -e ha-] constructions, what is at issue is how to capture the semantic difference and similarity between the two. Concerning this issue, one of the most controversial and difficult problems is whether the psych-construction has Action (Agency) as part of its meaning. The purpose of this paper is to solve this problem by answering the question why psych-constructions are much more natural when they are used as negative imperative than when they are used as positive imperative. First, in order to figure out why positive imperative is not allowed, we show that.e ha- adds the meaning of non-volitional action to psych-adjectives, using Jackendoff's Conceptual Semantics. Secondly, in accounting for why negative imperative is so natural, we show, with Talmy's Force Dynamics theory, what the speaker requires from the hearer is internal volitional action.
-
The so-called Korean BNC (bound noun construction) displays complex syntactic, semantic, and constructional properties. This paper, couched upon a constraint-based approach, two different syntactic structures for the construction with articulated lexical properties for the BNs and relevant predicates. The paper reports an implementation of this analysis in the LKB (Linguistic Knowledge Building) system and shows us that this direction is robust enough to pare relevant sentences.
-
This paper argues that various kinds of displaced structures in English should be licensed by a more explicitly formulated type of rule schema in order to deal with what is called weak connectivity in English. This paper claims that the filler and the gap site cannot maintain the total identity of features but a partial overlap since the two positions need to obey the structural forces that come from occupying respective positions. One such case is the missing object construction where the subject fillers and the object gaps are to observe requirements that are imposed on the respective positions. Others include passive constructions and topicalized structures. In this paper, it is argued that the feature discrepancy comes from the different syntactic positions in which the fillers are assumed to be located before and after displacement. In order to capture this type of mismatch, syntactically relevant features are handled separately from the semantically motivated features in order to deal with the syntactically imposed requirements.
-
In this paper, we propose a Transformation-Based Learning (TBL) method on generating the Korean standard pronunciation. Previous studies on the phonological processing have been focused on the phonological rule applications and the finite state automata (Johnson 1984; Kaplan and Kay 1994; Koskenniemi 1983; Bird 1995). In case of Korean computational phonology, some former researches have approached the phonological rule based pronunciation generation system (Lee et al. 2005; Lee 1998). This study suggests a corpus-based and data-oriented rule learning method on generating Korean standard pronunciation. In order to substituting rule-based generation with corpus-based one, an aligned corpus between an input and its pronunciation counterpart has been devised. We conducted an experiment on generating the standard pronunciation with the TBL algorithm, based on this aligned corpus.
-
This paper presents an implementation of a gramar of Dynamic Syntax for Japanese. Dynamic Syntax is a grammar formalism which enables a parser to process a sentence in an incremental fashion, establishing the semantic representation. Currently the application of lexical rules and transition rules in Dynamic Syntax is carried out arbitrarily and this leads to inefficient parsing. This paper provides an algorithm of rule application and partitioned parsing state for efficient parsing with special reference to processing Japanese, which is one of head-final languages. At the present stage the parser is still small but can parse scrambled sentences, relative clause constructions, and embedded clauses. The parser is written in Prolog and this paper shows that the parser can process null arguments in a complex sentence in Japanese.
-
This paper sets forth the phenomenon of Contrastive Reduplication (CR) in English relevant to the notion of contrastive focus (CF). CF differs from other reduplicative patterns in that rather than the general intensive function, denotation of a more prototypical and default meaning of a lexical item appears from the reduplicated form resulting as a semantic contrast with the meaning of the non-reduplicated word. Thus, CR is in concordance with CF under the concept of contrastivity. However, much of the previous works on CF associated contrastivity with a manufacture of a set of alternatives taking a semantic approach. We claim that a recent discourse-pragmatic account takes advantage of explaining the vague contrast in informativeness of CR. Zimmermann's (2006) Contrastive Focus Hypothesis characterizes contrastivity in the sense of speaker's assumptions about the hearer's expectation of the focused element. This approach makes possible adaptation to CR and recovers the possible subsets of meaning of a reduplicated form in a more refined way showing contrastivity in informativeness. Additionally, CR in other languages along with similar set-limiting phenomenon in various languages will be introduced in general.
-
The present paper investigates a particular structure in Taiwan Mandarin, "(NP) + (intensifier) +
$gei^3ta^1$ "give him/it"+ adjective" in terms of construction grammar. The structure is mostly observed in utterances of younger generation. Though it is not regarded as a grammatical or standard structure, it is still a register of language. The structure lays emphasis on speaker's attitude toward an undesired, unpleasant event. In most cases, the attitude tends to be negative. The events or propositions must have existed or been completed. The adjectives compatible with this structure belong to category of higher degree. The grammatical usage illustrates semantic bleaching of$gei^3ta^1$ . And the changes from giving to a grammatical particle denoting subjective belief is a kind of subjectification. Moreover,$ta^1$ could refer to events or situation expressed by a more complicated grammatical structure, or denotes nothing as a dummy word. Though many previous studies paid attention to the newly developed structure resulted from language contact, the adequate account was not provided. It is hoped through this investigation, we will get a better understanding of this particular structure. -
Among the languages that allow long-distance reflexives, some languages have blocking effects, whereas others don't. The goal of this paper is to provide computational algorithms that can handle presence and absence of blocking effects of long-distance reflexives. We will examine the blocking effects in Chinese and Korea and develop computational algorithms for handling blocking effects in those two languages. The algorithms will be developed by incorporating Chierchia's Binding Theory into Steedman's Combinatory Categorial Grammar (CCG). Through the analyses and implementations, this paper illustrates how blocking effects can be implemented computationally.
-
The parallel corpus is an important resource in the research field of data-driven natural language processing, but there are only a few parallel corpora publicly available nowadays, mostly due to the high labor force needed to construct this kind of resource. A novel strategy is brought out to automatically fetch parallel text from the web in this paper, which may help to solve the problem of the lack of parallel corpora with high quality. The system we develop first downloads the web pages from certain hosts. Then candidate parallel page pairs are prepared from the page set based on the outer features of the web pages. The candidate page pairs are evaluated in the last step in which the sentences in the candidate web page pairs are extracted and aligned first, and then the similarity of the two web pages is evaluate based on the similarities of the aligned sentences. The experiments towards a multilingual web site show the satisfactory performance of the system.
-
The present paper targets on the excessive structural article dao in the "
$X^1$ + dao + si" phrases, aiming to see the possible generation of the excessive meaning. The generation of excessiveness will be analyzed from the aspect of cognition, including conceptual structure and metaphor. It will be concluded that the position indicated by si in concept plays a crucial, which then tells the importance of collocation. What is more, the comparison of dao and Southern Min kah will be made to see the degree of grammaticalization of dao. -
Named Entity Recognition (NER) is always limited by its lower recall resulting from the asymmetric data distribution where the NONE class dominates the entity classes. This paper presents an approach that exploits non-local information to improve the NER recall. Several kinds of non-local features encoding entity token occurrence, entity boundary and entity class are explored under Conditional Random Fields (CRFs) framework. Experiments on SIGHAN 2006 MSRA (CityU) corpus indicate that non-local features can effectively enhance the recall of the state-of-the-art NER systems. Incorporating the non-local features into the NER systems using local features alone, our best system achieves a 23.56% (25.26%) relative error reduction on the recall and 17.10% (11.36%) relative error reduction on the F1 score; the improved F1 score 89.38% (90.09%) is significantly superior to the best NER system with F1 of 86.51% (89.03%) participated in the closed track.
-
Interrogative sentences are generally used to perform speech acts of directly asking a question or making a request, but they are also used to convey such speech acts indirectly. In the utterances, such indirect uses of interrogative sentences usually carry speaker's emotion with a negative attitude, which is close to an expression of anger. The identification of such negative emotion is known as a difficult problem that requires relevant information in syntax, semantics, discourse, pragmatics, and speech signals. In this paper, we argue that the interrogatives used for indirect speech acts could serve as a dominant marker for identifying the emotional attitudes, such as anger, as compared to other emotion-related markers, such as discourse markers, adverbial words, and syntactic markers. To support such an argument, we analyze the dialogues collected from the Korean soap operas, and examine individual or cooperative influences of the emotion-related markers on emotional realization. The user study shows that the interrogatives could be utilized as a promising device for emotion identification.
-
This paper reports on the application of network analysis approaches to investigate the characteristics of graph representations of Japanese word associations. Two semantic networks are constructed from two separate Japanese word association databases. The basic statistical features of the networks indicate that they have scale-free and small-world properties and that they exhibit hierarchical organization. A graph clustering method is also applied to the networks with the objective of generating hierarchical structures within the semantic networks. The method is shown to be an efficient tool for analyzing large-scale structures within corpora. As a utilization of the network clustering results, we briefly introduce two web-based applications: the first is a search system that highlights various possible relations between words according to association type, while the second is to present the hierarchical architecture of a semantic network. The systems realize dynamic representations of network structures based on the relationships between words and concepts.
-
We extracted English expressions that appear in Japanese sentences in newspaper articles and on the Internet. The results obtained from the newspaper articles showed that the preposition "in" has been regularly used for more than ten years, and it is still regularly used now. The results obtained from the Internet articles showed there were many kinds of English expressions from various parts of speech. We extracted some interesting expressions that included English prepositions and verb phrases. These were interesting because they had different word orders to the normal order in Japanese expressions. Comparing the extracted English and katakana expressions, we found that the expressions that are commonly used in Japanese are often written in the katakana syllabary and that the expressions that are not so often used in Japanese, such as prepositions, are hardly ever written in the katakana syllabary.
-
This paper deals with the complex verb formation of passive and potential predicates and syntactic structures projected by these verbs. Though both predicates are formed with the suffix -rare which has been assumed to originate from the same stem, they show significantly different syntactic behaviors. We propose two kinds of concatenation of base verbs and auxiliaries; passive verbs are lexically formed with the most restrictive mode of combination, while potential verbs are formed syntactically via more flexible combinatory operations of function composition. The difference in the mode of complex verb formation has significant consequences for their syntactic structures and semantic interpretations, including different combination with the honorific morphemes and subjectivization of arguments/adjuncts of base verbs. We also consider the case alternation phenomena and their implications for scope construals found in potential sentences, which can be accounted for in a unified manner in terms of the optional application of function composition.
-
Named entities (NEs) are important in many Natural Language Processing (NLP) applications, and discovering NE-related relations in texts may be beneficial for these applications. This paper proposes a method to extract the ISA relation between a "named entity" and its category, and an IS-RELATED-TO relation between the category and its related object. Based on the pattern extraction algorithm "Person Category Extraction" (PCE), we extend it for solving our problem. Our experiments on Wall Street Journal (WSJ) corpus show promising results. We also demonstrate a possible application of these relations by utilizing them for semantic search.
-
Syntax parsers can benefit from speakers' intuition about constituent structures indicated in the input string in the form of parentheses. Focusing on languages like Korean, whose orthographic convention requires more than one word to be written without spaces, we describe an algorithm for passing the bracketing information across the tagger to the probabilistic CFG parser, together with one for heightening (or penalizing, as the case may be) probabilities of putative constituents as they are suggested by the parser. It is shown that two or three constituents marked in the input suffice to guide the parser to the correct parse as the most likely one, even with sentences that are considered long.
-
This paper investigates the nature of Japanese argument cluster (Steedman 2000b). Based on Combinatory Categorial Grammar, a type-raising analysis of case particles which captures some aspects of the information structure in Japanese is discussed, including contrastive interpretation of coordination, wh-constructions, and some theme and rheme-related grammatical phenomena. These observations offer further support for the study of syntax, semantics, and phonology interface and the earlier analysis of English information structure.
-
This paper describes a method for automatic acquisition of wide-coverage treebank-based deep linguistic resources for Japanese, as part of a project on treebank-based induction of multilingual resources in the framework of Lexical-Functional Grammar (LFG). We automatically annotate LFG f-structure functional equations (i.e. labelled dependencies) to the Kyoto Text Corpus version 4.0 (KTC4) (Kurohashi and Nagao 1997) and the output of of Kurohashi-Nagao Parser (KNP) (Kurohashi and Nagao 1998), a dependency parser for Japanese. The original KTC4 and KNP provide unlabelled dependencies. Our method also includes zero pronoun identification. The performance of the f-structure annotation algorithm with zero-pronoun identification for KTC4 is evaluated against a manually-corrected Gold Standard of 500 sentences randomly chosen from KTC4 and results in a pred-only dependency f-score of 94.72%. The parsing experiments on KNP output yield a pred-only dependency f-score of 82.08%.
-
Corpora annotated with lots of linguistic information are required to develop robust and statistical natural language processing systems. Building such corpora, however, is an expensive, labor-intensive, and time-consuming work. To help the work, we design and implement an annotation tool for establishing a Korean dependency tree-tagged corpus. Compared with other annotation tools, our tool is characterized by the following features: independence of applications, localization of errors, powerful error checking, instant annotated information sharing, user-friendly. Using our tool, we have annotated 100,904 Korean sentences with dependency structures. The number of annotators is 33, the average annotation time is about 4 minutes per sentence, and the total period of the annotation is 5 months. We are confident that we can have accurate and consistent annotations as well as reduced labor and time.
-
This paper explores the nature of multiple sluicing in English, which has two or more remnant wh-phrases in clause edge position. At the beginning part of the paper we argue against Nishigauchi's (1998) and Lasnik's (2007) Gapping analysis of multiple sluicing, which says that two remnant wh-phrases each actually occupies the left and right edge of a clause, with the in-between string of words undergoing Gapping. We rather argue that multiple sluicing in English is the same kind as found in Bulgarian and Serbo-Croatian. In other words, multiple sluicing in English is also derived by multiple wh-fronting which otherwise does not apply. We demonstrate that some important properties of the construction noted by Lmultiple sluicing, multiple wh-movement/fronting, sluicing, TP/IP-deletion, asnik (2007) under the Gapping approach to it can be accounted for in a principled way by our proposed analysis.
-
This paper describes a preliminary work on prosody modeling aspect of a text-to-speech system for Thai. Specifically, the model is designed to predict symbolic markers from text (i.e., prosodic phrase boundaries, accent, and intonation boundaries), and then using these markers to generate pitch, intensity, and durational patterns for the synthesis module of the system. In this paper, a novel method for annotating the prosodic structure of Thai sentences based on dependency representation of syntax is presented. The goal of the annotation process is to predict from text the rhythm of the input sentence when spoken according to its intended meaning. The encoding of the prosodic structure is established by minimizing speech disrhythmy while maintaining the congruency with syntax. That is, each word in the sentence is assigned a prosodic feature called strength dynamic which is based on the dependency representation of syntax. The strength dynamics assigned are then used to obtain rhythmic groupings in terms of a phonological unit called foot. Finally, the foot structure is used to predict the durational pattern of the input sentence. The aforementioned process has been tested on a set of ambiguous sentences, which represents various structural ambiguities involving five types of compounds in Thai.
-
This paper proposes a convolution tree kernel-based approach for relation extraction where the parse tree is expanded with entity features such as entity type, subtype, and mention level etc. Our study indicates that not only can our method effectively capture both syntactic structure and entity information of relation instances, but also can avoid the difficulty with tuning the parameters in composite kernels. We also demonstrate that predicate verb information can be used to further improve the performance, though its enhancement is limited. Evaluation on the ACE2004 benchmark corpus shows that our system slightly outperforms both the previous best-reported feature-based and kernel-based systems.
-
The rapid growth of the online information services causes the problem of information explosion. Automatic text summarization techniques are essential for dealing with this problem. There are different approaches to text summarization and different systems have used one or a combination of them. Considering the wide variety of summarization techniques there should be an evaluation mechanism to assess the process of summarization. The evaluation of automatic summarization is important and challenging, since in general it is difficult to agree on an ideal summary of a text. Currently evaluating summaries is a laborious task that could not be done simply by human so automatic evaluation techniques are appearing to help this matter. In this paper, we will take a look at summarization approaches and examine summarizers' general architecture. The importance of evaluation methods is discussed and the need to find better automatic systems to evaluate summaries is studied.
-
In this paper, we use non-negative matrix factorization (NMF) to refine the document clustering results. NMF is a dimensional reduction method and effective for document clustering, because a term-document matrix is high-dimensional and sparse. The initial matrix of the NMF algorithm is regarded as a clustering result, therefore we can use NMF as a refinement method. First we perform min-max cut (Mcut), which is a powerful spectral clustering method, and then refine the result via NMF. Finally we should obtain an accurate clustering result. However, NMF often fails to improve the given clustering result. To overcome this problem, we use the Mcut object function to stop the iteration of NMF.
-
This paper provides a fine-grained analysis of Korean serial verb constructions within the HPSG framework, and covers major descriptive characteristics of the phenomena. This paper discusses constraints on serial verb constructions in terms of four aspects; transitivity, argument structure, semantic properties, and complementizers. As a result, 17 constraints have been built, which support the type hierarchies for Korean serial verb constructions. This paper also presents a sample derivation on the basis of on the constraints and the type hierarchies.
-
Tombstone inscriptions represent a linguistic genre which yields insights in culture and language. Creating corpora from tombstones is thus a complementary approach for the study of languages and cultures. For the annotation of tombstone corpora, we propose TSML, the Tombstone-Markup-Language, developed during the massive annotation of Taiwanese tombstones and a number of tombstones from China, Indonesia and Europe. We discuss our conceptual framework in the annotation of tombstones and derive successively and present preliminary research data to show how the usefulness of the annotations. Finally, we will encourage researchers to participate in the specification of TSML to obtain soon an annotation language for annotations across cultures and languages.
-
This study explores the textual characteristics, more precisely the quantity and diversity of nouns, of Japanese prime ministers' Diet addresses. In the field of stylistics, textual characteristics independent of the content have been examined with the aim on detecting the authors, genres, and chronological variations of texts. This study focuses instead on textual characteristics related to the content of texts, namely the quantity and diversity of nouns, because our aim is to analyze texts to better understand two political phenomena: (a) the difference between the two types of Diet addresses delivered by Japanese prime ministers, and (b) the perceived changes made to these addresses by two powerful prime ministers. It is a case study of the microscopic characterization of texts, which has become more and more important with the expansion in the scope of stylistics and the production of a wide variety of new types of texts following the advent of the Web.
-
Japanese does not exhibit deontic-epistemic polysemy which is recognized among typologically different languages. Hence, in Japanese linguistics, it has been debated which of the two types of modality is more prototypical. This study brings Chinese learner's acquisition data of Japanese modality to bear on the question of which of the two types of modality is more prototypical, using the Competition Model (Bates and MacWhinney 1981). The Competition Model notion of 'cues' as processing strategy adopted by learners reveals the continuity/discontinuity between these two modality domains.
-
In this paper, we propose a method to classify movie review documents into positive or negative opinions. There are several approaches to classify documents. The previous studies, however, used only a single classifier for the classification task. We describe a multiple classifier for the review document classification task. The method consists of three classifiers based on SVMs, ME and score calculation. We apply two voting methods and SVMs to the integration process of single classifiers. The integrated methods improved the accuracy as compared with the three single classifiers. The experimental results show the effectiveness of our method.
-
Named entity translation plays an important role in many applications, such as information retrieval and machine translation. In this paper, we focus on translating person names, the most common type of name entity in Korean-Chinese cross language information retrieval (KCIR). Unlike other languages, Chinese uses characters (ideographs), which makes person name translation difficult because one syllable may map to several Chinese characters. We propose an effective hybrid person name translation method to improve the performance of KCIR. First, we use Wikipedia as a translation tool based on the inter-language links between the Korean edition and the Chinese or English editions. Second, we adopt the Naver people search engine to find the query name's Chinese or English translation. Third, we extract Korean-English transliteration pairs from Google snippets, and then search for the English-Chinese transliteration in the database of Taiwan's Central News Agency or in Google. The performance of KCIR using our method is over five times better than that of a dictionary-based system. The mean average precision is 0.3490 and the average recall is 0.7534. The method can deal with Chinese, Japanese, Korean, as well as non-CJK person name translation from Korean to Chinese. Hence, it substantially improves the performance of KCIR.
-
In this paper, I use Taiwan Southern Min, a dialect of Mandarin Chinese, as the data to argue that pseudo-cleft construction is derived from cleft construction.
-
In order to extract some important information of a person from text, an extracting model was proposed. The person's name is recognized based on the maximal entropy statistic model and the training corpus. The sentences surrounding the person's name are analyzed according to the conceptual knowledge base. The three main elements of events, domain, situation and background, are also extracted from the sentences to construct the structure of events about the person.
-
The text categorization is an important field for the automatic text information processing. Moreover, the authorship identification of a text can be treated as a special text categorization. This paper adopts the conceptual primitives' expression based on the Hierarchical Network of Concepts (HNC) theory, which can describe the words meaning in hierarchical symbols, in order to avoid the sparse data shortcoming that is aroused by the natural language surface features in text categorization. The KNN algorithm is used as computing classification element. Then, the experiment has been done on the Chinese text authorship identification. The experiment result gives out that the processing mode that is put forward in this paper achieves high correct rate, so it is feasible for the text authorship identification.
-
This article investigates the use of distal demonstrative Hitlo in Taiwanese Southern Min (TSM) from a discourse-pragmatic perspective. The analysis is based on a 5-hour corpus of spoken data, including daily conversations, radio interviews, TV drama series, and some random examples. A total of 172 tokens of Hitlos are identified in the data. They can be divided into six categories according to their functions: firstly, exophoric usage, those Hitlos which refer to an object non-linguistically which can be identified in the immediate situation; secondly, endophoric usage, those which refer to an element textually; thirdly, referent introducing function, those which can be used to introduce a new but identifiable referent into the conversation (the referent usually has topical importance); fourthly, hedging expression, those which serve as a marker of imprecision; fifthly, a condition introducing marker, those which function as an indicator of the coming of a conditional sentence; finally, pause fillers, those which help speakers to manage speech turn or indicate the mental states In addition, an interactive function which Hitlo is found to serve will be discussed. Moreover, a grammaticalizational process involving semantic bleaching which Hitlo is probably undergoing is revealed in general. Finally, a filled demonstrative principle, stating that it may be a universal phenomenon to use demonstratives as filled pause will be proposed.
-
This paper investigates children's comprehension and production of demonstrative pronouns (DPs), 'zhege' (this) and 'nage' (that), in Mandarin Chinese. Subjects are children of ages three, four, five and six. Based on the results of the present experiment, children's developmental stages and the corresponding age grading are provided. Also, the present study incorporates a physical clue into the experiment. The result suggests that in the acquisition of deixis children rely highly on physical context to work out the meaning distinction. In addition, Piaget's egocentrism hypothesis and H. Clark's marking hypothesis are examined in the study. The result seems to support the egocentrism hypothesis. Subjects under the age of six do fail to shift the deictic center when they and the experimenter have a different perspective. As for the marking hypothesis, the study seems to challenge the hypothesis. The result shows that children actually performed better on the marked term 'zhege' than the unmarked member 'nage'.
-
It is suggested that the difference between co-referential and bound reflexive pronouns found in many languages can be accounted for by using the notion of the case extension of a type <1> quantifier. Given this proposal the co-referential pronouns get their meaning when the corresponding NP takes nominal case extension first. Bound reflexives are reflexivisers in the sense that they are not case extensions of quantifiers although they also transform binary relations into sets. Examples from Japanese and from Polish are discussed.
-
We present an automatic semantic annotation system for Korean on the EXCOM (EXploration COntextual for Multilingual) platform. The purpose of natural language processing is enabling computers to understand human language, so that they can perform more sophisticated tasks. Accordingly, current research concentrates more and more on extracting semantic information. The realization of semantic processing requires the widespread annotation of documents. However, compared to that of inflectional languages, the technology in agglutinative language processing such as Korean still has shortcomings. EXCOM identifies semantic information in Korean text using our new method, the Contextual Exploration Method. Our initial system properly annotates approximately 88% of standard Korean sentences, and this annotation rate holds across text domains.