Resolving co-reference Anaphora using Semantic Constraints
Anaphora resolution forms a critical cornerstone of natural language computational systems and hence forms a large part of computational linguistics literature. Study of anaphora is as old as the origin of languages, however computational study of anaphora started in the late seventies. Since then, there has been substantial progress in terms of both our understanding of anaphora as well as their resolution by computational systems. A variety of anaphora exist in natural languages and correspondingly, a variety of strategies are needed to resolve them. While some proportion of anaphora can be resolved by the use of basic syntactical and pragmatic strategies, the leftover ones require the use of various types of semantics. This thesis presents an enhanced framework for resolving anaphora which integrates the existing syntactic and pragmatic based strategies with the use of semantics.
In this thesis, we present the use of anaphora in a new light; that is, we emphasize its role as a shortcut elaborative device in addition to its role as a co-reference pointer. The latter has been emphasized a lot in previous studies. We show that this elaborative function is achieved by the use of an alternative word or a combination of words to form a noun phrase that serves as an anaphor. This additional functionality can be seen across the whole range of anaphora, from simple pronouns to multi-word noun phrases or compound nouns. We show how the interpretation of knowledge embedded in them is used to identify the antecedent and this is the key strategy we used in developing our new approach.
The resulting anaphora resolution algorithm is directly based on anaphora functioning as a shortcut elaborative device. The algorithm extracts knowledge from the document itself and from WordNet, and uses it to uncover elaborative information embedded in the anaphor. The latter is then used to identify the antecedent entity. The implemented system, named aCAR, is written in java, and it takes as input, a shallow parsed clausal structure of each sentence found in newspaper articles. The latter is a LISP based, in-house, parser. The resolution algorithm uses the fact that information about an entity in a discourse is expressed sequentially. Hence, critical information that can be used to help resolve an anaphor may be expressed after the mention of the anaphor. This fact gives rise to two crucial aspects of our algorithm. Firstly, an anaphor that is difficult to resolve with the current amount of information is left in a semi-resolved state, to be resolved later, instead of immediately taking a decision based on the current information. Hence, our algorithm uses a multi-pass approach. Secondly, our algorithm resolves anaphora at the level of the discourse, not to a single antecedent at a local level. This approach uses the fact that an entity is referred by various noun phrases (NPs) in a document and the information required to resolve an anaphoric NP can be embedded in any one of the NPs referring to the same entity. This can be before or after the anaphor. Hence when an anaphor gets resolved, all the information pertaining to the anaphor-NP and the antecedent-NP is merged and this accumulated information is used for all subsequent resolutions as well as attempts to resolve an anaphor in a semi-resolved state.
The anaphora resolution system was tested using a corpus consisting of 35 online newspaper articles from The Press, The Dominion and The Herald. Out of these, 20 were used as training data and 15 as test data. These gave us a total of 915 (out of 2323) anaphoric NPs for training data and 723 (out of 1895) anaphoric NPs for test data. The results were evaluated both in terms of correct resolutions as determined by the author and also compared with the results obtained by several similar systems. Our system performed at similar or better levels compared to most of the systems evaluated against, however our system resolves a much wider range of anaphora. The resolution task that was most similar in terms of the range of anaphora attempted and the level of resolution was Message Understanding Competition tasks (MUC-6). The highest precision rates achieved by systems participating in this competition were 71% compared to a precision rate of 78% for our system.
In summary, this thesis delivers results on three levels. Firstly, it provides an enhancement of the theory on the natural language phenomena of anaphora usage. Secondly, it provides a relational framework to substantiate the theory. Thirdly, it provides the results of implementing the framework and resolving anaphora from naturally occurring discourses and the results include a careful evaluation of its performance.