An extension and implementation of a computational theory of language
In the past 40 years, there has been much debate between nativists and empiricists about how children acquire the grammar of their first language. Nativists believe that children are born with a “Language Acquisition Device” containing innate knowledge of grammar to bootstrap the learning of grammar while empiricists insist that the input and feedback, which children receive from their parents or caretakers, are sufficient.
More recently, Yeap offered an alternative explanation: Children acquire their initial knowledge of grammar from understanding multiword utterances. It is argued that what is learned, as part of one's syntactic knowledge, is how meanings of individual words are passed between adjacent words. Some words’ meaning are passed to the words on their left or on their right while others accept meanings of the words from their left, right or both. It was claimed that the encoding of these basic movements in a word is similar to learning the word syntactic category as identified by linguists. Then, rather than learning formal grammar rules, children learn rules/heuristics to process words in terms of these allowable movements.
Yeap tested his theory with a program, code-named UGE, written in LISP. However, the test was done using a small sample of either individually constructed sentences or well-constructed sentences drawn from published English text. The grammar covered is only a fraction of those one would encounter in the daily use of the English language. This thesis extended UGE to enable it to parse real world sentences. These sentences were drawn from newspaper articles written for different domains (such as business, sports, etc.) in the New Zealand Herald. Many of these sentences are thus non-trivial sentences, highly ambiguous, and lengthy. If UGE could be extended to parse many of these sentences, then UGE would have “learned" a sophisticated English grammar and one is then more con_dent that the theory is feasible.
The work done involves the testing of UGE using more than 1900 sentences drawn from more than 100 newspaper articles. The result is that UGE was re-organised into four main modules, namely: a pre-processor, a more powerful UGE, an intelligent dictionary, and a post-UGE. Furthermore, a fixed test set of data (with 843 sentences) with the expected parsing result from UGE is collected and used for automatic re-test of UGE after any major modification have been done to it. This ensures that the new modifications do not undo what UGE was doing right before it was being modified. New labels were found missing and were introduced without the need to deviate from the theory. New rules were introduced to process the new labels. Several bugs were fixed and the dictionary size has been extended from the initial 5000 words to 145699 words. In short, a more powerful UGE has been developed and the theory was found to be feasible.
The performance of UGE has also been evaluated, both in terms of how well it parses sentences and in terms of how it performs compared to some existing parsers. The former is done via an evaluation of its speed, accuracy, and number of parse options generated. The latter is done via comparing it with the Link Grammar and the Stanford Parser. The overall result shows that UGE performed better than both the Link Grammar and the Stanford Parser and that UGE can now be put to use in practical applications.