Impact of different speech interfaces of personal devices on users' perception

Wadea, Mazen
Symonds, Judith
Master of Computer and Information Sciences
Auckland University of Technology

Because of Text-to-Speech (TTS) lacks both clarity and prosody of normal human speech, TTS sounds unnatural and is unpleasant to listen to. It is generally accepted using natural speech for a static prompts, whereas synthetic speech for dynamic content. However, most commercial applications on the market adopt mixing human speech and TTS within the same sentence and/or between sentences. But, this mixing approach led to inconsistent interface (Gong & Lai, 2001). So that, an immediate issue in the design of such speech interface is what type of speech should be used. The goal of this project is to explore users’ perception towards different types of speech in order to investigate the acceptability of personal speech interfaces. This study is aimed for the public users of mobile applications. This project explored redevelopment of the speech interface of the Goal Management Training (GMT) system based on results from testing different speech samples by the delivered VoiceTester mobile application. The VoiceTester application has been developed on the iPhone in this study, to facilitate the listening task therefore adding validity to the responses from participants by simulating environment of speech interfaces on personal devices. The contribution of this study is to provide some knowledge to the developers and health researchers about exploring the impact of different types of speech interfaces on users’ perception. The findings are ultimately helpful to the Traumatic Brain Injury (TBI) patients. As the recommended software will assist them undertake activities with support to help prevent them from making errors (McPherson, Kayes, & Weatherall, 2009). Six participants from different age groups have been chosen in the form of 3 couples, each couple construct of both genders. The examined types of speech are computer-generated voice (CV), natural voice (NV), and familiar voice (FV). The synthetic voices were generated by computer software, the natural speech samples were provided by two native speakers of New Zealand English, and the familiar voices for each couple were simply the recording of each other voices. Participants completed three times a post paper-and-pencil self-perception of task performance scales after each listening test, and then followed by an interview. The evaluative data were used to inform the participants and the researcher about the study and to guide the interview process. The main methods were largely qualitative through the use of semi-structured interviews to explore the users’ perception about manner of speaking and the speaker of the three examined speech samples, as well as, to investigate the importance of the used voice characteristics. The interviews are analysed to discover themes and patterns related to an analysis framework structured from the literature review. The findings revealed differences between three couples in their perceptions of different types of speech. The effect of gender was slightly present, as the subjects revealed a more positive attitude to their opposite gender. Both human voices, NV and FV, were acceptable to the majority of participants with many reporting improved mood and goal attainment. Participants found working with CV both challenging and rewarding. NV seemed particularly helpful in engaging people in the task process, while FV appeared particularly helpful in providing a structured framework for error prevention in attempting goal performance.

Computer generated voice , Perception of TTS interfaces , Acceptability of Text-to-Speech , Interpretive, exploratory, qualitative research approachs , Methodological triangulation
