Dissertation Title: Voice-Centered AI Systems for Analyzing and Exploring Spoken Discourse
Abstract:
Machine learning and artificial intelligence (AI) systems have come to offer significant potential for enhancing human-to-human communication. Speech is a particularly noteworthy medium in this context; spoken language is ubiquitous in our day-to-day lives, and it contains rich paralinguistic information that extends beyond the literal words that are spoken. This dissertation presents work that leverages AI to support the creation and extend the capabilities of "conversation networks": civic communication infrastructure grounded in spoken discourse that is aimed at bridging, listening, and deliberation. In particular, we introduce novel applications of voice-based AI systems and machine learning models designed to foster broader participation in such discourse and aid its analysis at scale.
In the first part of the dissertation, we explore the use of voice anonymization methods in civic conversations. We investigate AI speech transformation and synthesis approaches for anonymization and study their impact on listeners' feelings of empathy and trust towards speakers, as well as speakers' own perceptions of anonymity, agency, and being heard within civic processes. This work demonstrates how speech-based AI technologies can be used to encourage broader and safer participation in civic discourse.
The second part of the dissertation proposes machine learning models for analyzing and exploring spoken language. We propose a method of augmenting large language models (LLMs) with speech-understanding capabilities and apply it to speech summarization, enabling the organization of spoken content directly from audio without the need for transcription. Recognizing the importance of expressivity in the human voice, we further extend this system to capture paralinguistic in addition to semantic information. Then, we introduce a model for expressive speech retrieval that enables querying for spoken content based on descriptions of speaking style. This expands the utility of information retrieval systems for speech corpora by enabling the exploration of emotionally significant or powerful moments.
In the final part of the dissertation, we introduce an LLM-based system for spoken highlight detection—the task of identifying semantically or emotionally substantive moments—in conversations. We study how the system's behavior changes depending on the balance of textual and acoustically-derived expressive information provided, which offers implications for tailoring the system to meet the specific requirements of different domains. Finally, we adapt this system into an interactive tool that enables qualitative researchers to efficiently locate salient content within spoken corpora for further analysis.
Altogether, this dissertation aims to advance the role of machine learning and AI in understanding and facilitating the analysis of spoken communication. By developing systems that transform and process speech in meaningful ways, we contribute novel methods for studying conversational data at scale and tools for amplifying diverse voices within civic and social discourse.
Committee members:
Deb Roy
Professor of Media Arts and Sciences
Massachusetts Institute of Technology
James Glass
Senior Research Scientist
MIT Computer Science and Artificial Intelligence Laboratory (CSAIL)
Elena Glassman
Assistant Professor of Computer Science
Harvard John A. Paulson School of Engineering & Applied Sciences