From phonemes to syllables: a multi-scale behavioral and modeling investigation of speech chunking
OA Version
Citation
Abstract
It is widely understood that fluent speech relies on retrieving pre-assembled motor “chunks,” yet the grain size(s) of these chunks―and how different brain areas exploit them―remains unsettled. This dissertation tackles the problem first using a traditional functional magnetic resonance imaging (fMRI) contrast analysis, followed by a computational modeling approach. In the first study, twenty-four adults produced three syllable types in a magnetic resonance imaging (MRI) scanner: fully learned syllables (FL), syllables whose consonant clusters were learned but in a different syllable (LC), and novel syllables (N). Behavioral testing revealed a graded benefit: FL > LC > N for speech speed and accuracy, with reaction-time gains that were condition-invariant. Whole-brain and region-of-interest (ROI) fMRI contrasts (N – FL, N – LC, LC – FL) indicated that left ventral and middle premotor cortex (vPMC/midPMC), posterior inferior frontal sulcus (pIFS), pre-supplementary motor area (preSMA), and anterior insula (aINS) displayed the same FL < LC < N brain activity pattern, indicating reliance on both consonant-cluster and whole-syllable chunks. In contrast, areas such as intraparietal sulcus (IPS) and visual word form area (VWFA) differentiated only LC from FL, implicating syllable-sized chunking in these regions. Regions such as ventral motor cortex (vMC) and supplementary motor area (SMA) did not show significant difference in either of fMRI contrasts, suggesting an exclusive involvement of phonemic-level representations. Furthermore, regions involved in the default-mode network (DMN) displayed decreased activity with increased task difficulty. Together, these results suggest a flexible, multiscale architecture of the chunking mechanism. To investigate these mechanistic interpretations quantitatively, an open-source framework that quantitatively fits any cognitive model (specified by node coordinates and computational-load functions) to whole-brain neuroimaging data was developed. Model activity is generated via Gaussian kernels whose mean and standard deviation parameters are optimized by gradient descent, and comparisons between models use the Akaike Information Criterion. Validation on a working memory dataset confirmed the tool’s ability to rank competing models, prune superfluous nodes, and propose model refinements. To apply this framework to the fMRI results, the speech ROIs from the first study served as model nodes, with analysis focusing on the activity decreases in each ROI after cluster learning and syllable learning. For simplicity, we termed the activity decrease in the N – LC and in the LC – FL contrasts computational benefit of syllable learning and cluster learning, respectively. Fitting the model to the fMRI contrasts revealed that VWFA benefits three times more from syllable learning than cluster learning, indicating that this region utilizes syllable-sized chunks when available. Similarly, left IPS, preSMA, aINS, and STG benefit twice as much from syllable learning compared to cluster learning. Left pIFS and vPMC nodes benefitted equally from cluster and syllable learning. By integrating behavioral, neural, and computational evidence, this dissertation demonstrates that speech sequencing flexibly recruits both sub-syllabic and syllabic chunks, and it provides a quantitative pathway for embedding such insights into next-generation models of speech production.
Description
2025