Language modeling for personality prediction
MetadataShow full item record
This dissertation can be divided into two large questions. The first is a supervised learning problem: given text from an individual, how much can be said about their personality? The second is more fundamental: what personality structure is embedded in modern language models? To address the first question, three language models are used to predict many traits from Facebook Statuses. Traits include: gender, religion, politics, Big5 personality, sensational interests, impulsiveness, IQ, fair-mindedness, and self-disclosure. Linguistic Inquiry Word Count (Pennebaker et al., 2015), the dominant model used in psychology, explains close to zero variance on many labels. Bag of Words performs well and the model weights provide valuable insight about why predictions are made. Neural Nets perform the best by a wide margin on personality traits especially when few training samples are available. A pretrained personality model is made available online that can explain 10% of the variance of a trait with as little as 400 samples, within the range of normal psychology studies. This is a good replacement for Linguistic Inquiry Word Count in predictive settings. In psychology, personality structure is defined by dimensionality reduction of word vectors (Goldberg, 1993). To address the second question, factor analysis is performed on embeddings of personality words produced by the language model RoBERTa (Liu et al., 2019). This recovers two factors that look like Digman’s α and β (Digman, 1997) and not the more popular Big Five. The structure is shown to be robust to choice of context around an embedded word, language model, factorization method, word set and English vs Spanish. This is a flexible tool for exploring personality structure that can easily be applied to other languages.