Developing a Language Model for Analyzing German-Language Tweets with a Focus on Gender-Inclusive Language
Data-Method-Monitoring Cluster
Project head: Dr. Jannes Jacobsen
Project team members: Long Nguyen
This research project focuses on developing and applying a novel BERT-based language model specifically designed to analyze gender-inclusive language (GIL) discourse on German-language Twitter.
The first phase of the project is dedicated to methodological development, where we train a BERT model from scratch. Our goal is to capture the semantics of language as used in tweets, incorporating key linguistic elements such as emojis and hashtags, which are widely used in everyday online communication. Existing pretrained models, typically trained on longer texts such as Wikipedia articles or news reports, fail to adequately represent informal, social media-driven language, particularly in the German-speaking context. To address this gap, we leverage an extensive corpus of German-language Twitter data collected in collaboration with Tsolak et al. at Bielefeld University, spanning from September 2018 to March 2023 and comprising 2 billion tweets. The outcome of this phase includes a technical preprint on arXiv detailing the model training process and the public release of the model on Hugging Face, ensuring broad accessibility for the research community.
The second phase of the project shifts toward content analysis, investigating the prevalence, perceptions, and attitudes surrounding gender-inclusive language. We examine how GIL is used, conducting hypothesis-driven tests both at an aggregate and individual level. This includes analyzing regional and spatial variations in GIL usage in relation to demographic and socio-economic factors. Additionally, we explore correlations between GIL usage and users' stances on various political and social issues, such as migration and integration.
By combining advancements in natural language processing (NLP) with social and political analysis, this research contributes both to the methodological development of German-language NLP models and to a deeper understanding of gender-inclusive language dynamics within online discourse in the German-speaking community.
Funding: Federal Ministry for Education, Family Affairs, Senior Citizens, Women and Youth (Institutional funding)