Developing a Language Model for Analyzing German-Language Tweets with a Focus on Gender-Inclusive Language

Data-Method-Monitoring Cluster

Project head: Dr. Jannes Jacobsen

Project team members: Long Nguyen

Running time January 2025 until December 2026
Status Current project

The project develops and applies a language model specifically tailored to German-language tweets in order to empirically investigate gender-inclusive language in social media discourse.

Guiding research questions

How can a language model be specifically adapted to the informal, dynamic language of social media, particularly in the context of gender-inclusive language?
Which social and regional dynamics shape the spread and acceptance of gender-inclusive language on German-language Twitter?
How do network structures and thematic contexts influence the diffusion and rejection of gender-inclusive language?
The increase in gender-inclusive language is not only a linguistic phenomenon but also a social one.
Dr. Anica Waldendorf, Oxford University

The research project develops specialized tools in natural language processing (NLP) to analyze everyday communication in German-language online social networks, with gender-inclusive language discourse on German-language Twitter as the central use case.

The methodological focus lies on the systematic benchmarking and fine-tuning of powerful pre-trained German or multilingual language models to efficiently and reliably capture the linguistic patterns of German social media communication. The starting point is a large German-language Twitter dataset from 2018 to 2023.

Substantively, the project examines the discursive spread of gender-inclusive language and its relationship to social, regional, and political factors. Network-based and temporal approaches are used to understand how linguistic practices spread among users and how thematic contexts moderate the diffusion mechanisms of gender-inclusive forms. The project thus combines methodological innovation in language modeling with sociological research on public language debates.

  • Lack of NLP models specifically trained on informal German online language that enable reliable analysis of phenomena such as gender-inclusive language
  • Lack of empirical research on the everyday use of gender-inclusive language beyond institutional communication
  • Insufficient knowledge about network effects and diffusion mechanisms of linguistic innovations in social media

The project aims to develop and evaluate a language model-based classifier for identifying and analyzing gender-inclusive language, as well as to investigate its temporal, thematic, and social distribution in digital spaces. Methodologically, the goal is to create a flexible model that can also be applied in other research areas.

Empirically, the project seeks to analyze the acceptance, spread, and context-dependence of gender-inclusive language in digital discourse and to expand knowledge about the social diffusion and rejection of innovative linguistic practices.

The approach begins with the construction and preprocessing of a large German Twitter corpus. Based on this, existing general-purpose language models are benchmarked and fine-tuned to capture the specific features of gender-inclusive language and the platform.

This is followed by a quantitative, network-based analysis of patterns and diffusion mechanisms of gender-inclusive language, taking into account temporal trends, thematic contexts, and social structures. Finally, the developed methods will be made publicly available to the research community.

An initial study shows temporal and regional differences in the use of gender-inclusive language on Twitter and indicates correlations with demographic, socioeconomic, and political characteristics.

  • Adler, A., & Hansen, K. (2020). Studenten, StudentInnen, Studierende? Aktuelle Verwendungspräferenzen bei Personenbezeichnungen. Muttersprache. Themenheft" Sprache Und Geschlecht". Beiträge Zur Gender-Debatte, 130(1), 47–63. 
  • Chan, B., Schweter, S., & Möller, T. (2020). German’s next language model. In D. Scott, N. Bel, & C. Zong (Eds), Proceedings of the 28th international conference on computational linguistics (pp. 6788–6796). International Committee on Computational Linguistics. doi.org/10.18653/v1/2020.coling-main.598 
  • Dargiewicz, A. (2021). Verstärkung (m/w/d) gesucht. Zur Geschlechtsneutralität in den gegenwärtigen deutschen Stellenanzeigen. Acta Neophilologica, 1(XXIII), 123–140. 
  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (No. arXiv:1810.04805). arXiv. doi.org/10.48550/arXiv.1810.04805 
  • Krome, S. (2021). Gendern zwischen Sprachpolitik, orthografischer Norm, Sprach-und Schreibgebrauch. Bestandsaufnahme und orthografische Perspektiven zu einem umstrittenen Thema. Sprachreport, 37(2), 22–29. 
  • Nguyen, D. Q., Vu, T., & Nguyen, A. T. (2020). BERTweet: A pre-trained language model for English Tweets (No. arXiv:2005.10200). arXiv. doi.org/10.48550/arXiv.2005.10200 
  • Nguyen, H. L., Tsolak, D., Karmann, A., Knauff, S., & Kühne, S. (2022). Efficient and reliable geocoding of German Twitter data to enable spatial data linkage to official statistics and other data sources. Frontiers in Sociology, 7, 910111. 
  • Scheible, R., Frei, J., Thomczyk, F., He, H., Tippmann, P., Knaus, J., Jaravine, V., Kramer, F., & Boeker, M. (2024). GottBERT: A pure German Language Model. In Y. Al-Onaizan, M. Bansal, & Y.-N. Chen (Eds), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 21237–21250). Association for Computational Linguistics. doi.org/10.18653/v1/2024.emnlp-main.1183 
  • Tsolak, D., Knauff, S., Kühne, S., & Nguyen, H. L. (2023). X-GTA: The Cross-Topic German Twitter Archive. osf.io/preprints/socarxiv/9tbd4/ 
  • Waldendorf, A. (2024). Words of change: The increase of gender-inclusive language in German media. European Sociological Review, 40(2), 357–374. doi.org/10.1093/esr/jcad044 
  • Zhang, X., Malkov, Y., Florez, O., Park, S., McWilliams, B., Han, J., & El-Kishky, A. (2023). TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations at Twitter (No. arXiv:2209.07562). arXiv. doi.org/10.48550/arXiv.2209.07562

Funding: Federal Ministry for Education, Family Affairs, Senior Citizens, Women and Youth (Institutional funding)

Cooperation partner:

The project runs from January 1, 2025, to December 31, 2026. Cooperation partners include former colleagues from Bielefeld University (Institute for Interdisciplinary Research on Conflict and Violence), particularly within the framework of the FoDiRa preliminary work.

For the second substantive study, there is a collaboration with Dr. Anica Waldendorf (Nuffield College, University of Oxford).