CBR: The Dangerous Art of Text Mining Chapter 1: Why Textual Data from the Past Is Dangerous

Jo Guldi. The Dangerous Art of Text Mining: A Methodology for Digital History. Cambridge: Cambridge University Press, 2023.

Chapter Reviewed: Chapter 1: Why Textual Data from the Past Is Dangerous

Review by: Jessica Corona.

In The Dangerous Art of Text Mining, Jo Guldi emphasizes the urgent need for critical approaches to text mining, especially when engaging with historical data that has been shaped by omissions, biases, and ideological distortions. Chapter 1 examines how historical records, particularly those digitized for computational analysis, often embed systemic prejudices that can distort our understanding of the past. This issue is particularly significant when considering how archives have historically been constructed. Many of them reflect dominant ideologies and power structures that determine what gets recorded and how. Guldi critiques the reliability of historical data—not only because of what is missing, but also because of the ideological frameworks within which the data was created. For instance, historical records often diminish women’s agency, framing them in relation to moral panics, scandals, or institutional control rather than as active historical agents. These patterns of distortion complicate large-scale analysis, where frequency and visibility often obscure entrenched patterns of exclusion.

Another major concern raised in the chapter is how language encodes gender and social bias. Guldi notes that historical texts often use patriarchal terminology that reinforces stereotypes, failing to reflect the lived experiences of women and other marginalized people. Computational methods, such as sentiment analysis or topic modeling, risk replicating these patterns if they prioritize word frequency without interrogating context. For example, texts may associate women primarily with terms like “virtue” or “modesty,” while ignoring narratives of leadership or resistance. Guldi argues that text mining privileges the most visible figures—political leaders, lawmakers, and elites—while ignoring individuals excluded from the official record. This raises a critical challenge: how can computational methods uncover the presence and contributions of those whose voices were deliberately erased? Guldi calls for interdisciplinary collaboration between historians and digital humanists to question sources, analyze bias, and recover silenced narratives.

A key concept in this discussion is archival silences—gaps in the historical record that suppress testimonies and reinforce dominant narratives. This is particularly evident in cases of gender-based violence, where institutional records often frame women as passive victims and omit their voices. Feminist researchers and digital humanists must respond to these omissions by seeking alternative sources such as oral histories and community archives. Scholars like Miriam Posner and Lauren Klein emphasize that data is never neutral; it is shaped by political, cultural, and historical contexts. One strategy to address these challenges is metadata activism, which restructures data categories to reflect marginalized experiences. For instance, explicitly labeling feminicidio—the gender-based killing of women—in digital archives helps counteract the historical erasure of gendered violence.

Guldi’s critique of “dirty data” aligns with broader concerns about how knowledge is produced and who is granted visibility in historical narratives. While feminist scholars have long worked to uncover silenced women’s voices, archival silences also affect other marginalized groups—such as racialized communities and the working class. If text mining is to be a meaningful tool for historical analysis, it must not only identify patterns but also interrogate the origins and biases of the data itself. A critical approach to text mining must go beyond identifying inequalities: it must actively work to recover, document, and amplify those voices that have been systematically excluded from the archive.