CBR: The Dangerous Art of Text Mining Chapter 12: Attacks on Environmentalists in Congress

Jo Guldi. The Dangerous Art of Text Mining: A Methodology for Digital History. Cambridge: Cambridge University Press, 2023.

Chapter Reviewed: Chapter 12: Attacks on Environmentalists in Congress

Review by: Sadahisa Watanabe

Chapter 12, titled “Attacks on Environmentalists in Congress,” presents a case study that demonstrates both the potential and limitations of data-driven analysis in historical inquiry. In this chapter, Guldi examines the treatment of environmentalism in U.S. Congressional debates, illustrating the shift in congressional discussions of climate change. The research corpus consists of transcripts of speeches by members of Congress, which were recorded in The Congressional Register between 1970 and 2010.

As for text mining methods, Guldi combines word embedding with word frequency. Word embedding is a technique in natural language processing that represents words as vectors in a vector space, measuring semantic relationships from word co-occurrence in large text corpora. This method transforms words into numerical vectors, allowing models to recognize words with similar meanings as vectors in close proximity. Using the word2vec model created by Mikolov at Google in 2013, Guldi generated a list of words and two-word phrases similar to the target word “environmentalist” for the corpus divided into five-year periods¹. After reviewing the word embedding output by hand, Guldi discovers that “many of the words and phrases co-located with the keyword ‘environmentalist’ were terms of reproach” (p. 362-363)².

Word embedding has been used to capture conceptual changes in large corpora by humanistic scholars³. However, what is unique to Guldi’s approach is the combination of word frequency with word embedding to account for its “black box” aspect. Guldi calls word embedding a “black box” in the sense that “the algorithms behind their analysis are only open to limited inspection and adjustment” (p. 366). By visualizing the raw counts of the terms similar to the key term “environmentalist,” Guldi demonstrates that the frequencies of phrases such as “radical environmentalist” and “extreme environmentalist” increased from 1995 to 1999.

Furthermore, Guldi analyzes the shifts in the frequencies of negative bigrams containing the word “environmentalist” depending on their speakers over the same research time frame (p. 376). By tracing the history of US environmental policies and particular politicians’ tactics, Guldi tackles why the shift in the use of the term “environmentalist” occurred over the years.

Overall, this chapter serves as a valuable research model for historians and data scientists alike, offering insights into the practical application of word embedding and frequency in historical analysis while emphasizing the continued importance of humanistic analysis and contextual understanding.

One can use two word embedding models with word2vec: the CBOW and Skip-gram models.See the following article for the detail. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. “Efficient Estimation of Word Representations in Vector Space.” arXiv preprint arXiv:1301.3781 (2013). ↩︎
Guldi notes that the reason she reviews the word embedding output by hand, rather than using sentiment analysis algorithms, is because she wanted to examine them “based not on pure sentimentality but on what I was learning about the gradual formation of an attack on environmentalists after 1970.” (p. 363). ↩︎
Previous works using word embedding to research conceptual shifts in large corpora include Verheul, Jaap, et al. ‘Using word vector models to trace conceptual change over time and space in historical newspapers, 1840–1914,’ Digital Humanities Quarterly, Volume 16, Number 2, 2022. ↩︎