Chapter 3 focuses on the pivotal role Data – which the author explains in the driving force of AI today – plays in AI systems. In this chapter, Crawford critiques the extraction, methods, and use of data in AI systems (mainly computer vision systems), highlighting the complexities and challenges associated with the widespread use of AI systems that heavily rely on data sources for their functionality. The chapter explores how data shapes power dynamics, influences decision-making processes, and raises ethical concerns regarding privacy, bias, and surveillance in the digital age.
Crawford begins the chapter with the NIST dataset, explaining the transformation of mugshots from a means of identification to simply data used to train facial recognition systems. These people and their families do not have a say in the use of their pictures and most likely do not know it is being used for AI systems. Beyond the issue of privacy, this dataset introduces the issue of data abstraction. It underscores the shift of personal data from being seen as individual material to being treated as mere infrastructure, where the specific context or meaning of an image or video is often disregarded. In the book, Crawford states, “The personal, the social, and the political meanings are all imagined to be neutralized”, demonstrating a core ideology of data extraction – that this important contextual information is now treated as just data for technological innovation.
Before exploring the issues associated with data and its abstraction, there is a need to understand the reasoning behind the need for large training datasets in the first place. The author explains that training data serves as the foundational material that AI systems use to make inferences and predictions about the world. The more labelled data examples available for training, the better the algorithm becomes at producing precise predictions. Therefore, AI systems require a substantial amount of data. This heavy reliance on data has led to phrases like “data is the new oil”. Just as industries thrive on abundant oil reserves, AI systems thrive on vast quantities of labelled data for training.
Crawford goes on further to explain that this data (images, videos etc) is “anything but neutral”. They contain crucial information and knowledge boundaries that ultimately govern how AI systems see the world. An example given is the collection of data from sources such as Reddit vs Enron for language systems. Sentences from these sources will be quite different from one another therefore having skews and biases which are then built into the system. This shows that the context and origin of data matters and it is hence not neutral.
The chapter also delves into the process of collecting and organizing data from the internet to create training datasets, particularly focusing on the ImageNet project as a significant example. The dataset was designed to map out the world of objects by extracting millions of images primarily from search engines, forming a large-scale ontology of images for machine learning applications. The process of gathering and labelling data from the internet shapes AI models and algorithms, impacting how artificial intelligence operates in the world and which communities are most affected, potentially leading to exploitative outcomes through biased interpretations and applications of AI systems. Using ImageNet as an example is the series of offensive labels given to images such as “alcoholic”, “hooker” etc.
Additionally, connecting to Chapter 2 which focuses on labour: data plays a crucial role in the unseen labour (for example – Amazon Mechanical Turk used for IMAGENet) required to build and maintain AI systems, involving underpaid workers in different forms. This hidden labour involves tasks like labelling training data and reviewing content essential for AI systems to function, yet these workers are often poorly compensated despite their essential contributions. Another issue highlighted by Crawford is that data is considered a form of capital that influences the distribution of advantages and disadvantages across markets. High achievers in the mainstream economy tend to benefit from data-scoring economies, while the poorest individuals become targets of harmful data surveillance and extraction.
Crawford further explains how during data extraction, no regard is given to the consent and privacy of the people used in training datasets. An example is the DukeMTMC project where photos of over 2000 students were secretly captured for a facial recognition training dataset. One justification for this unconsented form of data collection is that the data is anonymized before release. However, Crawford demonstrates that this seemingly anonymized dataset could contain unexpected and highly personal forms of information. She demonstrates this using the New York City Taxi and Limousine Commission dataset in which details were quickly de-anonymized, leading to the identification of sensitive information like incomes and home addresses.
Throughout the chapter, the author does a good job of highlighting the dangers associated with data for AI. She provides real-world examples of how the extraction, abstraction and use of data could harm people and communities, leading me to remember the phrase, “Just because you can, doesn’t mean you should”. I remembered this especially when Crawford explained the gang-crime prediction project – a project to classify crimes, particularly if a crime is a gang-related crime. The data used in this project was based on the crime dataset from the Los Angeles Police Department which included a disproportionate amount of Black and Latinx people. Based on this biased dataset, one could imagine that a system built on it would also be biased. When asked about this potential for bias, Hau Chan, a computer scientist at Harvard University and the presenter of the work, simply replied that he is just a researcher, implying that he has no hand or idea in how these systems are used. For me, this lack of ethical considerations by some researchers is what stood out to me. I found it alarming that scientists could dissociate and distance themselves from the impact their work could have on real people. It made me wonder why they were creating these systems in the first place. Crawford expands on my concern by quoting Joseph Weizenbaum, an AI scientist back in 1976, who said, “… scientist and technologist must… counter the forces that tend to remove him from the consequences of his actions. He must—it is as simple as this—think of what he is actually doing” With this, I ponder whether AI scientists and technologists, if they were to more deeply consider the consequences and implications that their research and creations could have on people and societies, could this make the technology less harmful. So, my question to Crawford would be: are AI systems the problem? Or is it in fact the people who build them?
In the chapter, Crawford highlights the invasive and unconsented methods of data extraction. However, while reading I couldn’t help but wonder if these methods are to be categorized the same. For example, we use datasets that use ordinary people’s images from the internet and then we have unconsented recordings of people in places. To me the latter is obviously a violation of a person’s right to privacy and AI systems imploring such methods should be held accountable. However, in the former, one could argue that people put these data there deliberately on the internet. Before using most technology, especially social media applications, people even sign and agree to the “terms and conditions” of these apps and then post their data publicly. As such, considering the datasets that use these images that are seemingly marked as free to use, is this case also a violation of people’s privacy when they themselves made the data publicly available?
In conclusion, throughout the chapter we get into the concept which says, “data is the new oil”. This concept is explored as a metaphor that portrays data as a resource to be consumed or an investment to be harnessed. It emphasizes the shift of data from something personal and intimate to a more inert and nonhuman entity, highlighting the transformation of personal data into infrastructure and the utilization of various types of data to enhance AI algorithms. So, despite the risks and potential harms, the collection of such data continues due to the reliance of machine learning on large datasets and the normalization of mass data extraction.