The Dark Origins of Artificial Intelligence for Language Processing

    Most Popular

    OpenAI’s GPT-3 is a new artificial intelligence-powered language prediction model that has been developed by Google in 2017. GPT-3 will revolutionize the writing, publishing, and editing industry as it can predict what word is needed next for fluent sentence construction with unprecedented precision. GPT-3 is not just limited to English but also works for other languages such as French, Arabic and Urdu. GPT-4 could take this one step further by predicting whole sentences ahead of time rather than single words which would be even more revolutionary.

    In June 2020, a new and powerful AI named GPT-3 dazzled global tech wizards. OpenAI created the language model in San Francisco, California; it had been trained with billions of words from books and websites to develop an estimated cost of tens of millions of dollars.

    Enron’s Role In AI for Language Modeling

    The Enron Corpus is an email data set composed of 1.6 million emails sent between Enron employees that were publicly released by the Federal Energy Regulatory Commission in 2003.

     It is one of the most well-known and uniquely public email datasets, and with content that encompasses corporate intrigue and personal affairs, the archive has widespread use in artificial intelligence and natural language processing systems. NYU Tandon industry assistant professor Tegan Brain explains that “the archive has been an invaluable resource for computer scientists, who used it to train spam filters and other early machine learning systems, and apparently the first version of Apple’s Siri.”

    An analysis from the Enron Corpus. The nodes represent people within the Enron corpus and the links between them are incoming and outgoing emails. Source:

    “Machine learning systems make decisions based on statistics rather than explicit instructions from human programmer, but these systems are only as good as the data you give them,” Brain explained, alluding to one of the evening’s discussion topics on dataset bias. “While the archive was assumed to be representative of how people communicated, it wasn’t really representative of the general population.” Lavigne added that, now, Google and Facebook have much greater datasets from their users for machine learning purposes.

    They’ve become a commonly used data set for training A.I. systems. “If you think there might be significant biases embedded in emails sent among employees of [a] Texas oil-and-gas company that collapsed under federal investigation for fraud stemming from systemic, institutionalized unethical culture, you’d be right,” writes Levendowski. “Researchers have used the Enron emails specifically to analyze gender bias and power dynamics.” In other words, the most popular email data set for training A.I. has also been recognized by researchers as something that’s useful for studying misogyny, and our machines may be learning to display the same toxic masculinity as the Enron execs.

    It’s a tremendously important point to consider as everything around us becomes artificially intelligent, ruled by algorithms. We wrote about bias in algorithms last year, when former Kickstarter data chief Fred Benenson coined the term “mathwashing.” An Eyebeam fellow, Mimi Onuoha, is also doing interesting work on data and bias and how we ought not to think that computers are any less biased in the way they process the world than the humans that build them.

    You can read more about it in Brain and Lavigne’s own words over at Rhizome. “There are many ways to enjoy the Enron corpus, but by far the most pleasurable is to read all 500,000 emails in the order they were sent,” they write.

    Related Articles