Everest Pipkin - Corpora as medium: on the work of curating a poetic textual dataset
As a follow-up to Robin Sloan’s talk (where he used a custom corpora to retrain GPT-2) I think Everest Pipkin’s talk is an important complement, because their main point is how a carefully curated personal dataset can be more effective for generativity than vast but generic corpora.
A corpus (plural corpora) is the term for a collection of data, originally in a textual context, such as training corpora for natural language processing. For a lot of generative projects (particularly ones that involve machine learning) the data used matters a lot. (It’s not limited to text: Helena Sarin has made similar remarks about her visual art practice, using her own art as a corpus.)
Everest’s point is that generative text comes from other text. “Care is really important”: we should work with data that we care about. As they say, “…computational power can’t make up for a lack of argument or poetics,” which to my mind is one of the big gaps in procgen right now. It’s easy to make a sophisticated generator where no one cares about the generated artifacts because the entire thing lacks motivation and soul. Everest presents a way forward, where our care for the input leads the viewer to care about the output.