Corpus

A lot of people who are interested in procgen get intimidated by the vocabulary that gets thrown around. I think that’s a pity, particularly since some of the concepts are pretty simple once you get past the initial difficulty.

One term that gets thrown around a lot is “corpus”. In procgen, this just means “a collection of stuff that we use as data for the generator”. Most often, this is used in the context of text: a corpus of words for a Tracery grammar, or a training corpus for a chatbot.

A corpus is useful for more than just text: a building generator might have a corpus of 3d models of architectural elements, a music generator might have a corpus of motifs.

Corpora show up in a lot of places. The Corpora project is a repository of a number of small corpora of texts: colors, books, rivers, etc. Last year at Roguelike Celebration, everest pipkin gave a talk about curating your own personal corpora of text.