GINC (Generative In-Context learning Dataset) is a small-scale synthetic dataset for studying in-context learning. The pretraining data is generated by a mixture of HMMs and the in-context learning prompt examples are also generated from HMMs (either from the mixture or not). The prompt examples are out-of-distribution with respect to the pretraining data since every example is independent, concatenated, and separated by delimiters. The GitHub repository provides code to generate GINC-style datasets of varying vocabulary sizes, number of HMMs, and other parameters.
Paper | Code | Results | Date | Stars |
---|