WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. The authors scraped all outbound links from Reddit which received at least 3 karma. The authors used the approach as a heuristic indicator for whether other users found the link interesting, educational, or just funny.
334 PAPERS • NO BENCHMARKS YET
OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least three upvotes. (38GB).
141 PAPERS • NO BENCHMARKS YET