AI founders need to watch out. It’s no secret that publishers and writers are sensitive about large-language models using their data for training. But just how sensitive became clear in the recent shuttering of a six-year-old publishing dataset called Prosecraft.
The service, created in 2017 by Benji Smith, founder and CEO at word processor Shaxpir, was intended to be a helpful resource to authors. It ranked titles based on how passive or vivid their language is. For instance, it granted Lewis Carroll’s “Alice’s Adventures in Wonderland” a 83.94% “vividness” score. One issue: it scraped those books from the web without the authors’ permission. Even so, people didn’t seem to pay it much mind until the rise of generative AI, which has raised concerns around LLMs training on copyrighted material.
While Prosecraft itself has little to do with LLMs, the sheer amount of easily-accessible textual data it offered rang alarm bells with writers like “Little Fires Everywhere” author Celeste Ng. They worried that Prosecraft could be used by LLMs for training purposes. A few worried tweets this weekend from writers who discovered the site later kicked off a firestorm on X (formerly known as Twitter). Smith quickly buckled, taking down the dataset. Prosecraft had ingested 27,000 books up until that point. Smith did not respond to a request for comment.