Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
ToxicFrog
Apr 26, 2008


Parahexavoctal posted:

I believe the 'removing the DRM' stage may be illegal, which I'd normally scoff at but you said the legalities are important here. You could limit your project to publishers that omit DRM (e.g., Baen, Tor), or the vast "free!" sections on Smashwords, Kobo, etc?

Also, I very much hope this isn't something involving large language models so that you could automate the generation of prose-like content.

Even if it's a publisher like Tor that does not publish with DRM, the library borrowing software (Overdrive, CloudLibrary, etc) will add its own DRM layer to prevent you from keeping books checked out indefinitely or sharing them with other people. And most ebook DRM stripping tools do not support library DRM in the first place anyways.

So yeah, the easiest approach is going to be free books. Next easiest (but quite expensive given your book counts) is going to be purchasing DRM-free books. Next easiest after that is going to be purchasing DRMed books and stripping the DRM from them. Using library ebooks for this is going to be quite difficult just on technical grounds.

Actually the easiest easiest approach, on a technical level, is probably to find someone you know who is a voracious reader, has lots of recent books specifically, buys DRM-free or removes DRM from the books they buy as a matter of policy, and is willing to let you run your data science project on their personal library. I don't know if your project would be damaged by the bias inherent in a dataset curated by a single reader's tastes, though, nor do I know anyone who has that many recent books specifically; most people do not read a thousand books a year. I'm also not sure if that would satisfy your legal requirements.

Adbot
ADBOT LOVES YOU

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply