Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
pangstrom
Jan 25, 2003

Wedge Regret
Not sure my thinking is 100% clear on this but I think a more subtle form of "data leakage" is when there is some common causal factor in both your training and test sets that you aren't factoring out* (or including data markers that maybe you shouldn't) and isn't going to be meaningfully predictive going forward. Like if you were predicting mortgage defaults and had data including 2008-2010 and the algorithm "figures out" to just triple (or whatever) rates for those years, you could think your model was much better at predicting than it actually will be. Probably a bad example, since it kind of mushes the time travel issue unless you just had one data point per mortgage or something... so maybe if there were crashes in some regions of the US or something maybe that would be better example (unless those crashes were endemic and non-correcting). Data separation isn't always the best approach in those (compared to factoring out), though, like if you're training set has those years or regions with lots of crashes and your test set doesn't then your model might actually be worse than it could be.

*factoring out could mean like changing the performance metric "rate of default below/above baseline for that time period", or in better example "rate of default above/below baseline for that region and time period", though you would compute the baselines separately for the test and training sets.... or you could just strip the years/regions (though then you're going to miss any endemic signals, if there are any)

pangstrom fucked around with this message at 21:13 on Mar 22, 2022

Adbot
ADBOT LOVES YOU

pangstrom
Jan 25, 2003

Wedge Regret

CarForumPoster posted:

That's not data leakage.

Data leakage is when your training set knows a factor it otherwise would not know in practice. Sometimes this can be a restatement of your Y variable.

E.g. I am trying to predict which leads will convert to customers. My data include all info I know about customers, including their lifetime values. Obviously only companies with an LTV > 0 are customers, so I can't include the LTV column.

The underlying ecosystem changing in a way that makes your data inaccurate is just reality being a bitch and the nature of time-variant systems.
Well, the more subtle form of data leakage I'm talking about in converting customers could be like a difference in sales employee quality and those employees being in both training and test sets. Like you model would know Brad is awesome, Todd sucks, etc., but with new employees or putting the model in use for a different company it would do worse than you would expect.

tail end of this page has stuff like what I'm talking about : https://towardsdatascience.com/data-leakage-in-machine-learning-how-it-can-be-detected-and-minimize-the-risk-8ef4e3a97562

pangstrom
Jan 25, 2003

Wedge Regret
I agree it's different than the simple leakage between training/target sets, but seems to fit the "use of information in the model training process which would not be expected to be available at prediction time" definition? But whatever, semantics.

pangstrom
Jan 25, 2003

Wedge Regret
This is a factoid from a previous life but you used to hear that backprop was biologically implausible except for maybe kinda-sorta in the cerebellum.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply