The Artificial Intelligence & Machine Learning Megathread

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > The Artificial Intelligence & Machine Learning Megathread

pangstrom: Jan 25, 2003; Wedge Regret

Not sure my thinking is 100% clear on this but I think a more subtle form of "data leakage" is when there is some common causal factor in both your training and test sets that you aren't factoring out* (or including data markers that maybe you shouldn't) and isn't going to be meaningfully predictive going forward. Like if you were predicting mortgage defaults and had data including 2008-2010 and the algorithm "figures out" to just triple (or whatever) rates for those years, you could think your model was much better at predicting than it actually will be. Probably a bad example, since it kind of mushes the time travel issue unless you just had one data point per mortgage or something... so maybe if there were crashes in some regions of the US or something maybe that would be better example (unless those crashes were endemic and non-correcting). Data separation isn't always the best approach in those (compared to factoring out), though, like if you're training set has those years or regions with lots of crashes and your test set doesn't then your model might actually be worse than it could be.

*factoring out could mean like changing the performance metric "rate of default below/above baseline for that time period", or in better example "rate of default above/below baseline for that region and time period", though you would compute the baselines separately for the test and training sets.... or you could just strip the years/regions (though then you're going to miss any endemic signals, if there are any)

pangstrom fucked around with this message at 21:13 on Mar 22, 2022

# ¿ Mar 22, 2022 21:11

Adbot: ADBOT LOVES YOU

# ¿ May 17, 2024 02:07

pangstrom: Jan 25, 2003; Wedge Regret

CarForumPoster posted:

That's not data leakage.

Data leakage is when your training set knows a factor it otherwise would not know in practice. Sometimes this can be a restatement of your Y variable.

E.g. I am trying to predict which leads will convert to customers. My data include all info I know about customers, including their lifetime values. Obviously only companies with an LTV > 0 are customers, so I can't include the LTV column.

The underlying ecosystem changing in a way that makes your data inaccurate is just reality being a bitch and the nature of time-variant systems.

Well, the more subtle form of data leakage I'm talking about in converting customers could be like a difference in sales employee quality and those employees being in both training and test sets. Like you model would know Brad is awesome, Todd sucks, etc., but with new employees or putting the model in use for a different company it would do worse than you would expect.

tail end of this page has stuff like what I'm talking about : https://towardsdatascience.com/data-leakage-in-machine-learning-how-it-can-be-detected-and-minimize-the-risk-8ef4e3a97562

# ¿ Mar 22, 2022 21:54

pangstrom: Jan 25, 2003; Wedge Regret

I agree it's different than the simple leakage between training/target sets, but seems to fit the "use of information in the model training process which would not be expected to be available at prediction time" definition? But whatever, semantics.

# ¿ Mar 23, 2022 13:48

pangstrom: Jan 25, 2003; Wedge Regret

This is a factoid from a previous life but you used to hear that backprop was biologically implausible except for maybe kinda-sorta in the cerebellum.

# ¿ Jun 14, 2023 14:46

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > The Artificial Intelligence & Machine Learning Megathread