|
Not sure my thinking is 100% clear on this but I think a more subtle form of "data leakage" is when there is some common causal factor in both your training and test sets that you aren't factoring out* (or including data markers that maybe you shouldn't) and isn't going to be meaningfully predictive going forward. Like if you were predicting mortgage defaults and had data including 2008-2010 and the algorithm "figures out" to just triple (or whatever) rates for those years, you could think your model was much better at predicting than it actually will be. Probably a bad example, since it kind of mushes the time travel issue unless you just had one data point per mortgage or something... so maybe if there were crashes in some regions of the US or something maybe that would be better example (unless those crashes were endemic and non-correcting). Data separation isn't always the best approach in those (compared to factoring out), though, like if you're training set has those years or regions with lots of crashes and your test set doesn't then your model might actually be worse than it could be. *factoring out could mean like changing the performance metric "rate of default below/above baseline for that time period", or in better example "rate of default above/below baseline for that region and time period", though you would compute the baselines separately for the test and training sets.... or you could just strip the years/regions (though then you're going to miss any endemic signals, if there are any) pangstrom fucked around with this message at 21:13 on Mar 22, 2022 |
# ¿ Mar 22, 2022 21:11 |
|
|
# ¿ May 17, 2024 02:07 |
|
CarForumPoster posted:That's not data leakage. tail end of this page has stuff like what I'm talking about : https://towardsdatascience.com/data-leakage-in-machine-learning-how-it-can-be-detected-and-minimize-the-risk-8ef4e3a97562
|
# ¿ Mar 22, 2022 21:54 |
|
I agree it's different than the simple leakage between training/target sets, but seems to fit the "use of information in the model training process which would not be expected to be available at prediction time" definition? But whatever, semantics.
|
# ¿ Mar 23, 2022 13:48 |
|
This is a factoid from a previous life but you used to hear that backprop was biologically implausible except for maybe kinda-sorta in the cerebellum.
|
# ¿ Jun 14, 2023 14:46 |