|
Anyone got advice on how to prevent leaking data between training and test datasets when doing feature engineering? My previous models didn't require very much, but the next one I'm going to take a crack at would require substantial amounts of feature engineering. More specifically, I've read that mostly you should be fine if you are only combining data from one "row", but if you combine several rows there is risk of leakage. If anyone has great resources to dig into feature engineering in general, I'd very much appreciate that too!
|
# ¿ Mar 10, 2022 09:46 |
|
|
# ¿ May 17, 2024 13:46 |
|
ultrafilter posted:The key is not to use information from the test set in your feature engineering. Do your split first, and then do feature engineering. Thank god I posted! I was literally about to feature engineer at the dataset level before splitting. I saved the book link and will dive into it! You just saved me possibly hours and hours of frustration. Will this book help me understand how to handle hierarchical data in general? For example, say I have a bunch of property data from 100 cities, and I would want to cluster the cities based on the property data inside them. I have an idea how I would cluster the underlying properties as that is simple, but how I would cluster the upper hierarchy of cities that hold the properties has absolutely baffled me for at least a year, and haven't been able to find info about it despite a lot of googling. (I might just be bad at googling though, and definitely bad at ML.)
|
# ¿ Mar 11, 2022 09:38 |
|
Thanks for the tips! That sure explains a lot why I wasn't able to find good, basic information on dealing with hierarchical data. I'll try to work around the hierarchies for now. dexefiend posted:I love SKLearn. My current problem has data coming in every day. Can't I repeat the test on the final holdout by just waiting like a month to get fresh data and building a new holdout set? I don't see why not.
|
# ¿ Mar 15, 2022 11:08 |
|
ultrafilter posted:If you're not worried about dataset shift (but you should be). I am, kinda, if you mean that the holdout set will be biased in a different way compared to the train/test set. However, since only performance on the most fresh/holdout data matters, I guess this is not a critical issue? (Compared to if performance on the entire dataset was the goal)
|
# ¿ Mar 15, 2022 18:27 |
|
CarForumPoster posted:Again I wouldn't call myself very knowledgeable but IMO the answer to this is "it depends". If your goal is to build a model to predict on NEW data, then using data later in time as the test set is a good idea. That said, you can get seasonality or other "errant" behavior in datasets in many weird ways. In my business our marketing response rate drops every February. If I was to make a model to predict whether a customer responds to marketing on 10 months of data and then have a validation that is mostly February leads, I might think my model sucked. I should probably have two test/validation sets: Right, good points. I very well could see myself jumping into conclusions like that, for example concluding that a model in production has turned into trash even if it simply hit a seasonality that it can't assess well. dexefiend posted:But you need to "stand where you will be standing when you make your prediction". Could you clarify what you mean with this standing thing? It sounds vaguely familiar but I just can't recall it.
|
# ¿ Mar 17, 2022 09:43 |
|
dexefiend posted:That is what I say when I am thinking about a model. Ah! Yes! I got 99 problems but training the network on data not available during production aint one. I've tried to be very mindful of this stuff
|
# ¿ Mar 17, 2022 21:26 |
|
cinci zoo sniper posted:Not sure what’s a good alternative for it though, for general NN intro. One of the things on my docket is to figure out a replacement curriculum for future hires that may need it. Allow me to (possibly) help you! Personally, I used Kirill Eremenko et al's "Deep Learning A-Z™: Hands-On Artificial Neural Networks" to learn the basics. I had no previous understanding of ML at all, and only very basic knowledge of programming, perhaps 80-160 hours total of programming under my belt. The course was pretty good in my opinion, and I learnt the basics. My thesis network probably still uses bits of code from those lectures I did years ago. I don't know how "current" the course is, and obviously it doesn't teach deep insight on deep learning, but a very good hands-on lecture series. Or at least I found it good, I'm not saying it's the best one out there. Oh and just in case anyone isn't familiar, Udemy's "sales" are all bullshit, courses always cost 10-30 bucks no matter what the list prices say. Those very "urgent" 95 % off deals of several hundred dollar/euro courses are essentially fake. There's always a sale going on that ends in a few hours or days. Link: https://www.udemy.com/course/deeplearning/ EDIT: I still use the course every now and then, for example I am looking into RNNs at the moment, and watching the LSTM lectures every now and then. It's going to be a good stepping stone to time-series analysis or whatever the hell I want to do later.
|
# ¿ Jun 13, 2022 09:42 |
|
Raenir Salazar posted:So here's my current problem for a university project and my attempt to make it go a little faster using my GPU. With the really big caveat that I am relatively novice and probably wrong, it sounds like there is a bottleneck on data transfer to gpu, causing idle times. Possible reasons: batch size is small, model is too small to get good gpu advantage.
|
# ¿ Jun 21, 2022 13:16 |
|
At a quick glance it seems like you have an enormous sequential layer there. It might be the culprit here. I would look into scikit's gridsearchcv and random architecture search to tune network size or dropout values.
|
# ¿ Jun 24, 2022 09:02 |
|
Turambar posted:I'm really starting to fear for my job Lmao, that is amazing. Chatgpt isn't usually so catty. Edit: But it's probably fake or was asked to provide insulting responses
|
# ¿ May 17, 2023 12:48 |
|
poty posted:Thanks for the suggestion, but I don't think it works if I understand things correctly. There is only one call to .fit() for the whole RandomizedSearchCV optimization process, as opposed to one call to .fit() for each of the 100 sets of parameters (5 of which might never complete). I can put the global .fit() call inside a timer but then I don't get results for any set of parameters if one is slow (it's the same result as what I'm doing now, killing the kernel in Jupyter Notebook). Sadly the only solution I have come up with is to extend the RandomizedSearchCV/GridSearchCV code itself to allow for this. And I can't be arsed so I just dealt with it.
|
# ¿ Jun 4, 2023 07:11 |
|
Hammerite posted:it's a weird thing to talk about and he's being weird about it. especially as a prominent corporate guy who's in the news Yeah, and what the hell does he even mean "genuine" rape fantasy? That one strikes to me as "some women really enjoy/want to be raped" type poo poo. No one, no one wants to be raped, as rape by definition is unwanted. If someone "wants" to be raped, what they actually have is a taboo fetish or something. Moreover, if we presume that the studies cited were done scientifically properly and aren't some hackjobs, the conclusion to draw isn't " 40-60 % of women WANT to be raped", but could be more along the lines of "this study would indicate that about half of the population, men included, might have a particular taboo fetish. How surprising." How he just glossed over the ~55 % of men who also have these fantasies and presented it as somehow unique to women is odd. Where this study would be appropriate to bring up is a good question. Maybe some university psychology class on sexuality, sexual violence or some fetish site. But weird, weird thing to bring up on Twitter, triply so as an executive of a big corporation. EDIT: Yeah managed to open the twitter thread and yup, it's weird. Getting some Elon Musk level weird vibes. Keisari fucked around with this message at 11:57 on Nov 21, 2023 |
# ¿ Nov 21, 2023 11:16 |
|
Has anyone been able to really well fiddle with the custom GPTs and more specifically, the "knowledge" part of them? You can upload all kinds of poo poo to be their knowledgebase. I have tried to make one to help me build a program to use a certain API, and uploaded a bunch of JSON files that describe the API. Another I made was inspired by the built in board game explainer, I made one that was focused on explaining and clarifying the rules, and uploaded some game manuals. The API GPT basically completely errored out and either ignored all the knowledge or just crashed. Meanwhile, the game rules explainer worked a bit better, but is poor at applying the knowledge or combining it with other sources of information. it did reasonably ok at finding specific points from the manuals, but was poor at answering certain type of fringe questions that required combining different rules from the manual. Below are some chatlogs and my commentary about two boardgames: A Distant Plain, a COIN game about the War in Afghanistan, and War of the Ring, a super nerdy (and loving amazing) LOTR boardgame. Chatlog, Custom Board Game GPT and A Distant Plain, the boardgame posted:User The rules specify a "rally" action, which allows the insurgent factions, such as Taliban and Warlords to recruit guerillas on the board. I tried asking both "recruit" and "rally", and it couldn't find it from the rules. Meanwhile, I asked if the Coalition player can move Government police cubes with their sweep action, and it evidently did find the Sweep part of the rulebook, which proves that the file is readable by the GPT. It was also correctly able to deduce that because it only mentions troops and not police, it means that sweep doesn't include police. The problem is that it seems very hit and miss if it can locate correct info from the data. I’ve tried to find good sources that explain the intricacies of this feature, but haven’t been able to find any good ones. (The promising “Medium” article was hidden behind a paywall) Does anyone have any? The good news is that during my testing, hallucination has been basically zero. I’ll take “I have no idea” over hallucinating any day. I also wonder how I can have the AI access it’s internal knowledge in case it can’t find anything from the rulebook, and it’d be even better if it could combine knowledge from the rulebook with its internal knowledge. Here’s another test in case anyone is interested, this time I uploaded the War of the Ring rulebook: ”Chatlog, Custom Board Game GPT, War of the Ring posted:User quote:User I was able to prompt it to search from other sources, and then it was successful. If I recall correctly, it used Bing to find a website that answered the question and it gave a good answer. quote:User quote:User It would appear that the AI first answered based on its internal memory, and got the answer wrong. When I specifically prompted it to use the rulebook, it got the answer right, although it had to guess.
|
# ¿ Nov 22, 2023 10:49 |
|
|
# ¿ May 17, 2024 13:46 |
|
BAD AT STUFF posted:Did you give it an OpenAPI spec? And were you asking it to write code that queries an API or setting that up as an action for a custom GPT? Completely forgot to answer this, sorry! I am unsure what spec it was, I just plonked it in. Nektu posted:This fits to the restrictions of LLMs described here: https://aiguide.substack.com/p/can-large-language-models-reason. That was an interesting read, thanks! Well, even in it's current state it does provide some value as what basically amounts to an improved PDF search engine.
|
# ¿ Dec 27, 2023 09:35 |