Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Keisari
May 24, 2011

Anyone got advice on how to prevent leaking data between training and test datasets when doing feature engineering? My previous models didn't require very much, but the next one I'm going to take a crack at would require substantial amounts of feature engineering.

More specifically, I've read that mostly you should be fine if you are only combining data from one "row", but if you combine several rows there is risk of leakage.

If anyone has great resources to dig into feature engineering in general, I'd very much appreciate that too!

Adbot
ADBOT LOVES YOU

Keisari
May 24, 2011

ultrafilter posted:

The key is not to use information from the test set in your feature engineering. Do your split first, and then do feature engineering.

If your data has a hierarchical structure, it's best to split at the highest level of the hierarchy. For example, if you have an ad data set where each record represents an ad but there are multiple ads for a given campaign, you run the risk of leakage if you include data from a single campaign in the training and test set.

Beyond that, the authors of Applied Predictive Modeling wrote a book on feature engineering.

Thank god I posted! I was literally about to feature engineer at the dataset level before splitting. I saved the book link and will dive into it! You just saved me possibly hours and hours of frustration.

Will this book help me understand how to handle hierarchical data in general? For example, say I have a bunch of property data from 100 cities, and I would want to cluster the cities based on the property data inside them. I have an idea how I would cluster the underlying properties as that is simple, but how I would cluster the upper hierarchy of cities that hold the properties has absolutely baffled me for at least a year, and haven't been able to find info about it despite a lot of googling. (I might just be bad at googling though, and definitely bad at ML.) :eng99:

Keisari
May 24, 2011




Thanks for the tips! That sure explains a lot why I wasn't able to find good, basic information on dealing with hierarchical data. I'll try to work around the hierarchies for now.


dexefiend posted:

I love SKLearn.

It's probably a deep hole to jump into, but I like to use the Pipeline approach for feature engineering and training, that way I can just call pipeline_obj.fit(x_train, y_train) [and iterate my development until the model is good enough]



[Wait until I am done with all model development, and I can think of no further ways to improve]


[No seriously, you only get to do this one time]



[I mean it. Don't peek at your test set until you are done with your development.]


and then pipeline_obj.score(x_test, y_test).


My current problem has data coming in every day. Can't I repeat the test on the final holdout by just waiting like a month to get fresh data and building a new holdout set? I don't see why not.

Keisari
May 24, 2011

ultrafilter posted:

If you're not worried about dataset shift (but you should be).

I am, kinda, if you mean that the holdout set will be biased in a different way compared to the train/test set. However, since only performance on the most fresh/holdout data matters, I guess this is not a critical issue? (Compared to if performance on the entire dataset was the goal)

Keisari
May 24, 2011

CarForumPoster posted:

Again I wouldn't call myself very knowledgeable but IMO the answer to this is "it depends". If your goal is to build a model to predict on NEW data, then using data later in time as the test set is a good idea. That said, you can get seasonality or other "errant" behavior in datasets in many weird ways. In my business our marketing response rate drops every February. If I was to make a model to predict whether a customer responds to marketing on 10 months of data and then have a validation that is mostly February leads, I might think my model sucked. I should probably have two test/validation sets:
- a "test" set - random sample split off before I build my training set and
- a "validation" set - inputs that are most similar to my real world data, i.e. later in time.

If you build a model on a subset of time that doesn't encompass this seasonality, it can lead you to bad conclusions.

On the flip side, if you want to classify currently unclassified, "old" data for example if you want to predict what a bunch of company's revenue was at different times based on inputs from those companies and a model trained on other companies from 2005-2020, you probably DON'T want to have a validation set that is all 2020 company data.

Right, good points. I very well could see myself jumping into conclusions like that, for example concluding that a model in production has turned into trash even if it simply hit a seasonality that it can't assess well.

dexefiend posted:

But you need to "stand where you will be standing when you make your prediction".


Could you clarify what you mean with this standing thing? It sounds vaguely familiar but I just can't recall it.

Keisari
May 24, 2011

dexefiend posted:

That is what I say when I am thinking about a model.

Make sure that all your inputs are only available at the point in time that you will be predicting from.

The term that is often used is "data leakage."

My clumsy thoughts:

Make sure:
  • data in your testing set is never incorporated in your training -- either for feature engineering or fitting the model
  • you aren't using time travel*

Time travel is how I think about the issue of availability of data or timeliness of data in relation to using it.

Clumsy example:
So, if you are doing some model on retail sales, and in model development you have an input that is monthly unemployment.

Junior Data Scientist uses Monthly Unemployment as an input on Monthly Sales.

The problem is that monthly unemployment is released with a lag. You would never have February monthly unemployment available at the time you would want to predict February Sales. You would already have your February Sales completely accounted for by the time you got the February Unemployment numbers.


Does that help?

Ah! Yes! I got 99 problems but training the network on data not available during production aint one. I've tried to be very mindful of this stuff

Keisari
May 24, 2011

cinci zoo sniper posted:

Not sure what’s a good alternative for it though, for general NN intro. One of the things on my docket is to figure out a replacement curriculum for future hires that may need it.

Allow me to (possibly) help you! :eng101:

Personally, I used Kirill Eremenko et al's "Deep Learning A-Z™: Hands-On Artificial Neural Networks" to learn the basics. I had no previous understanding of ML at all, and only very basic knowledge of programming, perhaps 80-160 hours total of programming under my belt. The course was pretty good in my opinion, and I learnt the basics. My thesis network probably still uses bits of code from those lectures I did years ago. I don't know how "current" the course is, and obviously it doesn't teach deep insight on deep learning, but a very good hands-on lecture series. Or at least I found it good, I'm not saying it's the best one out there.

Oh and just in case anyone isn't familiar, Udemy's "sales" are all bullshit, courses always cost 10-30 bucks no matter what the list prices say. Those very "urgent" 95 % off deals of several hundred dollar/euro courses are essentially fake. There's always a sale going on that ends in a few hours or days.

Link: https://www.udemy.com/course/deeplearning/

EDIT:

I still use the course every now and then, for example I am looking into RNNs at the moment, and watching the LSTM lectures every now and then. It's going to be a good stepping stone to time-series analysis or whatever the hell I want to do later.

Keisari
May 24, 2011

Raenir Salazar posted:

So here's my current problem for a university project and my attempt to make it go a little faster using my GPU.

code:
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    if torch.cuda.is_available():
        print("CUDA is Available")
        print(torch.cuda.device(0))
        print(torch.cuda.get_device_name(0))
    else:
        print("Get a better computer.")
    
    model = CNN()
    model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=m_learning_rate)
    
    #training
    total_step = len(train_loader)
    loss_list = []
    acc_list = []
    
    for epoch in range(m_num_epochs):
        print("Epoch: ", epoch)
        for i, (images, labels) in enumerate(train_loader):
            #images = images.view(images.size(0), -1)
            images, labels = images.to(device), labels.to(device)
            #print("i: ", i)
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss_list.append(loss.item())
    
            # Backprop and optimisation
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            # Train accuracy
            total = labels.size(0)
            _, predicted = torch.max(outputs.data, 1)
            correct = (predicted == labels).sum().item()
            acc_list.append(correct / total)
            if (i + 1) % 10 == 0:
                print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Accuracy: {:.2f}%'
                .format(epoch + 1, m_num_epochs, i + 1, total_step, loss.item(),
                (correct / total) * 100))
This seems to work, and is faster, and using the GPU, albeit only 3-5% of the GPU is actually being utilized.

However using skorch it seems to be much slower and not using the GPU at all:

code:
    from torch.utils.data import random_split
    from sklearn.model_selection import cross_val_score
    from skorch.helper import SliceDataset
    
    m = len(train_set)
    train_set_size = int(m - m * 0.2)
    test_set_size = int(m * 0.2)
    rem = m - train_set_size - test_set_size
    train_data, val_data = random_split(train_set, [train_set_size+ rem, test_set_size])
    y_train = np.array([y for x, y in iter(train_data)])
    
    classes = ( 'cloth', 'n95', 'none', 'surgical' )
    new_net = NeuralNetClassifier(
        CNN,    # should match the part 1 for the model training
        max_epochs=m_num_epochs,
        iterator_train__num_workers=8,
        iterator_valid__num_workers=8,
        lr=m_learning_rate,  # maybe need to match
        batch_size=m_batch_size,  # maybe need to match
        optimizer=optim.Adam,
        criterion=nn.CrossEntropyLoss,
        device='cuda'
    )
Any idea what's up? I've tried googling but not much comes up that seems relevant to my use case.

With the really big caveat that I am relatively novice and probably wrong, it sounds like there is a bottleneck on data transfer to gpu, causing idle times.

Possible reasons: batch size is small, model is too small to get good gpu advantage.

Keisari
May 24, 2011

At a quick glance it seems like you have an enormous sequential layer there. It might be the culprit here. I would look into scikit's gridsearchcv and random architecture search to tune network size or dropout values.

Keisari
May 24, 2011


Lmao, that is amazing. Chatgpt isn't usually so catty.

Edit:
But it's probably fake or was asked to provide insulting responses

Keisari
May 24, 2011

poty posted:

Thanks for the suggestion, but I don't think it works if I understand things correctly. There is only one call to .fit() for the whole RandomizedSearchCV optimization process, as opposed to one call to .fit() for each of the 100 sets of parameters (5 of which might never complete). I can put the global .fit() call inside a timer but then I don't get results for any set of parameters if one is slow (it's the same result as what I'm doing now, killing the kernel in Jupyter Notebook).

I guess the solution is to evaluate the sets of parameters "manually" without using something like RandomizedSearchCV or GridSearchCV, then I would in fact have access to the individual .fit() calls and could do that.

Sadly the only solution I have come up with is to extend the RandomizedSearchCV/GridSearchCV code itself to allow for this. And I can't be arsed so I just dealt with it.

Keisari
May 24, 2011

Hammerite posted:

it's a weird thing to talk about and he's being weird about it. especially as a prominent corporate guy who's in the news

it's only half about what he's saying about it, it's at least that much about the fact he's talking about it at all. and that's partly because people infer from the fact he's talking about it, that he might have strongly held and off-putting opinions in that area

like, I don't know how many women have rape fantasies. And I don't particularly care so I wouldn't get into an animated discussion about it online. The fact that he did, and under his IRL name makes his behaviour Weird with a capital W.

Yeah, and what the hell does he even mean "genuine" rape fantasy? That one strikes to me as "some women really enjoy/want to be raped" type poo poo. No one, no one wants to be raped, as rape by definition is unwanted. If someone "wants" to be raped, what they actually have is a taboo fetish or something.

Moreover, if we presume that the studies cited were done scientifically properly and aren't some hackjobs, the conclusion to draw isn't ":actually: 40-60 % of women WANT to be raped", but could be more along the lines of "this study would indicate that about half of the population, men included, might have a particular taboo fetish. How surprising." How he just glossed over the ~55 % of men who also have these fantasies and presented it as somehow unique to women is odd.

Where this study would be appropriate to bring up is a good question. Maybe some university psychology class on sexuality, sexual violence or some fetish site. But weird, weird thing to bring up on Twitter, triply so as an executive of a big corporation.

EDIT:

Yeah managed to open the twitter thread and yup, it's weird. Getting some Elon Musk level weird vibes.

Keisari fucked around with this message at 11:57 on Nov 21, 2023

Keisari
May 24, 2011

Has anyone been able to really well fiddle with the custom GPTs and more specifically, the "knowledge" part of them? You can upload all kinds of poo poo to be their knowledgebase. I have tried to make one to help me build a program to use a certain API, and uploaded a bunch of JSON files that describe the API. Another I made was inspired by the built in board game explainer, I made one that was focused on explaining and clarifying the rules, and uploaded some game manuals.

The API GPT basically completely errored out and either ignored all the knowledge or just crashed.

Meanwhile, the game rules explainer worked a bit better, but is poor at applying the knowledge or combining it with other sources of information. it did reasonably ok at finding specific points from the manuals, but was poor at answering certain type of fringe questions that required combining different rules from the manual.

Below are some chatlogs and my commentary about two boardgames: A Distant Plain, a COIN game about the War in Afghanistan, and War of the Ring, a super nerdy (and loving amazing) LOTR boardgame.

Chatlog, Custom Board Game GPT and A Distant Plain, the boardgame posted:

User
In a Distant Plain, how does the Taliban recruit?

GPT
I wasn't able to find specific information on how the Taliban recruit in the game "A Distant Plain" from the provided rules document. However, based on my knowledge, in "A Distant Plain," the Taliban's recruitment typically involves placing their guerrillas on the board. This process often depends on various factors like the presence of opposition forces, control of provinces, and the game's current political climate.

The recruitment action generally allows the Taliban to increase their presence and influence on the board, which is crucial for their strategy. The specifics, such as the cost and limitations of recruitment actions, may vary depending on the game's scenario and the evolving situation on the board.

For precise details on the Taliban's recruitment mechanics in "A Distant Plain," I recommend consulting the game's rulebook, specifically sections discussing Taliban operations and actions.

User
In a Distant Plain, how does the Taliban rally?

2 / 2

GPT
I have searched through the "A Distant Plain" rulebook in my knowledge source but was unable to find specific details regarding how the Taliban faction conducts a Rally in the game. This information might be in a section of the rulebook that I couldn't access or might not be explicitly detailed in the material I have. For precise rules and guidelines on how the Taliban Rally is performed in "A Distant Plain," I would recommend referring directly to the game's rulebook, particularly the sections that detail the operations and activities of the Taliban faction.


2 / 2

User
Can the Coalition player move Government police cubes with sweep?
GPT
In "A Distant Plain," the Coalition player's ability to move Government police cubes is not explicitly mentioned in the context of the Sweep operation. According to the rules, during a Sweep operation, the Coalition player may move Troops to activate Taliban and Warlord Guerrillas, primarily focusing on the highest Population space possible and aiming to add COIN Control​​. However, the rules do not specifically state that Government police cubes can be moved as part of this operation. Therefore, based on the available information, it appears that the Sweep operation for the Coalition player involves moving Troops rather than Government police cubes. [CORRECT]




The rules specify a "rally" action, which allows the insurgent factions, such as Taliban and Warlords to recruit guerillas on the board. I tried asking both "recruit" and "rally", and it couldn't find it from the rules. Meanwhile, I asked if the Coalition player can move Government police cubes with their sweep action, and it evidently did find the Sweep part of the rulebook, which proves that the file is readable by the GPT. It was also correctly able to deduce that because it only mentions troops and not police, it means that sweep doesn't include police.

The problem is that it seems very hit and miss if it can locate correct info from the data. I’ve tried to find good sources that explain the intricacies of this feature, but haven’t been able to find any good ones. (The promising “Medium” article was hidden behind a paywall) Does anyone have any?

The good news is that during my testing, hallucination has been basically zero. I’ll take “I have no idea” over hallucinating any day. I also wonder how I can have the AI access it’s internal knowledge in case it can’t find anything from the rulebook, and it’d be even better if it could combine knowledge from the rulebook with its internal knowledge.

Here’s another test in case anyone is interested, this time I uploaded the War of the Ring rulebook:

”Chatlog, Custom Board Game GPT, War of the Ring posted:

User
I am playing war of the ring. Explain how I can move the fellowship to mordor?
GPT
In "War of the Ring", moving the Fellowship to Mordor is a critical part of the quest to destroy the One Ring. Here's how you can achieve this:

Reaching Minas Morgul or Morannon: The Fellowship must first reach either Minas Morgul or Morannon. This is a mandatory requirement and must be done during a Fellowship phase to begin the last leg of the journey to Mount Doom. [CORRECT]

Entering Mordor: Once the Fellowship is in Minas Morgul or Morannon during a Fellowship phase, the Free Peoples player may decide to enter Mordor. At this point, you will proceed with the next steps. [CORRECT]

On the Mordor Track: Upon entering Mordor, you need to:

Place the Ring-bearers figure on the first step of the Mordor Track, which is marked with the Elven numeral ‘0’. From this moment, the Fellowship is considered to be “On the Mordor Track.” [CORRECT]
The Fellowship Progress counter is no longer advanced on the Fellowship Track but is still used to indicate whether the Fellowship is Hidden or Revealed. [CORRECT]
A new Hunt Pool is created by placing all Eye tiles previously drawn back with the remaining tiles of the Hunt Pool, and also adding any special tiles put in play by Event cards. Note that any Eye tile permanently removed from the game should not be placed back in the pool. [CORRECT]
Remember, in Mordor, the rules for the Fellowship change, reflecting the increasing danger and the burden of the Ring as they draw closer to Mount Doom.
Here the GPT again managed to fetch a simple query from the rulebook. Next I wanted to test if I can “gotcha” the AI. Normally in the game, the Shadow player rolls “hunt dice” to determine if they catch the Fellowship and do damage. If the Shadow player has any Nazgul, including the Witch-King, they can reroll one of these die. When the Fellowship reaches Mordor, these dice are no longer rolled. So the “gotcha” here is that normally the Witch-King can reroll a die, but specifically in Mordor it can’t.

quote:

User
Can the Witch King be used to reroll hunt dice if the fellowship is in mordor?
GPT
My search within the provided rulebook did not yield specific information about whether the Witch King can be used to reroll hunt dice when the Fellowship is in Mordor in "War of the Ring". Based on the available information, I cannot confirm or deny this rule. For detailed rules regarding unique character abilities like those of the Witch King, especially in specific scenarios like the Fellowship being in Mordor, it would be best to consult the complete rulebook or additional expansion rules if applicable.
The GPT wasn’t able to, in essence, “combine” the information in the rulebook to answer the query. The rulebook both states that Nazgul allow a reroll, but elsewhere it says that in Mordor there are no rolls. (And therefore no rerolls.)

I was able to prompt it to search from other sources, and then it was successful. If I recall correctly, it used Bing to find a website that answered the question and it gave a good answer.

quote:

User
Can you search other sources?
GPT
Certainly! I'll search for information regarding whether the Witch King can be used to reroll hunt dice when the Fellowship is in Mordor in "War of the Ring." Let me look that up for you.

In "War of the Ring", the rules for the Hunt and rerolls change when the Fellowship is on the Mordor Track:

Hunt Re-rolls Outside Mordor: Normally, if the Fellowship is in a region with a Shadow Stronghold, Shadow Army units, or one or more Nazgul, the Shadow player may reroll one failed Hunt roll die for each of these conditions. For example, if there are both Army units and Nazgul in the region with the Fellowship, the Shadow player can reroll two dice. [CORRECT]

Special Rules on the Mordor Track: However, when the Fellowship is on the Mordor Track, the rules differ. The Hunt dice are not rolled during the Action Resolution phase. Instead, one tile is automatically drawn from the Hunt Pool. The effects of this tile are applied as for a successful Hunt, with specific rules for tiles showing an Eye or a Stop icon. [CORRECT]

Therefore, the Witch King's ability to reroll hunt dice would not be applicable when the Fellowship is on the Mordor Track, as the hunt dice are not rolled in this scenario. The game shifts to a different mechanism for representing the Hunt through the drawing of tiles from the Hunt Pool. [CORRECT]
Next, I asked an in-depth question about how casualties are taken. The correct answer would be that if there are no regular units in casualties, one can be taken from reinforcements. If there are none in reinforcements either, the entire elite unit must be taken as a casualty with no downgrading. (=more damage)

quote:

User
What if the free people's player needs to downgrade an elite unit but there is no regular unit as a casualty?
GPT
In "War of the Ring", if the Free Peoples player is required to downgrade an Elite unit but there are no regular units available as casualties (i.e., in the dead pile), the usual practice is that the Elite unit is removed from the board and considered destroyed. This simulates the situation where there are no reinforcements available to replace the Elite unit's losses, effectively resulting in the unit's elimination. This rule maintains the balance and realism of the game, reflecting the logistical and tactical challenges of maintaining an army in a prolonged conflict. [PARTIALLY WRONG]

User
Can you look from the rulebook?
GPT
According to the "War of the Ring" rulebook, when removing casualties, a player can either remove one Regular unit or replace one Elite unit with a Regular unit of the same Nation for each hit scored by the opponent. If an Elite unit is to be replaced but there are no Regular units available either from previous casualties or from the available reinforcements, the rulebook does not explicitly state what happens. It implies that if no replacements are available, the Elite unit cannot be replaced. This suggests that in such a case, the Elite unit would likely be removed as a casualty without replacement. [CORRECT]

It would appear that the AI first answered based on its internal memory, and got the answer wrong. When I specifically prompted it to use the rulebook, it got the answer right, although it had to guess.

Adbot
ADBOT LOVES YOU

Keisari
May 24, 2011

BAD AT STUFF posted:

Did you give it an OpenAPI spec? And were you asking it to write code that queries an API or setting that up as an action for a custom GPT?

I got good results with regular ChatGPT when uploading a basic JSON file describing a service I wanted it to create using FastAPI. I had much less success trying to set up an action for my own GPT. It gave an unhelpful error message and crashed each time. I think the root of the issue was that I gave it the full JSON file that I pulled from Swagger. I want to go back with a cut down file to try adding endpoints one at a time. That might help with finding the source of the error, but also I think it should perform better if I can limit extraneous context.

If anyone has found things that work well for cleaning up knowledge inputs to GPTs, I'd love to hear about that.

...also this OpenAI drama makes me feel better about the dumb politics poo poo that happens at my company. :munch:

Completely forgot to answer this, sorry! I am unsure what spec it was, I just plonked it in.

Nektu posted:

This fits to the restrictions of LLMs described here: https://aiguide.substack.com/p/can-large-language-models-reason.

They are good at identifying patterns in the training data that fit to your prompt, but they are incapable of actual reasoning.

That was an interesting read, thanks! Well, even in it's current state it does provide some value as what basically amounts to an improved PDF search engine.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply