The Artificial Intelligence & Machine Learning Megathread

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > The Artificial Intelligence & Machine Learning Megathread

dexefiend: Apr 25, 2003; THE GOGGLES DO NOTHING!

I am supposedly a data scientist, but I feel more like a data janitor.

# ¿ Mar 7, 2022 14:03

Adbot: ADBOT LOVES YOU

# ¿ May 17, 2024 16:22

dexefiend: Apr 25, 2003; THE GOGGLES DO NOTHING!

I love SKLearn.

It's probably a deep hole to jump into, but I like to use the Pipeline approach for feature engineering and training, that way I can just call pipeline_obj.fit(x_train, y_train) [and iterate my development until the model is good enough]

[Wait until I am done with all model development, and I can think of no further ways to improve]

[No seriously, you only get to do this one time]

[I mean it. Don't peek at your test set until you are done with your development.]

and then pipeline_obj.score(x_test, y_test).

# ¿ Mar 11, 2022 15:50

dexefiend: Apr 25, 2003; THE GOGGLES DO NOTHING!

But you need to "stand where you will be standing when you make your prediction".

If you are predicting every day, you need to be evaluating data "every day".

If you have "plenty of data", using 60% train, 20% validation, and 20% testing is also a good approach.

You can peek at the validation, for evaluation purposes, but never training. So, you iteratively train/evaluate/improve using train and validation, and then on the final final final evaluation against your test set.

On a multiple time series, you could hold 20% of your entire observations as test, and then partition the last 20% of each of your individual series as validation.

However, you always must make sure that you are standing in the right spot for your predictions.

# ¿ Mar 15, 2022 18:53

dexefiend: Apr 25, 2003; THE GOGGLES DO NOTHING!

That is what I say when I am thinking about a model.

Make sure that all your inputs are only available at the point in time that you will be predicting from.

The term that is often used is "data leakage."

My clumsy thoughts:

Make sure:

data in your testing set is never incorporated in your training -- either for feature engineering or fitting the model
you aren't using time travel*

Time travel is how I think about the issue of availability of data or timeliness of data in relation to using it.

Clumsy example:
So, if you are doing some model on retail sales, and in model development you have an input that is monthly unemployment.

Junior Data Scientist uses Monthly Unemployment as an input on Monthly Sales.

The problem is that monthly unemployment is released with a lag. You would never have February monthly unemployment available at the time you would want to predict February Sales. You would already have your February Sales completely accounted for by the time you got the February Unemployment numbers.

Does that help?

# ¿ Mar 17, 2022 17:00

dexefiend: Apr 25, 2003; THE GOGGLES DO NOTHING!

In addition to PCA, try T-SNE. It is loving amazing.

You might need to do some matrix simplification before T-SNE, but...

Seriously, try T-SNE.

T- Stochastic Neighbors Embedding

# ¿ Mar 20, 2024 14:45

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > The Artificial Intelligence & Machine Learning Megathread