Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
dexefiend
Apr 25, 2003

THE GOGGLES DO NOTHING!
I am supposedly a data scientist, but I feel more like a data janitor.

Adbot
ADBOT LOVES YOU

dexefiend
Apr 25, 2003

THE GOGGLES DO NOTHING!
I love SKLearn.

It's probably a deep hole to jump into, but I like to use the Pipeline approach for feature engineering and training, that way I can just call pipeline_obj.fit(x_train, y_train) [and iterate my development until the model is good enough]



[Wait until I am done with all model development, and I can think of no further ways to improve]


[No seriously, you only get to do this one time]



[I mean it. Don't peek at your test set until you are done with your development.]


and then pipeline_obj.score(x_test, y_test).

dexefiend
Apr 25, 2003

THE GOGGLES DO NOTHING!
But you need to "stand where you will be standing when you make your prediction".

If you are predicting every day, you need to be evaluating data "every day".

If you have "plenty of data", using 60% train, 20% validation, and 20% testing is also a good approach.

You can peek at the validation, for evaluation purposes, but never training. So, you iteratively train/evaluate/improve using train and validation, and then on the final final final evaluation against your test set.

On a multiple time series, you could hold 20% of your entire observations as test, and then partition the last 20% of each of your individual series as validation.

However, you always must make sure that you are standing in the right spot for your predictions.

dexefiend
Apr 25, 2003

THE GOGGLES DO NOTHING!
That is what I say when I am thinking about a model.

Make sure that all your inputs are only available at the point in time that you will be predicting from.

The term that is often used is "data leakage."

My clumsy thoughts:

Make sure:
  • data in your testing set is never incorporated in your training -- either for feature engineering or fitting the model
  • you aren't using time travel*

Time travel is how I think about the issue of availability of data or timeliness of data in relation to using it.

Clumsy example:
So, if you are doing some model on retail sales, and in model development you have an input that is monthly unemployment.

Junior Data Scientist uses Monthly Unemployment as an input on Monthly Sales.

The problem is that monthly unemployment is released with a lag. You would never have February monthly unemployment available at the time you would want to predict February Sales. You would already have your February Sales completely accounted for by the time you got the February Unemployment numbers.


Does that help?

dexefiend
Apr 25, 2003

THE GOGGLES DO NOTHING!
In addition to PCA, try T-SNE. It is loving amazing.


You might need to do some matrix simplification before T-SNE, but...


Seriously, try T-SNE.

T- Stochastic Neighbors Embedding

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply