|
I am supposedly a data scientist, but I feel more like a data janitor.
|
# ¿ Mar 7, 2022 14:03 |
|
|
# ¿ May 17, 2024 16:22 |
|
I love SKLearn. It's probably a deep hole to jump into, but I like to use the Pipeline approach for feature engineering and training, that way I can just call pipeline_obj.fit(x_train, y_train) [and iterate my development until the model is good enough] [Wait until I am done with all model development, and I can think of no further ways to improve] [No seriously, you only get to do this one time] [I mean it. Don't peek at your test set until you are done with your development.] and then pipeline_obj.score(x_test, y_test).
|
# ¿ Mar 11, 2022 15:50 |
|
But you need to "stand where you will be standing when you make your prediction". If you are predicting every day, you need to be evaluating data "every day". If you have "plenty of data", using 60% train, 20% validation, and 20% testing is also a good approach. You can peek at the validation, for evaluation purposes, but never training. So, you iteratively train/evaluate/improve using train and validation, and then on the final final final evaluation against your test set. On a multiple time series, you could hold 20% of your entire observations as test, and then partition the last 20% of each of your individual series as validation. However, you always must make sure that you are standing in the right spot for your predictions.
|
# ¿ Mar 15, 2022 18:53 |
|
That is what I say when I am thinking about a model. Make sure that all your inputs are only available at the point in time that you will be predicting from. The term that is often used is "data leakage." My clumsy thoughts: Make sure:
Time travel is how I think about the issue of availability of data or timeliness of data in relation to using it. Clumsy example: So, if you are doing some model on retail sales, and in model development you have an input that is monthly unemployment. Junior Data Scientist uses Monthly Unemployment as an input on Monthly Sales. The problem is that monthly unemployment is released with a lag. You would never have February monthly unemployment available at the time you would want to predict February Sales. You would already have your February Sales completely accounted for by the time you got the February Unemployment numbers. Does that help?
|
# ¿ Mar 17, 2022 17:00 |
|
In addition to PCA, try T-SNE. It is loving amazing. You might need to do some matrix simplification before T-SNE, but... Seriously, try T-SNE. T- Stochastic Neighbors Embedding
|
# ¿ Mar 20, 2024 14:45 |