|
I'm a practicing data scientist looking to improve the state of the art both in industry and academia. I can't speak to frameworks or systems too much, but I can say a lot about how to manage AI/ML products, how this discipline fits into data science as a whole, and how data science fits into an organization's strategy. I also do research on areas related to AI/ML although I don't publish in any of the big conferences (and I can tell you all about the horrible problems with that system). I'm glad to see this thread up, and I'm going to toss in links to the SAL data science thread as well as a collection of papers I've been curating during the pandemic.
|
# ¿ Mar 6, 2022 19:32 |
|
|
# ¿ May 21, 2024 05:30 |
|
Forgot to mention there's also a scientific programming thread in SAL that may have some overlap with topics discussed here.
|
# ¿ Mar 7, 2022 03:33 |
|
The key is not to use information from the test set in your feature engineering. Do your split first, and then do feature engineering. If your data has a hierarchical structure, it's best to split at the highest level of the hierarchy. For example, if you have an ad data set where each record represents an ad but there are multiple ads for a given campaign, you run the risk of leakage if you include data from a single campaign in the training and test set. Beyond that, the authors of Applied Predictive Modeling wrote a book on feature engineering.
|
# ¿ Mar 10, 2022 15:08 |
|
I've never seen a systematic treatment of dealing with hierarchical data in ML. There are papers on how to convert individual classifiers to deal with it, but not much on general theory. For clustering, I'd try to create a record for each city with features generated from the data based on some meaningful domain knowledge and try to cluster those. But clustering is a more tricky problem than people generally realize, so I'd also spend some time thinking about what I want out of the data.
|
# ¿ Mar 11, 2022 14:53 |
|
If you're not worried about dataset shift (but you should be).
|
# ¿ Mar 15, 2022 15:05 |
|
I know what you're talking about but I think it's better described as spurious correlations rather than data leakage. Data leakage is something you did wrong, but a spurious correlation is just there in the data set regardless of what you do. There's some very recent research on how to handle that (search for "invariant risk minimization") but it's still a big issue.
|
# ¿ Mar 22, 2022 22:29 |
|
Boris Galerkin posted:I'm just not really sure how this is any different from newton's method or like basically any numerical method that minimizes an objective/loss function. The goal in traditional optimization is to fit the points you have. The goal here is to fit the points you don't have. There are a lot of methods in common but the problems are fundamentally different.
|
# ¿ Jun 15, 2022 20:12 |
|
That's entirely plausible. ChatGPT is trying to predict what a human would say in response to the prompt and it probably has a lot of examples of that response in its training data.
|
# ¿ May 17, 2023 14:32 |
|
https://twitter.com/stokel/status/1726502623967392060
|
# ¿ Nov 20, 2023 17:33 |
|
Nektu posted:I read something that apparently generative AI is used to recognize diseases in very early stages, and I realized that I have no idea at all how AI can be applied to problems, and why it returns result. Can you link us to what you read?
|
# ¿ Feb 25, 2024 15:51 |
|
If a generative AI produces images that are similar to a target image, how strongly does that suggest that the target image is AI-generated?
|
# ¿ Mar 12, 2024 16:57 |
|
Replication crisis in AI/ML when?
|
# ¿ Mar 19, 2024 21:17 |
|
|
# ¿ May 21, 2024 05:30 |
|
You might look into Bayesian optimization as an alternative to any RL-based approach.
|
# ¿ Apr 10, 2024 22:01 |