|
Welcome to the Artificial Intelligence & Machine Learning megathread! Use this thread to discuss all things AI, ML, deep learning, GANs, CNNs, RNNs, transformers, data science, and all other kind of gobbledygook! This OP is a work in progress, and I'll admit upfront that I have a traditional software development background and therefore no significant formal education in data science or statistics. So help make this OP better by PMing me suggestions or posting them in this thread. What is Artificial Intelligence? Broadly, Wikipedia defines artificial intelligence as "intelligence demonstrated by machines, as opposed to natural intelligence displayed by animals including humans." And I think it's fair to call it a broad concept, because there's a ton of definitions for AI. In practice, AI is used in the forms of machine learning, and more specifically, deep learning. Okay, so what is Machine Learning then? Again ripping from Wikipedia: "Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data." Simply put: Given data, an algorithm will learn roughly the relationship between variables in your data. Classically, you might already be familiar with a line of best fit; this is a simplistic example of how machine learning works, but that should hopefully give you a rough idea of what machine learning aims to accomplish. Of course, more excitingly, there's: Deep Learning Obligatory Wikipedia definition: "Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised." That's a lot of stuff! The key operating word here is "deep" and that's because often, neural networks will have multiple layers. Each layer typically extracts higher-level features from the raw input. A tangible example would be detecting faces in an image. Here's what the layers might learn in this case: Naturally, the problem with deep learning and having many layers is that it's memory and computationally expensive. This is why desktop GPUs are often used for training deep learning models, and GPUs with high amounts of VRAM are preferred. That doesn't mean you need a RTX 3090; I've gotten away with a RTX 3070 with "only" 8GB of VRAM for deep learning, though I also have a machine with a RTX 2080 Ti that I use for more memory intensive tasks. But a lot of modern deep learning frameworks (i.e., Tensorflow and PyTorch) are accelerated by NVIDIA's CUDA, and that means you might want a new-ish NVIDIA card to train deep neural networks. Some state of the art models are trained on multiple GPUs even, but I've found in practice that's only if you want to squeeze the highest Mean Average Precision from your dataset or whatnot. Okay, so what can you do with machine learning and deep learning?
Machine Learning Frameworks For better or for worse, a lot of the popular frameworks are in Python these days. The major two deep learning frameworks are PyTorch and TensorFlow, and for standard machine learning algorithms such as logistic regression or Support Vector Machines are part of scikit-learn. These three are all fairly easy to use, and there are often larger libraries built on top of them, such as the Open MMLab ecosystem, Facebook's own libraries such as detectron2. How to get started with AI/ML There are tons of resources of varying quality out there. A few notable ones:
There's a lot to think about in the field of AI, ML, deep learning, or whatever buzzword is hot at the current time. My personal expertise is in computer vision, object detection and those sort of tasks, so I'm happy to discuss and help out people with that in this thread. In closing, I give you an image of a robot looking at a wall of science-y greeble, as this definitely represents AI/ML in a picture: Macichne Leainig fucked around with this message at 18:01 on Feb 9, 2022 |
# ? Feb 9, 2022 17:38 |
|
|
# ? May 4, 2024 11:01 |
|
Reserved for more.
|
# ? Feb 9, 2022 17:38 |
|
I peer review AI/ML papers for several journals, AMA If you ask me whether your paper is novel or even slightly good, the nice answer is "no" and the mean answer is "lol no"
|
# ? Feb 10, 2022 06:59 |
|
QuarkJets posted:I peer review AI/ML papers for several journals, AMA what AI/ML research do you publish, personally?
|
# ? Feb 10, 2022 13:33 |
|
I'm a practicing data scientist looking to improve the state of the art both in industry and academia. I can't speak to frameworks or systems too much, but I can say a lot about how to manage AI/ML products, how this discipline fits into data science as a whole, and how data science fits into an organization's strategy. I also do research on areas related to AI/ML although I don't publish in any of the big conferences (and I can tell you all about the horrible problems with that system). I'm glad to see this thread up, and I'm going to toss in links to the SAL data science thread as well as a collection of papers I've been curating during the pandemic.
|
# ? Mar 6, 2022 19:32 |
|
pmchem posted:what AI/ML research do you publish, personally? I don't want to dox myself so I won't go into details, but I tend to work with applied imagery research
|
# ? Mar 6, 2022 19:46 |
|
Forgot to mention there's also a scientific programming thread in SAL that may have some overlap with topics discussed here.
|
# ? Mar 7, 2022 03:33 |
|
I am supposedly a data scientist, but I feel more like a data janitor.
|
# ? Mar 7, 2022 14:03 |
|
I feel the same. It's good that I have a background in both front-end and back-end development, because most of what I've done for work has been setting up tools to manage our datasets, models, etc and not, y'know, data science. But hey, someone's gotta do it I guess. Besides, we hired some people who actually know more of the data science stuff, and I'm absorbing a ton of information from them. It's nice.
|
# ? Mar 7, 2022 16:18 |
|
what ai/ml stuff should i be learning as a dev that does microservices, frontends, etc for business?
|
# ? Mar 8, 2022 03:06 |
|
barkbell posted:what ai/ml stuff should i be learning as a dev that does microservices, frontends, etc for business? None, probably. Alternatively, suggest the fast.ai courses.
|
# ? Mar 8, 2022 10:49 |
|
okay thanks, ill check it out
|
# ? Mar 8, 2022 17:45 |
|
Anyone got advice on how to prevent leaking data between training and test datasets when doing feature engineering? My previous models didn't require very much, but the next one I'm going to take a crack at would require substantial amounts of feature engineering. More specifically, I've read that mostly you should be fine if you are only combining data from one "row", but if you combine several rows there is risk of leakage. If anyone has great resources to dig into feature engineering in general, I'd very much appreciate that too!
|
# ? Mar 10, 2022 09:46 |
|
The key is not to use information from the test set in your feature engineering. Do your split first, and then do feature engineering. If your data has a hierarchical structure, it's best to split at the highest level of the hierarchy. For example, if you have an ad data set where each record represents an ad but there are multiple ads for a given campaign, you run the risk of leakage if you include data from a single campaign in the training and test set. Beyond that, the authors of Applied Predictive Modeling wrote a book on feature engineering.
|
# ? Mar 10, 2022 15:08 |
|
ultrafilter posted:The key is not to use information from the test set in your feature engineering. Do your split first, and then do feature engineering. Thank god I posted! I was literally about to feature engineer at the dataset level before splitting. I saved the book link and will dive into it! You just saved me possibly hours and hours of frustration. Will this book help me understand how to handle hierarchical data in general? For example, say I have a bunch of property data from 100 cities, and I would want to cluster the cities based on the property data inside them. I have an idea how I would cluster the underlying properties as that is simple, but how I would cluster the upper hierarchy of cities that hold the properties has absolutely baffled me for at least a year, and haven't been able to find info about it despite a lot of googling. (I might just be bad at googling though, and definitely bad at ML.)
|
# ? Mar 11, 2022 09:38 |
|
I've never seen a systematic treatment of dealing with hierarchical data in ML. There are papers on how to convert individual classifiers to deal with it, but not much on general theory. For clustering, I'd try to create a record for each city with features generated from the data based on some meaningful domain knowledge and try to cluster those. But clustering is a more tricky problem than people generally realize, so I'd also spend some time thinking about what I want out of the data.
|
# ? Mar 11, 2022 14:53 |
|
Keisari posted:Thank god I posted! I was literally about to feature engineer at the dataset level before splitting. I saved the book link and will dive into it! You just saved me possibly hours and hours of frustration. I'm not experienced enough to answer authoritatively and am very curious to hear the answer from a practitioner as well. How I'd approach it based on what I think you're trying to do: If you're flattening the data into a dataset, simply split the flattened-but-raw data set into test and training before feature engineering it. Put the test set on a shelf and don't touch it. Do your FE on your training set and build your model. Okay now youre ready to test but your test set doesnt have the same features. Apply whatever code you wrote for feature engineering to the test/validation set only, then feed to your model.
|
# ? Mar 11, 2022 14:59 |
|
I love SKLearn. It's probably a deep hole to jump into, but I like to use the Pipeline approach for feature engineering and training, that way I can just call pipeline_obj.fit(x_train, y_train) [and iterate my development until the model is good enough] [Wait until I am done with all model development, and I can think of no further ways to improve] [No seriously, you only get to do this one time] [I mean it. Don't peek at your test set until you are done with your development.] and then pipeline_obj.score(x_test, y_test).
|
# ? Mar 11, 2022 15:50 |
|
I like transformers. My favorite one is soundwave. I wonder if it sucked to be the one that turned into a gun for the other one... Like p sure they all already had lazers and stuff so that kinda seems like a wack special power. But actually this is a cool thread, looking forward to lurking.
|
# ? Mar 11, 2022 18:24 |
|
Thanks for the tips! That sure explains a lot why I wasn't able to find good, basic information on dealing with hierarchical data. I'll try to work around the hierarchies for now. dexefiend posted:I love SKLearn. My current problem has data coming in every day. Can't I repeat the test on the final holdout by just waiting like a month to get fresh data and building a new holdout set? I don't see why not.
|
# ? Mar 15, 2022 11:08 |
|
If you're not worried about dataset shift (but you should be).
|
# ? Mar 15, 2022 15:05 |
|
ultrafilter posted:If you're not worried about dataset shift (but you should be). I am, kinda, if you mean that the holdout set will be biased in a different way compared to the train/test set. However, since only performance on the most fresh/holdout data matters, I guess this is not a critical issue? (Compared to if performance on the entire dataset was the goal)
|
# ? Mar 15, 2022 18:27 |
|
Keisari posted:Thanks for the tips! That sure explains a lot why I wasn't able to find good, basic information on dealing with hierarchical data. I'll try to work around the hierarchies for now. Again I wouldn't call myself very knowledgeable but IMO the answer to this is "it depends". If your goal is to build a model to predict on NEW data, then using data later in time as the test set is a good idea. That said, you can get seasonality or other "errant" behavior in datasets in many weird ways. In my business our marketing response rate drops every February. If I was to make a model to predict whether a customer responds to marketing on 10 months of data and then have a validation that is mostly February leads, I might think my model sucked. I should probably have two test/validation sets: - a "test" set - random sample split off before I build my training set and - a "validation" set - inputs that are most similar to my real world data, i.e. later in time. If you build a model on a subset of time that doesn't encompass this seasonality, it can lead you to bad conclusions. On the flip side, if you want to classify currently unclassified, "old" data for example if you want to predict what a bunch of company's revenue was at different times based on inputs from those companies and a model trained on other companies from 2005-2020, you probably DON'T want to have a validation set that is all 2020 company data. CarForumPoster fucked around with this message at 18:53 on Mar 15, 2022 |
# ? Mar 15, 2022 18:47 |
|
But you need to "stand where you will be standing when you make your prediction". If you are predicting every day, you need to be evaluating data "every day". If you have "plenty of data", using 60% train, 20% validation, and 20% testing is also a good approach. You can peek at the validation, for evaluation purposes, but never training. So, you iteratively train/evaluate/improve using train and validation, and then on the final final final evaluation against your test set. On a multiple time series, you could hold 20% of your entire observations as test, and then partition the last 20% of each of your individual series as validation. However, you always must make sure that you are standing in the right spot for your predictions.
|
# ? Mar 15, 2022 18:53 |
|
CarForumPoster posted:Again I wouldn't call myself very knowledgeable but IMO the answer to this is "it depends". If your goal is to build a model to predict on NEW data, then using data later in time as the test set is a good idea. That said, you can get seasonality or other "errant" behavior in datasets in many weird ways. In my business our marketing response rate drops every February. If I was to make a model to predict whether a customer responds to marketing on 10 months of data and then have a validation that is mostly February leads, I might think my model sucked. I should probably have two test/validation sets: Right, good points. I very well could see myself jumping into conclusions like that, for example concluding that a model in production has turned into trash even if it simply hit a seasonality that it can't assess well. dexefiend posted:But you need to "stand where you will be standing when you make your prediction". Could you clarify what you mean with this standing thing? It sounds vaguely familiar but I just can't recall it.
|
# ? Mar 17, 2022 09:43 |
|
That is what I say when I am thinking about a model. Make sure that all your inputs are only available at the point in time that you will be predicting from. The term that is often used is "data leakage." My clumsy thoughts: Make sure:
Time travel is how I think about the issue of availability of data or timeliness of data in relation to using it. Clumsy example: So, if you are doing some model on retail sales, and in model development you have an input that is monthly unemployment. Junior Data Scientist uses Monthly Unemployment as an input on Monthly Sales. The problem is that monthly unemployment is released with a lag. You would never have February monthly unemployment available at the time you would want to predict February Sales. You would already have your February Sales completely accounted for by the time you got the February Unemployment numbers. Does that help?
|
# ? Mar 17, 2022 17:00 |
|
dexefiend posted:That is what I say when I am thinking about a model. Ah! Yes! I got 99 problems but training the network on data not available during production aint one. I've tried to be very mindful of this stuff
|
# ? Mar 17, 2022 21:26 |
|
Not sure my thinking is 100% clear on this but I think a more subtle form of "data leakage" is when there is some common causal factor in both your training and test sets that you aren't factoring out* (or including data markers that maybe you shouldn't) and isn't going to be meaningfully predictive going forward. Like if you were predicting mortgage defaults and had data including 2008-2010 and the algorithm "figures out" to just triple (or whatever) rates for those years, you could think your model was much better at predicting than it actually will be. Probably a bad example, since it kind of mushes the time travel issue unless you just had one data point per mortgage or something... so maybe if there were crashes in some regions of the US or something maybe that would be better example (unless those crashes were endemic and non-correcting). Data separation isn't always the best approach in those (compared to factoring out), though, like if you're training set has those years or regions with lots of crashes and your test set doesn't then your model might actually be worse than it could be. *factoring out could mean like changing the performance metric "rate of default below/above baseline for that time period", or in better example "rate of default above/below baseline for that region and time period", though you would compute the baselines separately for the test and training sets.... or you could just strip the years/regions (though then you're going to miss any endemic signals, if there are any) pangstrom fucked around with this message at 21:13 on Mar 22, 2022 |
# ? Mar 22, 2022 21:11 |
|
pangstrom posted:Not sure my thinking is 100% clear on this but I think a more subtle form of "data leakage" is when there is some common causal factor in both your training and test sets that you aren't factoring out* (or including data markers that maybe you shouldn't) and isn't going to be meaningfully predictive going forward. Like if you were predicting mortgage defaults and had data including 2008-2010 and the algorithm "figures out" to just triple (or whatever) rates for those years, you could think your model was much better at predicting than it actually will be. Probably a bad example, since it kind of mushes the time travel issue unless you just had one data point per mortgage or something... so maybe if there were crashes in some regions of the US or something maybe that would be better example (unless those crashes were endemic and non-correcting). Data separation isn't always the best approach in those (compared to factoring out), though, like if you're training set has those years or regions with lots of crashes and your test set doesn't then your model might actually be worse than it could be. That's not data leakage. Data leakage is when your training set knows a factor it otherwise would not know in practice. Sometimes this can be a restatement of your Y variable. E.g. I am trying to predict which leads will convert to customers. My data include all info I know about customers, including their lifetime values. Obviously only companies with an LTV > 0 are customers, so I can't include the LTV column. The underlying ecosystem changing in a way that makes your data inaccurate is just reality being a bitch and the nature of time-variant systems.
|
# ? Mar 22, 2022 21:49 |
|
CarForumPoster posted:That's not data leakage. tail end of this page has stuff like what I'm talking about : https://towardsdatascience.com/data-leakage-in-machine-learning-how-it-can-be-detected-and-minimize-the-risk-8ef4e3a97562
|
# ? Mar 22, 2022 21:54 |
|
I know what you're talking about but I think it's better described as spurious correlations rather than data leakage. Data leakage is something you did wrong, but a spurious correlation is just there in the data set regardless of what you do. There's some very recent research on how to handle that (search for "invariant risk minimization") but it's still a big issue.
|
# ? Mar 22, 2022 22:29 |
|
pangstrom posted:Well, the more subtle form of data leakage I'm talking about in converting customers could be like a difference in sales employee quality and those employees being in both training and test sets. Like you model would know Brad is awesome, Todd sucks, etc., but with new employees or putting the model in use for a different company it would do worse than you would expect. That is a great (and 100% real thing I encountered!) example of the tyranny of time-variant systems and the limitations of using past/incomplete data to predict the future. How to handle that situation could be very important to the usefulness of your model! It leads to a path of what problem does this model solve? Do you remove the employee column to make the model more generalizable/prevent overfitting (random forest loves this sort of thing)? Should you come up with an employee quality score feature based on their performance? not data leakage tho
|
# ? Mar 23, 2022 12:06 |
|
I agree it's different than the simple leakage between training/target sets, but seems to fit the "use of information in the model training process which would not be expected to be available at prediction time" definition? But whatever, semantics.
|
# ? Mar 23, 2022 13:48 |
|
pangstrom posted:I agree it's different than the simple leakage between training/target sets, but seems to fit the "use of information in the model training process which would not be expected to be available at prediction time" It’s not tho, you have a priori knowledge of the employee. How you handle that employee knowledge, incl for employees that no longer exist or didnt at time of training, is something you can handle in various ways. You definitely have it at prediction time though.
|
# ? Mar 25, 2022 10:31 |
|
I just ran across this: https://algorithmsbook.com/files/dm.pdf via https://news.ycombinator.com/item?id=31123683 “Algorithms for Decision Making”. The pdf will always be free. Looks like a nice survey. Does anyone else have free (not pirated) e-pdf algorithm books to recommend? I like to collect them for my little PDF library, covering a range of topics. HPC/Scientific computing, AI/ML, basic comp sci, computational geometry, etc.
|
# ? Apr 26, 2022 15:00 |
|
mediocre book you've actually read and did problems from beats great book you hoard in your pdf trove any day of the week
|
# ? Apr 26, 2022 15:29 |
|
bob dobbs is dead posted:mediocre book you've actually read and did problems from beats great book you hoard in your pdf trove any day of the week While I don’t disagree, I’ve already done my homework years ago and actively work. Some of us still like to keep a personal library up to date.
|
# ? Apr 26, 2022 16:29 |
|
so have i and so do i, altho this 'relevance eng' job just involves normal software dev nowadays. still applies imo
|
# ? Apr 26, 2022 16:30 |
|
bob dobbs is dead posted:so have i and so do i, altho this 'relevance eng' job just involves normal software dev nowadays. still applies imo That’s great. Care to share any resources you’ve found useful?
|
# ? Apr 26, 2022 16:38 |
|
|
# ? May 4, 2024 11:01 |
|
one of my more deranged things i do is to get a pdf copy, then a separate paper copy, and then start tearing out pages from textbook as i read em. gives sense of progress. i also write on textbooks very liberally same w papers, but this is less deranged behavior w papers one book ive read that punches above its weight in my thinkin is michael i jordan and sejnowski's graphical models book. sejnowski's rbm was material part of the start of all this and if you wanna be a weenie about explainability imo you're better off shoving graphical models in poo poo. koller and friedman have a deec graphical models textbook but it goes less hard into the statstical physics than id like hybrid graphical models and neural crap was how waymo did stuff for a fair bit. now i'm given to understand they have more rl poo poo in there. that's been my reading lately but i dont think there's a definitive good neural rl book yet. but shoving rl and pgm into one book is unfortunately pretty vast and pgm stuff has a lot of practitioner voodoo in it unfortunately bob dobbs is dead fucked around with this message at 17:03 on Apr 26, 2022 |
# ? Apr 26, 2022 16:47 |