Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
CarForumPoster
Jun 26, 2013

⚡POWER⚡

barkbell posted:

what ai/ml stuff should i be learning as a dev that does microservices, frontends, etc for business?

None, probably.

Alternatively, suggest the fast.ai courses.

Adbot
ADBOT LOVES YOU

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Keisari posted:

Thank god I posted! I was literally about to feature engineer at the dataset level before splitting. I saved the book link and will dive into it! You just saved me possibly hours and hours of frustration.

Will this book help me understand how to handle hierarchical data in general? For example, say I have a bunch of property data from 100 cities, and I would want to cluster the cities based on the property data inside them. I have an idea how I would cluster the underlying properties as that is simple, but how I would cluster the upper hierarchy of cities that hold the properties has absolutely baffled me for at least a year, and haven't been able to find info about it despite a lot of googling. (I might just be bad at googling though, and definitely bad at ML.) :eng99:

I'm not experienced enough to answer authoritatively and am very curious to hear the answer from a practitioner as well.

How I'd approach it based on what I think you're trying to do: If you're flattening the data into a dataset, simply split the flattened-but-raw data set into test and training before feature engineering it. Put the test set on a shelf and don't touch it. Do your FE on your training set and build your model. Okay now youre ready to test but your test set doesnt have the same features. Apply whatever code you wrote for feature engineering to the test/validation set only, then feed to your model.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Keisari posted:

Thanks for the tips! That sure explains a lot why I wasn't able to find good, basic information on dealing with hierarchical data. I'll try to work around the hierarchies for now.

My current problem has data coming in every day. Can't I repeat the test on the final holdout by just waiting like a month to get fresh data and building a new holdout set? I don't see why not.

Again I wouldn't call myself very knowledgeable but IMO the answer to this is "it depends". If your goal is to build a model to predict on NEW data, then using data later in time as the test set is a good idea. That said, you can get seasonality or other "errant" behavior in datasets in many weird ways. In my business our marketing response rate drops every February. If I was to make a model to predict whether a customer responds to marketing on 10 months of data and then have a validation that is mostly February leads, I might think my model sucked. I should probably have two test/validation sets:
- a "test" set - random sample split off before I build my training set and
- a "validation" set - inputs that are most similar to my real world data, i.e. later in time.

If you build a model on a subset of time that doesn't encompass this seasonality, it can lead you to bad conclusions.

On the flip side, if you want to classify currently unclassified, "old" data for example if you want to predict what a bunch of company's revenue was at different times based on inputs from those companies and a model trained on other companies from 2005-2020, you probably DON'T want to have a validation set that is all 2020 company data.

CarForumPoster fucked around with this message at 18:53 on Mar 15, 2022

CarForumPoster
Jun 26, 2013

⚡POWER⚡

pangstrom posted:

Not sure my thinking is 100% clear on this but I think a more subtle form of "data leakage" is when there is some common causal factor in both your training and test sets that you aren't factoring out* (or including data markers that maybe you shouldn't) and isn't going to be meaningfully predictive going forward. Like if you were predicting mortgage defaults and had data including 2008-2010 and the algorithm "figures out" to just triple (or whatever) rates for those years, you could think your model was much better at predicting than it actually will be. Probably a bad example, since it kind of mushes the time travel issue unless you just had one data point per mortgage or something... so maybe if there were crashes in some regions of the US or something maybe that would be better example (unless those crashes were endemic and non-correcting). Data separation isn't always the best approach in those (compared to factoring out), though, like if you're training set has those years or regions with lots of crashes and your test set doesn't then your model might actually be worse than it could be.

*factoring out could mean like changing the performance metric "rate of default below/above baseline for that time period", or in better example "rate of default above/below baseline for that region and time period", though you would compute the baselines separately for the test and training sets.... or you could just strip the years/regions (though then you're going to miss any endemic signals, if there are any)

That's not data leakage.

Data leakage is when your training set knows a factor it otherwise would not know in practice. Sometimes this can be a restatement of your Y variable.

E.g. I am trying to predict which leads will convert to customers. My data include all info I know about customers, including their lifetime values. Obviously only companies with an LTV > 0 are customers, so I can't include the LTV column.

The underlying ecosystem changing in a way that makes your data inaccurate is just reality being a bitch and the nature of time-variant systems.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

pangstrom posted:

Well, the more subtle form of data leakage I'm talking about in converting customers could be like a difference in sales employee quality and those employees being in both training and test sets. Like you model would know Brad is awesome, Todd sucks, etc., but with new employees or putting the model in use for a different company it would do worse than you would expect.

That is a great (and 100% real thing I encountered!) example of the tyranny of time-variant systems and the limitations of using past/incomplete data to predict the future. How to handle that situation could be very important to the usefulness of your model! It leads to a path of what problem does this model solve? Do you remove the employee column to make the model more generalizable/prevent overfitting (random forest loves this sort of thing)? Should you come up with an employee quality score feature based on their performance?

not data leakage tho

CarForumPoster
Jun 26, 2013

⚡POWER⚡

pangstrom posted:

I agree it's different than the simple leakage between training/target sets, but seems to fit the "use of information in the model training process which would not be expected to be available at prediction time"

It’s not tho, you have a priori knowledge of the employee. How you handle that employee knowledge, incl for employees that no longer exist or didnt at time of training, is something you can handle in various ways. You definitely have it at prediction time though.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

poty posted:

I'm currently working on "babby's first ML project" for a school assignment about setting up an SVM classifier to guess if a stock will go up the next day. It does this from a handful of features I came up with (that obviously can't predict that) using sklearn on Python.

The thing that frustrates me about sklearn is that I don't know how to guess whether a model fitting (as part of a RandomizedSearchCV for example) is going to take 5 seconds, or 5 minutes, or seemingly infinite time. I wish you could set some sort of timeout value like "if this set of parameters hasn't finished the fitting process in 5 seconds just kill it" but it doesn't look like that exists.

You can create watchdog timers in Python, then run .fit() inside them.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

poty posted:

Thanks for the suggestion, but I don't think it works if I understand things correctly. There is only one call to .fit() for the whole RandomizedSearchCV optimization process, as opposed to one call to .fit() for each of the 100 sets of parameters (5 of which might never complete). I can put the global .fit() call inside a timer but then I don't get results for any set of parameters if one is slow (it's the same result as what I'm doing now, killing the kernel in Jupyter Notebook).

I guess the solution is to evaluate the sets of parameters "manually" without using something like RandomizedSearchCV or GridSearchCV, then I would in fact have access to the individual .fit() calls and could do that.

Ahh, yea, I misunderstood.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

They fired the face of the company a few days after a major public update?

CarForumPoster
Jun 26, 2013

⚡POWER⚡

I cant really figure out twitter so forgive me...is there..a controversial part of this or something? Seems like a dude asking *presumably* genuine questions about sex topics. If the questions are being asked in bad faith or somethin, its not shown in this tweet and I cant find the context easily by clicking or I hit a login wall.

CarForumPoster fucked around with this message at 05:46 on Nov 21, 2023

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Jabor posted:

You can't figure out what's controversial about saying 40-60% of women have rape fantasies? Like this is just a normal discussion you'd have with workmates around the water cooler?

I don’t think statements of fact derived from a credible source and presented in good faith on a public discussion on that topic should be particularly controversial. It’d be extremely inappropriate to say this in a work environment, especially when you’re a CEO or manager, but Twitter isn’t the workplace and I don’t have any reason to know they work together.

CarForumPoster
Jun 26, 2013

⚡POWER⚡
For the record, I agree its in bad taste to discuss publicly, particularly when you have hundreds of employees. A guy said some dumb but not hateful internet stuff is not something I'd want to be hung by though.


On to AI:

Last year I made an easy FastAI image recognition API to classify and grade American coins, as well as pull out a few attributes. Lil hobby project, but definitely put a couple weeks into it. Worked well enough to be useful.

I tried a customized GPT on 10 or so samples via ChatGPT and it gave nearly comparable results including now being an API that returns JSON in like 6 or 7 prompts. It was better at extracting dates but generally worse (though not uselessly so, I'd probably use it in a voting system) at classifying the type of coin. Absolutely incredible results for how little effort there was. It has the big problem though of not being able to feed back in its mislabeled data to improve it, can seemingly only tune the prompts.

Adbot
ADBOT LOVES YOU

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Xun posted:

When you look at a GitHub page for a research paper, what do you guys want to see if the readme? Im updating mine for a publication and sadly the only advice I'm getting from my labmates is "the code is all there and there's a bibtex citation, why are you worrying :confused:"

Honestly I'm usually pretty happy with just "type this to run model" but idk if that's the norm lol

Ideally, a working demo or at least a GIF of what to expect if its something that has dataviz or can be made a web app fairly easily.


QuarkJets posted:

I review research papers from time to time, and the amount of stuff in a project's README tends to be all over the map, from less than useless to comprehensive.

Put in whatever you personally would want to see if you were coming in fresh, having no experience with the specific project. Include details on how to run the model and any additional details that you feel are important. It doesn't need to be a comprehensive SDK, but it should provide high-level details that would be helpful to anyone seeking to build upon your work.

Assume that you'll want to show this project and its README to future prospective employers, too - put in the extra 60 seconds to give it a title, section headings, pass it through `aspell`, etc.

This

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply