huhu posted:Edit: I'm doing a bunch of dumb crap. Taking a break. Carrry on. Another rubber duck success story
|
|
# ? Mar 11, 2020 12:32 |
|
|
# ? May 14, 2024 06:53 |
|
oT5Zy7Jd3WXPRzHYbLnV WQZWfGm6IJUHtt3hswFH z05tMY0KQeVQ4zU8Ec4z NaxFaPgzGuN0TJ4Ohcgh cK0Gb6oRFqNaRvB6xt61 2cOCE8IQ13Rp8ydEzYkd Zehjkah8p2yl1hn4eJXC gFHETLuTQmTEwxkmzOZU W3Z58FA9LzG25hskWPRg wpHInNA9BpD4QySVkoOQ Plasmafountain fucked around with this message at 23:58 on Feb 27, 2023 |
# ? Mar 15, 2020 14:52 |
|
Zero Gravitas posted:Is there a way to create pandas dataframes in a class object? (I think thats the term.) code:
code:
|
# ? Mar 15, 2020 15:32 |
|
skull mask mcgee posted:You need to assign the dataframe to an instance variable, like so: I've never thought to use a df in a class before, good example. If youre adding one df of N rows to the end of another (think add new row in a db) you use pd.concat([df1,df2], ignore_index=True/False)
|
# ? Mar 15, 2020 16:15 |
|
CarForumPoster posted:I've never thought to use a df in a class before, good example. Yeah, concat is right. I hadn’t really thought of doing what the OP is doing either. The closest I’ve come is having a method return a dataframe, with the data coming from a sqlite database.
|
# ? Mar 15, 2020 17:16 |
|
Why not use a dictionary? A dataframe seems like unnecessary overkillcode:
code:
code:
lazerwolf fucked around with this message at 17:56 on Mar 15, 2020 |
# ? Mar 15, 2020 17:25 |
|
Question: How should I one hot encode some categorical features where one DF element can contain multiple categories? I have a string list of dept of state executive titles in a DF element. They can appear in any order. For example: code:
e.g. desired output: code:
|
# ? Mar 16, 2020 20:37 |
|
CarForumPoster posted:Question: How should I one hot encode some categorical features where one DF element can contain multiple categories? Haven't tested it on data set yet but olution to share w/thread. I thought it was nifty. Pandas is just so loving powerful. code:
|
# ? Mar 18, 2020 03:37 |
|
2 hour quarantine project https://gist.github.com/blindstitch/368940ed98993b78e07b75a66b2b6cb5
|
# ? Mar 19, 2020 20:06 |
|
I want to visualize how calls get routed in my company by looking at a count or the % thathe go through each hop. I have 1 call per row with all the hops. Something like: code:
I know graphviz/pydot/fastdot can make decision tree graphs from SKlearn, but any advice on how to do it for this application? EDIT: Solved it, solution was to do a groupby with .size() as the aggfunc then plot with fastdot (graphviz/pydot wrapper) Solution: code:
CarForumPoster fucked around with this message at 18:57 on Mar 23, 2020 |
# ? Mar 23, 2020 16:20 |
|
Think I'm fully onboard the attrs train now. The ability to define a converter function for attributes is extremely convenient when dealing with nested data structures.
|
# ? Mar 28, 2020 22:32 |
|
Can you show an example? I just use asdict(). Requires hashable (frozen=True) for something that will nest as keys though. But has the advantage of just dumping the whole thing to json-able (simplejson that is) string. And then loading it back in and converting back with cattrs structure_attrs_from_dict(object, attrs_class_type) works perfectly. Unless you use forward annotations. https://github.com/Tinche/cattrs/issues/80 But I also like that decorators work with attrs (they are broken in @dataclasses). I haven't needed getters/setters/deleters with attrs classes but I do use @property.
|
# ? Mar 30, 2020 20:39 |
|
What’s a good place to ask dumb machine learning questions? I’ve got a training pipeline set up for image segmentation and it works quite well overall, but due to memory constraints on my GPU I had to dial my batch size down to 1. I have a total set of 100 training images so I said screw it and pumped it up to 2, and that made my val_loss super jittery and the purported accuracy stopped around 0.8, whereas with a batch size of 1 it’s not jittery and the accuracy metric goes up to 0.93. The resulting mask from predictions is much less accurate too. This is one of those things where I definitely know enough to just get into trouble, and I’m just curious why increasing the batch size to 2 would affect that.
|
# ? Mar 31, 2020 01:07 |
|
A larger batch size is more likely to converge into a sharp minimum, which can cause overall worse testing accuracy. Your training dataset is very small, what's your testing dataset size?
|
# ? Mar 31, 2020 01:27 |
|
QuarkJets posted:A larger batch size is more likely to converge into a sharp minimum, which can cause overall worse testing accuracy. Your training dataset is very small, what's your testing dataset size? It’s 100 total images with an 80/20 split for training/validation, but I’m using the Unity3D image synthesis plugin to generate training data so I can always generate more. I figured 100 was plenty to get an idea of whether or not I could do anything meaningful.
|
# ? Mar 31, 2020 02:14 |
|
Showerthought project: Create a python interpreter that's a subset of the current one: - Dramatically trimmed-down std lib - Some struct-likes removed or streamlined. (ie consolidate namedtuple, NamedTuple, dataclass, TypedDict, class etc) - (related to above) All classes have freebee dunder methods like __repr__, and constructors (make dataclass default, and don't require the @?) - Remove some secondary functional APIs like for enum - Won't run unless it passes the type checker like mypy (perhaps with a flag to disable/enable?) - Doesn't accept third-party modules that don't use wheel format and/or don't specify dependencies/metadata properly More about removing than adding. I haven't dug into the Python code yet. Dumb idea? Where would you start?
|
# ? Mar 31, 2020 07:21 |
|
Dominoes posted:Showerthought project: Create a python interpreter that's a subset of the current one: Why
|
# ? Apr 1, 2020 00:13 |
|
https://micropython.org/ ?
|
# ? Apr 1, 2020 00:51 |
|
Hah I guess. Was thinking any target. Just a wishlist for a cleaned up Py. Have you tried Micropython? What do you think?
Dominoes fucked around with this message at 01:41 on Apr 1, 2020 |
# ? Apr 1, 2020 01:32 |
|
I think this article nails it: https://glyph.twistedmatrix.com/2019/06/kernel-python.html
|
# ? Apr 1, 2020 04:54 |
|
Dominoes posted:Showerthought project: Create a python interpreter that's a subset of the current one: My personal experience trying to recreate much of the Python user experience in a custom interpreter is that there's a lot of moon magic hiding under the hood that comes out and destroys your life. Things you think you can eliminate--or even just compromise--for the sake of parsimony eventually explode to their true forms. I guess if you're trying to start from CPython and just snip, then that's one thing. It'll probably crash on some dangling bit though. I haven't even pondered most of the data types you even listed, but I can contrast the super type versus a regular class. Super type is a sneaky little fucker. Internally it overrode __getattribute__ but you have to dig to figure out that's what it's doing. If you casually poke it with a stick in the REPL then you think you're seeing different results.
|
# ? Apr 1, 2020 06:39 |
|
It's called rpython and it powers pypy op
|
# ? Apr 1, 2020 07:10 |
|
Hey, I figured out my neural network problem. Turns out if you preprocess and normalize the images going into the neural network, you also have to preprocess and normalize images going into the prediction workflow. Who’d a thunk? Not me That said, I’ve been reading a lot about neural networks and feel much more confident about my knowledge. I’m not going to be breaking any records or doing anything noteworthy but I at least know what overfit and underfit mean now.
|
# ? Apr 2, 2020 02:25 |
|
Depending on what you’re doing, you can build in normalization to the network, e.g.: https://keras.io/layers/normalization/ that’s sometimes more convenient
|
# ? Apr 2, 2020 02:46 |
|
Dominoes posted:Showerthought project: Create a python interpreter that's a subset of the current one: Consider the effort required vs the potential upside..? Why would you do that? There is this weird pattern for programmers moving into Python with experience from large code bases in other languages. They start focusing on fixing 'problems' with python to future proof applications. Which would matter if we were building this massive monolith of a code base that will be used for 25 years. But usually the requirements is just to create~100 lines of code we can move into a docker container to solve some small problem.
|
# ? Apr 2, 2020 20:17 |
|
Protocol7 posted:Hey, I figured out my neural network problem. You probably know this already, but be sure to use n-fold validation to estimate performance of your pipelines. Also, include ALL the choices made in creating that pipelines. It is tempting to omit part of the preprocessing in the validation step for faster results, but the preprocessing steps matters a lot. Sometimes even more than model tuning. Don't just use n-fold validation for model performance, use it to estimate performance for the entire pipeline.
|
# ? Apr 2, 2020 20:20 |
|
Tracking stupid statistics about data flowing through your pipeline is surprisingly useful for catching mistakes like this. The mean pixel value entering the network is pretty opaque on its own, but is immediately suspicious when it's different between training and prediction.
|
# ? Apr 2, 2020 21:23 |
Random thought, it seems wonky to me that a dict subscript is written foo["bar"] and not foo{"bar"}
|
|
# ? Apr 2, 2020 21:52 |
|
hhhmmm posted:Consider the effort required vs the potential upside..? Why would you do that? Given how long folks stayed on Py2, I'm not optimistic. Dominoes fucked around with this message at 21:57 on Apr 2, 2020 |
# ? Apr 2, 2020 21:54 |
|
Data Graham posted:Random thought, it seems wonky to me that a dict subscript is written foo["bar"] and not foo{"bar"} It makes more sense if you remember that foo[bar] is basically just a nicer looking way to write foo.__getitem__(bar), regardless of foo’s type. (And of course it's actually a little more complicated than that when you consider assignment and del, and the corresponding __setitem__ and __delitem__)
|
# ? Apr 2, 2020 22:15 |
Data Graham posted:Random thought, it seems wonky to me that a dict subscript is written foo["bar"] and not foo{"bar"} I think it's a useful analogy with array indices.
|
|
# ? Apr 2, 2020 22:21 |
|
Data Graham posted:Random thought, it seems wonky to me that a dict subscript is written foo["bar"] and not foo{"bar"} Those characters do opposite things, nominally {} creates dict entries and [] fetches them. So the real wonkiness is in the fact that foo["bar"] suddenly becomes a setter if you put it on the left side of an equal sign, when it's normally a getter in other contexts
|
# ? Apr 3, 2020 03:02 |
|
-
|
# ? Apr 3, 2020 04:02 |
|
Nippashish posted:Tracking stupid statistics about data flowing through your pipeline is surprisingly useful for catching mistakes like this. The mean pixel value entering the network is pretty opaque on its own, but is immediately suspicious when it's different between training and prediction. Yeah, I definitely get that. With NNs it’s pretty easy to see the effects of garbage in, garbage out. Thankfully I’ve been slowly adding more robustness to the pipeline so in general so it helps normalize a lot of that stuff. If anyone else ever deals with image segmentation this library is super powerful in my own limited experience. https://github.com/qubvel/segmentation_models
|
# ? Apr 4, 2020 02:29 |
|
Protocol7 posted:Yeah, I definitely get that. With NNs it’s pretty easy to see the effects of garbage in, garbage out. Thankfully I’ve been slowly adding more robustness to the pipeline so in general so it helps normalize a lot of that stuff. You'll likely enjoy this part of the fast.ai deep learning course as they cover image localization/segmentation: https://www.youtube.com/watch?v=MpZxV6DVsmM Bonus that you can compare it to segmentation_models for output, ease of use, etc. Its PyTorch based rather than TF and is also pretty easy to use.
|
# ? Apr 4, 2020 18:23 |
|
Can someone help point me in the direction of some good resources on using SQL in Python, or alternatively give some advice? My internship's big project for me involves a ~60x200,000 excel spreadsheet that requires very specific, weird analysis that no one has really wrapped their head around how to do, specifically involving simultaneous spatial and temporal analysis between different entries, so they tossed it to me. Originally I tried loving with it in Pandas but I'm running out of ways to break it over my knee in a way that gets what I want out of it so I thought about turning to SQL or some other relational DB. My background is in GIS and I've more or less been learning python and pandas on the fly
|
# ? Apr 6, 2020 10:39 |
|
From what you’re describing it doesn’t sound like putting the data into a SQL database is going to help you. It might be a better option for storing/retrieving the data in future (though you should think long and hard before creating a table with 200,000 columns), but if you’ve already got the data in a pandas data frame and don’t know what to do next it is not going to help. Some more detail would be useful. What is the structure of your data and what analysis are you trying to run on it?
|
# ? Apr 6, 2020 11:41 |
|
Yeah SQL isn't going to help you find new data analysis solutions, Pandas tables basically replicate what you can produce with SQL tables anyway
|
# ? Apr 6, 2020 11:47 |
|
DoctorTristan posted:From what you’re describing it doesn’t sound like putting the data into a SQL database is going to help you. It might be a better option for storing/retrieving the data in future (though you should think long and hard before creating a table with 200,000 columns), but if you’ve already got the data in a pandas data frame and don’t know what to do next it is not going to help. I work with people who get called out from central areas all over a city to various places. If every person in one central area's territory is busy, another person from a nearby territory responds to any new calls in that saturated area. It gets more complex because sometimes multiple people get called out to a site, as well as a few other variables. We have a big business and the spreadsheet is automatically generated from the dispatching system with a ton of data including dates and times and locations of where the person is going, and that's the data source. Currently what my code does is take the sheet (which I've done some manual excel magic on to help the process, more on that later) and put it into a pandas dataframe, do some joins to gather all the data of what person belongs where into one df, and then queries every instance of where a person went to a place out of their territory dependent on a few extra variables via an embarrassing long conditional query. It's a very vague query and not entirely representative and that shows because it includes like 60% of the whole table. I'm trying to find a way to figure out, for example, if when this happens, was every available person already occupied determined via timestamp analysis, but I have no idea how that would work in pandas. Likewise, I've been trying to find a way to help sanitize or validate data conditionally, too. Ideally is run a for loop through pandas to look at our in house codes for each territory on a row by row basis and strip useful info from them (they're six digit characters, with each pair representing something about the territory in decreasing scale ie city, neighborhood, street). That stuff I've mostly been doing in excel manually, which sucks but works, but ideally I want that to be automated. Hopefully that explains what I'm doing? Monday mornings are never my strong point
|
# ? Apr 6, 2020 12:27 |
|
|
# ? May 14, 2024 06:53 |
|
Spaced God posted:I'll try to explain in vagueries while hopefully giving the scope of the problem. I don't want anyone knowing a comedy web forum helped me be a good intern lol I can only give vague advice on this kind of vague info, but it sounds like you are trying to do something quite difficult and have a moderate-to-severe case of running before you can walk. A few things stand out: When you say that your data is 60*200000, did you get that the wrong way round or do you really have 60 rows and 200,000 columns? What does each row in your data frame represent? Is it an event where someone is called out, with date/time/location? Can you do more basic queries, such as ‘how many call outs were there on June 19 2019?’ Or ‘how many call outs were there in total on region X during 2018?’ Can you do visualisations of these? It sounds like your extraction, data cleaning and analysis steps are getting quite mixed up with each other - this is invariably a bad idea.
|
# ? Apr 6, 2020 22:26 |