|
I don't think I'd call myself an "expert" so maybe someone will weigh in with a better solution, but using ffill seems like it would work:code:
|
# ? Feb 21, 2024 03:24 |
|
|
# ? May 15, 2024 04:21 |
|
I'm playing around with Polars and this seems to workcode:
Worth noting that polars does away with an explicit index column, but has some fast and powerful filtering. E: Also worth noting ChatGPT is poo poo at helping with Polars because it is too new and re-uses some common terms.
|
# ? Feb 21, 2024 03:52 |
|
WHERE MY HAT IS AT posted:I don't think I'd call myself an "expert" so maybe someone will weigh in with a better solution, but using ffill seems like it would work: Awesome! This is perfect, exactly what I was looking for. This sort of mask + ffill method would probably work in a lot of usecases for these kinds of tables.
|
# ? Feb 21, 2024 03:53 |
|
Is there a standard tool to look at parquet files? I'm trying to go through a slog of parquet files and keep getting an exception: Python code:
I understand I'm trying to have the efficiency of polars in lazy mode, but I'd love to know where it specifically blows up to help figure out the problem upstream. Is there a better place to ask polars / data science questions?
|
# ? Feb 21, 2024 21:06 |
|
There's a burgeoning data engineering thread here that might get you a better answer, but I think the crossover of posters is high anyways: https://forums.somethingawful.com/showthread.php?threadid=4050611
|
# ? Feb 21, 2024 21:22 |
|
PyArrow has a Parquet module. You might try that if Polars is bring uncooperative.
|
# ? Feb 21, 2024 21:49 |
|
Oysters Autobio posted:Any pandas experts here? Stupid simple option if they’re grouped in order Get list of your categories Save it as a CSV Use the category list to break it into chunks for each category. Make each chunk df, adding a column for the category pd.concat Otherwise I’d try to read the excel spreadsheet such that the index is ordered numerically and get the indexes that contain the categories (eg by filtering to where the other columns have nans) then just map the category to its applicable data rows between each category
|
# ? Feb 22, 2024 05:47 |
|
CarForumPoster posted:Stupid simple option if they’re grouped in order I agree. IMO trying to do this in Pandas is the wrong approach, part of your ETL pipeline is sanitizing and standardizing your data before loading it, so you should just handle that in code first before feeding it over to pandas. Especially if this is an adhoc spreadsheet. Here's an extremely naive example: Python code:
|
# ? Feb 23, 2024 00:28 |
|
I have a quick question that I cannot figure out and the keywords involved make googling difficult. I’m also having a hard time explaining this so bear with me. I am trying to write an if statement that checks that one variable isn’t in a list of values, or if that variable isn’t equal to a specific value while another variable is equal to a specific value at the same time. My test case is below. In this case I only want to see ‘Right’ when x is either ‘a’, ‘b’, or ‘c’ OR x is ‘d’ while y is also 3. code:
|
# ? Feb 25, 2024 16:37 |
|
Is the not negating the entire statement or just the part before the or? I think you need more parenthesis, because the not isn't applying precisely how you want it. I realize this is a simplified example to demonstrate the issue, but in production do you want "right" to be an outcome? Because if so, I don't see why you're even bothering with the not to start with the negative case.
|
# ? Feb 25, 2024 17:07 |
|
FISHMANPET posted:Is the not negating the entire statement or just the part before the or? I think you need more parenthesis, because the not isn't applying precisely how you want it. Yes, the ‘not’ should only apply to the first conditions wrapped in parentheses and not the one after the ‘or’. I tried wrapping the whole thing in parentheses like (not (x == ‘d’ and y == 3)) but that doesn’t seem to work either. ‘Right’ is the desired result when x = ‘d’ and y = 3. The test case is just an example from a much larger program that I can’t easily restructure so I just need to figure out if what I’m trying to do is possible. It feels like there should be a way to do this but I can’t find any info.
|
# ? Feb 25, 2024 17:17 |
|
yeah, why are you writing it like that? why not just writePython code:
and the second condition evaluates to True because x is NOT in ['a', 'b', 'c'] so what it's doing in your example is: Python code:
e: using negatives like that is absolute hell though so if i caught anybody writing code like that i would judge them forever and probably try to get them fired. out of a cannon, into a volcano boofhead fucked around with this message at 17:24 on Feb 25, 2024 |
# ? Feb 25, 2024 17:21 |
|
And the reason it's absolute hell is... exactly the case you're running into now. Sometimes programming conversations can feel preachy when doing it the "wrong" way works for now, but this is a pretty clear case of how it being done the "wrong" way is right now directly making the program harder to understand and troubleshoot.
|
# ? Feb 25, 2024 17:28 |
|
boofhead posted:yeah, why are you writing it like that? why not just write I have to do it this way because these two new conditions are just an addition to a very long list of existing conditions, and at least some of them will need to be negated. I inherited this program and I can’t change how it’s structured without causing a lot of drama so I’m trying to do the best with what I have. I agree that it’s going to be a (more) confusing mess from here on out.
|
# ? Feb 25, 2024 17:35 |
|
Deadite posted:I have to do it this way because these two new conditions are just an addition to a very long list of existing conditions, and at least some of them will need to be negated. I inherited this program and I can’t change how it’s structured without causing a lot of drama so I’m trying to do the best with what I have. I agree that it’s going to be a (more) confusing mess from here on out. refactor the whole thing imo but if you're determined to go ahead with it, this should do what you're asking Python code:
|
# ? Feb 25, 2024 17:43 |
|
boofhead posted:refactor the whole thing imo Perfect, thank you. And I'm sure that team doesn't do any unit testing. I got pulled in to help out because they were running behind. At the end of the month this program will not be my problem again until an ironic reorg forces me to maintain it. Here's a more accurate representation of the issue, but pretend there are about 20 more variables that need to be tested: Python code:
|
# ? Feb 25, 2024 17:51 |
|
Deadite posted:Perfect, thank you. And I'm sure that team doesn't do any unit testing. I got pulled in to help out because they were running behind. At the end of the month this program will not be my problem again until an ironic reorg forces me to maintain it. What version of Python are you running this on? (Lol I know, it's probably going to be 3.8) I ask just on the offchance you might be able to leverage pattern matching to make this less of a rats nest. (https://docs.python.org/3.10/reference/compound_stmts.html#the-match-statement)
|
# ? Feb 25, 2024 17:59 |
|
Deadite posted:Perfect, thank you. And I'm sure that team doesn't do any unit testing. I got pulled in to help out because they were running behind. At the end of the month this program will not be my problem again until an ironic reorg forces me to maintain it. i wonder why they're running behind! i also don't mean to take a dig at you, capitalism makes us do terrible, terrible things, code included but you really sure you can't convince them to implement something more maintainable? pydantic or if you cant use external tools, do it in a class that's just a little cleaner? something like Python code:
|
# ? Feb 25, 2024 18:25 |
|
QuarkJets posted:The missing docstring and bad function name are probably the two most important items on that list, new developers should definitely be thinking about these things when writing a new function. Every time. Why docstrings and not comments? Follow-up question: what to use comments for, then?
|
# ? Feb 26, 2024 17:03 |
|
Seventh Arrow posted:Follow-up question: what to use comments for, then? Disabling your debug print statements of course.
|
# ? Feb 26, 2024 17:26 |
|
I'm also a big fan of Python code:
|
# ? Feb 26, 2024 17:32 |
|
Seventh Arrow posted:Why docstrings and not comments? The pep8 style guide says this about docstrings: quote:Write docstrings for all public modules, functions, classes, and methods. Docstrings are not necessary for non-public methods, but you should have a comment that describes what the method does. This comment should appear after the def line. Pep8 is the industry standard, but even if you're not obeying pep8 docstrings are still considered best practice. That string automatically becomes the special __doc__ attribute of your function or class and should explain the fundamentals of the code: what it takes as input, what it does to the input, what it returns. This summary is useful for future developers looking at the code, including potentially the person who wrote it Other comments (comments that aren't basically a docstring) may or may not be necessary. Comments can be useful for explaining some tricky behavior, but they shouldn't be there if the behavior is obvious. Pep8 agrees with this: quote:Inline comments are unnecessary and in fact distracting if they state the obvious. Don’t do this: Comments have a tendency to rot; developers sometimes forget to tend to the comments while making changes to the code, especially if a comment is not immediately preceding a line that's being changed, and then you wind up with comments that do not accurately describe the code. This isn't just unhelpful, it's harmful and worse than not having a comment at all. Docstrings have all of these risks as well, but they're a little less prone to rotting: each object can only have 1 docstring, so it's easy to verify that the docstring is still accurate, whereas comments can be found anywhere I felt that the example code was easy to read and didn't need comments. A docstring is required and would be sufficient
|
# ? Feb 26, 2024 17:34 |
|
Falcon2001 posted:I agree. IMO trying to do this in Pandas is the wrong approach, part of your ETL pipeline is sanitizing and standardizing your data before loading it, so you should just handle that in code first before feeding it over to pandas. Especially if this is an adhoc spreadsheet. Here's an extremely naive example: The nice thing about your data cleaning approach is that you can easily farm out bugs in the data cleaning to people who are basically complete strangers to data engineering, but if this isn't a problem you have, there's nothing wrong with vectorizing if it makes sense. The Pandas approach will probably, ironically, make less sense to people who are very familiar with Python, but feel extremely natural to people looking at Python data engineering from an R background
|
# ? Feb 26, 2024 19:26 |
|
QuarkJets posted:Other comments (comments that aren't basically a docstring) may or may not be necessary. Comments can be useful for explaining some tricky behavior, but they shouldn't be there if the behavior is obvious. I've heard the most common case for this is new devs who add comments to everything except the parts that don't make sense, so you have stuff like # iterate through list. Personally, the best explanation for comments I've found is that comments tell you why the code is written this way, not what the code is doing, unless your code has some reason to be unreadable, in which case, sure...but you should explain why you have to do that, if I'm reviewing your code. The why should be more like business logic, or reasons why you chose a particular approach over another, and not simply an explanation of basic algorithms. Vulture Culture posted:I agree that it should usually be a separate pipeline step, but you should feel fine doing whatever is idiomatically familiar to the people who need to maintain it while being reasonably performant Yeah, I suppose that 'the wrong approach' was too opinionated on my part, I agree with the rest of your post.
|
# ? Feb 27, 2024 02:23 |
|
Personally I’m extremely anti using pandas in anything even resembling a pipeline since the devs absolutely love introducing breaking changes.
|
# ? Feb 27, 2024 06:20 |
|
DoctorTristan posted:Personally I’m extremely anti using pandas in anything even resembling a pipeline since the devs absolutely love introducing breaking changes. It shouldn't matter as long as you aren't blindly merging new changes to production without testing. If you are, pandas isn't the only thing that's going to screw your day up.
|
# ? Feb 27, 2024 07:34 |
|
DoctorTristan posted:Personally I’m extremely anti using pandas in anything even resembling a pipeline since the devs absolutely love introducing breaking changes. Better to never curate the API ever.
|
# ? Feb 27, 2024 12:24 |
|
Vulture Culture posted:I agree that it should usually be a separate pipeline step, but you should feel fine doing whatever is idiomatically familiar to the people who need to maintain it while being reasonably performant Thanks for all the examples. I'll admit as a data analyst, I'm more comfortable with pandas than python itself. To your point about idiomatic, not gonna lie, the pandas-based masking and then forward filling method made much better sense to me then the pure python method. But, since my question was geared towards ETL, I can understand why ppl might recommend pure pythonic methods for maintainability sake. I am still very much learning. Though to be honest I've found myself drawn towards functional approaches (like using map(), apply() etc). Sometimes people's really deeply nested iterative loops are hard for me to understand because I'm really new to imperative programming and mainly familiar with SQL and such as a data analyst. Absolutely love me a nice CASE...WHEN... THEN syntax.
|
# ? Feb 27, 2024 13:48 |
|
Falcon2001 posted:Personally, the best explanation for comments I've found is that comments tell you why the code is written this way, not what the code is doing, unless your code has some reason to be unreadable, in which case, sure...but you should explain why you have to do that, if I'm reviewing your code. The why should be more like business logic, or reasons why you chose a particular approach over another, and not simply an explanation of basic algorithms. One thing I used to do was write a comment describing what the next few lines were doing at a high level. Something like "# retrieve batch metadata from API" and then a few lines to create a requests session, query the API, pull data out of the response object. I thought it made the code clearer by separating things into sections. Now, I use that urge as an indication that those lines should be in their own function. I think that's something often not explicitly stated in discussions of self-documenting code. Individual lines can be clear enough to not need comments, but if the whole function needs comments to keep track of what you're doing, it should be split up.
|
# ? Feb 27, 2024 15:03 |
|
Chin Strap posted:Better to never curate the API ever. And at least it has a pretty long deprecation time so you can catch warnings. Meanwhile I'm embracing chaos and have switched over to polars. Every week there's a new deprecation to chase
|
# ? Feb 27, 2024 18:01 |
|
BAD AT STUFF posted:One thing I used to do was write a comment describing what the next few lines were doing at a high level. Something like "# retrieve batch metadata from API" and then a few lines to create a requests session, query the API, pull data out of the response object. I think that's a good instinct, I see that difference between more experienced and less experienced developers all the time - more experienced developers write few if any comments, good docstrings, and several smaller functions, whereas less experienced developers write fewer but much bigger functions. There are no hard and fast rules here, but that's the trend I notice
|
# ? Feb 27, 2024 18:40 |
|
This is probably a very luddite opinion of mine, and I'm sure it's objectively wrong, but I really hate when there are too many functions being called (especially within functions within functions etc). I somewhat frequently find myself tracing very closely through some random code base trying to figure out exactly what's going on, and going deeper and deeper with functions makes that more difficult. You'd think in theory that I should be able to look at a function and say "It's going to do X" the same way I can look at a line like "a = b + c" and know what's going to happen. But in practice, it doesn't work out that way, and I end up having to read through those functions to figure out exactly what's happening.
|
# ? Feb 27, 2024 18:49 |
|
Yeah I am absolutely not a credible authority on anything but I hate clicking through a million sub calls to figure out how it all fits together. But I suspect the difference in opinion results from people dealing with vastly different code bases, code and documentation quality, function and variable naming schemes, and even just how different programmers approach problems I agree in theory with most of the concepts behind clean code but I've had to deal with a lot of code from people who are far from perfect programmers and some additional comments on what they were TRYING to do would have been very very helpful for debugging, refactoring, and extending If I have to scroll down twice within the same function it's too far, but if you have code that could have fit on one screen and that's only used by one function and you still split it up, I'm gonna get annoyed by that too after a while. Unless the code needs regular work for whatever reason
|
# ? Feb 27, 2024 18:59 |
|
FISHMANPET posted:This is probably a very luddite opinion of mine, and I'm sure it's objectively wrong, but I really hate when there are too many functions being called (especially within functions within functions etc). I somewhat frequently find myself tracing very closely through some random code base trying to figure out exactly what's going on, and going deeper and deeper with functions makes that more difficult. You'd think in theory that I should be able to look at a function and say "It's going to do X" the same way I can look at a line like "a = b + c" and know what's going to happen. But in practice, it doesn't work out that way, and I end up having to read through those functions to figure out exactly what's happening. That's why docstrings and good function names were called out as the most important things earlier, that's what helps you to navigate code bases designed that way. Also good tooling that allows you to jump to/from function definition easily, and get hover over docstrings/types. Modern software engineering best practices make a lot of assumptions about the environment the person reading the code is operating in. That said, people can and often do go way too far with OOP stuff and make it an unreadable mess in the name of "clean code."
|
# ? Feb 27, 2024 19:05 |
|
Something I've kinda-sorta struggled with over the last couple of years is reconciling PEP8 and other best practices with the rise of notebooks. Does splitting sections into functions make sense when those sections are in individual cells? Do you avoid referencing results from previous cells when you don't have to, because of what can be a non-linear way of executing code (sometimes I'll just reread data from disk each cell if it's quick)? Docstrings are redundant when we have Markdown, right? What about line lengths?
|
# ? Feb 27, 2024 23:40 |
|
basically if you have to descend into a function because you didn't know it existed, that's fine, but if you have to bring something from inside it with you and then continue your search, it shouldn't have been a function
|
# ? Feb 28, 2024 00:20 |
|
Notebooks live in their own world of bad code and data science, where you prove something out with a notebook and then you throw it away.
|
# ? Feb 28, 2024 00:21 |
|
idk, I like the concept of notebooks having rich-text documentation with the code, and able to show a process step-by-step. I do wish there was a way to prevent rerunning previous sections of code (apart from an import cell) or altering a variable in a cell, changing the cell and altering the variable again.
|
# ? Feb 28, 2024 00:29 |
|
Notebooks entirely live in a different use of coding. There's using code to make a tool, package or something reusable. It's creating a system, package etc Then there's using code to analyze data or explore / test or share lessons and learning. It's to create a product. Something to be consumed once by any number of people. But after recently working in a data platform designed by DS' who were inspired by Netflix circa 2015, I think it's hell. Every ETL job is mashed into a notebook. Parameterized notebooks that generate other notebooks. Absolutely zero tests. If I'm writing an analytical report it's fantastic because I can weave in visualizations and text and images. Or often if I'm testing and building a number of functions, it's nice to call in the REPL on demand and debug things that way. But once that's finished, it goes into a python file. VS Codes has a great extension that can convert .ipynb to a .py file. But for straight code it's a mess and frankly I find it slows you down. With a plain .py file I just navigate up and down, comment as I need, etc. Finally once you've ever tried a feature IDE like vscode, you'll never want to go back to jupyterlab. The tab completion is way snappier, you've got great variable viewers and can leverage a big ecosystem of extensions I'm a complete amateur at python and am only a data analyst, but I'm so glad I moved to VS Code.
|
# ? Feb 28, 2024 01:25 |
|
|
# ? May 15, 2024 04:21 |
|
FISHMANPET posted:This is probably a very luddite opinion of mine, and I'm sure it's objectively wrong, but I really hate when there are too many functions being called (especially within functions within functions etc). I somewhat frequently find myself tracing very closely through some random code base trying to figure out exactly what's going on, and going deeper and deeper with functions makes that more difficult. You'd think in theory that I should be able to look at a function and say "It's going to do X" the same way I can look at a line like "a = b + c" and know what's going to happen. But in practice, it doesn't work out that way, and I end up having to read through those functions to figure out exactly what's happening. Functions within functions gets confusing fast. There was one project I inherited where you had to go three levels deep from the entry point of the Pyspark job before you started to see actual logic, and often you'd be in a different module than where you started. That was a nightmare. Comments to explain different sections of a larger code block is definitely a code smell rather than something that should never be done. I'd argue that having to trace a long call stack to figure out where stuff happens is also a code smell, although I can't come up with a specific software design principal to justify that. Dependency inversion, maybe? ...now notebooks, I'm throwing everything at the wall and seeing what sticks. BAD AT STUFF fucked around with this message at 02:22 on Feb 28, 2024 |
# ? Feb 28, 2024 02:19 |