Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
WHERE MY HAT IS AT
Jan 7, 2011
I don't think I'd call myself an "expert" so maybe someone will weigh in with a better solution, but using ffill seems like it would work:

code:
import pandas as pd

data = {'Category': ['Southwest', '1', '2', '3', 'West', '1', '2', '3', '4', 'North', '1', '2']}
df = pd.DataFrame(data)

# Identify rows that are NOT numeric (i.e., region names) and mask others
df['Category'] = df['Category'].where(df['Category'].str.isnumeric() == False)
df['Category'].ffill(inplace=True)

print(df)
This will replace all the numbers with null values and then ffill (forward fill) will propagate any remaining values down the sequence until the next non-null.

Adbot
ADBOT LOVES YOU

StumblyWumbly
Sep 12, 2007

Batmanticore!
I'm playing around with Polars and this seems to work
code:
import polars as pl

file_path = 'my_file.csv'

# Read the CSV into a Polars DataFrame
df = pl.read_csv(file_path)

# Create a new column with only the names (non-numeric values)
numbers = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
df = df.with_columns(new_col=pl.when(~pl.col("first_column").str.contains_any(numbers)).then(pl.col("first_column")))

# Fill the new column down so missing values in 'new_col' get the most recent word
df = df.with_columns(pl.col('new_col').fill_null(strategy="forward"))

# Filter out the rows where there's a name (non-numeric value) in the first column
df = df.filter(pl.col("first_column").str.contains_any(numbers))
print(df)
All you have to do is change your framework and your problems will disappear also change!
Worth noting that polars does away with an explicit index column, but has some fast and powerful filtering.

E: Also worth noting ChatGPT is poo poo at helping with Polars because it is too new and re-uses some common terms.

Oysters Autobio
Mar 13, 2017

WHERE MY HAT IS AT posted:

I don't think I'd call myself an "expert" so maybe someone will weigh in with a better solution, but using ffill seems like it would work:

code:
import pandas as pd

data = {'Category': ['Southwest', '1', '2', '3', 'West', '1', '2', '3', '4', 'North', '1', '2']}
df = pd.DataFrame(data)

# Identify rows that are NOT numeric (i.e., region names) and mask others
df['Category'] = df['Category'].where(df['Category'].str.isnumeric() == False)
df['Category'].ffill(inplace=True)

print(df)
This will replace all the numbers with null values and then ffill (forward fill) will propagate any remaining values down the sequence until the next non-null.

Awesome! This is perfect, exactly what I was looking for. This sort of mask + ffill method would probably work in a lot of usecases for these kinds of tables.

Hed
Mar 31, 2004

Fun Shoe
Is there a standard tool to look at parquet files?

I'm trying to go through a slog of parquet files and keep getting an exception:

Python code:
Traceback (most recent call last):
  File "log_count.py", line 57, in <module>
    daily_output = result.collect()
                   ^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\polars\lazyframe\frame.py", line 1937, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^

polars.exceptions.ComputeError: not implemented: reading parquet type Double to Int64 still not implemented
I know what this means, but I don't have a good way to diagnose what errors the files are in, and so end up moving groups of files around until it works, then putting them back in one by one until I find the offender.

I understand I'm trying to have the efficiency of polars in lazy mode, but I'd love to know where it specifically blows up to help figure out the problem upstream.

Is there a better place to ask polars / data science questions?

WHERE MY HAT IS AT
Jan 7, 2011
There's a burgeoning data engineering thread here that might get you a better answer, but I think the crossover of posters is high anyways: https://forums.somethingawful.com/showthread.php?threadid=4050611

Zugzwang
Jan 2, 2005

You have a kind of sick desperation in your laugh.


Ramrod XTreme
PyArrow has a Parquet module. You might try that if Polars is bring uncooperative.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Oysters Autobio posted:

Any pandas experts here?

I've got an adhoc spreadsheet I need to clean for ETL (why this is my job as a data analyst, I don't know) but I'm having trouble even articulating the problem decently enough to leverage ChatGPT or Google.

Spreadsheet (.XLSX file) is structured like this


code:

+----------+-------+------------+
| category | names | dates      |
+----------+-------+------------+
| Western  |       |            |
| 1        | Jim   | 2023.02.01 |
| 2        | Greg  | 2013.12.11 |
| 3        | Bob   | 2003.07.11 |
| Eastern  |       |            |
| 1        | Jess  | 2002.02.01 |
| 2        | Bill  | 2001.10.11 |
|          |       |            |
|          |       |            |
+----------+-------+------------+


Repeat for many more categories (i.e. "Southern", "Southwestwern") each with different ranges of numbers following them with matching names/dates.

Reading it in as a data frame with defaults has pandas assigning the "category" column as the index, but despite reading docs and googling I still can't really wrap my head around multi - indexes. Lots of tutorials on creating them, or analysing them, but can't find anything for the reverse (taking a multi-level index and transforming it into a repeating column value). The target model should be something like this:

code:

+---+----------+-------+------------+
|   | category | names | dates      |
+---+----------+-------+------------+
| 1 | Western  | Jim   | 2023.02.01 |
| 2 | Western  | Greg  | 2013.12.11 |
| 3 | Western  | Bob   | 2003.07.11 |
| 1 | Eastern  | Jess  | 2002.02.01 |
| 2 | Eastern  | Bill  | 2001.10.11 |
|   |          |       |            |
+---+----------+-------+------------+

edit: should specify that we'll like receive these types of spreadsheets with different labels and values. So I'd like to write it to be somewhat parameterized and reusable, which is why I'm not just hacking through this for a one-time ETL

Stupid simple option if they’re grouped in order
Get list of your categories
Save it as a CSV
Use the category list to break it into chunks for each category.
Make each chunk df, adding a column for the category
pd.concat

Otherwise I’d try to read the excel spreadsheet such that the index is ordered numerically and get the indexes that contain the categories (eg by filtering to where the other columns have nans) then just map the category to its applicable data rows between each category

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

CarForumPoster posted:

Stupid simple option if they’re grouped in order
Get list of your categories
Save it as a CSV
Use the category list to break it into chunks for each category.
Make each chunk df, adding a column for the category
pd.concat

Otherwise I’d try to read the excel spreadsheet such that the index is ordered numerically and get the indexes that contain the categories (eg by filtering to where the other columns have nans) then just map the category to its applicable data rows between each category

I agree. IMO trying to do this in Pandas is the wrong approach, part of your ETL pipeline is sanitizing and standardizing your data before loading it, so you should just handle that in code first before feeding it over to pandas. Especially if this is an adhoc spreadsheet. Here's an extremely naive example:

Python code:
from csv import DictReader

with open('file_path', 'r') as csvfile:
    reader = DictReader(csvfile)
    data = [row for row in reader]

out_data = []
current_category = ""
for row in data:
    if row['category'] and not (row['names'] or row['dates']):
        current_category = row['category']
    else:
        row['category'] = current_category
        out_data.append(row)
# Assumptions made:
# the category identifier row always shows up before the rows in that category

Deadite
Aug 30, 2003

A fat guy, a watermelon, and a stack of magazines?
Family.
I have a quick question that I cannot figure out and the keywords involved make googling difficult.

I’m also having a hard time explaining this so bear with me.

I am trying to write an if statement that checks that one variable isn’t in a list of values, or if that variable isn’t equal to a specific value while another variable is equal to a specific value at the same time. My test case is below. In this case I only want to see ‘Right’ when x is either ‘a’, ‘b’, or ‘c’ OR x is ‘d’ while y is also 3.

code:
x = ‘d’
y = 3

if not (x == ‘d’ and y == 3) or (x not in [‘a’, ‘b’, ‘c’]):
    print(‘Wrong’)
else:
    print(‘Right’)
I can get the ‘Right’ result if I only use one of these tests at a time, but when I link them with an ‘or’ it stops working. Am I missing something obvious here?

FISHMANPET
Mar 3, 2007

Sweet 'N Sour
Can't
Melt
Steel Beams
Is the not negating the entire statement or just the part before the or? I think you need more parenthesis, because the not isn't applying precisely how you want it.

I realize this is a simplified example to demonstrate the issue, but in production do you want "right" to be an outcome? Because if so, I don't see why you're even bothering with the not to start with the negative case.

Deadite
Aug 30, 2003

A fat guy, a watermelon, and a stack of magazines?
Family.

FISHMANPET posted:

Is the not negating the entire statement or just the part before the or? I think you need more parenthesis, because the not isn't applying precisely how you want it.

I realize this is a simplified example to demonstrate the issue, but in production do you want "right" to be an outcome? Because if so, I don't see why you're even bothering with the not to start with the negative case.

Yes, the ‘not’ should only apply to the first conditions wrapped in parentheses and not the one after the ‘or’. I tried wrapping the whole thing in parentheses like (not (x == ‘d’ and y == 3)) but that doesn’t seem to work either.

‘Right’ is the desired result when x = ‘d’ and y = 3. The test case is just an example from a much larger program that I can’t easily restructure so I just need to figure out if what I’m trying to do is possible. It feels like there should be a way to do this but I can’t find any info.

boofhead
Feb 18, 2021

yeah, why are you writing it like that? why not just write

Python code:
x = 'd'
y = 3

if (x == 'd' and y == 3) or (x in ['a', 'b', 'c']):
    print('right')
else:
    print('wrong')
the reason your example doesn't work the way you want it to is because combing conditions with an 'or' means that it checks if any of the conditions is True and goes down that path

and the second condition evaluates to True because x is NOT in ['a', 'b', 'c']

so what it's doing in your example is:

Python code:
# for x == 'd' and y == 3

=> not (x == ‘d’ and y == 3) or (x not in [‘a’, ‘b’, ‘c’])

=> not (True) or (True)

=> False or True
so it goes down that path every time

e: using negatives like that is absolute hell though so if i caught anybody writing code like that i would judge them forever and probably try to get them fired. out of a cannon, into a volcano

boofhead fucked around with this message at 17:24 on Feb 25, 2024

FISHMANPET
Mar 3, 2007

Sweet 'N Sour
Can't
Melt
Steel Beams
And the reason it's absolute hell is... exactly the case you're running into now. Sometimes programming conversations can feel preachy when doing it the "wrong" way works for now, but this is a pretty clear case of how it being done the "wrong" way is right now directly making the program harder to understand and troubleshoot.

Deadite
Aug 30, 2003

A fat guy, a watermelon, and a stack of magazines?
Family.

boofhead posted:

yeah, why are you writing it like that? why not just write

Python code:
x = 'd'
y = 3

if (x == 'd' and y == 3) or (x in ['a', 'b', 'c']):
    print('right')
else:
    print('wrong')
the reason your example doesn't work the way you want it to is because combing conditions with an 'or' means that it checks if any of the conditions is True and goes down that path

and the second condition evaluates to True because x is NOT in ['a', 'b', 'c']

so what it's doing in your example is:

Python code:
# for x == 'd' and y == 3

=> not (x == ‘d’ and y == 3) or (x not in [‘a’, ‘b’, ‘c’])

=> not (True) or (True)

=> False or True
so it goes down that path every time

e: using negatives like that is absolute hell though so if i caught anybody writing code like that i would judge them forever and probably try to get them fired. out of a cannon, into a volcano

I have to do it this way because these two new conditions are just an addition to a very long list of existing conditions, and at least some of them will need to be negated. I inherited this program and I can’t change how it’s structured without causing a lot of drama so I’m trying to do the best with what I have. I agree that it’s going to be a (more) confusing mess from here on out.

boofhead
Feb 18, 2021

Deadite posted:

I have to do it this way because these two new conditions are just an addition to a very long list of existing conditions, and at least some of them will need to be negated. I inherited this program and I can’t change how it’s structured without causing a lot of drama so I’m trying to do the best with what I have. I agree that it’s going to be a (more) confusing mess from here on out.

refactor the whole thing imo

but if you're determined to go ahead with it, this should do what you're asking

Python code:
if not ((x == 'd' and y == 3) or (x in ['a', 'b', 'c'])):
    print('wrong')
else:
    print('right')
hope that code has amazing unit tests btw

Deadite
Aug 30, 2003

A fat guy, a watermelon, and a stack of magazines?
Family.

boofhead posted:

refactor the whole thing imo

but if you're determined to go ahead with it, this should do what you're asking

Python code:
if not ((x == 'd' and y == 3) or (x in ['a', 'b', 'c'])):
    print('wrong')
else:
    print('right')
hope that code has amazing unit tests btw

Perfect, thank you. And I'm sure that team doesn't do any unit testing. I got pulled in to help out because they were running behind. At the end of the month this program will not be my problem again until an ironic reorg forces me to maintain it.

Here's a more accurate representation of the issue, but pretend there are about 20 more variables that need to be tested:

Python code:
v = 'N'
w = 'N'
x = 'a'
y = 3

if (v == 'N' or w == 'N') and (not ((x == 'd' and y == 3) or (x in ['a', 'b', 'c']))):
    print('do something')
else:
    print('do something else')

Armitag3
Mar 15, 2020

Forget it Jake, it's cybertown.


Deadite posted:

Perfect, thank you. And I'm sure that team doesn't do any unit testing. I got pulled in to help out because they were running behind. At the end of the month this program will not be my problem again until an ironic reorg forces me to maintain it.

Here's a more accurate representation of the issue, but pretend there are about 20 more variables that need to be tested:

Python code:
v = 'N'
w = 'N'
x = 'a'
y = 3

if (v == 'N' or w == 'N') and (not ((x == 'd' and y == 3) or (x in ['a', 'b', 'c']))):
    print('do something')
else:
    print('do something else')

What version of Python are you running this on? (Lol I know, it's probably going to be 3.8) I ask just on the offchance you might be able to leverage pattern matching to make this less of a rats nest. (https://docs.python.org/3.10/reference/compound_stmts.html#the-match-statement)

boofhead
Feb 18, 2021

Deadite posted:

Perfect, thank you. And I'm sure that team doesn't do any unit testing. I got pulled in to help out because they were running behind. At the end of the month this program will not be my problem again until an ironic reorg forces me to maintain it.

Here's a more accurate representation of the issue, but pretend there are about 20 more variables that need to be tested:

Python code:
v = 'N'
w = 'N'
x = 'a'
y = 3

if (v == 'N' or w == 'N') and (not ((x == 'd' and y == 3) or (x in ['a', 'b', 'c']))):
    print('do something')
else:
    print('do something else')

i wonder why they're running behind!

i also don't mean to take a dig at you, capitalism makes us do terrible, terrible things, code included

but you really sure you can't convince them to implement something more maintainable? pydantic or if you cant use external tools, do it in a class that's just a little cleaner? something like

Python code:
class ConditionsWrangler:
    """You have conditions? We have a wrangler."""
    v = 'N'
    w = 'N'
    x = 'd'
    y = 3

    # individual checks
    def v_or_w_is_N(self) -> bool:
        return self.v == 'N' or self.w == 'N'

    # helper checks
    def x_is_d_and_y_is_3(self) -> bool:
        return self.x == 'd' and self.y == 3
    
    def x_in_abc(self) -> bool:
        return self.x in ['a', 'b', 'c']
    
    # grouped conditions
    def x_is_valid(self) -> bool:
        return self.x_is_d_and_y_is_3() or self.x_in_abc()

    @property
    def is_valid(self) -> bool:
        return all(condition for condition in [
            self.v_or_w_is_N(),
            self.x_is_valid()
        ])

whatever = ConditionsWrangler()

if whatever.is_valid:
    print('right')
else:
    print('wrong')
i'm sure there are many better ways of doing it, but i'd still rather see the above than a long list of weird if or not and spaghetti conditions

Seventh Arrow
Jan 26, 2005

QuarkJets posted:

The missing docstring and bad function name are probably the two most important items on that list, new developers should definitely be thinking about these things when writing a new function. Every time.

Why docstrings and not comments?

Follow-up question: what to use comments for, then?

xzzy
Mar 5, 2009

Seventh Arrow posted:

Follow-up question: what to use comments for, then?

Disabling your debug print statements of course.

boofhead
Feb 18, 2021

I'm also a big fan of

Python code:

# TODO! fix this

QuarkJets
Sep 8, 2008

Seventh Arrow posted:

Why docstrings and not comments?

Follow-up question: what to use comments for, then?

The pep8 style guide says this about docstrings:

quote:

Write docstrings for all public modules, functions, classes, and methods. Docstrings are not necessary for non-public methods, but you should have a comment that describes what the method does. This comment should appear after the def line.

Pep8 is the industry standard, but even if you're not obeying pep8 docstrings are still considered best practice. That string automatically becomes the special __doc__ attribute of your function or class and should explain the fundamentals of the code: what it takes as input, what it does to the input, what it returns. This summary is useful for future developers looking at the code, including potentially the person who wrote it

Other comments (comments that aren't basically a docstring) may or may not be necessary. Comments can be useful for explaining some tricky behavior, but they shouldn't be there if the behavior is obvious. Pep8 agrees with this:

quote:

Inline comments are unnecessary and in fact distracting if they state the obvious. Don’t do this:
Python code:
x = x + 1                 # Increment x
But sometimes, this is useful:
Python code:
x = x + 1                 # Compensate for border

Comments have a tendency to rot; developers sometimes forget to tend to the comments while making changes to the code, especially if a comment is not immediately preceding a line that's being changed, and then you wind up with comments that do not accurately describe the code. This isn't just unhelpful, it's harmful and worse than not having a comment at all. Docstrings have all of these risks as well, but they're a little less prone to rotting: each object can only have 1 docstring, so it's easy to verify that the docstring is still accurate, whereas comments can be found anywhere

I felt that the example code was easy to read and didn't need comments. A docstring is required and would be sufficient

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Falcon2001 posted:

I agree. IMO trying to do this in Pandas is the wrong approach, part of your ETL pipeline is sanitizing and standardizing your data before loading it, so you should just handle that in code first before feeding it over to pandas. Especially if this is an adhoc spreadsheet. Here's an extremely naive example:

Python code:
from csv import DictReader

with open('file_path', 'r') as csvfile:
    reader = DictReader(csvfile)
    data = [row for row in reader]

out_data = []
current_category = ""
for row in data:
    if row['category'] and not (row['names'] or row['dates']):
        current_category = row['category']
    else:
        row['category'] = current_category
        out_data.append(row)
# Assumptions made:
# the category identifier row always shows up before the rows in that category
I agree that it should usually be a separate pipeline step, but you should feel fine doing whatever is idiomatically familiar to the people who need to maintain it while being reasonably performant

The nice thing about your data cleaning approach is that you can easily farm out bugs in the data cleaning to people who are basically complete strangers to data engineering, but if this isn't a problem you have, there's nothing wrong with vectorizing if it makes sense. The Pandas approach will probably, ironically, make less sense to people who are very familiar with Python, but feel extremely natural to people looking at Python data engineering from an R background

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

QuarkJets posted:

Other comments (comments that aren't basically a docstring) may or may not be necessary. Comments can be useful for explaining some tricky behavior, but they shouldn't be there if the behavior is obvious.

I've heard the most common case for this is new devs who add comments to everything except the parts that don't make sense, so you have stuff like # iterate through list.

Personally, the best explanation for comments I've found is that comments tell you why the code is written this way, not what the code is doing, unless your code has some reason to be unreadable, in which case, sure...but you should explain why you have to do that, if I'm reviewing your code. The why should be more like business logic, or reasons why you chose a particular approach over another, and not simply an explanation of basic algorithms.

Vulture Culture posted:

I agree that it should usually be a separate pipeline step, but you should feel fine doing whatever is idiomatically familiar to the people who need to maintain it while being reasonably performant

Yeah, I suppose that 'the wrong approach' was too opinionated on my part, I agree with the rest of your post.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Personally I’m extremely anti using pandas in anything even resembling a pipeline since the devs absolutely love introducing breaking changes.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

DoctorTristan posted:

Personally I’m extremely anti using pandas in anything even resembling a pipeline since the devs absolutely love introducing breaking changes.

It shouldn't matter as long as you aren't blindly merging new changes to production without testing. If you are, pandas isn't the only thing that's going to screw your day up.

Chin Strap
Nov 24, 2002

I failed my TFLC Toxx, but I no longer need a double chin strap :buddy:
Pillbug

DoctorTristan posted:

Personally I’m extremely anti using pandas in anything even resembling a pipeline since the devs absolutely love introducing breaking changes.

Better to never curate the API ever.

Oysters Autobio
Mar 13, 2017

Vulture Culture posted:

I agree that it should usually be a separate pipeline step, but you should feel fine doing whatever is idiomatically familiar to the people who need to maintain it while being reasonably performant

The nice thing about your data cleaning approach is that you can easily farm out bugs in the data cleaning to people who are basically complete strangers to data engineering, but if this isn't a problem you have, there's nothing wrong with vectorizing if it makes sense. The Pandas approach will probably, ironically, make less sense to people who are very familiar with Python, but feel extremely natural to people looking at Python data engineering from an R background

Thanks for all the examples. I'll admit as a data analyst, I'm more comfortable with pandas than python itself.

To your point about idiomatic, not gonna lie, the pandas-based masking and then forward filling method made much better sense to me then the pure python method.

But, since my question was geared towards ETL, I can understand why ppl might recommend pure pythonic methods for maintainability sake.

I am still very much learning. Though to be honest I've found myself drawn towards functional approaches (like using map(), apply() etc). Sometimes people's really deeply nested iterative loops are hard for me to understand because I'm really new to imperative programming and mainly familiar with SQL and such as a data analyst.

Absolutely love me a nice CASE...WHEN... THEN syntax.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

Falcon2001 posted:

Personally, the best explanation for comments I've found is that comments tell you why the code is written this way, not what the code is doing, unless your code has some reason to be unreadable, in which case, sure...but you should explain why you have to do that, if I'm reviewing your code. The why should be more like business logic, or reasons why you chose a particular approach over another, and not simply an explanation of basic algorithms.

One thing I used to do was write a comment describing what the next few lines were doing at a high level. Something like "# retrieve batch metadata from API" and then a few lines to create a requests session, query the API, pull data out of the response object.

I thought it made the code clearer by separating things into sections. Now, I use that urge as an indication that those lines should be in their own function. I think that's something often not explicitly stated in discussions of self-documenting code. Individual lines can be clear enough to not need comments, but if the whole function needs comments to keep track of what you're doing, it should be split up.

Tayter Swift
Nov 18, 2002

Pillbug

Chin Strap posted:

Better to never curate the API ever.

And at least it has a pretty long deprecation time so you can catch warnings.

Meanwhile I'm embracing chaos and have switched over to polars. Every week there's a new deprecation to chase :twisted:

QuarkJets
Sep 8, 2008

BAD AT STUFF posted:

One thing I used to do was write a comment describing what the next few lines were doing at a high level. Something like "# retrieve batch metadata from API" and then a few lines to create a requests session, query the API, pull data out of the response object.

I thought it made the code clearer by separating things into sections. Now, I use that urge as an indication that those lines should be in their own function. I think that's something often not explicitly stated in discussions of self-documenting code. Individual lines can be clear enough to not need comments, but if the whole function needs comments to keep track of what you're doing, it should be split up.

I think that's a good instinct, I see that difference between more experienced and less experienced developers all the time - more experienced developers write few if any comments, good docstrings, and several smaller functions, whereas less experienced developers write fewer but much bigger functions. There are no hard and fast rules here, but that's the trend I notice

FISHMANPET
Mar 3, 2007

Sweet 'N Sour
Can't
Melt
Steel Beams
This is probably a very luddite opinion of mine, and I'm sure it's objectively wrong, but I really hate when there are too many functions being called (especially within functions within functions etc). I somewhat frequently find myself tracing very closely through some random code base trying to figure out exactly what's going on, and going deeper and deeper with functions makes that more difficult. You'd think in theory that I should be able to look at a function and say "It's going to do X" the same way I can look at a line like "a = b + c" and know what's going to happen. But in practice, it doesn't work out that way, and I end up having to read through those functions to figure out exactly what's happening.

boofhead
Feb 18, 2021

Yeah I am absolutely not a credible authority on anything but I hate clicking through a million sub calls to figure out how it all fits together. But I suspect the difference in opinion results from people dealing with vastly different code bases, code and documentation quality, function and variable naming schemes, and even just how different programmers approach problems

I agree in theory with most of the concepts behind clean code but I've had to deal with a lot of code from people who are far from perfect programmers and some additional comments on what they were TRYING to do would have been very very helpful for debugging, refactoring, and extending

If I have to scroll down twice within the same function it's too far, but if you have code that could have fit on one screen and that's only used by one function and you still split it up, I'm gonna get annoyed by that too after a while. Unless the code needs regular work for whatever reason

M. Night Skymall
Mar 22, 2012

FISHMANPET posted:

This is probably a very luddite opinion of mine, and I'm sure it's objectively wrong, but I really hate when there are too many functions being called (especially within functions within functions etc). I somewhat frequently find myself tracing very closely through some random code base trying to figure out exactly what's going on, and going deeper and deeper with functions makes that more difficult. You'd think in theory that I should be able to look at a function and say "It's going to do X" the same way I can look at a line like "a = b + c" and know what's going to happen. But in practice, it doesn't work out that way, and I end up having to read through those functions to figure out exactly what's happening.

That's why docstrings and good function names were called out as the most important things earlier, that's what helps you to navigate code bases designed that way. Also good tooling that allows you to jump to/from function definition easily, and get hover over docstrings/types. Modern software engineering best practices make a lot of assumptions about the environment the person reading the code is operating in. That said, people can and often do go way too far with OOP stuff and make it an unreadable mess in the name of "clean code."

Tayter Swift
Nov 18, 2002

Pillbug
Something I've kinda-sorta struggled with over the last couple of years is reconciling PEP8 and other best practices with the rise of notebooks. Does splitting sections into functions make sense when those sections are in individual cells? Do you avoid referencing results from previous cells when you don't have to, because of what can be a non-linear way of executing code (sometimes I'll just reread data from disk each cell if it's quick)? Docstrings are redundant when we have Markdown, right? What about line lengths?

12 rats tied together
Sep 7, 2006

basically if you have to descend into a function because you didn't know it existed, that's fine, but if you have to bring something from inside it with you and then continue your search, it shouldn't have been a function

a dingus
Mar 22, 2008

Rhetorical questions only
Fun Shoe
Notebooks live in their own world of bad code and data science, where you prove something out with a notebook and then you throw it away.

Tayter Swift
Nov 18, 2002

Pillbug
idk, I like the concept of notebooks having rich-text documentation with the code, and able to show a process step-by-step.

I do wish there was a way to prevent rerunning previous sections of code (apart from an import cell) or altering a variable in a cell, changing the cell and altering the variable again.

Oysters Autobio
Mar 13, 2017
Notebooks entirely live in a different use of coding.

There's using code to make a tool, package or something reusable. It's creating a system, package etc

Then there's using code to analyze data or explore / test or share lessons and learning. It's to create a product. Something to be consumed once by any number of people.

But after recently working in a data platform designed by DS' who were inspired by Netflix circa 2015, I think it's hell. Every ETL job is mashed into a notebook. Parameterized notebooks that generate other notebooks. Absolutely zero tests.

If I'm writing an analytical report it's fantastic because I can weave in visualizations and text and images.

Or often if I'm testing and building a number of functions, it's nice to call in the REPL on demand and debug things that way. But once that's finished, it goes into a python file. VS Codes has a great extension that can convert .ipynb to a .py file.

But for straight code it's a mess and frankly I find it slows you down. With a plain .py file I just navigate up and down, comment as I need, etc.

Finally once you've ever tried a feature IDE like vscode, you'll never want to go back to jupyterlab. The tab completion is way snappier, you've got great variable viewers and can leverage a big ecosystem of extensions

I'm a complete amateur at python and am only a data analyst, but I'm so glad I moved to VS Code.

Adbot
ADBOT LOVES YOU

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

FISHMANPET posted:

This is probably a very luddite opinion of mine, and I'm sure it's objectively wrong, but I really hate when there are too many functions being called (especially within functions within functions etc). I somewhat frequently find myself tracing very closely through some random code base trying to figure out exactly what's going on, and going deeper and deeper with functions makes that more difficult. You'd think in theory that I should be able to look at a function and say "It's going to do X" the same way I can look at a line like "a = b + c" and know what's going to happen. But in practice, it doesn't work out that way, and I end up having to read through those functions to figure out exactly what's happening.

Functions within functions gets confusing fast. There was one project I inherited where you had to go three levels deep from the entry point of the Pyspark job before you started to see actual logic, and often you'd be in a different module than where you started. That was a nightmare.

Comments to explain different sections of a larger code block is definitely a code smell rather than something that should never be done. I'd argue that having to trace a long call stack to figure out where stuff happens is also a code smell, although I can't come up with a specific software design principal to justify that. Dependency inversion, maybe?

...now notebooks, I'm throwing everything at the wall and seeing what sticks.

BAD AT STUFF fucked around with this message at 02:22 on Feb 28, 2024

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply