Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.

nullfunction posted:

Since you're offering it up for a roast, here are some things to consider:
  • Your error handling choices only look good on the surface. Yes, you've made an error message slightly more fancy by adding some text to it, yes, the functions will always raise errors of a consistent type. They also don't react in any meaningful way to handle the errors that might be raised or enrich any of the error messages with context that would be useful to an end user (or logging system). You could argue that they make the software worse because they swallow the stack trace that might contain something meaningful (because they don't raise from the base exception).


Do you have any good resources that discuss the proper way to deal with error handling?

Adbot
ADBOT LOVES YOU

QuarkJets
Sep 8, 2008

Fender posted:

There is some good feedback in here. Some of it I feel would be appropriate for a conversation. Like, why I check for None a lot. It's just a habit I learned in my last role where we parsed a lot of unreliable data sourced from scrapers. I would happily explain my position there.

One problem I see is that the None-checking is everywhere. Since you need to filter out empty strings anyway, it'd be better if you translated "None" to an empty string - then all of the code that's downstream from the file parsers doesn't have to do any None checking. Basically you have these 3 interface functions that are letting None values filter into the rest of your functions, so you need to handle the possibility of None everywhere, but if you handled the None values in the interface functions then the rest of your code would be a bit nicer. Another problem is that the None checking has become boilerplate; it'd be better to use some small utility functions to simplify the code. This also gets back to the issue I talked about where the street values have the possibility of raising an exception if any of them happen to be None; you did the none-checking for each of the street components in the XML function, but then you forgot that any of them could be None when you concatenated them together. If you had an interface function that returned only strings instead, the code looks nicer and this bug goes away.

For example:
Python code:
            name = ent.find("NAME").text.strip() if ent.find("NAME") is not None else None
            company = ent.find("COMPANY").text.strip() if ent.find("COMPANY") is not None else None
            street_1 = ent.find("STREET").text.strip() if ent.find("STREET") is not None else None
            street_2 = ent.find("STREET_2").text.strip() if ent.find("STREET_2") is not None else None
            street_3 = ent.find("STREET_3").text.strip() if ent.find("STREET_3") is not None else None
            street = street_1 + street_2 + street_3
            city = ent.find("CITY").text.strip() if ent.find("CITY") is not None else None
            county = ent.fine("COUNTY").text.strip() if ent.find("COUNTY") is not None else None
            state = ent.find("STATE").text.strip() if ent.find("STATE") is not None else None
            country = ent.find("COUNTRY").text.strip() if ent.find("COUNTRY") is not None else None
            postal_code = ent.find("POSTAL_CODE").text.strip() if ent.find("POSTAL_CODE") is not None else None0

            data = {
                "name": name,
                "organization" : company,
                "street": street,
                "city": city,
                "county" : county,
                "state": state,
                "zip": self._parse_zip(postal_code),
                "country": country,
            }
Wouldn't it be great to have a function that does most of this stuff?
Python code:
def parse_ent(ent, name):
    return x.text.strip() if (x := ent.find("NAME")) is not None else ""

def xml_ent_to_address(ent):
    data = {'name': parse_ent(ent, 'NAME'),
            'company': parse_ent(ent, 'COMPANY'),
            # ...
            'zip': parse_zip(parse_ent(ent, "POSTAL_CODE")),
            'country': parse_ent(ent, 'COUNTRY')
    }
    # street can have several components
    street_parts = (parse_ent(ent, x) for x in ('STREET', 'STREET_2', 'STREET_3'))
    data['street'] = ' '.join(x for x in street_parts if x)
    # And check this out: splat operators let you avoid having a funny "if data is not None" in this class
    return Address(**data)

quote:

And stuff like the Address class having that weird view dict method, would love to chat about that. That is like that bc while they never said it, the examples both showed the printed json ignoring any keys without values, so I went out of my way to provide that since all the example data had different fields present/missing depending on the format. I was really getting in my own head by hour number 4.

I got that it was necessary to filter keys without values; I was saying that the view_dict method could have been a lot more concise. If you'd used a dataclass:
Python code:
    def view_dict(self):
        # If we put the .strip() then we don't need to have it on every line of the __init__
        # This also means we don't need None-checking or len-checking in the __init__
        # This one-liner does it all: 
        return {name: value.strip() for name, value in asdict(self).items() if value}
As it stands now, the Address class is kind of pointless; only its view_dict method is useful, the rest of it provides no additional value. Its __init__ strips input strings, but the file parsers are already doing that. A class like this could be beneficial for verifying that you're using valid names in your dictionaries, but the class allows everything to be None, so the class doesn't actually provide that benefit. Any dictionaries that were implemented with a mistyped key would see those values get silently dropped on the floor as each Address instance auto-assigns those fields to None instead, quietly permitting buggy outcomes. Permitting a dictionary as an optional input argument that you then parse with .get() is to blame there; if you make the keyword arguments explicit and use splat operators while creating the instance instead of a loose dictionary interface then you'll be alerted right away if a bad key gets provided.

Python code:
# Too permissive:
class Address():
    def __init__(self, data=None, **kwargs) -> None:
        if data is None:
            data = kwargs
        self.street = data.get("street").strip() if data.get("street") else None
        self.city = data.get("city").strip() if data.get("city") else None
...


# Good:
class Address:
    def __init__(self, street='', city='', etc you get the idea):
        self.street = street
        self.city = city
...


# Better:
@dataclass
class Address:
    street: str = ""
    city: str = ""
...

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Chillmatic posted:

I wrote a script to get all the xml files from the show in question, and then iterate through them using the intro marker data in those xml files to decide where to make the edits using ffmpeg, except now I'm running into losing the subtitles from the original files, and converting them is a whole thing because they're PGS format. :negative:

Comedy option: Send the edited files to OpenAIs whisper API and let them create new transcripts for you!

The March Hare
Oct 15, 2006

Je rêve d'un
Wayne's World 3
Buglord

Chillmatic posted:

I wrote a script to get all the xml files from the show in question, and then iterate through them using the intro marker data in those xml files to decide where to make the edits using ffmpeg, except now I'm running into losing the subtitles from the original files, and converting them is a whole thing because they're PGS format. :negative:

You "simply" need to discover the correct invocation Obi-Wan. (I do not know it, sorry...)

Chillmatic
Jul 25, 2003

always seeking to survive and flourish
I ended up doing the most hacky bullshit ever, but finally I will never have to listen to the Office theme song ever again. gently caress that piano. gently caress that doo doo rear end elementary school recorder instrument or whatever the gently caress it is. Boom bitch, bye.

Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum

QuarkJets posted:

More stuff.

My entire thought process for that Address class was basically, "this data is crap, I'm just gonna pass None all the way to the constructor and not worry about it." Which sounds so obviously pointless after reading your advice. It's rough getting your code roasted, but drat if this isn't good poo poo.

QuarkJets
Sep 8, 2008

Thank you! Yeah that data sucks for sure, what a pain. And irl someone is going to come to you with some stupid bullshit like "Oops between dates X and Y the last and first names are swapped and in the wrong columns, can you fix that" and depending on the request you have to know how to ride the line between "no that's an unreasonable request" and "this is important, I will help you with that".

nullfunction
Jan 24, 2005

Nap Ghost

Cyril Sneer posted:

Do you have any good resources that discuss the proper way to deal with error handling?

Afraid not. My knowledge is built up over many years and many versions and there's no one place I can point to that has everything you would conceivably want to know. I can generalize a bit but you're going to need to get your hands dirty if you really want to understand.

Always catch the most specific type you can in any given situation. Overly-broad exception handling is definitely a smell and you can configure your linter to warn you about this if you're not already being warned. There are a couple of patterns you can use to help manage complex exception possibilities:

Python code:
try:
    might_raise()
except TypeError as data_format_error:
   ... # Nothing fancy here, but you can except multiple times for a single try
except (KeyError, ValueError) as data_validity_error:
    ... # Both KeyError and ValueError are handled with the same logic
except Exception as e:
    ... # This broad statement comes after the specific exceptions we really care about and provides a catch-all for anything unhandled above.
except: # DO NOT DO THIS DO NOT DO THIS
    ... # You have now swallowed KeyboardInterrupt so good luck stopping your code while this is running, idiot
If I catch an exception, generally I'm doing one of three things with it:

Python code:
# I want to add some logic but still get the original error, which 
# can be accomplished with a bare raise like QuarkJets pointed out.
try:
    might_raise()
except SomeException:
    logger.exception("Oh no!")
    raise
Python code:
# I want to wrap an existing exception in an exception type that I
# control so that I have more granular control elsewhere in my code
# or I expose predictable errors to my consumers
try:
    might_raise()
except SomeException as e:
    raise MyOtherExceptionType("context info") from e
Python code:
# I want to swallow an exception because I have a way to handle
# it in code, or it is not impactful to the outcome of my program.
try:
    might_raise()
except SomeNonBlockingException:
    # Handle the failure here, or don't.
    pass

# The above is fine but contextlib (stdlib) gives us an even nicer 
# way to disregard an exception with suppress as a context manager:
from contextlib import suppress

def nan_divider(a: int, b: int) -> float:
    """Returns NaN instead of ZeroDivisionError"""
    with suppress(ZeroDivisionError):
        return a / b
    return float("nan")
Yes, you can do other things with exceptions like raising an exception disconnected from the exceptions that caused it, I just don't find myself reaching for this very often.

It's a good practice to put the absolute minimum you can get away with inside each try block, often this is a single line of code. It's much nicer to reason about lots of try/except blocks that catch the same error from different calls than to have all the calls in a single try/except and not have a clear indication of which call raised the exception without digging through the exception's guts. If you intend to handle an exception from either of those calls in the same way and it makes no difference to you where that exception originated, feel free to use the more concise form.

Python code:
# Concise but ambiguous
try:
    might_raise_one()
    might_raise_two()
except TheSameException:
    ... # raised, but which one failed?

# Longer, but clear and easier to modify
try:
    might_raise_one()
except TheSameException:
    ...

try:
    might_raise_two()
except TheSameException:
    ...
Finally, handling exceptions in the right place is critical. I only handle exceptions when I know I have all the necessary context to put into that exception so that it's useful to whoever is receiving it. Exceptions sometimes happen at deeper levels of your software and need to be propagated up to the right place where you have enough meaningful info to be able to handle it, whether that's retrying a certain number of times, putting a placeholder in for some missing data, or logging an error message to be addressed later. This is the hardest bit to illustrate with examples, because the examples have to be nontrivial to really illustrate why it's important. This is left as an exercise to the reader -- you're almost certainly going to get this wrong a few times before you start getting it right.

Consider a script that munges CSV files from a list of provided directories and outputs the result of some calculation as a different CSV file. How many failure modes can you think of that might need to be handled differently?

What if we were provided an empty list of directories? What if we can't access one of those directories? What if we can't access any of the directories? What if we can't access a single file in one of the directories but can access the rest? What happens if a provided directory doesn't contain any CSV data? What happens if we load something that is named CSV but doesn't contain CSV data, just plain text or a blank file? What happens if it contains binary data? What happens if the CSV headers don't include the field(s) we're after? What happens if the data itself is not valid? What if we can't write the output file?

That's a lot of failures and I'm sure you could come up with many more in this vein! What's not great is that these are largely technical concerns which naturally lend themselves to error handling conversations. The other side of the coin is how the script fits into the larger picture. If your script is run nightly on a server using automation, your error handling story will be completely different than if Bob in Accounting has to run this script ad-hoc at the end of the month to come up with the numbers needed to close the books.

Hed
Mar 31, 2004

Fun Shoe
I have a docker container that just runs a python script that does a TCP server I built. It's worked great for the last 3 years.

I was changing some logging in the code and in the meantime upgraded python and the underlying linux. Python 3.9 -> 3.12 is no problem but bullseye -> bookworm seems to break it.

Python code:
server = await asyncio.start_server(client_connected, host, port)
await server.serve_forever()
It launches fine but when I connect to it I get
pre:
[2024-05-02 01:23:14,025] - ERROR - Unhandled exception in client_connected_cb
handle_traceback: Handle created at (most recent call last):
  File "/usr/local/app/server.py", line 442, in <module>
    asyncio.run(main(bind_address, bind_port), debug=True)
  File "/usr/local/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
  File "/usr/local/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 674, in run_until_complete
    self.run_forever()
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 641, in run_forever
    self._run_once()
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 1979, in _run_once
    handle._run()
  File "/usr/local/lib/python3.12/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
transport: <_SelectorSocketTransport fd=7 read=polling write=<idle, bufsize=0>>
Traceback (most recent call last):
  File "/usr/local/app/server.py", line 285, in client_connected
    fwd_reader, fwd_writer = await asyncio.open_connection(host_ip, port, family=AF_INET,
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/streams.py", line 48, in open_connection
    transport, _ = await loop.create_connection(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 1080, in create_connection
    infos = await self._ensure_resolved(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 1456, in _ensure_resolved
    return await loop.getaddrinfo(host, port, family=family, type=type,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 901, in getaddrinfo
    return await self.run_in_executor(
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 863, in run_in_executor
    executor.submit(func, *args), loop=self)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 179, in submit
    self._adjust_thread_count()
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 202, in _adjust_thread_count
    t.start()
  File "/usr/local/lib/python3.12/threading.py", line 992, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
Can't start new thread is weird.
It works fine if I run Python 3.12 in bullseye, just not bookworm, so not sure what changed in bookworm. Running ulimit and ulimit -u inside the container result in "unlimited" and the host has thousands of spare threads. Any idea what it could be?

Hed fucked around with this message at 13:55 on May 3, 2024

necrotic
Aug 2, 2005
I owe my brother big time for this!
I found this thread on docker forums that mentions the —privileged flag with docker resolved it. https://forums.docker.com/t/runtimeerror-cant-start-new-thread/138142/4

Upgrading both the OS and Python at the same time can make it difficult to nail down exactly what side changed, but my hunch is something related to Linux and containers and security contexts.

There may be a more limited seccomp change you can make to allow threading without the privileged flag.

https://docs.docker.com/engine/security/seccomp/

Hed
Mar 31, 2004

Fun Shoe
Thanks, I’ll give that a shot. Great find.
And I posted in error, I was upgrading all the way from buster. So everything but the latest Linux release seems to work. I’ll report back.

Jose Cuervo
Aug 25, 2004
I have a SQLAlchemy model called Subject, and each Subject model has an attribute named 'heart_rate': Mapped[list['HeartRateValue']], where the HeartRateValue model stores the time_stamp and value of each heart rate value. I know if I have the ID of the subject I can use

session.get(Subject, subject_id)

to get the Subject object where the subject ID is subject_id. Is there a way with SQLAlchemy to then query the Subject object for the heart rate values which fall into a certain time interval (say start_date_time, end_date_time)?

Jose Cuervo
Aug 25, 2004

Jose Cuervo posted:

I have a SQLAlchemy model called Subject, and each Subject model has an attribute named 'heart_rate': Mapped[list['HeartRateValue']], where the HeartRateValue model stores the time_stamp and value of each heart rate value. I know if I have the ID of the subject I can use

session.get(Subject, subject_id)

to get the Subject object where the subject ID is subject_id. Is there a way with SQLAlchemy to then query the Subject object for the heart rate values which fall into a certain time interval (say start_date_time, end_date_time)?

Figured out how to do this:
Python code:
session.get(Subject, subject_id) \
       .hr_values \
       .where(HeartRateValue.datetime >= start_dt) \
       .where(HeartRateValue.datetime < \
                (pd.to_datetime(start_dt) + \
                 pd.DateOffset(hours=tss_length_hrs))) \
       .all()

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.
Anyone attending PyCon this weekend?

Hed
Mar 31, 2004

Fun Shoe
I have lots of Python experience, but no pandas and while I'm trying to learn polars I clearly don't get it.

I'm trying to make column selections in one dataframe based on rows in another

I have a dataframe widgets:
pre:
shape: (3, 4)
┌───────┬───────┬──────┬──────────┐
│ group ┆ month ┆ year ┆ quantity │
│ ---   ┆ ---   ┆ ---  ┆ ---      │
│ u8    ┆ str   ┆ u16  ┆ i16      │
╞═══════╪═══════╪══════╪══════════╡
│ 1     ┆ 07    ┆ 2024 ┆ 520      │
│ 1     ┆ 09    ┆ 2024 ┆ 640      │
│ 1     ┆ 12    ┆ 2024 ┆ 108      │
└───────┴───────┴──────┴──────────┘
and one prices (only relevant columns shown):
pre:
shape: (13_992, 134)
┌─────────────┬─────────────┬────────────┐
│ 072024      ┆ 092024      ┆ 122024     │
│ ---         ┆ ---         ┆ ---        │
│ f64         ┆ f64         ┆ f64        │
╞═════════════╪═════════════╪════════════╡
│ null        ┆ null        ┆ null       │
│ 1947.050278 ┆ 1252.862918 ┆ 857.521461 │
│ 1917.11333  ┆ 1234.664637 ┆ 846.660942 │
│ 1917.11333  ┆ 1234.664637 ┆ 846.660942 │
│ 1917.11333  ┆ 1234.664637 ┆ 846.660942 │
│ …           ┆ …           ┆ …          │
│ null        ┆ null        ┆ null       │
│ null        ┆ null        ┆ null       │
│ null        ┆ null        ┆ null       │
│ null        ┆ null        ┆ null       │
│ null        ┆ null        ┆ null       │
└─────────────┴─────────────┴────────────┘
Now I want to take the str_concat(month, year) from the widgets dataframe, and multiply it and the quantity by the price column. So the result would be would be column (pl.col("072024") * 520), then (pl.col("092024") * 640. I keep wanting to do this with a for loop, but I can't find any way to do this the "right" way. Do I need to join and filter?

I have read the docs and the O'Reilly preview for "Python Polars: The Definitive Guide" but while I get the functions I'm having trouble getting in the mindset.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Hed posted:

I have lots of Python experience, but no pandas and while I'm trying to learn polars I clearly don't get it.

I'm trying to make column selections in one dataframe based on rows in another

I unfortunately don't know Polars at all, but I can confirm that you're not crazy for feeling like there's a mindset difference. Pandas took me a while to figure out how the hell I was supposed to do basic operations stuff; a lot of it appeared to be basically 'nah you just need to apply things across a whole series'/etc rather than more traditional for-looping.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.
Are you familiar with SQL? Python libraries with DataFrames (like pandas, polars, or pyspark) use a lot of SQL idioms. You don't need a for loop because a select statement applies to all of your rows.

Python code:

df.select(pl.concat_str('month', 'year').cast(pl.UInt32) * pl.col('quantity'))

Jose Cuervo
Aug 25, 2004

BAD AT STUFF posted:

Are you familiar with SQL? Python libraries with DataFrames (like pandas, polars, or pyspark) use a lot of SQL idioms. You don't need a for loop because a select statement applies to all of your rows.

Python code:
df.select(pl.concat_str('month', 'year').cast(pl.UInt32) * pl.col('quantity'))

Do you have a recommended resource to learn some basics of SQL?

Tayter Swift
Nov 18, 2002

Pillbug
I think the problem is your price table isn't in a tidy, "one observation per row" format. I'd unpivot it with a melt() function, convert the dates in both dataframes to a common datetime format, then join on that column.

edit: this should get you started:
code:
price = (
    price
    .melt(
        variable_name='date', 
        value_name='price'
    )
    .with_columns(
        col('date').str.to_date(format='%m%d%y', strict=False)
    )
)

Tayter Swift fucked around with this message at 00:34 on May 22, 2024

Tayter Swift
Nov 18, 2002

Pillbug
know what, I needed a break from the other polars code I've been hitting my head with all day:

Python code:
import polars as pl
from polars import col, lit

qty = pl.DataFrame({
    'group': [1, 1, 1],
    'month': ['07', '09', '12'],
    'year': ['2024', '2024', '2024'],
    'quantity': [520, 640, 108],}
)

price = pl.DataFrame({
    {'072024': [None, 1947.050278, 1917.11333, 1917.333,1917.11333], 
    '092024': [None, 1252.8629, 1234.6646, 1234.6646, 1234.6646],
    '122024':[None,857.521,846.660,846.660,846.660],}
)

price = (
    price
    .melt(variable_name='date',  value_name='price')
    .with_columns(
        col('date').str.slice(0,2).alias('month'),
        (lit('20') + col('date').str.slice(-2,None)).alias('year'),
    )
)

joined = (
    qty
    .join(price, on=['month', 'year'], how='inner')
    .with_columns(cost = col('quantity') * col('price'))
    .drop('date')
)
code:
shape: (15, 6)
+-------+-------+------+----------+-------------+-------------+
| group | month | year | quantity | price       | cost        |
| ---   | ---   | ---  | ---      | ---         | ---         |
| i64   | str   | str  | i64      | f64         | f64         |
+=============================================================+
| 1     | 07    | 2024 | 520      | null        | null        |
| 1     | 07    | 2024 | 520      | 1947.050278 | 1.0125e6    |
| 1     | 07    | 2024 | 520      | 1917.11333  | 996898.9316 |
| 1     | 07    | 2024 | 520      | 1917.333    | 997013.16   |
| 1     | 07    | 2024 | 520      | 1917.11333  | 996898.9316 |
| …     | …     | …    | …        | …           | …           |
| 1     | 12    | 2024 | 108      | null        | null        |
| 1     | 12    | 2024 | 108      | 857.521     | 92612.268   |
| 1     | 12    | 2024 | 108      | 846.66      | 91439.28    |
| 1     | 12    | 2024 | 108      | 846.66      | 91439.28    |
| 1     | 12    | 2024 | 108      | 846.66      | 91439.28    |
+-------+-------+------+----------+-------------+-------------+

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.
Oh, good call. I was too focused in on the "(pl.col("072024") * 520)" portion of the question (and still misunderstood it) and missed the point. I should have wondered why the second table was mentioned...

Jose Cuervo posted:

Do you have a recommended resource to learn some basics of SQL?

I don't have a specific recommendation, other than I would look at relational databases in general and not just SQL on its own. Maybe from a university that posts its stuff online like MIT or Stanford.

Hed
Mar 31, 2004

Fun Shoe

Tayter Swift posted:

know what, I needed a break from the other polars code I've been hitting my head with all day:

Python code:
import polars as pl
from polars import col, lit

qty = pl.DataFrame({
    'group': [1, 1, 1],
    'month': ['07', '09', '12'],
    'year': ['2024', '2024', '2024'],
    'quantity': [520, 640, 108],}
)

price = pl.DataFrame({
    {'072024': [None, 1947.050278, 1917.11333, 1917.333,1917.11333], 
    '092024': [None, 1252.8629, 1234.6646, 1234.6646, 1234.6646],
    '122024':[None,857.521,846.660,846.660,846.660],}
)

price = (
    price
    .melt(variable_name='date',  value_name='price')
    .with_columns(
        col('date').str.slice(0,2).alias('month'),
        (lit('20') + col('date').str.slice(-2,None)).alias('year'),
    )
)

joined = (
    qty
    .join(price, on=['month', 'year'], how='inner')
    .with_columns(cost = col('quantity') * col('price'))
    .drop('date')
)
code:
shape: (15, 6)
+-------+-------+------+----------+-------------+-------------+
| group | month | year | quantity | price       | cost        |
| ---   | ---   | ---  | ---      | ---         | ---         |
| i64   | str   | str  | i64      | f64         | f64         |
+=============================================================+
| 1     | 07    | 2024 | 520      | null        | null        |
| 1     | 07    | 2024 | 520      | 1947.050278 | 1.0125e6    |
| 1     | 07    | 2024 | 520      | 1917.11333  | 996898.9316 |
| 1     | 07    | 2024 | 520      | 1917.333    | 997013.16   |
| 1     | 07    | 2024 | 520      | 1917.11333  | 996898.9316 |
| …     | …     | …    | …        | …           | …           |
| 1     | 12    | 2024 | 108      | null        | null        |
| 1     | 12    | 2024 | 108      | 857.521     | 92612.268   |
| 1     | 12    | 2024 | 108      | 846.66      | 91439.28    |
| 1     | 12    | 2024 | 108      | 846.66      | 91439.28    |
| 1     | 12    | 2024 | 108      | 846.66      | 91439.28    |
+-------+-------+------+----------+-------------+-------------+

Holy crap, thank you. I didn't realize "tidy" is the concept, so I found this and yes I agree I will do this: https://r4ds.had.co.nz/tidy-data.html. It makes sense.
I guess the thought is that since the data is columnar don't worry about the seemingly inefficient operations, this is actually optimized? I will probably play around to make sure I really understand it.

Thanks for getting me on the right path.

Tayter Swift
Nov 18, 2002

Pillbug
I can't read rust code very well but it looks like care is taken to not make it an O(n2) operation.

But yeah, manipulating columns and rows like it's an Excel PivotTable is core to DataFrame functionality, whether it's pandas or polars.

StumblyWumbly
Sep 12, 2007

Batmanticore!
It's a great solution and I think I'm closer to understanding melt now. 2 minor style comments:
- you can replace .with_columns(<stuff>.alias(name)) with .with_columns(name=<stuff>)
- Instead of using .drop to remove the columns you don't want, use .select to keep the ones you do. I've heard that the polars devs encourage doing it that way, it makes what's in the dataframe clearer.

Tayter Swift
Nov 18, 2002

Pillbug
I think the "x = foo" syntax showed up just after I picked up polars -- I prefer it but have a bad habit of still using .alias() sometimes.

I'll die on the .drop() hill though, when I only want to drop a column or two. It's explicitly dropping the columns, while .select()ing everything but what I want is implicitly dropping (and can be difficult to state if you don't know what columns are in the DataFrame -- df.select(pl.all().exclude('foo')) is a bit much for my taste).

Tayter Swift fucked around with this message at 03:44 on May 23, 2024

StumblyWumbly
Sep 12, 2007

Batmanticore!
Polars is definitely still changing pretty quickly in kinda major ways. I saw a talk recently where the guy was pretty clear that x= and .alias(x) are the same, but alias has become such an ingrained habit its getting passed on long past its life.

I can see the reasoning behind using drop, especially if you're just creating a temporary column for some math. I might be misremembering but I thought at the talk the guy said the devs behind polars preferred select and drop was less efficient, but I should really test that before saying it too loudly.

Tayter Swift
Nov 18, 2002

Pillbug
It definitely shouldn't make a difference if you use LazyFrames.

Tayter Swift
Nov 18, 2002

Pillbug
Huh, looking at the issue tracker it looks like someone did experience a performance degradation when using drop() on LazyFrames "dozens to hundreds of times per second." Maybe to do with being more difficult to parallelize. Ritchie did some tweaking but it did not fix this person's problem, so he reverted to using select().

I dunno, that seems like a corner case to me and something to keep in mind, but not worth losing the readability of the code for most work.

HorseHeadBed
May 6, 2009
Sorry in advance but I am a loving moron when it comes to programming and as much as I think I understand Python when I follow tutorials, trying to do something in the real world leads to me holding my head in my hands.

I'm trying to make 600 PDFs. They all have different first pages (covers), which are individually named PDFs in a folder (1.pdf, 2.pdf etc). I want to add the same PDF after each of these covers ('666-interior.pdf'). I'm trying to use PyPDF2 for this and I thought I knew what I was doing, but I do not.

This code is what I got, but it's not incrementing the cover file name, so each cover is just 1.pdf and it is doubling up the interior each time, so on each iteration it's repeating the 666-interior.pdf. What am I doing wrong? I would really appreciate some help here.

code:
import PyPDF2

from PyPDF2 import PdfReader, PdfWriter

merger = PdfWriter()

interiorPDF = '666-interior.pdf'

x = 1

while x <= 600:
    increment = str(x)
    readPDF = increment + '.pdf'
    print(readPDF)
   
    merger.append(readPDF, 'rb')
    merger.append(interiorPDF, 'rb')
    outputName = 'Six Six Six - ' + str(x) + '.pdf'
    merger.write(outputName)
    merger.close
    print("Done!")
    x = x + 1

boofhead
Feb 18, 2021

I'm not following your description of the bug and what output you actually get - do you still get 600 files? Or just 1 with 600 iterations of the content file? Do you get any specific error messages from python when you run it?

I haven't used that tool before but in the meantime, you could try moving the definition of merger = PdfWriter() to the start of each loop, because you're closing it again at the end of every loop, and I'd assume it's not supposed to work that way. Also I imagine you need parentheses on close, like merger.close()

HorseHeadBed
May 6, 2009

boofhead posted:

I'm not following your description of the bug and what output you actually get - do you still get 600 files? Or just 1 with 600 iterations of the content file? Do you get any specific error messages from python when you run it?

I haven't used that tool before but in the meantime, you could try moving the definition of merger = PdfWriter() to the start of each loop, because you're closing it again at the end of every loop, and I'd assume it's not supposed to work that way. Also I imagine you need parentheses on close, like merger.close()

Sorry - I should have been clearer. It's producing 600 files. Each one has the same cover (first file) which doesn't seem to be incrementing. Furthermore, it's doubling up the file size each time, adding another duplicate of the interior file. I don't think it's a bug so much as my own incompetence. I was just hoping someone would be able to point out the logic flow that's doing something wrong.

And yes, I need parentheses on the end of that line.

Data Graham
Dec 28, 2009

📈📊🍪😋



First thing is that you're .append()'ing in every iteration of the loop and it never gets reset, meaning it will just keep accumulating new pages every time it loops. That reset is what the instantiation of merger = PdfWriter() does, so that needs to be done fresh at the beginning of each iteration, i.e. within the loop, like boofhead said. See if that helps with the initial cover read issue too.

QuarkJets
Sep 8, 2008

If you want 600 files, shouldn't `merger = PdfWriter()` go inside of the for loop? If it's on the outside then you're appending to that same object for each iteration

The line "merger.close" doesn't do anything. "merge.close()" will call the close method

e: Looks like i have the same suggestions as boofhead, I should have read their post before replying!

QuarkJets fucked around with this message at 18:48 on May 27, 2024

QuarkJets
Sep 8, 2008

As an aside, I'll suggest f-strings and integer 0-padding to make your file names neater:


code:
outputName = f'Six Six Six - {x}.pdf'  # Without 0-padding
outputName = f'Six Six Six - {x:03d}.pdf'  # Zero-pads up to 3 digits, e.g. 1 becomes 001

boofhead
Feb 18, 2021

E: other people replied with the same stuff in the meantime

Have you tried the changes I suggested? Looking at your code I imagine it's just ignoring merger.close and then adding the next cover page then contents to the same pdf file, so your 600th file probably looks like:

Cover page 1
Contents
Cover page 2
Contents
...
Cover page 600
Contents

Move the definition of merger inside the loop so it creates a new object each time, and close it properly, and see if that fixes it

HorseHeadBed
May 6, 2009
Ah that's great - thank you! Moving the merger = pdfWriter() into the loop seems to have done it. That and the merger.close() parentheses has worked a treat.

I really, really appreciate the guidance.

PS - boofhead, the interiors were repeating, but not the covers.

Adbot
ADBOT LOVES YOU

CarForumPoster
Jun 26, 2013

⚡POWER⚡
I can’t remember what I replaced them with at the moment but both pypdf2 and 4 are old and unsupported at this point. I recall running into annoying stuff with them a lot and ditching them for a more modern package 1-2 years ago

Edit: it was pymupdf

CarForumPoster fucked around with this message at 20:55 on May 27, 2024

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply