Python

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

«‹›22 »

QuarkJets: Sep 8, 2008

Data Graham posted:

Wait a minute, I was just using dataclasses for something similar and was going to mention something annoying about them, which is
Python code:
----> 1 thing1 = Thing()

TypeError: __init__() missing 3 required positional arguments: 'a', 'b', and 'c'

You can assign defaults in the dataclass definition, see my example

# ¿ Apr 26, 2024 22:42

Adbot: ADBOT LOVES YOU

# ¿ May 16, 2024 02:00

QuarkJets: Sep 8, 2008

I don't like how the function code is all wrapped in large try/except blocks that raise new errors. That's not really error checking, it's error obfuscation; yes, you print the caught exception object, but it's hiding the stack trace for no good reason. If you absolutely felt like you had to add more context to exceptions, like if you wanted to use logging or send a notification to someone, then you could use exception chaining to raise the original exception

Python code:

def test():
    try:
        raise RuntimeError('foo')
    except RuntimeError as e:
        logging.warning(f'gently caress, I saw exeption {e}!!!')
        raise  # <-- This simple line raises the original exception as though it wasn't caught at all

But I think you should ditch the try/except blocks entirely, they're not helping you and bare Exception catching (where you catch the base Exception type instead of a more specific type) is a code smell.

Don't run pip install commands in your notebook code. A comment that describes what's needed is fine.

# ¿ Apr 27, 2024 03:15

QuarkJets: Sep 8, 2008

The commit message is "challenge completed"; one of the requirements is to write a concise commit message, but I think that's probably too concise.

After looking it over a little I realized that the code isn't pep8-compliant, which is another requirement. A common tool to verify this is the standalone utility named "pycodestyle", which used to literally be called "pep8". An even better tool is "flake8", which combines pycodestyle and some other linters into one utility. Try installing flake8 (with pip) and running it on your script (`flake8 challenge.py`), it'll print out a list of line numbers with specific issues to fix. You can also run pycodestyle in the same way if you want to see only the pep8 problems. Linting is integrated directly into most modern IDEs but every one has its own default flavor of what's "right" and not all of them are pep8-compliant; lesson learned, run flake8 on your code before you commit it
(you should look up pre-commit-hooks, during a git commit they can not only automatically fix a lot of common issues like unnecessary whitespace at the end of lines, you can also incorporate a flake8 pre-commit stage that will cause the commit to be rejected if flake8 has complaints; this enforces pep8 compliance on anything that you try to commit, it's very handy!)

Those two issues (PEP8 compliance, vague commit message) + the result not being sorted by zip code is 3 failed requirements. Those are probably the big issues here. I have some other suggestions for your code that you can think about, because I like providing code review, but I don't think these matter as much:

The `Address` class would be a lot more concise if it was a dataclass. Then the `view_dict` method reduces to 2 lines (one of which is the asdict function from the dataclasses module). As it stands, `Address` also has no documentation

A class-based approach like this has fallen out of favor with a lot of Python developers. The methods in `Parser` could have all been functions.

`file_path[-3:].lower()` is less effective at extracting file extensions than `os.path.splitext(file_path)[-1]`. Technically it satisfies the assignment, but this line is a brittle implementation that could require rework if it had to be extended to other file formats in the future. In a code review I'd reject this line. The rest of the implementation in `parse_file` looks good. A dictionary dispatch approach would be equally valid

I'm not great at XML parsing. But when you extract the street components I think you're missing some logic to correctly handle street names. Here's your approach to concatenating these:

Python code:

            street_1 = ent.find("STREET").text.strip() if ent.find("STREET") is not None else None
            street_2 = ent.find("STREET_2").text.strip() if ent.find("STREET_2") is not None else None
            street_3 = ent.find("STREET_3").text.strip() if ent.find("STREET_3") is not None else None
            street = street_1 + street_2 + street_3

First, it doesn't make sense that you're allowing assignment to None. If any of these are None then an exception is going to be raised when you try to add them together, so a lot of this code either doesn't do anything or will fail. I think that you could just use findtext() to provide empty strings, then you could concatenate those together safely even when one of the tags isn't present:

Python code:

            street_1 = ent.findtext("STREET", "").strip()
            street_2 = ent.findtext("STREET_2", "").strip()
            street_3 = ent.findtext("STREET_3", "").strip()
            street = street_1 + street_2 + street_3

But the concatenation I think is also wrong; for instance here's some of the example data:

XML code:

      <STREET>135 West 20th Street</STREET>
      <STREET_2>Suite 201</STREET_2>
      <STREET_3> </STREET_3>

Since you just add these together, street would be "135 West 20th StreetSuite 201" right? That's a problem. Check out this approach:

Python code:

    street_components = (s for t in ("STREET", "STREET_2", "STREET_3") if (s := ent.findtext(t, "").strip()))
    street = " ".join(street_components)

- The list of valid tag names is a concise tuple of elements
- We use a walrus operator to only keep elements that have length > 0
- We join together the remaining elements with spaces

Moving beyond the issues with start parsing in xml, I'm not enamored with the "x = y if y is not None else None" approach in general. 90% of this code is duplicate effort. I think this could be a lot cleaner with a looped approach. I'm imagining a dict comprehension but even a for loop could be cleaner, I think

The postal code parser is a little too complicated and I think it doesn't work right with XML values. The Zip+4 values in the XML are formatted as "NNNNN - MMMM" but your regex is `r"^\d{5}-\d{4}$`. That lack of a space means that your parser assumes it's not a Zip+4 value, so it always returns the first 5 digits. This would work better:

Python code:

def parse_zip(zip):
    """Given a zip code, remove spaces and strip off trailing dashes."""
    return zip.replace(' ','').strip('-')

I think there's a lot of unnecessary None-checking in this code. For instance parse_zip shouldn't be receiving None objects, you should be eliminating that possibility sooner. This is kind of a nitpicky suggestion, it's fine to use None but I personally don't seeing it used this often

In the tcv parser I think you were meant to infer that sometimes an Organization label is actually in the "last" column; it looks like you're parsing "Central Trading Company Ltd." as a last name for a person with no first name.

If a middle name is "N/M/N" you should probably just not include that value. E.g. "no middle name" == "N/A" == don't bother keeping this vlaue

Probably don't want to leave a "Hello, world!" docstring in your code

# ¿ Apr 29, 2024 08:08

QuarkJets: Sep 8, 2008

Fender posted:

There is some good feedback in here. Some of it I feel would be appropriate for a conversation. Like, why I check for None a lot. It's just a habit I learned in my last role where we parsed a lot of unreliable data sourced from scrapers. I would happily explain my position there.

One problem I see is that the None-checking is everywhere. Since you need to filter out empty strings anyway, it'd be better if you translated "None" to an empty string - then all of the code that's downstream from the file parsers doesn't have to do any None checking. Basically you have these 3 interface functions that are letting None values filter into the rest of your functions, so you need to handle the possibility of None everywhere, but if you handled the None values in the interface functions then the rest of your code would be a bit nicer. Another problem is that the None checking has become boilerplate; it'd be better to use some small utility functions to simplify the code. This also gets back to the issue I talked about where the street values have the possibility of raising an exception if any of them happen to be None; you did the none-checking for each of the street components in the XML function, but then you forgot that any of them could be None when you concatenated them together. If you had an interface function that returned only strings instead, the code looks nicer and this bug goes away.

For example:

Python code:

            name = ent.find("NAME").text.strip() if ent.find("NAME") is not None else None
            company = ent.find("COMPANY").text.strip() if ent.find("COMPANY") is not None else None
            street_1 = ent.find("STREET").text.strip() if ent.find("STREET") is not None else None
            street_2 = ent.find("STREET_2").text.strip() if ent.find("STREET_2") is not None else None
            street_3 = ent.find("STREET_3").text.strip() if ent.find("STREET_3") is not None else None
            street = street_1 + street_2 + street_3
            city = ent.find("CITY").text.strip() if ent.find("CITY") is not None else None
            county = ent.fine("COUNTY").text.strip() if ent.find("COUNTY") is not None else None
            state = ent.find("STATE").text.strip() if ent.find("STATE") is not None else None
            country = ent.find("COUNTRY").text.strip() if ent.find("COUNTRY") is not None else None
            postal_code = ent.find("POSTAL_CODE").text.strip() if ent.find("POSTAL_CODE") is not None else None0

            data = {
                "name": name,
                "organization" : company,
                "street": street,
                "city": city,
                "county" : county,
                "state": state,
                "zip": self._parse_zip(postal_code),
                "country": country,
            }

Wouldn't it be great to have a function that does most of this stuff?

Python code:

def parse_ent(ent, name):
    return x.text.strip() if (x := ent.find("NAME")) is not None else ""

def xml_ent_to_address(ent):
    data = {'name': parse_ent(ent, 'NAME'),
            'company': parse_ent(ent, 'COMPANY'),
            # ...
            'zip': parse_zip(parse_ent(ent, "POSTAL_CODE")),
            'country': parse_ent(ent, 'COUNTRY')
    }
    # street can have several components
    street_parts = (parse_ent(ent, x) for x in ('STREET', 'STREET_2', 'STREET_3'))
    data['street'] = ' '.join(x for x in street_parts if x)
    # And check this out: splat operators let you avoid having a funny "if data is not None" in this class
    return Address(**data)

quote:

And stuff like the Address class having that weird view dict method, would love to chat about that. That is like that bc while they never said it, the examples both showed the printed json ignoring any keys without values, so I went out of my way to provide that since all the example data had different fields present/missing depending on the format. I was really getting in my own head by hour number 4.

I got that it was necessary to filter keys without values; I was saying that the view_dict method could have been a lot more concise. If you'd used a dataclass:

Python code:

    def view_dict(self):
        # If we put the .strip() then we don't need to have it on every line of the __init__
        # This also means we don't need None-checking or len-checking in the __init__
        # This one-liner does it all: 
        return {name: value.strip() for name, value in asdict(self).items() if value}

As it stands now, the Address class is kind of pointless; only its view_dict method is useful, the rest of it provides no additional value. Its __init__ strips input strings, but the file parsers are already doing that. A class like this could be beneficial for verifying that you're using valid names in your dictionaries, but the class allows everything to be None, so the class doesn't actually provide that benefit. Any dictionaries that were implemented with a mistyped key would see those values get silently dropped on the floor as each Address instance auto-assigns those fields to None instead, quietly permitting buggy outcomes. Permitting a dictionary as an optional input argument that you then parse with .get() is to blame there; if you make the keyword arguments explicit and use splat operators while creating the instance instead of a loose dictionary interface then you'll be alerted right away if a bad key gets provided.

Python code:

# Too permissive:
class Address():
    def __init__(self, data=None, **kwargs) -> None:
        if data is None:
            data = kwargs
        self.street = data.get("street").strip() if data.get("street") else None
        self.city = data.get("city").strip() if data.get("city") else None
...


# Good:
class Address:
    def __init__(self, street='', city='', etc you get the idea):
        self.street = street
        self.city = city
...


# Better:
@dataclass
class Address:
    street: str = ""
    city: str = ""
...

# ¿ May 1, 2024 03:37

QuarkJets: Sep 8, 2008

Thank you! Yeah that data sucks for sure, what a pain. And irl someone is going to come to you with some stupid bullshit like "Oops between dates X and Y the last and first names are swapped and in the wrong columns, can you fix that" and depending on the request you have to know how to ride the line between "no that's an unreasonable request" and "this is important, I will help you with that".

# ¿ May 1, 2024 22:27

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

«‹›22 »