Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

QuarkJets posted:

There may be some cutesy itertools solution, I always like something that avoids regex even if it winds up being a little slower. Might think about it tonight

That's what I did. Well, cutsey index stuff rather than itertools. Regex is usually my thing, but I haven't messed with string slicing at work for a while. It's probably less efficient, but since we're bailing out after the first number it's not a big deal.


Python code:

digits = {
    'one': '1',
    'two': '2',
    'three': '3',
    'four': '4',
    'five': '5',
    'six': '6',
    'seven': '7',
    'eight': '8',
    'nine': '9',
}

digits_reversed = {key[::-1]: value for key, value in digits.items()}

def find_first_number(line: str, digits: dict[str, int]) -> str:
    for i, char in enumerate(line):
        if char.isdigit():
            return char
        else:
            for num, value in digits.items():
                if line[i:].startswith(num):
                    return value
    raise RuntimeError('No digits found in string!')

def parse_calibration_value_v2(line: str) -> int:
    return int(find_first_number(line, digits) + find_first_number(line[::-1], digits_reversed))

BAD AT STUFF fucked around with this message at 14:52 on Dec 14, 2023

Adbot
ADBOT LOVES YOU

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.
Does your organization already have Python build pipelines that you'd be using, or are you creating those as well? You mentioned gitlabs CI, but I wasn't sure if that was something your team owns.

If you're creating your own build pipelines, I'd suggest using pyproject.toml files for your packages rather than setup.cfg or setup.py. I have a couple libraries I'd like to update, but our build pipelines are shared across the org. Getting everyone else to ditch their old setup.py files is going to be an uphill effort. If you're starting fresh, use the newest standard.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.
I've spent a ton of time over the past year dealing with Python cert issues and this was a new one. That, plus the fact that apparently things work (albeit with an error message) without setting an option like "accept insecure certs" immediately made me suspicious. Finally, I noted that the error message is about *parsing* the cert rather than an SSL error.

So I went to look at the source file that's throwing this error and found an interesting comment:

quote:

// TODO(mattm): this creates misleading log spam if one of the other Parse*
// methods is actually able to parse the data.

https://source.chromium.org/chromium/chromium/src/+/main:net/cert/internal/cert_issuer_source_aia.cc;l=32

When you said that you're not getting errors using the form in Chrome, were you checking the logs? I'm guessing that this is happening in the browser too, but the error isn't being surfaced to the user since it's not causing any problems.

I don't know if the StackOverflow fix will suppress any other error messages that you should care about, but it looks like they're correct that this is one you don't need to care about.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

Oysters Autobio posted:

Any good resources on managing ETL with Python? I'm still fairly new to python but I'm struggling to find good resources on building ETL tools and design patterns for helping automate different data cleaning and modeling requirements.

Our team is often responsible for taking adhoc spreadsheets or datasets and transforming them to a common data model/schema as well as some processing like removing PII for GDPR compliance, I'm struggling to conceptualize how this would look like.

We have a mixed team of data analysts. So some who are comfortable with Python, others only within the context of a Jupyter notebook.

I've seen in-house other similar projects which created custom wrappers (ie like a "dataset" object that then can be passed through various functions) but Id rather use tried/true patterns or even better a framework/library made by smarter people.

Really what I'm looking for is inspiration on how others have done ETL in python.

I'd agree with the advice that other folks gave already. One thing that makes it hard to have a single answer for this question is that there are different scales that the various frameworks operate on. If you need distributed computing and parallel task execution, then you'll want something like Airflow and PySpark to do this. If you're working on the scale of spreadsheets or individual files, something like Pandas would likely be a better option. But there are certain data formats or types that probably don't lend themselves to Pandas dataframes.

Falcon's approach of working backwards is what I'd go with. Where will this data live once your process is done with it? Database, mart, fileserver, something else? I'd start with what kind of libraries you need to interact with that system and work backwards to what can read from your data sources. If you really like Pandas dataframes for ETL, but late in the process you realize you need some database connecter library with its own ORM classes then you'll have to figure out how to reconcile those.

In terms of general library advice:

  • PySpark (distributed dataframes)
  • Pandas (non-distributed dataframes)
  • Pydantic (for data separated into different entities/models, can be made to work with different db libraries)

Then in terms of how to structure your code, I would also advocate for a functional style. I'd focus on having the individual steps as standalone functions that are well tested, rather than putting too much effort into custom wrappers or pipeline automation. If you break your steps into discrete functions, you can start out with basic (but hopefully well structured) Jupter notebooks to chain everything together. Once you have a few different pipelines, if you start seeing commonalities then maybe you can abstract your code into wrappers or libraries.

iceaim posted:

Your post has been extremely helpful.

Great, I'm glad to hear that!

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.
That's one nice thing about moving to Databricks. It's much easier to keep current with Pyspark versions compared to the old days. Waiting for Cloudera to release an update, getting all the users of the cluster to get on board with the new version, needing downtime to update all of the VMs... all of that was a nightmare. And yeah, our packages would forever raise Xray alerts because of some old rear end CVE that Apache wasn't going to fix.

I still prefer the Pyspark dataframe API over Pandas (and I guess Dask?) though :shrug:

Are you spinning up your own compute for Dask, or is there a good managed service out there? We needed a pretty well fleshed out platform for the data scientists. It was hard enough to get them to stop using R for everything.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

Oysters Autobio posted:

Thanks a bunch for you and everyone's advice.

Also especially thanks for flagging pydantic. I think what it calls type coercion is very much what I need, will need to see an actual example ETL project using it to make full sense.

I am a bit stuck in tutorial-hell on this project at the moment and struggling to start writing actual code because I don't really know what the end state should "look" like in terms of structure.

Nice. I'd also look at validators. They can be super helpful. https://docs.pydantic.dev/latest/concepts/validators/

In terms of project design, again I absolutely agree with Falcon. Before getting too invested in how to lay out the code, I create a high-level flow chart of the required steps. Then I'll have a whiteboarding session with other engineers, get feedback, and revise as needed.

Once you get to the implementation stage, we generally have our projects separated into an automation directory and a Python package with actual logic. That lets you modify or sub out the automation as needed. If you start with a basic driver script or notebook, you can go with something more complex that supports DAGs in the future if needed.

I'd focus more on making sure that the individual components are independent and well tested. That way, when you need to modify the design later you don't have to rework things at a low level. When developing patterns that are new for your team, there will be things you don't get right the first time. I'm a big advocate for getting out something as an MVP, learning from it, and iterating.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

Oysters Autobio posted:

Yeah, and to be honest, I can't tell you why but I'm not a big fan of pandas in general, and actually prefer pyspark / sparksql. Very much a subjective thing, couldn't tell you why but the whole "index" and "axes" abstractions are bit annoying to my brain since I'm coming from SQL-world. I only really need to use the distributed processing like 50% of the time, but its enough where I'd rather just spend my time in one library framework rather than splitting between both.

Are you me? Because I agree with all of that. Beyond my preference for SQL idoms, it's an infrequent thing that I need to reference anything by index. Columns are by name, rows are by row num (but only if I need a Window for some reason). If my data was all numbers in a multidimensional ndarrary, then sure give me indexes. But I find it weird and unhelpful when my data is a mix of different types and effectively always tabular. It's just another thing you can screw up if you only use Pandas infrequently.

I'm also guilty of running smaller workflows through Pyspark, but it's nice to work within one stack. Plus, I don't think anyone ever got in trouble for making their jobs too small :v:

I have been beating the drum of "we shouldn't use Pyspark automatically to parallelize everything just because it's there" for years, though. I think it's finally sinking in now that we can track costs of specific jobs through Databricks. If we're doing some hacky bullshit to call native Python libraries from Pandas UDFs, that's probably a sign that we could be doing this in Kubernetes. Now, we can finally attribute a dollar amount to that from Databricks license costs.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.
I would love to have a place to bitch about Azure Data Factory, so I'm all for a data engineering thread.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

Hughmoris posted:

I've only used ADF for learning and personal projects, and it seems like it could be a powerful tool. What don't you all like about it?

A million little things. One of the more recent: you can't resume a failed pipeline run. Instead, you can rerun the pipeline from the failed task. However, this does not preserve the variable state of the previous run. So if you've been writing out files to a working directory, say, and you have some variable that is referenced throughout the pipeline for those file names then you lose that state for the new run. We ended up creating a sub-pipline that could have all variables passed into it, and using another pipeline to calculate variables and pass them in. Then if you need to restart the sub-pipeline, it will actually use the same params as the original run.

I thought leaving Airflow behind would make our lives easier. Little did I know...

CompeAnansi posted:

Disagree. It is its own thing that touches on both.

Yeah. It likely depends on the workplace, but where I'm at there's a pretty big distinction between data science and data engineering. I've picked up some science knowledge by osmosis, but I'm not going to be the one deciding if we should be using MAPE or RMSE. I'm going to be making sure that science's code scales, is automated, and has tests.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

QuarkJets posted:

I use miniconda installed directly on the computer, containers are too fussy and hold no advantage for anything that I'm doing at home

I've gone to miniconda over virtual environments for the ease of switching to new Python versions. I can leave the system Python alone but still get new features regularly now that releases are more frequent.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

Seventh Arrow posted:

Actually, I think I can probably even do everything in a single line:

code:
def front_times(str, n):
    return str[:3] * n if n > 0 else "Please enter a non-negative number"


I'd prefer your first conditional approach, both for clarity and because it would be more Pythonic to raise a ValueError rather than saying what's wrong in the return value.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

Falcon2001 posted:

Personally, the best explanation for comments I've found is that comments tell you why the code is written this way, not what the code is doing, unless your code has some reason to be unreadable, in which case, sure...but you should explain why you have to do that, if I'm reviewing your code. The why should be more like business logic, or reasons why you chose a particular approach over another, and not simply an explanation of basic algorithms.

One thing I used to do was write a comment describing what the next few lines were doing at a high level. Something like "# retrieve batch metadata from API" and then a few lines to create a requests session, query the API, pull data out of the response object.

I thought it made the code clearer by separating things into sections. Now, I use that urge as an indication that those lines should be in their own function. I think that's something often not explicitly stated in discussions of self-documenting code. Individual lines can be clear enough to not need comments, but if the whole function needs comments to keep track of what you're doing, it should be split up.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

FISHMANPET posted:

This is probably a very luddite opinion of mine, and I'm sure it's objectively wrong, but I really hate when there are too many functions being called (especially within functions within functions etc). I somewhat frequently find myself tracing very closely through some random code base trying to figure out exactly what's going on, and going deeper and deeper with functions makes that more difficult. You'd think in theory that I should be able to look at a function and say "It's going to do X" the same way I can look at a line like "a = b + c" and know what's going to happen. But in practice, it doesn't work out that way, and I end up having to read through those functions to figure out exactly what's happening.

Functions within functions gets confusing fast. There was one project I inherited where you had to go three levels deep from the entry point of the Pyspark job before you started to see actual logic, and often you'd be in a different module than where you started. That was a nightmare.

Comments to explain different sections of a larger code block is definitely a code smell rather than something that should never be done. I'd argue that having to trace a long call stack to figure out where stuff happens is also a code smell, although I can't come up with a specific software design principal to justify that. Dependency inversion, maybe?

...now notebooks, I'm throwing everything at the wall and seeing what sticks.

BAD AT STUFF fucked around with this message at 02:22 on Feb 28, 2024

Adbot
ADBOT LOVES YOU

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

96 spacejam posted:

Currently working my way into intermediate courses using Think Python, Automate Boring, and Intro to Python (in OP). As this is a career change for me and I did well enough in my previous life that I can take a 2-3 years off to acquire a new skill and while I'm sure I could go farther no problem on my own, I'd like to start socializing and being able to discuss more complex concepts ... you know, like how we did in the Before Time? I'm in San Diego in case a goon has insight specific to my locale but I'm not getting another degree, I already have my B.S.

Community Colleges are going to be my best bet, right?

You can also take classes at a traditional undergraduate college as a non-degree seeking student, if there are better options for that close to you.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply