Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Armitag3
Mar 15, 2020

Forget it Jake, it's cybertown.


Zoracle Zed posted:

Maybe check out the 'sliding_window' iterator from the itertools recipes, which continues to drive me insane for not being importable but instead requires copy-pasting.

Hey now thanks for this

Adbot
ADBOT LOVES YOU

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
Anyone familiar with Click know if there's a way to implement argument parsing into a shared dataclass or something of that nature?

Basically I have code that looks like this - it's a CLI that's mostly invoked via runbook / change management process and has a lot of required arguments and the signatures are starting to get ridiculous; I think we're up to like 10 common options, but individual commands need unique ones as well for the given workflow.

Python code:
cli.command(name='example_cli')
@click.option('common_one')
@click.option('common_two')
@click.option('common_three')
@click.option('unique_one')
def launch_command(
    common_one,
    common_two,
    common_three,
    unique_one
):
You can group several options into a decorator, so I was already working to do the below:
Python code:
cli.command(name='example_cli')
@common_options
@click.option('unique_one')
def launch_command(
    common_one,
    common_two,
    common_three,
    unique_one
):
    
However, what I'd really like to do would be to build the common options into a shared dataclass or something, so that the individual functions are all a lot cleaner looking and don't have a ton of duplicated code in the signatures/decorators.

Python code:
cli.command(name='example_cli')
@common_options
@click.option('unique_one')
def launch_command(
    common_options: CommonOptions
    unique_one
):
I can't seem to figure out a way to handle this, without overwriting a ton of the core click logic. Anyone familiar with this have any suggestions?

Zoracle Zed
Jul 10, 2001
have you read the building a git clone documentation? I've that used that approach a lot, where the base command has the @pass_context decorator and stashes a custom dataclass in ctx.obj, and then the subcommands have the @pass_obj decorator to recieve the dataclass object directly.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Zoracle Zed posted:

have you read the building a git clone documentation? I've that used that approach a lot, where the base command has the @pass_context decorator and stashes a custom dataclass in ctx.obj, and then the subcommands have the @pass_obj decorator to recieve the dataclass object directly.

I'll take another look. I thought this method involved having group options which necessitated breaking your cli args up in a weird way ex: (command group --opt-one subcommand --opt-two) but I might have misread it.

I'd like to keep all the actual command parsing and opts at the lowest level so the help etc doesn't get janky, but I'll reread that in depth

Zoracle Zed
Jul 10, 2001

Falcon2001 posted:

I'll take another look. I thought this method involved having group options which necessitated breaking your cli args up in a weird way ex: (command group --opt-one subcommand --opt-two) but I might have misread it.

yeah you end up with an api like

code:
butt --fart loud --n-turds 3 poo poo --no-flush
where the base 'butt' command doesn't really do anything except collect the common arguments and the subcommand 'poo poo' does the actual work and accepts the specialized args. I think git is the only real-world cli I use that works like but i find it pretty natural, ymmv

Zoracle Zed fucked around with this message at 03:47 on Dec 13, 2023

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Zoracle Zed posted:

yeah you end up with an api like

code:
butt --fart loud --n-turds 3 poo poo --no-flush
where the base 'butt' command doesn't really do anything except collect the common arguments and the subcommand 'poo poo' does the actual work and accepts the specialized args. I think git is the only real-world cli I use that works like but i find it pretty natural, ymmv

Yeah I mean maybe that is fine. I might just avoid this whole thing for now, the feature I'm working on has enough other stuff going on I shouldn't try and fix this.

monochromagic
Jun 17, 2023

Zoracle Zed posted:

Maybe check out the 'sliding_window' iterator from the itertools recipes, which continues to drive me insane for not being importable but instead requires copy-pasting.

more-itertools have you covered: https://more-itertools.readthedocs.io/en/stable/

Jose Cuervo
Aug 25, 2004
Is there a way to have a Python script run automatically when a file (or really a set of files) is placed in a folder? I think I have to set up some sort of process which listens(?) for files being placed in the folder, and then once that is detected I can trigger the script, but I don't know what keywords to look for to start researching the solution.

Fuller story:
I have a script that cleans / formats 4 or sometimes 5 different CSV files. This process occurs 4 times during the first four days of the month. Right now I am the only one who runs the script, and the two people who rely on the output are not able to run the script. I want to avoid running into an issue where I am sick (or hit by a bus) and they are unable to access the data they need. I want to be able to designate a folder to them, and have them copy-paste the raw CSV files into the folder. Once they have done that, ideally something(?) would start the script running to clean / format the CSV files, and output the cleaned / formatted files into a different folder.

12 rats tied together
Sep 7, 2006

Windows has scheduled tasks, linuxes will have cron. You would need a way to tell your scheduled script that a file has already been processed and should not be processed again (maybe move it out of your to-do folder if it successfully produces output).

There are thousands of ways to try and solve this problem but given that you're asking here, and considering the language you used to ask about it, I would probably start with scheduled tasks or cron.

A piece of advice I would offer is to run your automation against a copy of the data at first, even if you have to manually copy it. It's easy to accidentally break files.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Jose Cuervo posted:

Is there a way to have a Python script run automatically when a file (or really a set of files) is placed in a folder? I think I have to set up some sort of process which listens(?) for files being placed in the folder, and then once that is detected I can trigger the script, but I don't know what keywords to look for to start researching the solution.

Fuller story:
I have a script that cleans / formats 4 or sometimes 5 different CSV files. This process occurs 4 times during the first four days of the month. Right now I am the only one who runs the script, and the two people who rely on the output are not able to run the script. I want to avoid running into an issue where I am sick (or hit by a bus) and they are unable to access the data they need. I want to be able to designate a folder to them, and have them copy-paste the raw CSV files into the folder. Once they have done that, ideally something(?) would start the script running to clean / format the CSV files, and output the cleaned / formatted files into a different folder.

Easiest option is to have something running constantly in the background with a sleep timer of some duration, then it wakes up and checks for files in the folder/etc and takes action if appropriate. there is file watching stuff but it's finicky from what I've read previously.

QuarkJets
Sep 8, 2008

OP what you're looking for is the concept of a "daemon", basically a process that lives forever in the background of a system doing some specific thing. There are lots of good options for this in linux, and some okay options in windows. Any process can be turned into a daemon, including python scripts

At a very basic level you could try using the built-in "sched" module. It's easy to use, basically you can ask sched to execute a function at some future point. At the end of that function you can have it task sched again, creating an infinite loop with whatever interval you want (for instance, 10 seconds). This is cute because only 1 instance of the function is ever running, and the next execution is only scheduled when the current one is ending, so it's very easy to manage. This should handle your use case easily, just make sure you have robust error handling and that you aren't trying to modify files while they're still being written

Foxfire_
Nov 8, 2010

Any decent OS will have APIs that will react to directory changes instead of making you poll. They will generally be harder to use than just polling, so don't bother unless polling is actually a problem for your use case.

ReadDirectoryChangesW is the Win32 one.
inotify and fanotify are linux ones

You would need something to wrap those if you wanted to use them from python. This almost certainly already exists somewhere on pypi, but I don't know of it.

For linux, there is also a program built on top of inotify that will block until a change event happens, then exit. If you don't care that much about it being self contained, you could have a loop spawn that as a subprocess from your python program, wait for it to terminate, handle whatever changed, then loop again.

I don't know of an equivalently packaged up equivalent windows utility. There's a .NET wrapper arond ReadDirectoryChangesW (FileSystemWatcher) with a prettier interface that you could call from powershell.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

QuarkJets posted:

There may be some cutesy itertools solution, I always like something that avoids regex even if it winds up being a little slower. Might think about it tonight

That's what I did. Well, cutsey index stuff rather than itertools. Regex is usually my thing, but I haven't messed with string slicing at work for a while. It's probably less efficient, but since we're bailing out after the first number it's not a big deal.


Python code:

digits = {
    'one': '1',
    'two': '2',
    'three': '3',
    'four': '4',
    'five': '5',
    'six': '6',
    'seven': '7',
    'eight': '8',
    'nine': '9',
}

digits_reversed = {key[::-1]: value for key, value in digits.items()}

def find_first_number(line: str, digits: dict[str, int]) -> str:
    for i, char in enumerate(line):
        if char.isdigit():
            return char
        else:
            for num, value in digits.items():
                if line[i:].startswith(num):
                    return value
    raise RuntimeError('No digits found in string!')

def parse_calibration_value_v2(line: str) -> int:
    return int(find_first_number(line, digits) + find_first_number(line[::-1], digits_reversed))

BAD AT STUFF fucked around with this message at 14:52 on Dec 14, 2023

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
My team of devops/platform/infra engineers are building and distributing our first internal Python library, which is going to provide helper functions and custom span processors for opentelemetry. We’re using semantic versioning and are publishing to an internal shared pypi repo.

Unfortunately, we don’t have a lot of SWEs on the team! We’ve got one SWE who comes from a Java background, and I’m a comfortable programmer but come from an ops background. The rest of my team are largely ops people who write scripts here and there.

What are some best practices we should consider as we transition from a team that maintains lovely little lambda functions to a team providing a library our production services will depend on?

We’ve got semantic versioning, conventional commits, and some basic unit tests so far. I’m pushing for choosing a shared linter and enforcing that in our gitlab CI pipelines and precommit hooks (which incidentally I’ll happily take recommendations on). We’ve got a mix of pycharm and vs code users.

What are some other best practices we should consider? Both for library distribution to the organization, but also policies you feel are helpful for improving code quality.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

The Iron Rose posted:

My team of devops/platform/infra engineers are building and distributing our first internal Python library, which is going to provide helper functions and custom span processors for opentelemetry. We’re using semantic versioning and are publishing to an internal shared pypi repo.

Unfortunately, we don’t have a lot of SWEs on the team! We’ve got one SWE who comes from a Java background, and I’m a comfortable programmer but come from an ops background. The rest of my team are largely ops people who write scripts here and there.

What are some best practices we should consider as we transition from a team that maintains lovely little lambda functions to a team providing a library our production services will depend on?

We’ve got semantic versioning, conventional commits, and some basic unit tests so far. I’m pushing for choosing a shared linter and enforcing that in our gitlab CI pipelines and precommit hooks (which incidentally I’ll happily take recommendations on). We’ve got a mix of pycharm and vs code users.

What are some other best practices we should consider? Both for library distribution to the organization, but also policies you feel are helpful for improving code quality.

A shared Linter and enforcement is good, I highly recommend it. Our system works by using black as our linter (run as a command in the environment, not in your IDE) and then the enforcement is 'did you run Black and get no changes? Then you can proceed.'

My best practices recommendations would be to think about it from your customer's point of view. Some things I'd consider:
  • Semantic Versioning is good, but you should specifically consider your API as needing to be backwards compatible outside of certain changes; most software uses major versions to denote times to make breaking changes. So 2.5 -> 2.6 could add features, but you would only deprecate something going from 2.5 > 3.0 for example.
  • This goes double if your company's build system would auto-update software/etc. You need a way to ensure that you can make breaking changes without accidentally breaking builds/production/etc.
  • On that topic, if your security team doesn't handle this you might want to consider a way to identify and notify teams if an older version has some sort of vulnerability and they're still using it.

Generally just think about 'If I was depending on a library, what would piss me off the most?' and that's a good greatest hits list of things to solve for. Otherwise the rest all sounds good and you're thinking through the major code quality stuff already.

Ninja Edit: You may want to consider including MyPy in your package and the linter/enforcement checks, or at least evaluate if it's appropriate. MyPy raises the bar significantly on your code quality by basically faking type enforcement, but because of the intricacies of Python and the libraries you use, this is sometimes a MASSIVE amount of work.

If you don't use it, my recommendation for any professional Python dev is to approach typing like you're working in Java or C# or another statically typed language and use type hints/etc everywhere possible. I think this is the single biggest easily-avoided Python footgun: You can churn out lovely, unreadable code incredibly quickly; with type hints you can at least help discovery on otherwise lovely code. Forcing yourself to do things like type hints and docstrings/etc are all going to mean that your code is easier to read later, which is the most important single thing for high quality code. That's not to say it's the only thing - but from an effort perspective it's such a huge return on investment.

Falcon2001 fucked around with this message at 19:42 on Dec 14, 2023

QuarkJets
Sep 8, 2008

Do you have a CI pipeline? Each change to the code should require a pull request with successful unit testing at minimum.

flake8 is a great addition, you can easily make it a required pre-commit hook. The only downside is that it won't fix stuff, it will only alert you of a problem. That could be considered an upside, in a different light. Also, add it to your CI pipeline so that people who aren't using pre-commit will be caught.

I like Sonarqube, if you want more sophisticated code analysis

If you use pytest that has an add-on (pytest-cov) that will mark the test run as failed if it did not have enough coverage (e.g. percent of lines covered by tests). Crank that up to 90% at least imo

monochromagic
Jun 17, 2023

Try ruff in place of flake8 - it's much faster and can replace black as well.

LochNessMonster
Feb 3, 2005

I need about three fitty


I like pylint for the suggestions / code rating and black as pre commit formatting.

As someone else said, try to use type hinting and docstrings to make your code easier to understand.

And once you have unit testing in place on a acceptable level, start building some integrations tests on the most used features.

Be relentless in sticking to all of the above. Devs always wanna take shortcuts and lower the standards (“let’s just disable rule x, it doesn’t really apply to us”, “90% code coverage is impossible, why is 70% not good enough!?”, “do we really need docstrings? Code is so easy to read”). If you give them an inch, they’ll take a mile.

I’d also give your project setup some good thought as changing things later on might impact your users/consumers.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.
Does your organization already have Python build pipelines that you'd be using, or are you creating those as well? You mentioned gitlabs CI, but I wasn't sure if that was something your team owns.

If you're creating your own build pipelines, I'd suggest using pyproject.toml files for your packages rather than setup.cfg or setup.py. I have a couple libraries I'd like to update, but our build pipelines are shared across the org. Getting everyone else to ditch their old setup.py files is going to be an uphill effort. If you're starting fresh, use the newest standard.

QuarkJets
Sep 8, 2008

Falcon2001 posted:

This goes double if your company's build system would auto-update software/etc. You need a way to ensure that you can make breaking changes without accidentally breaking builds/production/etc.

This requirement also needs to be levied on whatever group controls those auto-updates. If they're automatically updating across major version rolls of your software and that winds up breaking everything then that's really on them

sugar free jazz
Mar 5, 2008


just wanted to follow up on this. got a horrible 10 minute long copy and paste and menu horror show down to a single keystroke, so success story for pyautogui

iceaim
May 20, 2001

This is the Python mega thread for Python questions right?

Well this is mainly a Selenium headache, but I'm using Python 3.11 so I think the question belongs here. It's mainly a configuration issue of Selenium and poor documentation on the Selenium official site that is giving me a massive headache regarding a certificate parsing error.

Here's the script in question:

Python code:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

#driver = webdriver.Chrome()
service = Service('C:\\py311seleniumbot\\chromedriver.exe')
driver = webdriver.Chrome(service=service)

driver.get("https://www.selenium.dev/selenium/web/web-form.html")

title = driver.title

driver.implicitly_wait(0.5)

text_box = driver.find_element(by=By.NAME, value="my-text")
submit_button = driver.find_element(by=By.CSS_SELECTOR, value="button")

text_box.send_keys("Selenium")
submit_button.click()

message = driver.find_element(by=By.ID, value="message")
text = message.text

print("Text:", text)

driver.quit()

I just want a baseline script that confirms my Selenium setup works before creating something more advanced that I will use on my target website.

So my script is working fine and is based off a script in the official documentation (https://www.selenium.dev/documentation/webdriver/getting_started/first_script/), however the certificate errors are a huge concern:

Python terminal posted:

DevTools listening on ws://127.0.0.1:15090/devtools/browser/5c404594-b8a4-4fee-8c75-a78f5f6a9a61
[2944:20700:1219/211513.232:ERROR:cert_issuer_source_aia.cc(34)] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate

I've been trying to find a solution for days and can't find anything remotely adequate. These certificate errors seem to have started in the past year with multiple posts on them on reddit, stackexchange, etc without any real solution that addresses the root cause.

Example here:

https://www.reddit.com/r/selenium/comments/xpgx4o/error_message_that_just_doesnt_have_a_solution/

Yeah "error that doesn't have a solution" has been my experience and it's really bumming me out big time as Selenium is really cool despite being a bit crusty. There's a guy in that reddit thread that suggests "Most likely an expired ca public cert in the browsers cert store. so please update the browser and webdriver you use." well my normal Chrome installation 120.0.6099.110 doesn’t throw up any certificate errors if I visit the website I am submitting a form on and scraping (https://www.selenium.dev/selenium/web/web-form.html).

I thought maybe my Chrome webdriver is outdated. Since I am using Chrome 120.0.6099.110 I went to https://sites.google.com/chromium.org/driver/downloads?authuser=0 which instructed me to go here: https://googlechromelabs.github.io/chrome-for-testing/

I grabbed chromedriver.exe that matches my Chrome web browser (Is the selenium spawned chrome browser even the same as my regular web surfing browser? Edit: Yes it appears to be the same version. Just checked)) and dropped it into my project directory. Then I modified first_script.py to ensure that it calls that specific web driver also commented out the original code:

Python code:
#driver = webdriver.Chrome()
service = Service('C:\\py311seleniumbot\\chromedriver.exe')
driver = webdriver.Chrome(service=service)
Well it’s using the new chromedriver.exe alright but I’m still getting the same cert error.

Some people on stackoverflow (https://stackoverflow.com/questions/75695413/xpath-wrong-in-selenium/75695484#75695484) suggest adding:

Python code:
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
But my understanding is this just ignores the errors and sweeps the problem under the rug. You are still insecurely connecting via https. I need a solution that actually addresses the certificate errors and I'm having one hell of a time finding it.

Are there any goon Selenium experts who can help with this configuration issue? I just want the Chrome browser and webdriver to work without throwing up these certificate errors.

Note that I am using a Python virtual environment (venv) just for the Selenium project.

iceaim fucked around with this message at 19:12 on Dec 23, 2023

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.
I've spent a ton of time over the past year dealing with Python cert issues and this was a new one. That, plus the fact that apparently things work (albeit with an error message) without setting an option like "accept insecure certs" immediately made me suspicious. Finally, I noted that the error message is about *parsing* the cert rather than an SSL error.

So I went to look at the source file that's throwing this error and found an interesting comment:

quote:

// TODO(mattm): this creates misleading log spam if one of the other Parse*
// methods is actually able to parse the data.

https://source.chromium.org/chromium/chromium/src/+/main:net/cert/internal/cert_issuer_source_aia.cc;l=32

When you said that you're not getting errors using the form in Chrome, were you checking the logs? I'm guessing that this is happening in the browser too, but the error isn't being surfaced to the user since it's not causing any problems.

I don't know if the StackOverflow fix will suppress any other error messages that you should care about, but it looks like they're correct that this is one you don't need to care about.

Oysters Autobio
Mar 13, 2017
Any good resources on managing ETL with Python? I'm still fairly new to python but I'm struggling to find good resources on building ETL tools and design patterns for helping automate different data cleaning and modeling requirements.

Our team is often responsible for taking adhoc spreadsheets or datasets and transforming them to a common data model/schema as well as some processing like removing PII for GDPR compliance, I'm struggling to conceptualize how this would look like.

We have a mixed team of data analysts. So some who are comfortable with Python, others only within the context of a Jupyter notebook.

I've seen in-house other similar projects which created custom wrappers (ie like a "dataset" object that then can be passed through various functions) but Id rather use tried/true patterns or even better a framework/library made by smarter people.

Really what I'm looking for is inspiration on how others have done ETL in python.

Generic Monk
Oct 31, 2011

Oysters Autobio posted:

Any good resources on managing ETL with Python? I'm still fairly new to python but I'm struggling to find good resources on building ETL tools and design patterns for helping automate different data cleaning and modeling requirements.

Our team is often responsible for taking adhoc spreadsheets or datasets and transforming them to a common data model/schema as well as some processing like removing PII for GDPR compliance, I'm struggling to conceptualize how this would look like.

We have a mixed team of data analysts. So some who are comfortable with Python, others only within the context of a Jupyter notebook.

I've seen in-house other similar projects which created custom wrappers (ie like a "dataset" object that then can be passed through various functions) but Id rather use tried/true patterns or even better a framework/library made by smarter people.

Really what I'm looking for is inspiration on how others have done ETL in python.

i’m by no means a python/etl expert but i think the conventional wisdom is to use something off the shelf like dbt to handle the lion’s share of the transformation. dbt is written in python but the actual transformation is done with sql, which is typically better for that purpose. you do need to hook it up to a database tho.

python scripts are pretty useful to glue it all together; i find pandas invaluable for importing data or exporting in some weird esoteric custom format

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Oysters Autobio posted:

Any good resources on managing ETL with Python?

So, I did something vaguely related before, and we didn't use any off the shelf stuff. (Probably because as mentioned this wasn't precisely the same thing.) We were combining multiple types of data sources to form a unified 'source list' for use in an auditing application, basically - so more ET and very little L. So here's the useful things I learned from that project, and maybe it's useful to know about. It's also entirely possible I'm full of poo poo so if others have more context and I'm wrong, speak up.

  • Most important: work backwards from the requirements to figure out how the data should look at the end. If you don't have a set of requirements, talk to whoever's going to use the data. If you don't have a clear list there, sit down yourself before you start and write up a list of requirements. This will help you avoid some of the more insane data spaghetti I've seen before.
  • Make sure to normalize your data. This is, from what I've seen, a foundational part of ETL systems in the transform step and also a general programming practice, but it's extremely important with dealing with dissimilar data sources. For example, if you're combining data from System A and System B, you should end up with an object that allows you to directly compare the two. This means that you might have to basically make a brand new schema for the objects that allows you to transform things together.
  • Another way of putting this is that I wouldn't recommend treating Source A as a subclass of TargetData and Source B as another subclass, you should have just multiple instances of TargetData, each with an attribute that identifies the sources.
  • In contrast, it may be useful to have metadata/other source data available. For example, we had a source_data attribute that contained a flexibly structured JSON blob of the raw source data. This is likely to be the sort of thing that's highly domain dependent - you can argue that ideally you shouldn't need to preserve any of this, but I suspect that's also a Perfect World argument and doesn't really pan out. I would recommend making this consistent to find, so you're always looking in a certain place, and from there you can query it when required. Ideally, you can use this to identify places you're having to do these edge cases and work them back into your basic schema instead.
  • ETL is a domain that I think highly benefits from a more functional programming approach instead of a classic OO style - build your transform pipelines and other actions from atomic functions that get chained together and try and avoid having to worry about state as much as possible. If necessary, I'd try and keep it high level like a state machine or something of that nature instead of more flexible mutable objects.
  • Coincident with the previous suggestion, for god's sake please use type hints as much as possible. If your data requires moving complex stuff around, use objects such as dataclasses instead of complex dictionaries. Functional approaches move a LOT of data around from function to function quickly, and one of my biggest nightmares was this massive complex nested dictionary the previous dev had implemented that we just passed around everywhere and had to throw print statements at to figure out what the hell was in it.
  • Come up with a way to test this early on and make sure it stays up to date. Unit testing is great, but you also need a way to with an attached debugger run your workflow using sample source data. This will absolutely save you a shitton of time later on.

Falcon2001 fucked around with this message at 21:39 on Dec 23, 2023

iceaim
May 20, 2001

BAD AT STUFF posted:

I've spent a ton of time over the past year dealing with Python cert issues and this was a new one. That, plus the fact that apparently things work (albeit with an error message) without setting an option like "accept insecure certs" immediately made me suspicious. Finally, I noted that the error message is about *parsing* the cert rather than an SSL error.

So I went to look at the source file that's throwing this error and found an interesting comment:

https://source.chromium.org/chromium/chromium/src/+/main:net/cert/internal/cert_issuer_source_aia.cc;l=32

When you said that you're not getting errors using the form in Chrome, were you checking the logs? I'm guessing that this is happening in the browser too, but the error isn't being surfaced to the user since it's not causing any problems.

I don't know if the StackOverflow fix will suppress any other error messages that you should care about, but it looks like they're correct that this is one you don't need to care about.

I wasn't checking the logs, but I did after you suggested it. You were absolutely right. THANK YOU for the clarification. Your post has been extremely helpful. This Python mega thread rocks and I'm going to be hanging around here and browsing older posts. Stuff like Cavern of COBAL is exactly WHY SA is such much better than reddit and some of the federated clones going on now that have scaling issues.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

Oysters Autobio posted:

Any good resources on managing ETL with Python? I'm still fairly new to python but I'm struggling to find good resources on building ETL tools and design patterns for helping automate different data cleaning and modeling requirements.

Our team is often responsible for taking adhoc spreadsheets or datasets and transforming them to a common data model/schema as well as some processing like removing PII for GDPR compliance, I'm struggling to conceptualize how this would look like.

We have a mixed team of data analysts. So some who are comfortable with Python, others only within the context of a Jupyter notebook.

I've seen in-house other similar projects which created custom wrappers (ie like a "dataset" object that then can be passed through various functions) but Id rather use tried/true patterns or even better a framework/library made by smarter people.

Really what I'm looking for is inspiration on how others have done ETL in python.

I'd agree with the advice that other folks gave already. One thing that makes it hard to have a single answer for this question is that there are different scales that the various frameworks operate on. If you need distributed computing and parallel task execution, then you'll want something like Airflow and PySpark to do this. If you're working on the scale of spreadsheets or individual files, something like Pandas would likely be a better option. But there are certain data formats or types that probably don't lend themselves to Pandas dataframes.

Falcon's approach of working backwards is what I'd go with. Where will this data live once your process is done with it? Database, mart, fileserver, something else? I'd start with what kind of libraries you need to interact with that system and work backwards to what can read from your data sources. If you really like Pandas dataframes for ETL, but late in the process you realize you need some database connecter library with its own ORM classes then you'll have to figure out how to reconcile those.

In terms of general library advice:

  • PySpark (distributed dataframes)
  • Pandas (non-distributed dataframes)
  • Pydantic (for data separated into different entities/models, can be made to work with different db libraries)

Then in terms of how to structure your code, I would also advocate for a functional style. I'd focus on having the individual steps as standalone functions that are well tested, rather than putting too much effort into custom wrappers or pipeline automation. If you break your steps into discrete functions, you can start out with basic (but hopefully well structured) Jupter notebooks to chain everything together. Once you have a few different pipelines, if you start seeing commonalities then maybe you can abstract your code into wrappers or libraries.

iceaim posted:

Your post has been extremely helpful.

Great, I'm glad to hear that!

QuarkJets
Sep 8, 2008

Dask is also a good library for distributed dataframes.

Pyspark is riddled with vulnerabilities, it makes my code and container scanners light up like Christmas trees. Apache is very bad about updating spark's dependencies, back when the Log4Shell vulnerability came out Apache said "it's ok, the log4j that comes with spark is so old that it's not effected" :psyduck:

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.
That's one nice thing about moving to Databricks. It's much easier to keep current with Pyspark versions compared to the old days. Waiting for Cloudera to release an update, getting all the users of the cluster to get on board with the new version, needing downtime to update all of the VMs... all of that was a nightmare. And yeah, our packages would forever raise Xray alerts because of some old rear end CVE that Apache wasn't going to fix.

I still prefer the Pyspark dataframe API over Pandas (and I guess Dask?) though :shrug:

Are you spinning up your own compute for Dask, or is there a good managed service out there? We needed a pretty well fleshed out platform for the data scientists. It was hard enough to get them to stop using R for everything.

mightygerm
Jun 29, 2002



BAD AT STUFF posted:

That's one nice thing about moving to Databricks. It's much easier to keep current with Pyspark versions compared to the old days. Waiting for Cloudera to release an update, getting all the users of the cluster to get on board with the new version, needing downtime to update all of the VMs... all of that was a nightmare. And yeah, our packages would forever raise Xray alerts because of some old rear end CVE that Apache wasn't going to fix.

I still prefer the Pyspark dataframe API over Pandas (and I guess Dask?) though :shrug:

Are you spinning up your own compute for Dask, or is there a good managed service out there? We needed a pretty well fleshed out platform for the data scientists. It was hard enough to get them to stop using R for everything.
Coiled/Saturn host managed Dask services, but I haven't used them personally so ymmv.
I'm starting to move away from dask since there's just a lot of gotchas with it, especially when you are working with fairly large chunks of data (10's of billow of row dataframes, multi-TB raw data, etc). It's easy to write something that has a reshuffle hidden under the hood and complexly balloons your task graph into oblivion. Also it still has iffy support for multi-index so if you use that a lot with Pandas you might have trouble moving over to Dask.

Oysters Autobio
Mar 13, 2017

BAD AT STUFF posted:

I'd agree with the advice that other folks gave already. One thing that makes it hard to have a single answer for this question is that there are different scales that the various frameworks operate on. If you need distributed computing and parallel task execution, then you'll want something like Airflow and PySpark to do this. If you're working on the scale of spreadsheets or individual files, something like Pandas would likely be a better option. But there are certain data formats or types that probably don't lend themselves to Pandas dataframes.

Falcon's approach of working backwards is what I'd go with. Where will this data live once your process is done with it? Database, mart, fileserver, something else? I'd start with what kind of libraries you need to interact with that system and work backwards to what can read from your data sources. If you really like Pandas dataframes for ETL, but late in the process you realize you need some database connecter library with its own ORM classes then you'll have to figure out how to reconcile those.

In terms of general library advice:

  • PySpark (distributed dataframes)
  • Pandas (non-distributed dataframes)
  • Pydantic (for data separated into different entities/models, can be made to work with different db libraries)

Then in terms of how to structure your code, I would also advocate for a functional style. I'd focus on having the individual steps as standalone functions that are well tested, rather than putting too much effort into custom wrappers or pipeline automation. If you break your steps into discrete functions, you can start out with basic (but hopefully well structured) Jupter notebooks to chain everything together. Once you have a few different pipelines, if you start seeing commonalities then maybe you can abstract your code into wrappers or libraries.

Great, I'm glad to hear that!

Thanks a bunch for you and everyone's advice.

Also especially thanks for flagging pydantic. I think what it calls type coercion is very much what I need, will need to see an actual example ETL project using it to make full sense.

I am a bit stuck in tutorial-hell on this project at the moment and struggling to start writing actual code because I don't really know what the end state should "look" like in terms of structure.

StumblyWumbly
Sep 12, 2007

Batmanticore!
One thing to be slightly careful of is that Pydantic released v2 in 2022, so some internet info is outdated. Most significantly, ChatGPT is pretty much unaware of Pydantic v2.
It's a pretty straightforward tool tho, like
Dataclasses but safer

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Oysters Autobio posted:

I am a bit stuck in tutorial-hell on this project at the moment and struggling to start writing actual code because I don't really know what the end state should "look" like in terms of structure.

(Again, not a super professional at this, but here's some general advice for high level stuff.)

The end state you should again, figure out what your requirements are - what will it be used for, where does it go, etc. Get feedback on this phase from the people who will use the data. From there, sketch out a basic structure on how you'd like the data to look that fits with wherever it's being stored. I wouldn't over-worry about perfection, just get something to start with as a target goal. You can iterate as you work through the design and implementation process.

From there, you have a starting point (the data comes in as X) and an end point (I want the data to all look like Y); put those down in a flowchart like Draw.io /etc and then start adding steps between. At this phase you're literally just trying to make sure you cover the bases and then validate that the transitions work/etc. Get advice from others at this stage for sure; this flowchart would mostly be domain experts so other devs, but also anyone that would catch that certain things aren't happening.

From there you have a general design. Designs aren't set in stone, they're designs, but the point is to try and figure out as much as possible before you start writing big chunks of code you might have to throw away.

After all that, you can start coding. (You can do some of your design in pseudocode or code if it makes you more comfortable but don't start writing functions/etc yet). Hopefully that helps get you through this phase.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

Oysters Autobio posted:

Thanks a bunch for you and everyone's advice.

Also especially thanks for flagging pydantic. I think what it calls type coercion is very much what I need, will need to see an actual example ETL project using it to make full sense.

I am a bit stuck in tutorial-hell on this project at the moment and struggling to start writing actual code because I don't really know what the end state should "look" like in terms of structure.

Nice. I'd also look at validators. They can be super helpful. https://docs.pydantic.dev/latest/concepts/validators/

In terms of project design, again I absolutely agree with Falcon. Before getting too invested in how to lay out the code, I create a high-level flow chart of the required steps. Then I'll have a whiteboarding session with other engineers, get feedback, and revise as needed.

Once you get to the implementation stage, we generally have our projects separated into an automation directory and a Python package with actual logic. That lets you modify or sub out the automation as needed. If you start with a basic driver script or notebook, you can go with something more complex that supports DAGs in the future if needed.

I'd focus more on making sure that the individual components are independent and well tested. That way, when you need to modify the design later you don't have to rework things at a low level. When developing patterns that are new for your team, there will be things you don't get right the first time. I'm a big advocate for getting out something as an MVP, learning from it, and iterating.

monochromagic
Jun 17, 2023

Oysters Autobio posted:

I am a bit stuck in tutorial-hell on this project at the moment and struggling to start writing actual code because I don't really know what the end state should "look" like in terms of structure.

In addition to all the great advice you have already received from other posters, I'll add: don't overthink it and don't over engineer it.

For most ETL pipelines Dask/PySpark is shooting flies with shotguns - look into them later if you have scaling issues.

It's great to have a good idea of structure from the get go, the flow charts people have mentioned are great for getting an overview, but structure will and can change - don't let it block you.

When I implement new pipelines I do an MVP of one table/data source/whatever and work out the kinks because actually coding is sometimes better than thinking about it.

Lastly, someone mentioned normalization - I don't think this was what they meant, but normalizing data in the database sense is usually not useful for analytics. A lot of an ETL/ELT workflow is denormalization.

WHERE MY HAT IS AT
Jan 7, 2011
If I can keep making this the ETL thread for a little longer, what would you all suggest for a totally greenfield project?

I do some contracting for my wife’s company, and inherited a mishmash of make.com no-code projects, single-purpose API endpoints in Heroku, and Airtable automations which sync data to Google Sheets.

They’re starting to hit a scale where this is a problem because things are unreliable, go out of sync, certain data exists only in some data stores, etc. The goal is to get them set up with a proper data store that can act as a single source of truth, and an actual ETL platform where I can write real code or use premade connectors to shift data around.

I did a bit of digging and something like Airbyte + Databricks looks nice, but maybe it’s overkill for what they need? Think thousands of rows a day rather than millions, and they only want to be able to do dashboarding and ad-hoc querying in Tableau. Would I regret just doing a managed Airflow and a Postgres instance at this point? I don’t want to have to redo everything in a year or two.

monochromagic
Jun 17, 2023

WHERE MY HAT IS AT posted:

I did a bit of digging and something like Airbyte + Databricks looks nice, but maybe it’s overkill for what they need? Think thousands of rows a day rather than millions, and they only want to be able to do dashboarding and ad-hoc querying in Tableau. Would I regret just doing a managed Airflow and a Postgres instance at this point? I don’t want to have to redo everything in a year or two.

Go for Dagster over Airflow imo. Keep Airbyte out of it until you need CDC or something similar - we run it in production and I'd say it's not actually mature enough for that yet. If you have good pipelines in Dagster it's easy to redo the ingestion. Look into dbt if anything for managing queries.

Seventh Arrow
Jan 26, 2005

Mage is another possible Airflow alternative. I only say this as someone who hates Airflow, mind you.

Adbot
ADBOT LOVES YOU

WHERE MY HAT IS AT
Jan 7, 2011
Thanks! Those both look interesting, leaning towards dagster just because mage doesn’t have a managed hosted setup and I want to minimize the time I spend on this.


Say I have a pydantic model that represents my incoming data from a third party API (in this case a warehouse management system), what are folks using to actually write that to a raw table for transformation with dbt later? At work we use sqlalchemy for all our DB interactions but that seems heavy handed, especially if I’ve already got a list of models parsed from JSON or whatever. I could just hand roll a parameterized sql statement but surely there’s a library out there that will do what I need, right?

Edit: looks like Dagster can do this natively with a data frame, never mind!

WHERE MY HAT IS AT fucked around with this message at 10:23 on Dec 27, 2023

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply