|
Zoracle Zed posted:Maybe check out the 'sliding_window' iterator from the itertools recipes, which continues to drive me insane for not being importable but instead requires copy-pasting. Hey now thanks for this
|
# ? Dec 13, 2023 00:20 |
|
|
# ? May 28, 2024 15:53 |
|
Anyone familiar with Click know if there's a way to implement argument parsing into a shared dataclass or something of that nature? Basically I have code that looks like this - it's a CLI that's mostly invoked via runbook / change management process and has a lot of required arguments and the signatures are starting to get ridiculous; I think we're up to like 10 common options, but individual commands need unique ones as well for the given workflow. Python code:
Python code:
Python code:
|
# ? Dec 13, 2023 02:23 |
|
have you read the building a git clone documentation? I've that used that approach a lot, where the base command has the @pass_context decorator and stashes a custom dataclass in ctx.obj, and then the subcommands have the @pass_obj decorator to recieve the dataclass object directly.
|
# ? Dec 13, 2023 02:38 |
|
Zoracle Zed posted:have you read the building a git clone documentation? I've that used that approach a lot, where the base command has the @pass_context decorator and stashes a custom dataclass in ctx.obj, and then the subcommands have the @pass_obj decorator to recieve the dataclass object directly. I'll take another look. I thought this method involved having group options which necessitated breaking your cli args up in a weird way ex: (command group --opt-one subcommand --opt-two) but I might have misread it. I'd like to keep all the actual command parsing and opts at the lowest level so the help etc doesn't get janky, but I'll reread that in depth
|
# ? Dec 13, 2023 03:25 |
|
Falcon2001 posted:I'll take another look. I thought this method involved having group options which necessitated breaking your cli args up in a weird way ex: (command group --opt-one subcommand --opt-two) but I might have misread it. yeah you end up with an api like code:
Zoracle Zed fucked around with this message at 03:47 on Dec 13, 2023 |
# ? Dec 13, 2023 03:44 |
|
Zoracle Zed posted:yeah you end up with an api like Yeah I mean maybe that is fine. I might just avoid this whole thing for now, the feature I'm working on has enough other stuff going on I shouldn't try and fix this.
|
# ? Dec 13, 2023 04:24 |
|
Zoracle Zed posted:Maybe check out the 'sliding_window' iterator from the itertools recipes, which continues to drive me insane for not being importable but instead requires copy-pasting. more-itertools have you covered: https://more-itertools.readthedocs.io/en/stable/
|
# ? Dec 13, 2023 11:33 |
|
Is there a way to have a Python script run automatically when a file (or really a set of files) is placed in a folder? I think I have to set up some sort of process which listens(?) for files being placed in the folder, and then once that is detected I can trigger the script, but I don't know what keywords to look for to start researching the solution. Fuller story: I have a script that cleans / formats 4 or sometimes 5 different CSV files. This process occurs 4 times during the first four days of the month. Right now I am the only one who runs the script, and the two people who rely on the output are not able to run the script. I want to avoid running into an issue where I am sick (or hit by a bus) and they are unable to access the data they need. I want to be able to designate a folder to them, and have them copy-paste the raw CSV files into the folder. Once they have done that, ideally something(?) would start the script running to clean / format the CSV files, and output the cleaned / formatted files into a different folder.
|
# ? Dec 13, 2023 19:34 |
|
Windows has scheduled tasks, linuxes will have cron. You would need a way to tell your scheduled script that a file has already been processed and should not be processed again (maybe move it out of your to-do folder if it successfully produces output). There are thousands of ways to try and solve this problem but given that you're asking here, and considering the language you used to ask about it, I would probably start with scheduled tasks or cron. A piece of advice I would offer is to run your automation against a copy of the data at first, even if you have to manually copy it. It's easy to accidentally break files.
|
# ? Dec 13, 2023 19:41 |
|
Jose Cuervo posted:Is there a way to have a Python script run automatically when a file (or really a set of files) is placed in a folder? I think I have to set up some sort of process which listens(?) for files being placed in the folder, and then once that is detected I can trigger the script, but I don't know what keywords to look for to start researching the solution. Easiest option is to have something running constantly in the background with a sleep timer of some duration, then it wakes up and checks for files in the folder/etc and takes action if appropriate. there is file watching stuff but it's finicky from what I've read previously.
|
# ? Dec 13, 2023 19:42 |
|
OP what you're looking for is the concept of a "daemon", basically a process that lives forever in the background of a system doing some specific thing. There are lots of good options for this in linux, and some okay options in windows. Any process can be turned into a daemon, including python scripts At a very basic level you could try using the built-in "sched" module. It's easy to use, basically you can ask sched to execute a function at some future point. At the end of that function you can have it task sched again, creating an infinite loop with whatever interval you want (for instance, 10 seconds). This is cute because only 1 instance of the function is ever running, and the next execution is only scheduled when the current one is ending, so it's very easy to manage. This should handle your use case easily, just make sure you have robust error handling and that you aren't trying to modify files while they're still being written
|
# ? Dec 14, 2023 05:01 |
|
Any decent OS will have APIs that will react to directory changes instead of making you poll. They will generally be harder to use than just polling, so don't bother unless polling is actually a problem for your use case. ReadDirectoryChangesW is the Win32 one. inotify and fanotify are linux ones You would need something to wrap those if you wanted to use them from python. This almost certainly already exists somewhere on pypi, but I don't know of it. For linux, there is also a program built on top of inotify that will block until a change event happens, then exit. If you don't care that much about it being self contained, you could have a loop spawn that as a subprocess from your python program, wait for it to terminate, handle whatever changed, then loop again. I don't know of an equivalently packaged up equivalent windows utility. There's a .NET wrapper arond ReadDirectoryChangesW (FileSystemWatcher) with a prettier interface that you could call from powershell.
|
# ? Dec 14, 2023 06:02 |
|
QuarkJets posted:There may be some cutesy itertools solution, I always like something that avoids regex even if it winds up being a little slower. Might think about it tonight That's what I did. Well, cutsey index stuff rather than itertools. Regex is usually my thing, but I haven't messed with string slicing at work for a while. It's probably less efficient, but since we're bailing out after the first number it's not a big deal. Python code:
BAD AT STUFF fucked around with this message at 14:52 on Dec 14, 2023 |
# ? Dec 14, 2023 14:49 |
|
My team of devops/platform/infra engineers are building and distributing our first internal Python library, which is going to provide helper functions and custom span processors for opentelemetry. We’re using semantic versioning and are publishing to an internal shared pypi repo. Unfortunately, we don’t have a lot of SWEs on the team! We’ve got one SWE who comes from a Java background, and I’m a comfortable programmer but come from an ops background. The rest of my team are largely ops people who write scripts here and there. What are some best practices we should consider as we transition from a team that maintains lovely little lambda functions to a team providing a library our production services will depend on? We’ve got semantic versioning, conventional commits, and some basic unit tests so far. I’m pushing for choosing a shared linter and enforcing that in our gitlab CI pipelines and precommit hooks (which incidentally I’ll happily take recommendations on). We’ve got a mix of pycharm and vs code users. What are some other best practices we should consider? Both for library distribution to the organization, but also policies you feel are helpful for improving code quality.
|
# ? Dec 14, 2023 17:55 |
|
The Iron Rose posted:My team of devops/platform/infra engineers are building and distributing our first internal Python library, which is going to provide helper functions and custom span processors for opentelemetry. We’re using semantic versioning and are publishing to an internal shared pypi repo. A shared Linter and enforcement is good, I highly recommend it. Our system works by using black as our linter (run as a command in the environment, not in your IDE) and then the enforcement is 'did you run Black and get no changes? Then you can proceed.' My best practices recommendations would be to think about it from your customer's point of view. Some things I'd consider:
Generally just think about 'If I was depending on a library, what would piss me off the most?' and that's a good greatest hits list of things to solve for. Otherwise the rest all sounds good and you're thinking through the major code quality stuff already. Ninja Edit: You may want to consider including MyPy in your package and the linter/enforcement checks, or at least evaluate if it's appropriate. MyPy raises the bar significantly on your code quality by basically faking type enforcement, but because of the intricacies of Python and the libraries you use, this is sometimes a MASSIVE amount of work. If you don't use it, my recommendation for any professional Python dev is to approach typing like you're working in Java or C# or another statically typed language and use type hints/etc everywhere possible. I think this is the single biggest easily-avoided Python footgun: You can churn out lovely, unreadable code incredibly quickly; with type hints you can at least help discovery on otherwise lovely code. Forcing yourself to do things like type hints and docstrings/etc are all going to mean that your code is easier to read later, which is the most important single thing for high quality code. That's not to say it's the only thing - but from an effort perspective it's such a huge return on investment. Falcon2001 fucked around with this message at 19:42 on Dec 14, 2023 |
# ? Dec 14, 2023 19:38 |
|
Do you have a CI pipeline? Each change to the code should require a pull request with successful unit testing at minimum. flake8 is a great addition, you can easily make it a required pre-commit hook. The only downside is that it won't fix stuff, it will only alert you of a problem. That could be considered an upside, in a different light. Also, add it to your CI pipeline so that people who aren't using pre-commit will be caught. I like Sonarqube, if you want more sophisticated code analysis If you use pytest that has an add-on (pytest-cov) that will mark the test run as failed if it did not have enough coverage (e.g. percent of lines covered by tests). Crank that up to 90% at least imo
|
# ? Dec 14, 2023 21:48 |
|
Try ruff in place of flake8 - it's much faster and can replace black as well.
|
# ? Dec 14, 2023 22:17 |
|
I like pylint for the suggestions / code rating and black as pre commit formatting. As someone else said, try to use type hinting and docstrings to make your code easier to understand. And once you have unit testing in place on a acceptable level, start building some integrations tests on the most used features. Be relentless in sticking to all of the above. Devs always wanna take shortcuts and lower the standards (“let’s just disable rule x, it doesn’t really apply to us”, “90% code coverage is impossible, why is 70% not good enough!?”, “do we really need docstrings? Code is so easy to read”). If you give them an inch, they’ll take a mile. I’d also give your project setup some good thought as changing things later on might impact your users/consumers.
|
# ? Dec 14, 2023 22:29 |
|
Does your organization already have Python build pipelines that you'd be using, or are you creating those as well? You mentioned gitlabs CI, but I wasn't sure if that was something your team owns. If you're creating your own build pipelines, I'd suggest using pyproject.toml files for your packages rather than setup.cfg or setup.py. I have a couple libraries I'd like to update, but our build pipelines are shared across the org. Getting everyone else to ditch their old setup.py files is going to be an uphill effort. If you're starting fresh, use the newest standard.
|
# ? Dec 14, 2023 23:30 |
|
Falcon2001 posted:This goes double if your company's build system would auto-update software/etc. You need a way to ensure that you can make breaking changes without accidentally breaking builds/production/etc. This requirement also needs to be levied on whatever group controls those auto-updates. If they're automatically updating across major version rolls of your software and that winds up breaking everything then that's really on them
|
# ? Dec 14, 2023 23:44 |
|
just wanted to follow up on this. got a horrible 10 minute long copy and paste and menu horror show down to a single keystroke, so success story for pyautogui
|
# ? Dec 19, 2023 22:52 |
|
This is the Python mega thread for Python questions right? Well this is mainly a Selenium headache, but I'm using Python 3.11 so I think the question belongs here. It's mainly a configuration issue of Selenium and poor documentation on the Selenium official site that is giving me a massive headache regarding a certificate parsing error. Here's the script in question: Python code:
So my script is working fine and is based off a script in the official documentation (https://www.selenium.dev/documentation/webdriver/getting_started/first_script/), however the certificate errors are a huge concern: Python terminal posted:DevTools listening on ws://127.0.0.1:15090/devtools/browser/5c404594-b8a4-4fee-8c75-a78f5f6a9a61 I've been trying to find a solution for days and can't find anything remotely adequate. These certificate errors seem to have started in the past year with multiple posts on them on reddit, stackexchange, etc without any real solution that addresses the root cause. Example here: https://www.reddit.com/r/selenium/comments/xpgx4o/error_message_that_just_doesnt_have_a_solution/ Yeah "error that doesn't have a solution" has been my experience and it's really bumming me out big time as Selenium is really cool despite being a bit crusty. There's a guy in that reddit thread that suggests "Most likely an expired ca public cert in the browsers cert store. so please update the browser and webdriver you use." well my normal Chrome installation 120.0.6099.110 doesn’t throw up any certificate errors if I visit the website I am submitting a form on and scraping (https://www.selenium.dev/selenium/web/web-form.html). I thought maybe my Chrome webdriver is outdated. Since I am using Chrome 120.0.6099.110 I went to https://sites.google.com/chromium.org/driver/downloads?authuser=0 which instructed me to go here: https://googlechromelabs.github.io/chrome-for-testing/ I grabbed chromedriver.exe that matches my Chrome web browser (Is the selenium spawned chrome browser even the same as my regular web surfing browser? Edit: Yes it appears to be the same version. Just checked)) and dropped it into my project directory. Then I modified first_script.py to ensure that it calls that specific web driver also commented out the original code: Python code:
Some people on stackoverflow (https://stackoverflow.com/questions/75695413/xpath-wrong-in-selenium/75695484#75695484) suggest adding: Python code:
Are there any goon Selenium experts who can help with this configuration issue? I just want the Chrome browser and webdriver to work without throwing up these certificate errors. Note that I am using a Python virtual environment (venv) just for the Selenium project. iceaim fucked around with this message at 19:12 on Dec 23, 2023 |
# ? Dec 23, 2023 10:08 |
|
I've spent a ton of time over the past year dealing with Python cert issues and this was a new one. That, plus the fact that apparently things work (albeit with an error message) without setting an option like "accept insecure certs" immediately made me suspicious. Finally, I noted that the error message is about *parsing* the cert rather than an SSL error. So I went to look at the source file that's throwing this error and found an interesting comment: quote:// TODO(mattm): this creates misleading log spam if one of the other Parse* https://source.chromium.org/chromium/chromium/src/+/main:net/cert/internal/cert_issuer_source_aia.cc;l=32 When you said that you're not getting errors using the form in Chrome, were you checking the logs? I'm guessing that this is happening in the browser too, but the error isn't being surfaced to the user since it's not causing any problems. I don't know if the StackOverflow fix will suppress any other error messages that you should care about, but it looks like they're correct that this is one you don't need to care about.
|
# ? Dec 23, 2023 16:13 |
|
Any good resources on managing ETL with Python? I'm still fairly new to python but I'm struggling to find good resources on building ETL tools and design patterns for helping automate different data cleaning and modeling requirements. Our team is often responsible for taking adhoc spreadsheets or datasets and transforming them to a common data model/schema as well as some processing like removing PII for GDPR compliance, I'm struggling to conceptualize how this would look like. We have a mixed team of data analysts. So some who are comfortable with Python, others only within the context of a Jupyter notebook. I've seen in-house other similar projects which created custom wrappers (ie like a "dataset" object that then can be passed through various functions) but Id rather use tried/true patterns or even better a framework/library made by smarter people. Really what I'm looking for is inspiration on how others have done ETL in python.
|
# ? Dec 23, 2023 18:47 |
|
Oysters Autobio posted:Any good resources on managing ETL with Python? I'm still fairly new to python but I'm struggling to find good resources on building ETL tools and design patterns for helping automate different data cleaning and modeling requirements. i’m by no means a python/etl expert but i think the conventional wisdom is to use something off the shelf like dbt to handle the lion’s share of the transformation. dbt is written in python but the actual transformation is done with sql, which is typically better for that purpose. you do need to hook it up to a database tho. python scripts are pretty useful to glue it all together; i find pandas invaluable for importing data or exporting in some weird esoteric custom format
|
# ? Dec 23, 2023 18:59 |
|
Oysters Autobio posted:Any good resources on managing ETL with Python? So, I did something vaguely related before, and we didn't use any off the shelf stuff. (Probably because as mentioned this wasn't precisely the same thing.) We were combining multiple types of data sources to form a unified 'source list' for use in an auditing application, basically - so more ET and very little L. So here's the useful things I learned from that project, and maybe it's useful to know about. It's also entirely possible I'm full of poo poo so if others have more context and I'm wrong, speak up.
Falcon2001 fucked around with this message at 21:39 on Dec 23, 2023 |
# ? Dec 23, 2023 21:36 |
|
BAD AT STUFF posted:I've spent a ton of time over the past year dealing with Python cert issues and this was a new one. That, plus the fact that apparently things work (albeit with an error message) without setting an option like "accept insecure certs" immediately made me suspicious. Finally, I noted that the error message is about *parsing* the cert rather than an SSL error. I wasn't checking the logs, but I did after you suggested it. You were absolutely right. THANK YOU for the clarification. Your post has been extremely helpful. This Python mega thread rocks and I'm going to be hanging around here and browsing older posts. Stuff like Cavern of COBAL is exactly WHY SA is such much better than reddit and some of the federated clones going on now that have scaling issues.
|
# ? Dec 24, 2023 06:14 |
|
Oysters Autobio posted:Any good resources on managing ETL with Python? I'm still fairly new to python but I'm struggling to find good resources on building ETL tools and design patterns for helping automate different data cleaning and modeling requirements. I'd agree with the advice that other folks gave already. One thing that makes it hard to have a single answer for this question is that there are different scales that the various frameworks operate on. If you need distributed computing and parallel task execution, then you'll want something like Airflow and PySpark to do this. If you're working on the scale of spreadsheets or individual files, something like Pandas would likely be a better option. But there are certain data formats or types that probably don't lend themselves to Pandas dataframes. Falcon's approach of working backwards is what I'd go with. Where will this data live once your process is done with it? Database, mart, fileserver, something else? I'd start with what kind of libraries you need to interact with that system and work backwards to what can read from your data sources. If you really like Pandas dataframes for ETL, but late in the process you realize you need some database connecter library with its own ORM classes then you'll have to figure out how to reconcile those. In terms of general library advice:
Then in terms of how to structure your code, I would also advocate for a functional style. I'd focus on having the individual steps as standalone functions that are well tested, rather than putting too much effort into custom wrappers or pipeline automation. If you break your steps into discrete functions, you can start out with basic (but hopefully well structured) Jupter notebooks to chain everything together. Once you have a few different pipelines, if you start seeing commonalities then maybe you can abstract your code into wrappers or libraries. iceaim posted:Your post has been extremely helpful. Great, I'm glad to hear that!
|
# ? Dec 25, 2023 01:49 |
|
Dask is also a good library for distributed dataframes. Pyspark is riddled with vulnerabilities, it makes my code and container scanners light up like Christmas trees. Apache is very bad about updating spark's dependencies, back when the Log4Shell vulnerability came out Apache said "it's ok, the log4j that comes with spark is so old that it's not effected"
|
# ? Dec 25, 2023 05:16 |
|
That's one nice thing about moving to Databricks. It's much easier to keep current with Pyspark versions compared to the old days. Waiting for Cloudera to release an update, getting all the users of the cluster to get on board with the new version, needing downtime to update all of the VMs... all of that was a nightmare. And yeah, our packages would forever raise Xray alerts because of some old rear end CVE that Apache wasn't going to fix. I still prefer the Pyspark dataframe API over Pandas (and I guess Dask?) though Are you spinning up your own compute for Dask, or is there a good managed service out there? We needed a pretty well fleshed out platform for the data scientists. It was hard enough to get them to stop using R for everything.
|
# ? Dec 25, 2023 08:22 |
|
BAD AT STUFF posted:That's one nice thing about moving to Databricks. It's much easier to keep current with Pyspark versions compared to the old days. Waiting for Cloudera to release an update, getting all the users of the cluster to get on board with the new version, needing downtime to update all of the VMs... all of that was a nightmare. And yeah, our packages would forever raise Xray alerts because of some old rear end CVE that Apache wasn't going to fix. I'm starting to move away from dask since there's just a lot of gotchas with it, especially when you are working with fairly large chunks of data (10's of billow of row dataframes, multi-TB raw data, etc). It's easy to write something that has a reshuffle hidden under the hood and complexly balloons your task graph into oblivion. Also it still has iffy support for multi-index so if you use that a lot with Pandas you might have trouble moving over to Dask.
|
# ? Dec 25, 2023 13:59 |
|
BAD AT STUFF posted:I'd agree with the advice that other folks gave already. One thing that makes it hard to have a single answer for this question is that there are different scales that the various frameworks operate on. If you need distributed computing and parallel task execution, then you'll want something like Airflow and PySpark to do this. If you're working on the scale of spreadsheets or individual files, something like Pandas would likely be a better option. But there are certain data formats or types that probably don't lend themselves to Pandas dataframes. Thanks a bunch for you and everyone's advice. Also especially thanks for flagging pydantic. I think what it calls type coercion is very much what I need, will need to see an actual example ETL project using it to make full sense. I am a bit stuck in tutorial-hell on this project at the moment and struggling to start writing actual code because I don't really know what the end state should "look" like in terms of structure.
|
# ? Dec 25, 2023 15:40 |
|
One thing to be slightly careful of is that Pydantic released v2 in 2022, so some internet info is outdated. Most significantly, ChatGPT is pretty much unaware of Pydantic v2. It's a pretty straightforward tool tho, like Dataclasses but safer
|
# ? Dec 25, 2023 16:32 |
|
Oysters Autobio posted:I am a bit stuck in tutorial-hell on this project at the moment and struggling to start writing actual code because I don't really know what the end state should "look" like in terms of structure. (Again, not a super professional at this, but here's some general advice for high level stuff.) The end state you should again, figure out what your requirements are - what will it be used for, where does it go, etc. Get feedback on this phase from the people who will use the data. From there, sketch out a basic structure on how you'd like the data to look that fits with wherever it's being stored. I wouldn't over-worry about perfection, just get something to start with as a target goal. You can iterate as you work through the design and implementation process. From there, you have a starting point (the data comes in as X) and an end point (I want the data to all look like Y); put those down in a flowchart like Draw.io /etc and then start adding steps between. At this phase you're literally just trying to make sure you cover the bases and then validate that the transitions work/etc. Get advice from others at this stage for sure; this flowchart would mostly be domain experts so other devs, but also anyone that would catch that certain things aren't happening. From there you have a general design. Designs aren't set in stone, they're designs, but the point is to try and figure out as much as possible before you start writing big chunks of code you might have to throw away. After all that, you can start coding. (You can do some of your design in pseudocode or code if it makes you more comfortable but don't start writing functions/etc yet). Hopefully that helps get you through this phase.
|
# ? Dec 25, 2023 20:13 |
|
Oysters Autobio posted:Thanks a bunch for you and everyone's advice. Nice. I'd also look at validators. They can be super helpful. https://docs.pydantic.dev/latest/concepts/validators/ In terms of project design, again I absolutely agree with Falcon. Before getting too invested in how to lay out the code, I create a high-level flow chart of the required steps. Then I'll have a whiteboarding session with other engineers, get feedback, and revise as needed. Once you get to the implementation stage, we generally have our projects separated into an automation directory and a Python package with actual logic. That lets you modify or sub out the automation as needed. If you start with a basic driver script or notebook, you can go with something more complex that supports DAGs in the future if needed. I'd focus more on making sure that the individual components are independent and well tested. That way, when you need to modify the design later you don't have to rework things at a low level. When developing patterns that are new for your team, there will be things you don't get right the first time. I'm a big advocate for getting out something as an MVP, learning from it, and iterating.
|
# ? Dec 25, 2023 23:29 |
|
Oysters Autobio posted:I am a bit stuck in tutorial-hell on this project at the moment and struggling to start writing actual code because I don't really know what the end state should "look" like in terms of structure. In addition to all the great advice you have already received from other posters, I'll add: don't overthink it and don't over engineer it. For most ETL pipelines Dask/PySpark is shooting flies with shotguns - look into them later if you have scaling issues. It's great to have a good idea of structure from the get go, the flow charts people have mentioned are great for getting an overview, but structure will and can change - don't let it block you. When I implement new pipelines I do an MVP of one table/data source/whatever and work out the kinks because actually coding is sometimes better than thinking about it. Lastly, someone mentioned normalization - I don't think this was what they meant, but normalizing data in the database sense is usually not useful for analytics. A lot of an ETL/ELT workflow is denormalization.
|
# ? Dec 26, 2023 11:22 |
|
If I can keep making this the ETL thread for a little longer, what would you all suggest for a totally greenfield project? I do some contracting for my wife’s company, and inherited a mishmash of make.com no-code projects, single-purpose API endpoints in Heroku, and Airtable automations which sync data to Google Sheets. They’re starting to hit a scale where this is a problem because things are unreliable, go out of sync, certain data exists only in some data stores, etc. The goal is to get them set up with a proper data store that can act as a single source of truth, and an actual ETL platform where I can write real code or use premade connectors to shift data around. I did a bit of digging and something like Airbyte + Databricks looks nice, but maybe it’s overkill for what they need? Think thousands of rows a day rather than millions, and they only want to be able to do dashboarding and ad-hoc querying in Tableau. Would I regret just doing a managed Airflow and a Postgres instance at this point? I don’t want to have to redo everything in a year or two.
|
# ? Dec 26, 2023 12:27 |
|
WHERE MY HAT IS AT posted:I did a bit of digging and something like Airbyte + Databricks looks nice, but maybe it’s overkill for what they need? Think thousands of rows a day rather than millions, and they only want to be able to do dashboarding and ad-hoc querying in Tableau. Would I regret just doing a managed Airflow and a Postgres instance at this point? I don’t want to have to redo everything in a year or two. Go for Dagster over Airflow imo. Keep Airbyte out of it until you need CDC or something similar - we run it in production and I'd say it's not actually mature enough for that yet. If you have good pipelines in Dagster it's easy to redo the ingestion. Look into dbt if anything for managing queries.
|
# ? Dec 26, 2023 15:11 |
|
Mage is another possible Airflow alternative. I only say this as someone who hates Airflow, mind you.
|
# ? Dec 26, 2023 22:50 |
|
|
# ? May 28, 2024 15:53 |
|
Thanks! Those both look interesting, leaning towards dagster just because mage doesn’t have a managed hosted setup and I want to minimize the time I spend on this. Say I have a pydantic model that represents my incoming data from a third party API (in this case a warehouse management system), what are folks using to actually write that to a raw table for transformation with dbt later? At work we use sqlalchemy for all our DB interactions but that seems heavy handed, especially if I’ve already got a list of models parsed from JSON or whatever. I could just hand roll a parameterized sql statement but surely there’s a library out there that will do what I need, right? Edit: looks like Dagster can do this natively with a data frame, never mind! WHERE MY HAT IS AT fucked around with this message at 10:23 on Dec 27, 2023 |
# ? Dec 27, 2023 00:47 |