|
monochromagic posted:In addition to all the great advice you have already received from other posters, I'll add: don't overthink it and don't over engineer it. I'd recommend dask or spark if you predict that you'll need a solution that works with more than 1 cpu core. These frameworks work great on a single computer, in fact they're probably the easiest way to escape from the GIL - arguably with fewer pitfalls than concurrent.futures. But if you're doing something that takes 5 minutes once per month then yeah, do something simpler imo
|
# ? Dec 27, 2023 01:08 |
|
|
# ? May 14, 2024 13:09 |
|
QuarkJets posted:I'd recommend dask or spark if you predict that you'll need a solution that works with more than 1 cpu core. These frameworks work great on a single computer, in fact they're probably the easiest way to escape from the GIL - arguably with fewer pitfalls than concurrent.futures. But if you're doing something that takes 5 minutes once per month then yeah, do something simpler imo Yeah, and to be honest, I can't tell you why but I'm not a big fan of pandas in general, and actually prefer pyspark / sparksql. Very much a subjective thing, couldn't tell you why but the whole "index" and "axes" abstractions are bit annoying to my brain since I'm coming from SQL-world. I only really need to use the distributed processing like 50% of the time, but its enough where I'd rather just spend my time in one library framework rather than splitting between both.
|
# ? Dec 28, 2023 02:26 |
|
Oysters Autobio posted:Yeah, and to be honest, I can't tell you why but I'm not a big fan of pandas in general, and actually prefer pyspark / sparksql. Very much a subjective thing, couldn't tell you why but the whole "index" and "axes" abstractions are bit annoying to my brain since I'm coming from SQL-world. I only really need to use the distributed processing like 50% of the time, but its enough where I'd rather just spend my time in one library framework rather than splitting between both. Are you me? Because I agree with all of that. Beyond my preference for SQL idoms, it's an infrequent thing that I need to reference anything by index. Columns are by name, rows are by row num (but only if I need a Window for some reason). If my data was all numbers in a multidimensional ndarrary, then sure give me indexes. But I find it weird and unhelpful when my data is a mix of different types and effectively always tabular. It's just another thing you can screw up if you only use Pandas infrequently. I'm also guilty of running smaller workflows through Pyspark, but it's nice to work within one stack. Plus, I don't think anyone ever got in trouble for making their jobs too small I have been beating the drum of "we shouldn't use Pyspark automatically to parallelize everything just because it's there" for years, though. I think it's finally sinking in now that we can track costs of specific jobs through Databricks. If we're doing some hacky bullshit to call native Python libraries from Pandas UDFs, that's probably a sign that we could be doing this in Kubernetes. Now, we can finally attribute a dollar amount to that from Databricks license costs.
|
# ? Dec 28, 2023 03:49 |
|
WHERE MY HAT IS AT posted:If I can keep making this the ETL thread for a little longer, what would you all suggest for a totally greenfield project? My company has data on a similar scale. I do self-hosted airflow on a small VM and a managed postgres instance for the "warehouse". I use python pipelines, which use polars for dataframes for lazy evaluation and to handle the rare cases where there is a larger than memory job. Those pipelines are orchestrated by Airflow to sync data from all our sources (production db, various api sources), then airflow executes dbt jobs by tag (depending on the cadence - hourly jobs or daily jobs) to transform the data. Then we hook Tableau to the derived dbt tables. It works great. I'll be honest and say that there are two reasons I used Airflow over dagster/prefect/mage: (1) A good reason: there is better support in that you can find a more articles, wikis, blogs, etc. on airflow compared to the newer alternatives. (2) A selfish reason: I wanted deep experience with Airflow on my resume. If I were contracting, I'd seriously consider just using dagster instead since it seems easier to use all things considered. For self-host vs managed, it really depends on the budget. Since it sounds like they won't have anyone full-time on the data engineering side, that'd lean heavily towards managed for everything if it's within the budget. We do largely self-hosted because it's cheaper given that they are paying me anyway. I'd lean against spark in this context. Polars can do everything you need in cases where you're dealing with thousands of rows a day. Plus, if you're using DBT then all your actual computation for derived tables is being done on the database instance anyways. Speaking of the database instance, if you find that the DBT jobs are really slow, then you might need a columnar database instead of postgres. The main issue with that is that if you want a managed columnar database, rather than self-hosting something like doris/starrocks, then things get really expensive fast because then you're usually looking at snowflake, bigquery, etc. CompeAnansi fucked around with this message at 04:26 on Dec 28, 2023 |
# ? Dec 28, 2023 04:16 |
|
CompeAnansi posted:Speaking of the database instance, if you find that the DBT jobs are really slow, then you might need a columnar database instead of postgres. The main issue with that is that if you want a managed columnar database, rather than self-hosting something like doris/starrocks, then things get really expensive fast because then you're usually looking at snowflake, bigquery, etc. You can also look into managed DuckDB, although I'm not sure how it compares cost wise: https://motherduck.com/docs/intro/ We should maybe consider starting a data engineering thread.
|
# ? Dec 28, 2023 12:19 |
|
Yeah, they’re probably several years out from needing a full time engineer, if they ever get that far. So I’m it, and the less time spent on maintenance the better for both sides. I’ll start with Postgres since I have experience scaling that pretty far at work, and plan to move to BigQuery someday if they need it. A data engineering thread sounds like a good idea, I can work on an OP this week since I’m off on PTO and my nephews gave me the plague or something anyways.
|
# ? Dec 28, 2023 12:36 |
|
monochromagic posted:We should maybe consider starting a data engineering thread. I would happily contribute to such a thread if someone starts it.
|
# ? Dec 28, 2023 20:02 |
|
I would love to have a place to bitch about Azure Data Factory, so I'm all for a data engineering thread.
|
# ? Dec 28, 2023 21:04 |
|
we had this exact conversation in the SQL thread a while ago. I refused to start a data engineering thread because I did that a few years ago and there were no replies. So there
|
# ? Dec 28, 2023 21:32 |
|
BAD AT STUFF posted:I would love to have a place to bitch about Azure Data Factory, so I'm all for a data engineering thread. when i was hired at my current job i was started out on an ADF project and i jumped to something else as soon as the opportunity presented itself
|
# ? Dec 28, 2023 23:56 |
|
Adding to the idea about a data engineering thread. I don't really follow many threads right now in SH/SC because the data stuff spans multiple megathreads.
|
# ? Dec 29, 2023 00:49 |
|
The Science subforum has separate threads for data science and numerical analysis, I think data engineering falls under both of those
|
# ? Dec 29, 2023 01:21 |
|
BAD AT STUFF posted:I would love to have a place to bitch about Azure Data Factory, so I'm all for a data engineering thread. susan b buffering posted:when i was hired at my current job i was started out on an ADF project and i jumped to something else as soon as the opportunity presented itself I've only used ADF for learning and personal projects, and it seems like it could be a powerful tool. What don't you all like about it?
|
# ? Dec 29, 2023 02:34 |
|
QuarkJets posted:The Science subforum has separate threads for data science and numerical analysis, I think data engineering falls under both of those Disagree. It is its own thing that touches on both.
|
# ? Dec 29, 2023 05:43 |
|
Hughmoris posted:I've only used ADF for learning and personal projects, and it seems like it could be a powerful tool. What don't you all like about it? A million little things. One of the more recent: you can't resume a failed pipeline run. Instead, you can rerun the pipeline from the failed task. However, this does not preserve the variable state of the previous run. So if you've been writing out files to a working directory, say, and you have some variable that is referenced throughout the pipeline for those file names then you lose that state for the new run. We ended up creating a sub-pipline that could have all variables passed into it, and using another pipeline to calculate variables and pass them in. Then if you need to restart the sub-pipeline, it will actually use the same params as the original run. I thought leaving Airflow behind would make our lives easier. Little did I know... CompeAnansi posted:Disagree. It is its own thing that touches on both. Yeah. It likely depends on the workplace, but where I'm at there's a pretty big distinction between data science and data engineering. I've picked up some science knowledge by osmosis, but I'm not going to be the one deciding if we should be using MAPE or RMSE. I'm going to be making sure that science's code scales, is automated, and has tests.
|
# ? Dec 29, 2023 09:05 |
|
Data engineering thread: https://forums.somethingawful.com/showthread.php?threadid=4050611WHERE MY HAT IS AT posted:A data engineering thread sounds like a good idea, I can work on an OP this week since I’m off on PTO and my nephews gave me the plague or something anyways. Sorry for the snipe, feel free to add any resources etc I've forgotten!
|
# ? Dec 29, 2023 13:21 |
|
For those of you who fiddle with Python at home, what does your dev environment look like? Do you have Python installed directly to your desktop? Utilizing containers, or a remote environment?
|
# ? Jan 22, 2024 00:34 |
|
I use miniconda installed directly on the computer, containers are too fussy and hold no advantage for anything that I'm doing at home
|
# ? Jan 22, 2024 00:40 |
|
I just install system python on windows or Linux and then one virtualenv for each one. At risk of starting a holy war I just never got into the condas or whatever. Probably because I didn’t come from the data science analysis world and so I just do it the way I’m comfortable with.
|
# ? Jan 22, 2024 00:44 |
|
Hughmoris posted:For those of you who fiddle with Python at home, what does your dev environment look like? Do you have Python installed directly to your desktop? Utilizing containers, or a remote environment? System Python and a virtual environment per project is a low-fuss way to live, unless your system python is too old because you're riding CentOS into the ground for some reason.
|
# ? Jan 22, 2024 00:53 |
|
QuarkJets posted:I use miniconda installed directly on the computer, containers are too fussy and hold no advantage for anything that I'm doing at home I've gone to miniconda over virtual environments for the ease of switching to new Python versions. I can leave the system Python alone but still get new features regularly now that releases are more frequent.
|
# ? Jan 22, 2024 01:23 |
|
Hughmoris posted:For those of you who fiddle with Python at home, what does your dev environment look like? Do you have Python installed directly to your desktop? Utilizing containers, or a remote environment? Pyenv for versions and poetry for dependencies per project.
|
# ? Jan 22, 2024 01:30 |
|
pyenv + pyenv-alias plugin for isolated installs per project
|
# ? Jan 22, 2024 01:32 |
|
Hed posted:I just install system python on windows or Linux and then one virtualenv for each one. They're just another kind of environment creation, it's venv + the ability to bring in additional libraries including libraries that aren't python packages. If you're comfortable with venv or pyenv then conda would feel very familiar
|
# ? Jan 22, 2024 02:25 |
|
As a total moron, I found venv easier to use when I got started because I ran into it a lot in documentation for other things I was using. I have yet to understand Condas and would prefer not to, as again I am a total moron and barely understand what I'm doing with venv.
|
# ? Jan 22, 2024 12:37 |
|
tangentially related, is there a good way to generate a requirements.txt that only has what is being imported in my working directory and not everything in my entire environment
|
# ? Jan 22, 2024 18:54 |
|
The Fool posted:tangentially related, is there a good way to generate a requirements.txt that only has what is being imported in my working directory and not everything in my entire environment Requirements.txt is usually made by pip freeze or by hand as you install stuff so you can maintain orders and versions. What’s being imported doesn’t necessarily have 1:1 naming with the package name on pip. So maybe, but there’s likely going to be a lot of issues with that a year or two from now if the environment needs to get reinstalled.
|
# ? Jan 22, 2024 19:48 |
|
The Fool posted:tangentially related, is there a good way to generate a requirements.txt that only has what is being imported in my working directory and not everything in my entire environment pip freeze doesn’t look at your imports, just what’s installed. The point is to have _every_ dependency pinned.
|
# ? Jan 22, 2024 19:59 |
|
necrotic posted:pip freeze doesn’t look at your imports, just what’s installed. The point is to have _every_ dependency pinned. I understand that which is the entire point of my question.
|
# ? Jan 22, 2024 22:06 |
|
CarForumPoster posted:Requirements.txt is usually made by pip freeze or by hand as you install stuff so you can maintain orders and versions. What’s being imported doesn’t necessarily have 1:1 naming with the package name on pip. My issue is that my local workstation has a million unrelated things installed and I never did a good job of managing venvs. So when I write a thing to be deployed and need a requirements.txt I have to do it by hand right now and it's annoying.
|
# ? Jan 22, 2024 22:08 |
|
I use pipreqs
|
# ? Jan 22, 2024 22:09 |
|
The Fool posted:My issue is that my local workstation has a million unrelated things installed and I never did a good job of managing venvs. The pipregs suggestion seems like what you want. But I recommend using a proper venv at some point. Could even use that as step one, make a new venv, then pip install and bang now you have a venv with just what your project needs.
|
# ? Jan 23, 2024 00:41 |
|
my boss has one giant folder that he opens in vscode for everything (dozens of subfolders), never uses venv, and complains about it regularly. I, a complete moron, continue to feel good about spending too much time failing to adequately understand venv. I may not know what I'm doing, but that's all the more reason to try to keep my bullshit contained.
|
# ? Jan 23, 2024 14:41 |
|
I have a Python library for opentelemetry that does some specific things for our business - mainly an overridden class and a few static methods for developers. I’d like to generate documentation for this library, which I intend to store in the gitlab repo where the library is stored. Is there any reason why I shouldn’t just use pdoc for this in my CI pipeline or in a git commit hook to automatically generate the documentation from docstrings?
|
# ? Jan 23, 2024 20:00 |
|
I found several bizarre packages in our requirements.txt and installing them every time in a docker container is time consuming. Looks like every time someone goes looking for unused packages, they decide to reinvent the wheel and write their own script. I keep stumbling across these when searching It's a Django project not that t it matters Is there a bog standard way to do this or am I really gonna have to grind this out
|
# ? Jan 24, 2024 00:34 |
Grind it out imo. And in the future don't pip install anything without adding it to requirements.txt at the same time. Use venv and make your requirements.txt as sparse as you can get away with, while pinning everything you do specify. Or use pipenv and do a lockfile to pin everything, including all dependencides that get installed incidentally. (these are my opinions/habits and it's what works for me, I defer to almost anyone else in this thread if they have a better practice to promote)
|
|
# ? Jan 24, 2024 00:51 |
|
The Fool posted:My issue is that my local workstation has a million unrelated things installed and I never did a good job of managing venvs. Have you given Poetry a whirl? I've started playing around with it a few months ago. It can build a venv for your project and keeps track of dependencies in a .toml file. Haven't stumbled on any huge annoyances just yet.
|
# ? Jan 24, 2024 01:20 |
|
rich thick and creamy posted:Have you given Poetry a whirl? I've started playing around with it a few months ago. It can build a venv for your project and keeps track of dependencies in a .toml file. Haven't stumbled on any huge annoyances just yet. I second this. If you are developing a package, library or executable, and it needs to be in an environment with its correct dependencies, imo poetry is the best thing to use right now. It keeps your dependencies nice and tidy and also generates a lock file to reproduce an exact environment, depending on your needs. Just makes it easy to do things properly the modern way. Have separate dev dependencies? It's got you covered. Devs get those, but someone just pip installing the package doesn't.
|
# ? Jan 24, 2024 22:57 |
|
Hadlock posted:I found several bizarre packages in our requirements.txt and installing them every time in a docker container is time consuming. I use https://pypi.org/project/deptry/ with poetry
|
# ? Jan 24, 2024 23:15 |
|
|
# ? May 14, 2024 13:09 |
|
Yeah we're using poetry Can I ask poetry where "rust 0.1.1" is being called from? I thought poetry was just a dependency engine that requires two additional files in the root directory with poor backwards compatibility It's (rust) used for parsing, uh ribosomes, possibly for dna tangential research. We are definitely not in the business of doing that, and seems unlikely Edit: I'm talking about this from optimizing the build time/process. I don't own/work on this repo besides config tweaks Hadlock fucked around with this message at 23:41 on Jan 24, 2024 |
# ? Jan 24, 2024 23:38 |