Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
QuarkJets
Sep 8, 2008

monochromagic posted:

In addition to all the great advice you have already received from other posters, I'll add: don't overthink it and don't over engineer it.

For most ETL pipelines Dask/PySpark is shooting flies with shotguns - look into them later if you have scaling issues.

I'd recommend dask or spark if you predict that you'll need a solution that works with more than 1 cpu core. These frameworks work great on a single computer, in fact they're probably the easiest way to escape from the GIL - arguably with fewer pitfalls than concurrent.futures. But if you're doing something that takes 5 minutes once per month then yeah, do something simpler imo

Adbot
ADBOT LOVES YOU

Oysters Autobio
Mar 13, 2017

QuarkJets posted:

I'd recommend dask or spark if you predict that you'll need a solution that works with more than 1 cpu core. These frameworks work great on a single computer, in fact they're probably the easiest way to escape from the GIL - arguably with fewer pitfalls than concurrent.futures. But if you're doing something that takes 5 minutes once per month then yeah, do something simpler imo

Yeah, and to be honest, I can't tell you why but I'm not a big fan of pandas in general, and actually prefer pyspark / sparksql. Very much a subjective thing, couldn't tell you why but the whole "index" and "axes" abstractions are bit annoying to my brain since I'm coming from SQL-world. I only really need to use the distributed processing like 50% of the time, but its enough where I'd rather just spend my time in one library framework rather than splitting between both.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

Oysters Autobio posted:

Yeah, and to be honest, I can't tell you why but I'm not a big fan of pandas in general, and actually prefer pyspark / sparksql. Very much a subjective thing, couldn't tell you why but the whole "index" and "axes" abstractions are bit annoying to my brain since I'm coming from SQL-world. I only really need to use the distributed processing like 50% of the time, but its enough where I'd rather just spend my time in one library framework rather than splitting between both.

Are you me? Because I agree with all of that. Beyond my preference for SQL idoms, it's an infrequent thing that I need to reference anything by index. Columns are by name, rows are by row num (but only if I need a Window for some reason). If my data was all numbers in a multidimensional ndarrary, then sure give me indexes. But I find it weird and unhelpful when my data is a mix of different types and effectively always tabular. It's just another thing you can screw up if you only use Pandas infrequently.

I'm also guilty of running smaller workflows through Pyspark, but it's nice to work within one stack. Plus, I don't think anyone ever got in trouble for making their jobs too small :v:

I have been beating the drum of "we shouldn't use Pyspark automatically to parallelize everything just because it's there" for years, though. I think it's finally sinking in now that we can track costs of specific jobs through Databricks. If we're doing some hacky bullshit to call native Python libraries from Pandas UDFs, that's probably a sign that we could be doing this in Kubernetes. Now, we can finally attribute a dollar amount to that from Databricks license costs.

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination

WHERE MY HAT IS AT posted:

If I can keep making this the ETL thread for a little longer, what would you all suggest for a totally greenfield project?

I do some contracting for my wife’s company, and inherited a mishmash of make.com no-code projects, single-purpose API endpoints in Heroku, and Airtable automations which sync data to Google Sheets.

They’re starting to hit a scale where this is a problem because things are unreliable, go out of sync, certain data exists only in some data stores, etc. The goal is to get them set up with a proper data store that can act as a single source of truth, and an actual ETL platform where I can write real code or use premade connectors to shift data around.

I did a bit of digging and something like Airbyte + Databricks looks nice, but maybe it’s overkill for what they need? Think thousands of rows a day rather than millions, and they only want to be able to do dashboarding and ad-hoc querying in Tableau. Would I regret just doing a managed Airflow and a Postgres instance at this point? I don’t want to have to redo everything in a year or two.

My company has data on a similar scale. I do self-hosted airflow on a small VM and a managed postgres instance for the "warehouse". I use python pipelines, which use polars for dataframes for lazy evaluation and to handle the rare cases where there is a larger than memory job. Those pipelines are orchestrated by Airflow to sync data from all our sources (production db, various api sources), then airflow executes dbt jobs by tag (depending on the cadence - hourly jobs or daily jobs) to transform the data. Then we hook Tableau to the derived dbt tables. It works great.

I'll be honest and say that there are two reasons I used Airflow over dagster/prefect/mage:
(1) A good reason: there is better support in that you can find a more articles, wikis, blogs, etc. on airflow compared to the newer alternatives.
(2) A selfish reason: I wanted deep experience with Airflow on my resume.
If I were contracting, I'd seriously consider just using dagster instead since it seems easier to use all things considered.

For self-host vs managed, it really depends on the budget. Since it sounds like they won't have anyone full-time on the data engineering side, that'd lean heavily towards managed for everything if it's within the budget. We do largely self-hosted because it's cheaper given that they are paying me anyway.

I'd lean against spark in this context. Polars can do everything you need in cases where you're dealing with thousands of rows a day. Plus, if you're using DBT then all your actual computation for derived tables is being done on the database instance anyways.

Speaking of the database instance, if you find that the DBT jobs are really slow, then you might need a columnar database instead of postgres. The main issue with that is that if you want a managed columnar database, rather than self-hosting something like doris/starrocks, then things get really expensive fast because then you're usually looking at snowflake, bigquery, etc.

CompeAnansi fucked around with this message at 04:26 on Dec 28, 2023

monochromagic
Jun 17, 2023

CompeAnansi posted:

Speaking of the database instance, if you find that the DBT jobs are really slow, then you might need a columnar database instead of postgres. The main issue with that is that if you want a managed columnar database, rather than self-hosting something like doris/starrocks, then things get really expensive fast because then you're usually looking at snowflake, bigquery, etc.

You can also look into managed DuckDB, although I'm not sure how it compares cost wise: https://motherduck.com/docs/intro/

We should maybe consider starting a data engineering thread.

WHERE MY HAT IS AT
Jan 7, 2011
Yeah, they’re probably several years out from needing a full time engineer, if they ever get that far. So I’m it, and the less time spent on maintenance the better for both sides. I’ll start with Postgres since I have experience scaling that pretty far at work, and plan to move to BigQuery someday if they need it.

A data engineering thread sounds like a good idea, I can work on an OP this week since I’m off on PTO and my nephews gave me the plague or something anyways.

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination

monochromagic posted:

We should maybe consider starting a data engineering thread.

I would happily contribute to such a thread if someone starts it.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.
I would love to have a place to bitch about Azure Data Factory, so I'm all for a data engineering thread.

Seventh Arrow
Jan 26, 2005

we had this exact conversation in the SQL thread a while ago. I refused to start a data engineering thread because I did that a few years ago and there were no replies. So there :colbert:

susan b buffering
Nov 14, 2016

BAD AT STUFF posted:

I would love to have a place to bitch about Azure Data Factory, so I'm all for a data engineering thread.

when i was hired at my current job i was started out on an ADF project and i jumped to something else as soon as the opportunity presented itself

Oysters Autobio
Mar 13, 2017
Adding to the idea about a data engineering thread. I don't really follow many threads right now in SH/SC because the data stuff spans multiple megathreads.

QuarkJets
Sep 8, 2008

The Science subforum has separate threads for data science and numerical analysis, I think data engineering falls under both of those

Hughmoris
Apr 21, 2007
Let's go to the abyss!

BAD AT STUFF posted:

I would love to have a place to bitch about Azure Data Factory, so I'm all for a data engineering thread.

susan b buffering posted:

when i was hired at my current job i was started out on an ADF project and i jumped to something else as soon as the opportunity presented itself

I've only used ADF for learning and personal projects, and it seems like it could be a powerful tool. What don't you all like about it?

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination

QuarkJets posted:

The Science subforum has separate threads for data science and numerical analysis, I think data engineering falls under both of those

Disagree. It is its own thing that touches on both.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

Hughmoris posted:

I've only used ADF for learning and personal projects, and it seems like it could be a powerful tool. What don't you all like about it?

A million little things. One of the more recent: you can't resume a failed pipeline run. Instead, you can rerun the pipeline from the failed task. However, this does not preserve the variable state of the previous run. So if you've been writing out files to a working directory, say, and you have some variable that is referenced throughout the pipeline for those file names then you lose that state for the new run. We ended up creating a sub-pipline that could have all variables passed into it, and using another pipeline to calculate variables and pass them in. Then if you need to restart the sub-pipeline, it will actually use the same params as the original run.

I thought leaving Airflow behind would make our lives easier. Little did I know...

CompeAnansi posted:

Disagree. It is its own thing that touches on both.

Yeah. It likely depends on the workplace, but where I'm at there's a pretty big distinction between data science and data engineering. I've picked up some science knowledge by osmosis, but I'm not going to be the one deciding if we should be using MAPE or RMSE. I'm going to be making sure that science's code scales, is automated, and has tests.

monochromagic
Jun 17, 2023

Data engineering thread: https://forums.somethingawful.com/showthread.php?threadid=4050611

WHERE MY HAT IS AT posted:

A data engineering thread sounds like a good idea, I can work on an OP this week since I’m off on PTO and my nephews gave me the plague or something anyways.

Sorry for the snipe, feel free to add any resources etc I've forgotten!

Hughmoris
Apr 21, 2007
Let's go to the abyss!
For those of you who fiddle with Python at home, what does your dev environment look like? Do you have Python installed directly to your desktop? Utilizing containers, or a remote environment?

QuarkJets
Sep 8, 2008

I use miniconda installed directly on the computer, containers are too fussy and hold no advantage for anything that I'm doing at home

Hed
Mar 31, 2004

Fun Shoe
I just install system python on windows or Linux and then one virtualenv for each one.

At risk of starting a holy war I just never got into the condas or whatever. Probably because I didn’t come from the data science analysis world and so I just do it the way I’m comfortable with.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Hughmoris posted:

For those of you who fiddle with Python at home, what does your dev environment look like? Do you have Python installed directly to your desktop? Utilizing containers, or a remote environment?

System Python and a virtual environment per project is a low-fuss way to live, unless your system python is too old because you're riding CentOS into the ground for some reason.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

QuarkJets posted:

I use miniconda installed directly on the computer, containers are too fussy and hold no advantage for anything that I'm doing at home

I've gone to miniconda over virtual environments for the ease of switching to new Python versions. I can leave the system Python alone but still get new features regularly now that releases are more frequent.

lazerwolf
Dec 22, 2009

Orange and Black

Hughmoris posted:

For those of you who fiddle with Python at home, what does your dev environment look like? Do you have Python installed directly to your desktop? Utilizing containers, or a remote environment?

Pyenv for versions and poetry for dependencies per project.

12 rats tied together
Sep 7, 2006

pyenv + pyenv-alias plugin for isolated installs per project

QuarkJets
Sep 8, 2008

Hed posted:

I just install system python on windows or Linux and then one virtualenv for each one.

At risk of starting a holy war I just never got into the condas or whatever. Probably because I didn’t come from the data science analysis world and so I just do it the way I’m comfortable with.

They're just another kind of environment creation, it's venv + the ability to bring in additional libraries including libraries that aren't python packages. If you're comfortable with venv or pyenv then conda would feel very familiar

ComradePyro
Oct 6, 2009
As a total moron, I found venv easier to use when I got started because I ran into it a lot in documentation for other things I was using. I have yet to understand Condas and would prefer not to, as again I am a total moron and barely understand what I'm doing with venv.

The Fool
Oct 16, 2003


tangentially related, is there a good way to generate a requirements.txt that only has what is being imported in my working directory and not everything in my entire environment

CarForumPoster
Jun 26, 2013

⚡POWER⚡

The Fool posted:

tangentially related, is there a good way to generate a requirements.txt that only has what is being imported in my working directory and not everything in my entire environment

Requirements.txt is usually made by pip freeze or by hand as you install stuff so you can maintain orders and versions. What’s being imported doesn’t necessarily have 1:1 naming with the package name on pip.

So maybe, but there’s likely going to be a lot of issues with that a year or two from now if the environment needs to get reinstalled.

necrotic
Aug 2, 2005
I owe my brother big time for this!

The Fool posted:

tangentially related, is there a good way to generate a requirements.txt that only has what is being imported in my working directory and not everything in my entire environment

pip freeze doesn’t look at your imports, just what’s installed. The point is to have _every_ dependency pinned.

The Fool
Oct 16, 2003


necrotic posted:

pip freeze doesn’t look at your imports, just what’s installed. The point is to have _every_ dependency pinned.

I understand that which is the entire point of my question.

The Fool
Oct 16, 2003


CarForumPoster posted:

Requirements.txt is usually made by pip freeze or by hand as you install stuff so you can maintain orders and versions. What’s being imported doesn’t necessarily have 1:1 naming with the package name on pip.

So maybe, but there’s likely going to be a lot of issues with that a year or two from now if the environment needs to get reinstalled.

My issue is that my local workstation has a million unrelated things installed and I never did a good job of managing venvs.

So when I write a thing to be deployed and need a requirements.txt I have to do it by hand right now and it's annoying.

Son of Thunderbeast
Sep 21, 2002
I use pipreqs

necrotic
Aug 2, 2005
I owe my brother big time for this!

The Fool posted:

My issue is that my local workstation has a million unrelated things installed and I never did a good job of managing venvs.

So when I write a thing to be deployed and need a requirements.txt I have to do it by hand right now and it's annoying.

The pipregs suggestion seems like what you want. But I recommend using a proper venv at some point. Could even use that as step one, make a new venv, then pip install and bang now you have a venv with just what your project needs.

ComradePyro
Oct 6, 2009
my boss has one giant folder that he opens in vscode for everything (dozens of subfolders), never uses venv, and complains about it regularly.

I, a complete moron, continue to feel good about spending too much time failing to adequately understand venv. I may not know what I'm doing, but that's all the more reason to try to keep my bullshit contained.

The Iron Rose
May 12, 2012

:minnie: Cat Army :minnie:
I have a Python library for opentelemetry that does some specific things for our business - mainly an overridden class and a few static methods for developers.

I’d like to generate documentation for this library, which I intend to store in the gitlab repo where the library is stored. Is there any reason why I shouldn’t just use pdoc for this in my CI pipeline or in a git commit hook to automatically generate the documentation from docstrings?

Hadlock
Nov 9, 2004

I found several bizarre packages in our requirements.txt and installing them every time in a docker container is time consuming.

Looks like every time someone goes looking for unused packages, they decide to reinvent the wheel and write their own script. I keep stumbling across these when searching

It's a Django project not that t it matters

Is there a bog standard way to do this or am I really gonna have to grind this out

Data Graham
Dec 28, 2009

📈📊🍪😋



Grind it out imo. And in the future don't pip install anything without adding it to requirements.txt at the same time.

Use venv and make your requirements.txt as sparse as you can get away with, while pinning everything you do specify. Or use pipenv and do a lockfile to pin everything, including all dependencides that get installed incidentally.

(these are my opinions/habits and it's what works for me, I defer to almost anyone else in this thread if they have a better practice to promote)

rich thick and creamy
May 23, 2005

To whip it, Whip it good
Pillbug

The Fool posted:

My issue is that my local workstation has a million unrelated things installed and I never did a good job of managing venvs.

So when I write a thing to be deployed and need a requirements.txt I have to do it by hand right now and it's annoying.

Have you given Poetry a whirl? I've started playing around with it a few months ago. It can build a venv for your project and keeps track of dependencies in a .toml file. Haven't stumbled on any huge annoyances just yet.

OnceIWasAnOstrich
Jul 22, 2006

rich thick and creamy posted:

Have you given Poetry a whirl? I've started playing around with it a few months ago. It can build a venv for your project and keeps track of dependencies in a .toml file. Haven't stumbled on any huge annoyances just yet.

I second this. If you are developing a package, library or executable, and it needs to be in an environment with its correct dependencies, imo poetry is the best thing to use right now.

It keeps your dependencies nice and tidy and also generates a lock file to reproduce an exact environment, depending on your needs. Just makes it easy to do things properly the modern way.

Have separate dev dependencies? It's got you covered. Devs get those, but someone just pip installing the package doesn't.

LightRailTycoon
Mar 24, 2017

Hadlock posted:

I found several bizarre packages in our requirements.txt and installing them every time in a docker container is time consuming.

Looks like every time someone goes looking for unused packages, they decide to reinvent the wheel and write their own script. I keep stumbling across these when searching

It's a Django project not that t it matters

Is there a bog standard way to do this or am I really gonna have to grind this out

I use https://pypi.org/project/deptry/ with poetry

Adbot
ADBOT LOVES YOU

Hadlock
Nov 9, 2004

Yeah we're using poetry

Can I ask poetry where "rust 0.1.1" is being called from? I thought poetry was just a dependency engine that requires two additional files in the root directory with poor backwards compatibility

It's (rust) used for parsing, uh ribosomes, possibly for dna tangential research. We are definitely not in the business of doing that, and seems unlikely

Edit: I'm talking about this from optimizing the build time/process. I don't own/work on this repo besides config tweaks

Hadlock fucked around with this message at 23:41 on Jan 24, 2024

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply