Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination

pmchem posted:

anaconda 2022.05 was released earlier this week ... native support for m1 macs is the highlight

Hurray! Been waiting for that. Although I wonder how nicely that is going to play with some of the libraries that use executables that haven't been ported over yet (e.g., opencv).

Adbot
ADBOT LOVES YOU

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination
I know this isn't really helpful for solving your problem, but why is anything that ends up in a tableau dashboard starting out in spreadsheets? :negative:

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination

Necrobama posted:

3 - take the data IN the csv files and INSERT into the created table

...

but as I sit and plan out the implementation of item 3, I'm wondering if there's a better tool for the job than python for taking anywhere from a couple thousand to half a million or more lines.

Since you're already loading the CSVs into memory via a pandas dataframe to create your DDL, you can use that dataframe to load the data as loading data from a dataframe is normally pretty straightforward. I have some high-performance code snippets for this that I can share if you're using postgres (the problem is that each db uses different high performance loading methods).

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination
Yeah modern query planners are a joy to behold. The lowly database is an often overlooked modern marvel. Love digging into database internals.

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination
I know this is the python thread not the sql one, but in case anyone wants to deep dive on how modern dbs work, this is an amazing lecture series: https://www.youtube.com/playlist?list=PLSE8ODhjZXjYzlLMbX3cR0sxWnRM7CLFn

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination

Oysters Autobio posted:

I'm helping out a data analysis team adopt python and jupyter notebooks for their work rather than Excel or Tableau or whatever.

What's wrong with Tableau for data analysts? I understand maybe wanting to be able to explore the data in python if you want, but visualizations generally look better in Tableau and they're more interactive. Seems like a confusing move unless leadership is trying to move them more into doing data science type work. If they're just using python for cleaning data, shouldn't that be the job of the data engineers?

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination

Cyril Sneer posted:

Thanks for all the responses. So this doesn't turn into an XY problem, maybe I'll just start from the top and explain what I'm trying to do. See the following diagram:

https://ibb.co/Fgy2f26

I'm working on a project to extract data from a bunch of production-related excel files. Individual files consist of two sheets - a cover sheet and a report sheet. The cover sheet has certain fields whose values I extract and the report sheet contains tabular data records. This tabular data gets extracted, possibly cleaned, then merged with the cover fields.

The blocks in the black circles can be considered stable/fixed, meaning the same code works for all file types. The red circles represent places where the code may vary. For example, for some file types, the clean block has to have a few lines of code to deal with merged cells.

We can think of there being 3 files types. FileTypeA and FileTypeB require the same processing steps, with only certain options in a configuration dictionary that need changing (column names, desired fields, that sort of thing). However they are different datasets and should be separately aggregated. A 3rd file type, FileTypeC, requires some different processing in the Clean module.

Normal classes at first pass seem like an obvious solution. I can define standard behaviors for those 5 blocks, and aggregate the results to each class instance. Then, I can subclass the blocks when/if needed (i.e,. to handle FileTypeC). The thing that doesn't sit will with me here is that none of these blocks actually require any state information. They can all be standalone functions. This was partially why I explored the singleton approach.

This is a data engineering task. Generally, we don't go nuts on using classes or use OOP-style approaches when writing data pipelines especially if it's a one-off task. The most recent code from QuarkJets using the dictionary dispatch is closer to the code I'd write for this than the other class based proposals.

If I were handed this task, I'd write a function for each step and then an overall pipeline function that strings them together passing data from one step to the next. To provide some background here, generally, data pipelines have three discrete steps: extract, transform, and load. For your case, that means that you'd start by extracting the data from the excel files into an in-memory object. Then you'd do whatever cleaning procedures, transformations, and/or formatting changes needed for the load step. Finally, you'd load it into the destination, which usually means writing it to a table or loading it into s3.

Your diagram in this post covers the first two steps and then, from prior posts, it seems your load step is unioning together the cleaned dataframes from each file into a single dataframe for each type. Not sure what you're ultimately doing with those giant dataframes though... If you're loading them into a table, then you don't need to combine the dataframes first. You can just append the rows from each dataframe to the table directly as you process them.

For handling the various kinds of transforms you have to do depending on the base data, I think the suggestions so far overcomplicate things unless the different transforms have basically no overlap between them. If they overlap but, say, some types skip some steps while others add some steps, then I'd just gate certain steps within the transform function behind conditionals that test against values passed into the function rather than breaking each transform "type" out into its own separate function.

CompeAnansi fucked around with this message at 23:07 on Aug 16, 2023

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination

WHERE MY HAT IS AT posted:

If I can keep making this the ETL thread for a little longer, what would you all suggest for a totally greenfield project?

I do some contracting for my wife’s company, and inherited a mishmash of make.com no-code projects, single-purpose API endpoints in Heroku, and Airtable automations which sync data to Google Sheets.

They’re starting to hit a scale where this is a problem because things are unreliable, go out of sync, certain data exists only in some data stores, etc. The goal is to get them set up with a proper data store that can act as a single source of truth, and an actual ETL platform where I can write real code or use premade connectors to shift data around.

I did a bit of digging and something like Airbyte + Databricks looks nice, but maybe it’s overkill for what they need? Think thousands of rows a day rather than millions, and they only want to be able to do dashboarding and ad-hoc querying in Tableau. Would I regret just doing a managed Airflow and a Postgres instance at this point? I don’t want to have to redo everything in a year or two.

My company has data on a similar scale. I do self-hosted airflow on a small VM and a managed postgres instance for the "warehouse". I use python pipelines, which use polars for dataframes for lazy evaluation and to handle the rare cases where there is a larger than memory job. Those pipelines are orchestrated by Airflow to sync data from all our sources (production db, various api sources), then airflow executes dbt jobs by tag (depending on the cadence - hourly jobs or daily jobs) to transform the data. Then we hook Tableau to the derived dbt tables. It works great.

I'll be honest and say that there are two reasons I used Airflow over dagster/prefect/mage:
(1) A good reason: there is better support in that you can find a more articles, wikis, blogs, etc. on airflow compared to the newer alternatives.
(2) A selfish reason: I wanted deep experience with Airflow on my resume.
If I were contracting, I'd seriously consider just using dagster instead since it seems easier to use all things considered.

For self-host vs managed, it really depends on the budget. Since it sounds like they won't have anyone full-time on the data engineering side, that'd lean heavily towards managed for everything if it's within the budget. We do largely self-hosted because it's cheaper given that they are paying me anyway.

I'd lean against spark in this context. Polars can do everything you need in cases where you're dealing with thousands of rows a day. Plus, if you're using DBT then all your actual computation for derived tables is being done on the database instance anyways.

Speaking of the database instance, if you find that the DBT jobs are really slow, then you might need a columnar database instead of postgres. The main issue with that is that if you want a managed columnar database, rather than self-hosting something like doris/starrocks, then things get really expensive fast because then you're usually looking at snowflake, bigquery, etc.

CompeAnansi fucked around with this message at 04:26 on Dec 28, 2023

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination

monochromagic posted:

We should maybe consider starting a data engineering thread.

I would happily contribute to such a thread if someone starts it.

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination

QuarkJets posted:

The Science subforum has separate threads for data science and numerical analysis, I think data engineering falls under both of those

Disagree. It is its own thing that touches on both.

Adbot
ADBOT LOVES YOU

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination

Oysters Autobio posted:

Is there a particular 'style' of python that leans towards tables and similar relational data rather than slinging everything around in key:value pairs, lists etc. I've been finding it hard to get imperative styles to click in my head when I'm so used to how "connected" All the data is within a SQL statement for example. Like is there a way to manage data in python script basically create dimensional tables of data that you need and join em as needed.

(I do tend to lean towards functional styles of python when I'm straight up transforming / cleaning data, but in this case I mean within the context of OOP / imperative styles for local scripts and such).

What you want are dataframes. Pandas/Polars/Spark all offer tabular data formats that mirror database formats. You can perform all the same kinds of transformations using their API methods (.groupby, .sum, etc.). Polars and Spark even offer a sqlcontext method that lets you just write sql against a dataframe. That said, Spark is almost certainly more overhead than you need. DuckDB, as mentioned above, is also a good option for cases where you want to write sql against a dataframe.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply