|
pmchem posted:anaconda 2022.05 was released earlier this week ... native support for m1 macs is the highlight Hurray! Been waiting for that. Although I wonder how nicely that is going to play with some of the libraries that use executables that haven't been ported over yet (e.g., opencv).
|
# ¿ May 16, 2022 03:13 |
|
|
# ¿ Apr 29, 2024 04:29 |
|
I know this isn't really helpful for solving your problem, but why is anything that ends up in a tableau dashboard starting out in spreadsheets?
|
# ¿ Mar 22, 2023 03:09 |
|
Necrobama posted:3 - take the data IN the csv files and INSERT into the created table Since you're already loading the CSVs into memory via a pandas dataframe to create your DDL, you can use that dataframe to load the data as loading data from a dataframe is normally pretty straightforward. I have some high-performance code snippets for this that I can share if you're using postgres (the problem is that each db uses different high performance loading methods).
|
# ¿ Mar 30, 2023 22:31 |
|
Yeah modern query planners are a joy to behold. The lowly database is an often overlooked modern marvel. Love digging into database internals.
|
# ¿ Jul 25, 2023 18:04 |
|
I know this is the python thread not the sql one, but in case anyone wants to deep dive on how modern dbs work, this is an amazing lecture series: https://www.youtube.com/playlist?list=PLSE8ODhjZXjYzlLMbX3cR0sxWnRM7CLFn
|
# ¿ Jul 26, 2023 00:41 |
|
Oysters Autobio posted:I'm helping out a data analysis team adopt python and jupyter notebooks for their work rather than Excel or Tableau or whatever. What's wrong with Tableau for data analysts? I understand maybe wanting to be able to explore the data in python if you want, but visualizations generally look better in Tableau and they're more interactive. Seems like a confusing move unless leadership is trying to move them more into doing data science type work. If they're just using python for cleaning data, shouldn't that be the job of the data engineers?
|
# ¿ Aug 1, 2023 17:27 |
|
Cyril Sneer posted:Thanks for all the responses. So this doesn't turn into an XY problem, maybe I'll just start from the top and explain what I'm trying to do. See the following diagram: This is a data engineering task. Generally, we don't go nuts on using classes or use OOP-style approaches when writing data pipelines especially if it's a one-off task. The most recent code from QuarkJets using the dictionary dispatch is closer to the code I'd write for this than the other class based proposals. If I were handed this task, I'd write a function for each step and then an overall pipeline function that strings them together passing data from one step to the next. To provide some background here, generally, data pipelines have three discrete steps: extract, transform, and load. For your case, that means that you'd start by extracting the data from the excel files into an in-memory object. Then you'd do whatever cleaning procedures, transformations, and/or formatting changes needed for the load step. Finally, you'd load it into the destination, which usually means writing it to a table or loading it into s3. Your diagram in this post covers the first two steps and then, from prior posts, it seems your load step is unioning together the cleaned dataframes from each file into a single dataframe for each type. Not sure what you're ultimately doing with those giant dataframes though... If you're loading them into a table, then you don't need to combine the dataframes first. You can just append the rows from each dataframe to the table directly as you process them. For handling the various kinds of transforms you have to do depending on the base data, I think the suggestions so far overcomplicate things unless the different transforms have basically no overlap between them. If they overlap but, say, some types skip some steps while others add some steps, then I'd just gate certain steps within the transform function behind conditionals that test against values passed into the function rather than breaking each transform "type" out into its own separate function. CompeAnansi fucked around with this message at 23:07 on Aug 16, 2023 |
# ¿ Aug 15, 2023 22:04 |
|
WHERE MY HAT IS AT posted:If I can keep making this the ETL thread for a little longer, what would you all suggest for a totally greenfield project? My company has data on a similar scale. I do self-hosted airflow on a small VM and a managed postgres instance for the "warehouse". I use python pipelines, which use polars for dataframes for lazy evaluation and to handle the rare cases where there is a larger than memory job. Those pipelines are orchestrated by Airflow to sync data from all our sources (production db, various api sources), then airflow executes dbt jobs by tag (depending on the cadence - hourly jobs or daily jobs) to transform the data. Then we hook Tableau to the derived dbt tables. It works great. I'll be honest and say that there are two reasons I used Airflow over dagster/prefect/mage: (1) A good reason: there is better support in that you can find a more articles, wikis, blogs, etc. on airflow compared to the newer alternatives. (2) A selfish reason: I wanted deep experience with Airflow on my resume. If I were contracting, I'd seriously consider just using dagster instead since it seems easier to use all things considered. For self-host vs managed, it really depends on the budget. Since it sounds like they won't have anyone full-time on the data engineering side, that'd lean heavily towards managed for everything if it's within the budget. We do largely self-hosted because it's cheaper given that they are paying me anyway. I'd lean against spark in this context. Polars can do everything you need in cases where you're dealing with thousands of rows a day. Plus, if you're using DBT then all your actual computation for derived tables is being done on the database instance anyways. Speaking of the database instance, if you find that the DBT jobs are really slow, then you might need a columnar database instead of postgres. The main issue with that is that if you want a managed columnar database, rather than self-hosting something like doris/starrocks, then things get really expensive fast because then you're usually looking at snowflake, bigquery, etc. CompeAnansi fucked around with this message at 04:26 on Dec 28, 2023 |
# ¿ Dec 28, 2023 04:16 |
|
monochromagic posted:We should maybe consider starting a data engineering thread. I would happily contribute to such a thread if someone starts it.
|
# ¿ Dec 28, 2023 20:02 |
|
QuarkJets posted:The Science subforum has separate threads for data science and numerical analysis, I think data engineering falls under both of those Disagree. It is its own thing that touches on both.
|
# ¿ Dec 29, 2023 05:43 |
|
|
# ¿ Apr 29, 2024 04:29 |
|
Oysters Autobio posted:Is there a particular 'style' of python that leans towards tables and similar relational data rather than slinging everything around in key:value pairs, lists etc. I've been finding it hard to get imperative styles to click in my head when I'm so used to how "connected" All the data is within a SQL statement for example. Like is there a way to manage data in python script basically create dimensional tables of data that you need and join em as needed. What you want are dataframes. Pandas/Polars/Spark all offer tabular data formats that mirror database formats. You can perform all the same kinds of transformations using their API methods (.groupby, .sum, etc.). Polars and Spark even offer a sqlcontext method that lets you just write sql against a dataframe. That said, Spark is almost certainly more overhead than you need. DuckDB, as mentioned above, is also a good option for cases where you want to write sql against a dataframe.
|
# ¿ Mar 23, 2024 15:34 |