Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
monochromagic
Jun 17, 2023

Following separate discussions in both the Python and SQL threads, here it is - the data engineering thread for people who like just constantly having problems, always, forever.
Let's go.



What is data engineering?
Great question!!! The definitions vary and data engineering may be many things at once depending on context.
Personally I describe it as software engineering with focus on analytical data pipelines, infrastructure and what have you. That is, no AI/ML/reporting stuff but rather making sure data is available for the smart people who need it, i.e. the data analysts/scientists.
However, depending on company size etc. etc. a data engineer might also be a data scientist or a data analyst might also be a data engineer. Finally, most backend engineers do data engineering to some degree when they design data models and so on.

How do I become one?
Here's a tip - don't! If you insist, there are usually two ways to data engineering - either you are already a data analyst/scientist and want to hone in on the engineering part or you are a software engineer who want to hone in on the data part. I fall in the latter category and before you ask, yes, I am in therapy. Currently there also seems to be some interest from educational institutions to do data engineering specialties. Exciting stuff!

Seventh Arrow posted:

Also if you're going to get into data engineering then it'd probably be a good idea to get a certification for one of the three big cloud platforms - Amazon AWS, Microsoft Azure, or Google GCP.

However don't let job interviewers intimidate you if you have AWS and they're an Azure house. It's all the same stuff with different labels slapped on, and you need to show them that.

Resources
  • The Python and SQL threads mentioned are very good and you should bookmark them
  • Generally I'd say methods over tools, but dbt is pretty ubiquitous in the industry
  • All cloud providers have OLAP databases available, pick your poison
  • If you are not in the cloud DuckDB is very nice and fast and stuff
  • Orchestration of pipelines is very necessary - you will read about Airflow in this thread, but please look at ANY alternative if you're starting a new project

Hughmoris posted:

For those interested in checking out the Microsoft flavor of Data Engineering:

You can create a free Microsoft 365 Developer Sandbox. This gives you free access to the Microsoft Fabric trial, which is Microsoft's next push for all things data. So you can spin up data lakes, warehouses, Spark clusters etc... No credit card needed for any of it, so you can play around without fear of racking up bills.


Tutorials/educational

monochromagic fucked around with this message at 20:25 on Dec 29, 2023

Adbot
ADBOT LOVES YOU

Seventh Arrow
Jan 26, 2005

Also if you're going to get into data engineering then it'd probably be a good idea to get a certification for one of the three big cloud platforms - Amazon AWS, Microsoft Azure, or Google GCP.

However don't let job interviewers intimidate you if you have AWS and they're an Azure house. It's all the same stuff with different labels slapped on, and you need to show them that.

pmchem
Jan 22, 2010


what do you recommend if you need to merge partially overlapping data from people each using different dbs (oracle, postgres, excel) and different schema into one set of pandas dataframes

monochromagic
Jun 17, 2023

pmchem posted:

what do you recommend if you need to merge partially overlapping data from people each using different dbs (oracle, postgres, excel) and different schema into one set of pandas dataframes

Generally I'd recommend building data models for each source and then slap on whatever logic is necessary to convert them to the set of dataframes. I'm assuming Python due to the use of pandas and for data modelling Pydantic is always my default recommendation. 2.0 has a Rust backend and is supposedly very fast although I haven't had the chance to play around with it personally yet (stuck with 1.10 due to our managed Airflow :argh:)

pmchem
Jan 22, 2010


I'll check it out. imo add tutorial / educational links to the OP for people having to do more work in this area or looking to change fields? also I know a lot is done in the cloud but sometimes cloud is not an option so including both cloud and non-cloud tutorials would be A++

good luck with the new thread!

monochromagic
Jun 17, 2023

That is a great idea - I'll see what I can find and otherwise people are more than welcome to post great tutorials and I'll edit them in.

e: added two, one with a bit of BigQuery involvement and one without cloud tech.

monochromagic fucked around with this message at 15:44 on Dec 29, 2023

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

Seventh Arrow posted:

However don't let job interviewers intimidate you if you have AWS and they're an Azure house. It's all the same stuff with different labels slapped on, and you need to show them that.

That's true even if you're coming from an on-prem background! There the biggest change is infra. It's really nice to be able to allocate your own resources via Terraform rather than depending on another team, for example.

Once you get to the point of actually doing work there's a ton of overlap: k8s is still k8s, Pyspark is still Pyspark.

Seventh Arrow
Jan 26, 2005

If there's two things that you need to have a lock on in the DE space, it's python and SQL. Here's some python learning resources I posted in another thread:

Our humble SA thread

https://forums.somethingawful.com/showthread.php?threadid=3812541

Reddit

https://www.reddit.com/r/inventwithpython/
https://www.reddit.com/r/learnpython/
https://www.reddit.com/r/Python/
https://www.reddit.com/r/PythonNoobs/

Books

Python Crash Course
Automate the Boring Stuff with Python
Think Python

Youtube

Indently
Tech With Tim
mCoding
ArjanCodes
Python Simplified

Online Learning

Codecademy
Real Python
The Official Guide
DataCamp

Discussion

Official Python IRC
Discord

Brain-busters

LeetCode
Practice Python
HackerRank
Edabit

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.
For folks who work somewhere with a clear distinction between data science and data engineering, what are some strategies that have helped make the handoff of projects/code between roles less painful?

At my company, data science comes up with an approach for something (a new data asset, an ML model, whatever). Then engineering builds that out into a full application, handles automation, sets up CI/CD, makes sure everything scales, etc. We've been struggling with inconsistencies in how that work gets delivered from our data scientists. Some of them are really technical and awesome. They'll put PRs in on the repos and have tests for their new code. Others are sending us links to unversioned notebooks filled with print statements and no guidance for how you actually run these things and in what order.

I'd be interested in hearing about anyone's social or technical solutions to these kinds of problems. We've talked about a few things but haven't gone too far down any particular road yet.

monochromagic
Jun 17, 2023

BAD AT STUFF posted:

For folks who work somewhere with a clear distinction between data science and data engineering, what are some strategies that have helped make the handoff of projects/code between roles less painful?

At my company, data science comes up with an approach for something (a new data asset, an ML model, whatever). Then engineering builds that out into a full application, handles automation, sets up CI/CD, makes sure everything scales, etc. We've been struggling with inconsistencies in how that work gets delivered from our data scientists. Some of them are really technical and awesome. They'll put PRs in on the repos and have tests for their new code. Others are sending us links to unversioned notebooks filled with print statements and no guidance for how you actually run these things and in what order.

I'd be interested in hearing about anyone's social or technical solutions to these kinds of problems. We've talked about a few things but haven't gone too far down any particular road yet.

Some thoughts - for context I have been the sole data engineer at my company (1500+ employees, ecommerce) for a year and just recently got a junior data engineer. We are also migrating from Airflow 1 to Airflow 2 and from Dataform to dbt while implementing Airbyte for EL and 60% of the data team are hired within the last 6 months. So...

1) Standardise the way people hand over code, for me that would mean stop accepting notebooks. PRs are king and an onboarding goal of ours is that you have your first code merged within your first week.

2) Let the data scientists learn from each other. When our data scientists/analysts have e.g. dbt questions, we take a discussion and they will write some new documentation they can refer to in the future. In your case I would encourage the awesome technical data scientists to be part of standardising handovers.

3) I'm a bit curious - are you implementing custom CI/CD for each new project? This touches a bit on platform engineering but perhaps this could be standardised as well, e.g. prefer docker images or something similar that you then have a unified CI/CD process for. I'm trying to move us towards more docker and using the Kubernetes executor in Airflow for example - that's how we run dbt in prod.

Hughmoris
Apr 21, 2007
Let's go to the abyss!
For those interested in checking out the Microsoft flavor of Data Engineering:

You can create a free Microsoft 365 Developer Sandbox. This gives you free access to the Microsoft Fabric trial, which is Microsoft's next push for all things data. So you can spin up data lakes, warehouses, Spark clusters etc... No credit card needed for any of it, so you can play around without fear of racking up bills.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

monochromagic posted:

Some thoughts - for context I have been the sole data engineer at my company (1500+ employees, ecommerce) for a year and just recently got a junior data engineer. We are also migrating from Airflow 1 to Airflow 2 and from Dataform to dbt while implementing Airbyte for EL and 60% of the data team are hired within the last 6 months. So...

1) Standardise the way people hand over code, for me that would mean stop accepting notebooks. PRs are king and an onboarding goal of ours is that you have your first code merged within your first week.

2) Let the data scientists learn from each other. When our data scientists/analysts have e.g. dbt questions, we take a discussion and they will write some new documentation they can refer to in the future. In your case I would encourage the awesome technical data scientists to be part of standardising handovers.

3) I'm a bit curious - are you implementing custom CI/CD for each new project? This touches a bit on platform engineering but perhaps this could be standardised as well, e.g. prefer docker images or something similar that you then have a unified CI/CD process for. I'm trying to move us towards more docker and using the Kubernetes executor in Airflow for example - that's how we run dbt in prod.

I appreciate the thoughts. For reference, this is a team that typically has about 5 to 10 data engineers and 5 to 10 data scientists. We're a pretty isolated team within a much larger company. One of the reasons this is an issue is that our org structure is based around job function (science reports to science, engineering reports to engineering), rather than area of the business. The level of leadership that could be dictatorial about enforcing standards on everyone is far enough removed that it would cause political headaches if we escalated things. So I'm trying to come up with some ideas that can make things easier for everyone and that we can come to a consensus about.

1) I think I made us sound more dysfunctional than we actually are there, but that's why we can't go for the low hanging fruit of "every change must be a PR". We're doing that on the engineering side, but getting science to buy in is the harder part. We've been working on things like standardizing PR templates this past year. I'd also like to look into more sophisticated linting options for notebooks. Or anything else folks can think of to make better development patterns more appealing and easy to use.

2) We've been talking to some of the more technical data scientists already. It's going to take time to shift things, but I'm hopeful. Most of the time they already do a good job of giving us something usable. There have just been a couple of notable cases lately that when what we got was unusually bad.

3) We've recently moved to Github Actions for CI/CD, and as part of that all of our pipelines are using standard components. However, there's a decent amount of variability in what the projects do. Some may just be a Python library. Some may be a library plus notebooks that need to get deployed to Databricks. Some might have FastAPI service that needs to be deployed.

I think overall we're making our projects too large (often at a solution/product level rather than component level), which makes it harder to have complete CI/CD out of the box from the project template without needing to tweak anything. That's another area I'm looking to improve, but thankfully no, we aren't reinventing the wheel every time.

mortarr
Apr 28, 2005

frozen meat at high speed

monochromagic posted:

How do I become one?
Here's a tip - don't!

Help, I think I have been doing on prem data engineering for the last like 20+ years without knowing that's what it was called?! Lots of SQL and shifting data between transactional and analytic platforms, with the odd app, report suite or data vis thrown in too.

I do like having problems all the time, I don't really like doing reports, I've lived in on-prem ms-land since forever and never touched python or done any cloud type stuff, currently moved sideways into it/ops consulting but not enjoying it so much.

Very interested in this thread, my employer just got bought out by a much bigger player with a large data division but with no presence in my geo area, so hopeful I can wrangle something, and looking here for inspiration.

UU TEE EFF
Apr 10, 2005
mate?
I am a newbie to both programming and SQL in general. I can follow a guide and understand what I am doing, but that is about it. but I want to expand my skills a little in managing and analyzing a database. I think I want to do it by creating a local database on my windows machine and then connecting to it with Power BI. I was able to follow the python tutorial https://docs.python.org/3/library/sqlite3.html#tutorial to create an SQLite3 database. I then connected to it through Power BI. So that part is good.

Am I on the right track using SQLite with Python to learn this? The number of ways to do this is a little overwhelming for a newbie like myself, so I am not sure if I should change tack.

Next step I want to do is to create a new database with my CSV data files and start building the table relationships. Anyone know a good tutorial on that? Particularly, I want to connect to a directory of CSV files that are stored on my PC in an organized file structure I see some guides for loading one file, but am not sure how to load a directory of files.

Seventh Arrow
Jan 26, 2005

The easiest way to do anything with tabular data in python is to use pandas dataframes. Import your CSV into a dataframe, then export it to SQLite. Boom, done. There are a zillion pandas tutorials out there, it won't be hard if you search for things like "import csv into pandas dataframe" or "pandas dataframe export to database", etc.

Seventh Arrow
Jan 26, 2005

Also, when I originally had thoughts of creating a data engineering thread, the subtitle was going to be "do you want that in CSV, CSV, or CSV?"

QuarkJets
Sep 8, 2008

Seventh Arrow posted:

Also, when I originally had thoughts of creating a data engineering thread, the subtitle was going to be "do you want that in CSV, CSV, or CSV?"

My preferred format is hdf5 to be honest, I've converted a few groups over from csv because it rules and is way faster/better than csv. Changing isn't always an option of course, but I love encountering projects that use hdf5

Why use hdf5 instead of csv?
1. It's natively supported by both pandas and the excellent h5py and pytables libraries, among others. With h5py datasets can be read directly from a file as numpy arrays. The hdf5 file structure representation in python is basically a dictionary
2. It's a binary format so numerical data is stored more compactly and efficiently
3. Data can be given a structure that's more useful than simple rows/columns. Individual datasets can be any size, you can create what are basically folders with more datasets inside of them, you can assign attributes wherever you need them (e.g. to designate units on a dataset, to leave comments on a group/folder, to complain about the weather in the file header, whatever you want!)
4. Datasets have a defined type, so pandas doesn't need to be told what datatype to expect (or doesn't have to infer a type from the data in each column)
5. Compression is natively supported at the dataset level and is completely transparent. The default built-in algorithm is gzip but you can use whatever compression scheme you want. That column full of one million 0s ballooning the size of your csv neatly compresses neatly to almost no space at all in a compressed hdf5 dataset
6. Data can be chunked for more efficient reading and writing

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination

UU TEE EFF posted:

I am a newbie to both programming and SQL in general. I can follow a guide and understand what I am doing, but that is about it. but I want to expand my skills a little in managing and analyzing a database. I think I want to do it by creating a local database on my windows machine and then connecting to it with Power BI. I was able to follow the python tutorial https://docs.python.org/3/library/sqlite3.html#tutorial to create an SQLite3 database. I then connected to it through Power BI. So that part is good.

Am I on the right track using SQLite with Python to learn this? The number of ways to do this is a little overwhelming for a newbie like myself, so I am not sure if I should change tack.

Next step I want to do is to create a new database with my CSV data files and start building the table relationships. Anyone know a good tutorial on that? Particularly, I want to connect to a directory of CSV files that are stored on my PC in an organized file structure I see some guides for loading one file, but am not sure how to load a directory of files.

It is very rare for anyone to just start as a junior data engineer with no prior experience, even with a CS degree. Most people start as backend engineers and move to DE because they like working with data, or they start as a data analyst and move to DE because their company does what lots of companies do and expect analysts to do both the DE work and the analyst work and they realized they like the DE side better. I'd feel pretty uncomfortable suggesting that someone just dive right in and try to get a job as a DE with no prior data/engineering roles.

Lib and let die
Aug 26, 2004

Maybe you all can help me put a label on the job I've done/job I am looking for, because I seem to be struggling in that department right now in my job search.

My last couple of roles were doing "data services" - updating, suppressing/deleting records, ETL for new client data intake from legacy systems, creating and processing duplicate reports, and creating custom reports to client spec. The actual position name I held, "Data Services Specialist" ends up showing me Data Engineer jobs which to me make sense - not being responsible for the design, maintenance, or troubleshooting of the database architecture itself, I always sort of just assumed I fell into the "engineer" bucket rather than the "administrator" bucket making me a Data Engineer since the scope of my work was restricted to the data itself. (I'm also competent enough with PHP/HTML to throw together a basic web interface that can query/write to a database but I haven't seen a ton of overlap in job postings there, most full stack stuff wants experience with Ruby etc)

From what I've been able to gather from job postings, what I've done in the past is more of a database administrator role and if that's the case there's some upskilling I need to do on the creation/maintenance/policy admin side (or a lot of resume and interview bluffing to cover that blind spot while I try and actually catch up) but what with my lovely underemployed after getting laid off situation if I'm going to spend money for a course I need to make sure it aligns with my expectations of a job role.

BAD AT STUFF
May 10, 2012

We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

CompeAnansi posted:

It is very rare for anyone to just start as a junior data engineer with no prior experience, even with a CS degree. Most people start as backend engineers and move to DE because they like working with data, or they start as a data analyst and move to DE because their company does what lots of companies do and expect analysts to do both the DE work and the analyst work and they realized they like the DE side better. I'd feel pretty uncomfortable suggesting that someone just dive right in and try to get a job as a DE with no prior data/engineering roles.

I started as a Data Engineer after my undergrad, but my initial work was all porting WPS/SAS code to Pyspark. That only required Python and SQL skills, and the rest I picked up on the job. I agree that it's a hard thing to do coming out of school. We didn't have any classes that covered big data and distributed computing.

I think that you're right about backend being a good entry point. General SQL and data experience are more important than the specific flavor of SQLite/MySQL/Postgres. It's also good to think about the roles at the company if you're looking to make an internal move in the future. Working at a data focused company in a non-DE role is a decent place to start.

Lib and let die posted:

Maybe you all can help me put a label on the job I've done/job I am looking for, because I seem to be struggling in that department right now in my job search.

My last couple of roles were doing "data services" - updating, suppressing/deleting records, ETL for new client data intake from legacy systems, creating and processing duplicate reports, and creating custom reports to client spec. The actual position name I held, "Data Services Specialist" ends up showing me Data Engineer jobs which to me make sense - not being responsible for the design, maintenance, or troubleshooting of the database architecture itself, I always sort of just assumed I fell into the "engineer" bucket rather than the "administrator" bucket making me a Data Engineer since the scope of my work was restricted to the data itself. (I'm also competent enough with PHP/HTML to throw together a basic web interface that can query/write to a database but I haven't seen a ton of overlap in job postings there, most full stack stuff wants experience with Ruby etc)

From what I've been able to gather from job postings, what I've done in the past is more of a database administrator role and if that's the case there's some upskilling I need to do on the creation/maintenance/policy admin side (or a lot of resume and interview bluffing to cover that blind spot while I try and actually catch up) but what with my lovely underemployed after getting laid off situation if I'm going to spend money for a course I need to make sure it aligns with my expectations of a job role.

If your focus was on data quality issues and being informed about the data assets, that sounds like a role I've worked with before called a Data Solutions Analyst. They weren't a data scientist building new models or a data engineer creating new systems. But if we wanted to know about the contents of specific columns or report to answer a specific question (e.g. how many of X event did we have on the website last quarter?) then we'd go to the DSA.

Hughmoris
Apr 21, 2007
Let's go to the abyss!
For those of you currently working as Data Engineers, what does your day-to-day look like? What technologies are you working with?

Seventh Arrow
Jan 26, 2005

Right now I'm working with the three technologies I hate the most: Salesforce, Informatica, and Oracle. However, the team is great so I don't mind. They're in the process of migrating Informatica to GCP, so I'm currently working on getting my Google cert. I've also let my python and SQL skills lag, so I need to pick that up again in case I have to do interviews when my contract is done.

Hughmoris
Apr 21, 2007
Let's go to the abyss!

Seventh Arrow posted:

Right now I'm working with the three technologies I hate the most: Salesforce, Informatica, and Oracle. However, the team is great so I don't mind. They're in the process of migrating Informatica to GCP, so I'm currently working on getting my Google cert. I've also let my python and SQL skills lag, so I need to pick that up again in case I have to do interviews when my contract is done.

I heard that working with Oracle was an invitation to Planet Money. Is that not the case?

Maybe it was just for Oracle DBA training.

QuarkJets
Sep 8, 2008

Hughmoris posted:

I heard that working with Oracle was an invitation to Planet Money. Is that not the case?

Maybe it was just for Oracle DBA training.

Every single person I know how has spent appreciable time working with oracle databases seems to seriously hate them, but I don't have any personal experience with Oracle DB so I don't really know why :shrug:

ElehemEare
May 20, 2001
I am an omnipotent penguin.

Hughmoris posted:

I heard that working with Oracle was an invitation to Planet Money. Is that not the case?

Maybe it was just for Oracle DBA training.

I have administered Oracle 11g/12c on RAC, it was not a one-way ticket to Planet Money.

QuarkJets posted:

Every single person I know how has spent appreciable time working with oracle databases seems to seriously hate them, but I don't have any personal experience with Oracle DB so I don't really know why :shrug:

I think a lot of the justifiable hatred for Oracle comes from obtuse implementation details designed to ensure backwards-compatibility, objectively bad administrative tooling relative to paid and open-source competitors, and the relative complexity of pl/sql when embedding logic within the data layer.

Seventh Arrow
Jan 26, 2005

Currently we're using Oracle as a dumping ground for Salesforce data so fortunately I don't have to interact with it much - do some queries, dump metadata to a spreadsheet, that kind of thing.

What really does drive me up the wall, however, is the query language that Salesforce uses...specifically, how limited it is. You can't do joins, you can't use aliases, you can't do subqueries, no window functions. You can't even do "select *", lol. You have to include each and every field/column by name instead. You can't even do "distinct" on specific fields/columns. Someone wanted me to do some aggregate functions on specific fields with a where clause, so I had to export the whole object to SQL and query it there ಠ_ಠ

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination
For OLTP DBs, the most popular one for startups in the US (and medium-sized newer tech companies) seems to be Postgres, unless they've tied themselves to the Microsoft stack, in which case they're on SQL Server. MySQL is still somewhat popular in the US but is very popular in Asia, especially China. No one uses Oracle if they can help it, basically only large old non-tech companies, which means that there are a ton of jobs that use it but no one likes them.

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination

Hughmoris posted:

For those of you currently working as Data Engineers, what does your day-to-day look like? What technologies are you working with?

Our stack is:

1. Self-hosted Airflow as our orchestrator, which executes two kinds of jobs:
a. Python pipelines that are either replicating tables from our production db or pulling data from APIs
b. DBT jobs that create derived tables from those base-level tables created/updated by the Python pipelines

2. A "warehouse", which is a managed Postgres instance because it is cheap and our data sizes are small enough that we can get away with it for now.

Changes we're considering are to either move to a lakehouse architecture with Iceberg + Trino, or to move us to a proper columnar db, like self-hosted StarRocks (aka Apache Doris).

My day-to-day at my startup is:

1. Checking the orchestrator status page - Mostly just rerunning failed jobs and/or fixing code if, e.g., an API change happened that caused the failed job.
2. Fielding tickets for data requests - This is the bulk of my job. If the request requires a brand new source, then it'll require writing a new pipeline job to pull data from an API. But often it just requires new DBT transforms to present the existing data in a different way so we can hook it to Tableau or Geckoboard.
3. Occasional calls - I am blessed to have been able to set hard boundaries (and be supported in this by my boss) regarding the number of meetings I am in since I am still an IC even though I am the data lead.
4. Occasional work to re-architect existing systems - We've always chosen the simplest/cheapest route with the thought that when it breaks as our data volumes grow, then we'll revisit it and choose a more complicated or more expensive setup. Well, our data volumes have continued to grow and things do break as expected. When that happens this becomes my main focus until resolved, putting incoming tickets on hold.

CompeAnansi fucked around with this message at 21:04 on Jan 3, 2024

Hughmoris
Apr 21, 2007
Let's go to the abyss!

QuarkJets posted:

Every single person I know how has spent appreciable time working with oracle databases seems to seriously hate them, but I don't have any personal experience with Oracle DB so I don't really know why :shrug:

ElehemEare posted:

I have administered Oracle 11g/12c on RAC, it was not a one-way ticket to Planet Money.

I've been sold a bill of lies. :mad:

CompeAnansi posted:

Our stack is:

...

I appreciate the insight. I really need to sit down and spend a little time exploring Airflow.

Do your data engineers also administer your Postgres instance, or do you have a dedicated DBA team for that?

monochromagic
Jun 17, 2023

Hughmoris posted:

For those of you currently working as Data Engineers, what does your day-to-day look like? What technologies are you working with?

Work for an e-commerce company - the backend is MySQL/PHP on Google Cloud, our data stack is managed Airflow (Cloud Composer), Airbyte on GKE, dbt via Kubernetes operators, BigQuery, Cloud Run, Cloud Notebooks, some Dataflow.

We are currently refactoring our datalake architecture, moving from one big lake to four zones - ingest, transform, staging and publish, trying to solve the current permission spaghetti we have in the old datalake. At the same time, we are moving ingestion from Rundeck and Airflow over to Airbyte, migrating from Airflow 1 to Airflow 2 (long overdue!!) and migrating from Dataform to dbt.

My days are currently lots of Terraform, k8s shenanigans with the people from platform engineering, putting out fires, helping people understand dbt, looking at pull requests, and onboarding our new junior data engineer. Hopefully when they're up and running I can get some more big brain time for the architectural side of things, making sure our stack is running smoothly etc. I've been a one person army until December and it's been quite stressful.

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination

Hughmoris posted:

Do your data engineers also administer your Postgres instance, or do you have a dedicated DBA team for that?

We're paying for managed Postgres so it is largely handled for us, but anything that needs doing I do myself. No DBA team.

lazerwolf
Dec 22, 2009

Orange and Black
Hello Data Engineering thread! I’m sort of a hybrid Full Stack Engineer / Data Engineer in the Biotechnology space meaning I can build front facing UIs, backend APIs and Data pipelines.

We’re an AWS shop so we use MWAA, S3 l, Redshift etc. React for any UI application and Django for REST and GraphQL APIs to serve the frontends and house business logic for data pipelines.

Does anyone have a great approach syncing external databases into Redshift as is (aka no transformation)?

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination

lazerwolf posted:

Hello Data Engineering thread! I’m sort of a hybrid Full Stack Engineer / Data Engineer in the Biotechnology space meaning I can build front facing UIs, backend APIs and Data pipelines.

We’re an AWS shop so we use MWAA, S3 l, Redshift etc. React for any UI application and Django for REST and GraphQL APIs to serve the frontends and house business logic for data pipelines.

Does anyone have a great approach syncing external databases into Redshift as is (aka no transformation)?

Since no one is replying I'll mention what we do for Postgres to Postgres syncs since Redshift is based on Postgres and has the COPY command (although it may be somewhat restricted). Assuming that none of the tables in the DB are larger than memory, you can use:
code:
COPY table TO STDOUT
to pull the table into an in memory buffer, then immediately push it back out to the destination (after setting the buffer back to the start with file.seek(0)) with:
code:
COPY table FROM STDIN
As I said I'm not 100% sure how that'll carry over to Redshift (or what your source DBs are), but it's a potential starting place for high speed duplication.

EDIT: A cursory google search after posting turned this up, though: https://stackoverflow.com/questions/31603429/redshift-copy-from-stdin-using-java-program. Sounds like you'll need to copy it to S3 first then you can use the
code:
COPY table FROM Amazon S3
command instead, but this is significantly less efficient because you have to store the table in an intermediary stage rather than just temporarily in memory.

CompeAnansi fucked around with this message at 05:19 on Jan 5, 2024

lazerwolf
Dec 22, 2009

Orange and Black

CompeAnansi posted:

Since no one is replying I'll mention what we do for Postgres to Postgres syncs since Redshift is based on Postgres and has the COPY command (although it may be somewhat restricted). Assuming that none of the tables in the DB are larger than memory, you can use:
code:
COPY table TO STDOUT
to pull the table into an in memory buffer, then immediately push it back out to the destination (after setting the buffer back to the start with file.seek(0)) with:
code:
COPY table FROM STDIN
As I said I'm not 100% sure how that'll carry over to Redshift (or what your source DBs are), but it's a potential starting place for high speed duplication.

EDIT: A cursory google search after posting turned this up, though: https://stackoverflow.com/questions/31603429/redshift-copy-from-stdin-using-java-program. Sounds like you'll need to copy it to S3 first then you can use the
code:
COPY table FROM Amazon S3
command instead, but this is significantly less efficient because you have to store the table in an intermediary stage rather than just temporarily in memory.

Some of these databases have 600+ tables and I doubt they would fit into memory. We have been doing the table to s3 copy paradigm. I was wondering if there was a better way but seems like not really. Especially since we’re being really lazy right now with dropping the replica and copying full each time.

monochromagic
Jun 17, 2023

lazerwolf posted:

Some of these databases have 600+ tables and I doubt they would fit into memory. We have been doing the table to s3 copy paradigm. I was wondering if there was a better way but seems like not really. Especially since we’re being really lazy right now with dropping the replica and copying full each time.

If the external databases have WALs enabled CDC could be an option. We're using Airbyte for that purpose - but I'd probably look into whether AWS has CDC for Redshift because we are not having a good experience running Airbyte in prod.

Oysters Autobio
Mar 13, 2017
So I'm a data analyst who's been trying to learn more DE work. Additionally, our DE team is heavily overworked on projects and just really has no time for things like data warehousing or dimensional models.

I'm interested in dbt, particularly to leverage the macros so I could parameterize some of the repetitive logical transforms.

Basically our situation is that the DE team will output their pipelines and adhoc ETLs as tables into impala. They don't do any cleaning or transformation on the data or try and map it to any sort of standard set of schemas.

As a result we have a sprawling catalogue of tables with different schemas but with many containing repeating types of data that very much could get assembled into a few analytical tables. Good news is that there is a metadata API built on-top of a metadata SQL database, but very little is tagged (I could crowdsource some tagging if needed).

I have an idea in mind on a dimensional model and how it might look like in terms of fact and dim tables, and I could go table by table to hand-construct the transform code, but I feel like there has to be better idiomatic ways to parameterize this process so it could be somewhat reusable in the future. Like, "for every table that's ETLd with a column tagged as 'dog names', append data from that column to the master table 'dim_dognames'. Is this something doable just in dbt or would I have to do this kind of transform further upstream in pyspark? If so in the latter case, rather than hardcoding all the code and logic, any tips managing schema mapping (ie something like a user can pass a YAML file which identifies each column name from source to common schema in target table).

Also, there a term or concept here that describes an ETL pattern like this (ie using common metadata tags to map parameterized transform logic into analytical tables)? Is this a common practice? Obviously the "right" answer here is to get the clients who are getting this data to conform it to a standard schema prior to dumping it into our data lake but bureacracy, politics and very poor data literacy is hampering that.

I've thrown this idea by some more experienced folks and they flagged something like avro or protobuf, but I don't quite understand how you would use serialization libraries for this kind of thing.

Oysters Autobio fucked around with this message at 00:53 on Jan 11, 2024

WHERE MY HAT IS AT
Jan 7, 2011
I mentioned in the Python thread that I'm doing some contracting for my wife's company and moving them from their old no-code pipelines in Make to Dagster (at the thread's suggestion) for flexibility and better reliability. I'm just wrapping up the first rewrite of a nightly job which pulls order data from Shopify and learned that whatever Make does is non-deterministic. I can run the same pipeline with the same variables three times and get three different order numbers reported. It's also consistently 10-15% lower than it should be, which means that all their COGS calculations for the last several years have been wrong.

Needless to say the founders are a) horrified and b) no longer have any reservations about the rewrite project.

monochromagic
Jun 17, 2023

WHERE MY HAT IS AT posted:

I mentioned in the Python thread that I'm doing some contracting for my wife's company and moving them from their old no-code pipelines in Make to Dagster (at the thread's suggestion) for flexibility and better reliability. I'm just wrapping up the first rewrite of a nightly job which pulls order data from Shopify and learned that whatever Make does is non-deterministic. I can run the same pipeline with the same variables three times and get three different order numbers reported. It's also consistently 10-15% lower than it should be, which means that all their COGS calculations for the last several years have been wrong.

Needless to say the founders are a) horrified and b) no longer have any reservations about the rewrite project.

Non-deterministic is like my worst nightmare. Kind of nice that you were able to get founders on board for the rewrite project though! Would love to hear about how you find working with/migrating to Dagster.



It sounds to me like the setup you have is more of an ELT approach, that is extract, load, transform rather than the other way around. This can actually be beneficial - we do this - but it requires that the architecture follows suit and it seems like this might not be the case for you. We have a four layers in our datalake - ingest, transform, staging, and publish. Ingest is solely data engineering territory, and we work together with DAs to provide sane tables in the transform layer.

With respect to your dimensional modelling approach, I think this should be possible with dbt but you probably need code generation on top. What you are touching upon here is also related to data contracts, a feature dbt core is supporting in its newer versions. The reason people are recommending avro or protobuf is that they are industry standards for describing schemas (or contracts) between systems, and both can be used for code generation.
I think data mesh and data contracts might be interesting concepts for you to research - I hope our datalake will mature enough over the next few years that these concepts are achievable for us.

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination
Has anyone tried running dbt over iceberg yet (with whatever query layer you prefer)? I know views aren’t supported on iceberg yet but you should in theory be able to materialize as a table. But the process for creating a table in iceberg is way more complicated that it is in a traditional db so I’m not sure if the system can pull it off.

Adbot
ADBOT LOVES YOU

Oysters Autobio
Mar 13, 2017

monochromagic posted:

With respect to your dimensional modelling approach, I think this should be possible with dbt but you probably need code generation on top. What you are touching upon here is also related to data contracts, a feature dbt core is supporting in its newer versions. The reason people are recommending avro or protobuf is that they are industry standards for describing schemas (or contracts) between systems, and both can be used for code generation.
I think data mesh and data contracts might be interesting concepts for you to research - I hope our datalake will mature enough over the next few years that these concepts are achievable for us.

Yeah this is interesting and I'd love to learn more practical skills in this but I can't seem to find anything on data contracts and data mesh that isnt SEO-generated VP-bait nonsense bullshit. Is there any actual content that showcases demos, code bases or examples on how this works?

Additionally, I'm a bit skeptical over data mesh actually working in the practical world unless the org is willing to actually invest in the technical expertise needed to make it happen. Each domain having its own data infrastructure team sounds a bit like a pipe dream unless you're afloat in VC cash or something, and a data mesh without subject matter experts being the main participants sounds like it wont work.

The org I support barely even uses spreadsheets let alone understands how structured data works and what a schema even is. I tried to ask for a schema change to look at including a really useful property into a data model used for managing memorandum-style reports and I got blank looks and response saying "Why? People can just write whatever they want in the body of the text".

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply