Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
monochromagic
Jun 17, 2023

Following separate discussions in both the Python and SQL threads, here it is - the data engineering thread for people who like just constantly having problems, always, forever.
Let's go.



What is data engineering?
Great question!!! The definitions vary and data engineering may be many things at once depending on context.
Personally I describe it as software engineering with focus on analytical data pipelines, infrastructure and what have you. That is, no AI/ML/reporting stuff but rather making sure data is available for the smart people who need it, i.e. the data analysts/scientists.
However, depending on company size etc. etc. a data engineer might also be a data scientist or a data analyst might also be a data engineer. Finally, most backend engineers do data engineering to some degree when they design data models and so on.

How do I become one?
Here's a tip - don't! If you insist, there are usually two ways to data engineering - either you are already a data analyst/scientist and want to hone in on the engineering part or you are a software engineer who want to hone in on the data part. I fall in the latter category and before you ask, yes, I am in therapy. Currently there also seems to be some interest from educational institutions to do data engineering specialties. Exciting stuff!

Seventh Arrow posted:

Also if you're going to get into data engineering then it'd probably be a good idea to get a certification for one of the three big cloud platforms - Amazon AWS, Microsoft Azure, or Google GCP.

However don't let job interviewers intimidate you if you have AWS and they're an Azure house. It's all the same stuff with different labels slapped on, and you need to show them that.

Resources
  • The Python and SQL threads mentioned are very good and you should bookmark them
  • Generally I'd say methods over tools, but dbt is pretty ubiquitous in the industry
  • All cloud providers have OLAP databases available, pick your poison
  • If you are not in the cloud DuckDB is very nice and fast and stuff
  • Orchestration of pipelines is very necessary - you will read about Airflow in this thread, but please look at ANY alternative if you're starting a new project

Hughmoris posted:

For those interested in checking out the Microsoft flavor of Data Engineering:

You can create a free Microsoft 365 Developer Sandbox. This gives you free access to the Microsoft Fabric trial, which is Microsoft's next push for all things data. So you can spin up data lakes, warehouses, Spark clusters etc... No credit card needed for any of it, so you can play around without fear of racking up bills.


Tutorials/educational

monochromagic fucked around with this message at 20:25 on Dec 29, 2023

Adbot
ADBOT LOVES YOU

monochromagic
Jun 17, 2023

pmchem posted:

what do you recommend if you need to merge partially overlapping data from people each using different dbs (oracle, postgres, excel) and different schema into one set of pandas dataframes

Generally I'd recommend building data models for each source and then slap on whatever logic is necessary to convert them to the set of dataframes. I'm assuming Python due to the use of pandas and for data modelling Pydantic is always my default recommendation. 2.0 has a Rust backend and is supposedly very fast although I haven't had the chance to play around with it personally yet (stuck with 1.10 due to our managed Airflow :argh:)

monochromagic
Jun 17, 2023

That is a great idea - I'll see what I can find and otherwise people are more than welcome to post great tutorials and I'll edit them in.

e: added two, one with a bit of BigQuery involvement and one without cloud tech.

monochromagic fucked around with this message at 15:44 on Dec 29, 2023

monochromagic
Jun 17, 2023

BAD AT STUFF posted:

For folks who work somewhere with a clear distinction between data science and data engineering, what are some strategies that have helped make the handoff of projects/code between roles less painful?

At my company, data science comes up with an approach for something (a new data asset, an ML model, whatever). Then engineering builds that out into a full application, handles automation, sets up CI/CD, makes sure everything scales, etc. We've been struggling with inconsistencies in how that work gets delivered from our data scientists. Some of them are really technical and awesome. They'll put PRs in on the repos and have tests for their new code. Others are sending us links to unversioned notebooks filled with print statements and no guidance for how you actually run these things and in what order.

I'd be interested in hearing about anyone's social or technical solutions to these kinds of problems. We've talked about a few things but haven't gone too far down any particular road yet.

Some thoughts - for context I have been the sole data engineer at my company (1500+ employees, ecommerce) for a year and just recently got a junior data engineer. We are also migrating from Airflow 1 to Airflow 2 and from Dataform to dbt while implementing Airbyte for EL and 60% of the data team are hired within the last 6 months. So...

1) Standardise the way people hand over code, for me that would mean stop accepting notebooks. PRs are king and an onboarding goal of ours is that you have your first code merged within your first week.

2) Let the data scientists learn from each other. When our data scientists/analysts have e.g. dbt questions, we take a discussion and they will write some new documentation they can refer to in the future. In your case I would encourage the awesome technical data scientists to be part of standardising handovers.

3) I'm a bit curious - are you implementing custom CI/CD for each new project? This touches a bit on platform engineering but perhaps this could be standardised as well, e.g. prefer docker images or something similar that you then have a unified CI/CD process for. I'm trying to move us towards more docker and using the Kubernetes executor in Airflow for example - that's how we run dbt in prod.

monochromagic
Jun 17, 2023

Hughmoris posted:

For those of you currently working as Data Engineers, what does your day-to-day look like? What technologies are you working with?

Work for an e-commerce company - the backend is MySQL/PHP on Google Cloud, our data stack is managed Airflow (Cloud Composer), Airbyte on GKE, dbt via Kubernetes operators, BigQuery, Cloud Run, Cloud Notebooks, some Dataflow.

We are currently refactoring our datalake architecture, moving from one big lake to four zones - ingest, transform, staging and publish, trying to solve the current permission spaghetti we have in the old datalake. At the same time, we are moving ingestion from Rundeck and Airflow over to Airbyte, migrating from Airflow 1 to Airflow 2 (long overdue!!) and migrating from Dataform to dbt.

My days are currently lots of Terraform, k8s shenanigans with the people from platform engineering, putting out fires, helping people understand dbt, looking at pull requests, and onboarding our new junior data engineer. Hopefully when they're up and running I can get some more big brain time for the architectural side of things, making sure our stack is running smoothly etc. I've been a one person army until December and it's been quite stressful.

monochromagic
Jun 17, 2023

lazerwolf posted:

Some of these databases have 600+ tables and I doubt they would fit into memory. We have been doing the table to s3 copy paradigm. I was wondering if there was a better way but seems like not really. Especially since we’re being really lazy right now with dropping the replica and copying full each time.

If the external databases have WALs enabled CDC could be an option. We're using Airbyte for that purpose - but I'd probably look into whether AWS has CDC for Redshift because we are not having a good experience running Airbyte in prod.

monochromagic
Jun 17, 2023

WHERE MY HAT IS AT posted:

I mentioned in the Python thread that I'm doing some contracting for my wife's company and moving them from their old no-code pipelines in Make to Dagster (at the thread's suggestion) for flexibility and better reliability. I'm just wrapping up the first rewrite of a nightly job which pulls order data from Shopify and learned that whatever Make does is non-deterministic. I can run the same pipeline with the same variables three times and get three different order numbers reported. It's also consistently 10-15% lower than it should be, which means that all their COGS calculations for the last several years have been wrong.

Needless to say the founders are a) horrified and b) no longer have any reservations about the rewrite project.

Non-deterministic is like my worst nightmare. Kind of nice that you were able to get founders on board for the rewrite project though! Would love to hear about how you find working with/migrating to Dagster.



It sounds to me like the setup you have is more of an ELT approach, that is extract, load, transform rather than the other way around. This can actually be beneficial - we do this - but it requires that the architecture follows suit and it seems like this might not be the case for you. We have a four layers in our datalake - ingest, transform, staging, and publish. Ingest is solely data engineering territory, and we work together with DAs to provide sane tables in the transform layer.

With respect to your dimensional modelling approach, I think this should be possible with dbt but you probably need code generation on top. What you are touching upon here is also related to data contracts, a feature dbt core is supporting in its newer versions. The reason people are recommending avro or protobuf is that they are industry standards for describing schemas (or contracts) between systems, and both can be used for code generation.
I think data mesh and data contracts might be interesting concepts for you to research - I hope our datalake will mature enough over the next few years that these concepts are achievable for us.

monochromagic
Jun 17, 2023

Hadlock posted:

I'm curious what is the dividing line between data engineering, and devops/platform engineering/sre

For example we had an "analytics" department that was maintaining their own spaghetti code of bash and SQL, database creation/modification, as well as a home built kitchen/pentaho/spoon ETL disaster and ftp/sftp stuff

I rewrote most everything but the sql and ported them to a new platform (with an eye towards moving them to airflow as the next step), parallelized a lot of the tasks and got the run time way down

On the other hand I've got a coworker that's building out some data pipelines for Google forms and excel data and piping it into tableau for upper management to do internal business forecasting. Arguably you could farm that out to a devops guy

So I think it's hard to draw dividing lines between these disciplines, as they will often overlap. The last few companies I've been in have for example moved away from talking about DevOps as a specific discipline with specific roles tied to it, and moved on to talk about e.g. platform engineering and developer experience.

Hadlock posted:

I mean, is data engineering a sub-discipline of devops, like devsec ops or platform engineering, or is it somehow independent? Arguably it's a sub discipline of platform engineering

Personally I see data engineering as a subset of software engineering, we are essentially data backenders. Depending on the job and size/maturity of the project/company you work for, you might also do platform engineering and SRE stuff. At my current place of work we have a mature data platform and a dedicated platform engineering team that helps out with our Kubernetes clusters etc. They don't know the specifics of the data infrastructure but help us easily spin up new clusters, make CI/CD pipelines and so on. However, in smaller companies this could all be on the data engineer or even analysts.

monochromagic
Jun 17, 2023

Oysters Autobio posted:

Data analyst / BI analyst here who's interested in upskilling into a more direct data engineering role, particularly with a focus on analytics (ie data warehousing, modelling, ETL etc). Currently most of my work is spent running typical adhoc analysis and spreadsheet wrangling.

Maybe you'd be interested in pursuing something akin to analytics engineering. With respect to skills I'd value personal projects highly in a hiring situation, comp sci degrees are nice but outranked by actual work experience when I look at CVs. I have no experience with bootcamps unfortunately.

Adbot
ADBOT LOVES YOU

monochromagic
Jun 17, 2023

Oysters Autobio posted:

So conceptually, absolutely yeah I would like to do DE work as close to the analytics side as possible.

But only thing is that I haven't been able to find material on the domain that isn't just dbt promotional blogs and such. Are employers actually hiring positions like this? And if so, outside of just maintaining and creating dbt models, what else does the role entail?

Not slagging dbt here, but I haven't seen anything like "analytics engineering with python" or basically anything that constitutes the stack that isn't dbt, so I'm skeptical.

It might not be called an analytics engineer, but focusing on analytics engineering as described by dbt (they coined the term) will definitely land you and edge in both data analyst and data engineering jobs. For example, I'd love a more analytics focused engineer in our team - whether the official job title is data analyst or engineer is of less importance.

This blog post describes analytics engineering very well in my opinion. Basically, bringing software engineering practices into the analytics space, for example by designing test suites for analyses, CI/CD for the analytics pipelines, etc. This also means that you will definitely need Python in your arsenal among others.

dbt is also just data as code - something that can be achieved with tools like SQLMesh which looks super interesting. My point here is that dbt is just a tool, but the principles behind analytics engineering (and data engineering, and any kind of engineering) can be applied across tool choices.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply