|
Following separate discussions in both the Python and SQL threads, here it is - the data engineering thread for people who like just constantly having problems, always, forever. Let's go. What is data engineering? Great question!!! The definitions vary and data engineering may be many things at once depending on context. Personally I describe it as software engineering with focus on analytical data pipelines, infrastructure and what have you. That is, no AI/ML/reporting stuff but rather making sure data is available for the smart people who need it, i.e. the data analysts/scientists. However, depending on company size etc. etc. a data engineer might also be a data scientist or a data analyst might also be a data engineer. Finally, most backend engineers do data engineering to some degree when they design data models and so on. How do I become one? Here's a tip - don't! If you insist, there are usually two ways to data engineering - either you are already a data analyst/scientist and want to hone in on the engineering part or you are a software engineer who want to hone in on the data part. I fall in the latter category and before you ask, yes, I am in therapy. Currently there also seems to be some interest from educational institutions to do data engineering specialties. Exciting stuff! Seventh Arrow posted:Also if you're going to get into data engineering then it'd probably be a good idea to get a certification for one of the three big cloud platforms - Amazon AWS, Microsoft Azure, or Google GCP. Resources
Hughmoris posted:For those interested in checking out the Microsoft flavor of Data Engineering: Tutorials/educational Seventh Arrow posted:If there's two things that you need to have a lock on in the DE space, it's python and SQL. Here's some python learning resources I posted in another thread: monochromagic fucked around with this message at 20:25 on Dec 29, 2023 |
# ¿ Dec 29, 2023 13:19 |
|
|
# ¿ May 21, 2024 09:57 |
|
pmchem posted:what do you recommend if you need to merge partially overlapping data from people each using different dbs (oracle, postgres, excel) and different schema into one set of pandas dataframes Generally I'd recommend building data models for each source and then slap on whatever logic is necessary to convert them to the set of dataframes. I'm assuming Python due to the use of pandas and for data modelling Pydantic is always my default recommendation. 2.0 has a Rust backend and is supposedly very fast although I haven't had the chance to play around with it personally yet (stuck with 1.10 due to our managed Airflow )
|
# ¿ Dec 29, 2023 15:28 |
|
That is a great idea - I'll see what I can find and otherwise people are more than welcome to post great tutorials and I'll edit them in. e: added two, one with a bit of BigQuery involvement and one without cloud tech. monochromagic fucked around with this message at 15:44 on Dec 29, 2023 |
# ¿ Dec 29, 2023 15:36 |
|
BAD AT STUFF posted:For folks who work somewhere with a clear distinction between data science and data engineering, what are some strategies that have helped make the handoff of projects/code between roles less painful? Some thoughts - for context I have been the sole data engineer at my company (1500+ employees, ecommerce) for a year and just recently got a junior data engineer. We are also migrating from Airflow 1 to Airflow 2 and from Dataform to dbt while implementing Airbyte for EL and 60% of the data team are hired within the last 6 months. So... 1) Standardise the way people hand over code, for me that would mean stop accepting notebooks. PRs are king and an onboarding goal of ours is that you have your first code merged within your first week. 2) Let the data scientists learn from each other. When our data scientists/analysts have e.g. dbt questions, we take a discussion and they will write some new documentation they can refer to in the future. In your case I would encourage the awesome technical data scientists to be part of standardising handovers. 3) I'm a bit curious - are you implementing custom CI/CD for each new project? This touches a bit on platform engineering but perhaps this could be standardised as well, e.g. prefer docker images or something similar that you then have a unified CI/CD process for. I'm trying to move us towards more docker and using the Kubernetes executor in Airflow for example - that's how we run dbt in prod.
|
# ¿ Dec 29, 2023 17:37 |
|
Hughmoris posted:For those of you currently working as Data Engineers, what does your day-to-day look like? What technologies are you working with? Work for an e-commerce company - the backend is MySQL/PHP on Google Cloud, our data stack is managed Airflow (Cloud Composer), Airbyte on GKE, dbt via Kubernetes operators, BigQuery, Cloud Run, Cloud Notebooks, some Dataflow. We are currently refactoring our datalake architecture, moving from one big lake to four zones - ingest, transform, staging and publish, trying to solve the current permission spaghetti we have in the old datalake. At the same time, we are moving ingestion from Rundeck and Airflow over to Airbyte, migrating from Airflow 1 to Airflow 2 (long overdue!!) and migrating from Dataform to dbt. My days are currently lots of Terraform, k8s shenanigans with the people from platform engineering, putting out fires, helping people understand dbt, looking at pull requests, and onboarding our new junior data engineer. Hopefully when they're up and running I can get some more big brain time for the architectural side of things, making sure our stack is running smoothly etc. I've been a one person army until December and it's been quite stressful.
|
# ¿ Jan 3, 2024 20:50 |
|
lazerwolf posted:Some of these databases have 600+ tables and I doubt they would fit into memory. We have been doing the table to s3 copy paradigm. I was wondering if there was a better way but seems like not really. Especially since we’re being really lazy right now with dropping the replica and copying full each time. If the external databases have WALs enabled CDC could be an option. We're using Airbyte for that purpose - but I'd probably look into whether AWS has CDC for Redshift because we are not having a good experience running Airbyte in prod.
|
# ¿ Jan 6, 2024 11:32 |
|
WHERE MY HAT IS AT posted:I mentioned in the Python thread that I'm doing some contracting for my wife's company and moving them from their old no-code pipelines in Make to Dagster (at the thread's suggestion) for flexibility and better reliability. I'm just wrapping up the first rewrite of a nightly job which pulls order data from Shopify and learned that whatever Make does is non-deterministic. I can run the same pipeline with the same variables three times and get three different order numbers reported. It's also consistently 10-15% lower than it should be, which means that all their COGS calculations for the last several years have been wrong. Non-deterministic is like my worst nightmare. Kind of nice that you were able to get founders on board for the rewrite project though! Would love to hear about how you find working with/migrating to Dagster. Oysters Autobio posted:words It sounds to me like the setup you have is more of an ELT approach, that is extract, load, transform rather than the other way around. This can actually be beneficial - we do this - but it requires that the architecture follows suit and it seems like this might not be the case for you. We have a four layers in our datalake - ingest, transform, staging, and publish. Ingest is solely data engineering territory, and we work together with DAs to provide sane tables in the transform layer. With respect to your dimensional modelling approach, I think this should be possible with dbt but you probably need code generation on top. What you are touching upon here is also related to data contracts, a feature dbt core is supporting in its newer versions. The reason people are recommending avro or protobuf is that they are industry standards for describing schemas (or contracts) between systems, and both can be used for code generation. I think data mesh and data contracts might be interesting concepts for you to research - I hope our datalake will mature enough over the next few years that these concepts are achievable for us.
|
# ¿ Jan 13, 2024 09:49 |
|
Hadlock posted:I'm curious what is the dividing line between data engineering, and devops/platform engineering/sre So I think it's hard to draw dividing lines between these disciplines, as they will often overlap. The last few companies I've been in have for example moved away from talking about DevOps as a specific discipline with specific roles tied to it, and moved on to talk about e.g. platform engineering and developer experience. Hadlock posted:I mean, is data engineering a sub-discipline of devops, like devsec ops or platform engineering, or is it somehow independent? Arguably it's a sub discipline of platform engineering Personally I see data engineering as a subset of software engineering, we are essentially data backenders. Depending on the job and size/maturity of the project/company you work for, you might also do platform engineering and SRE stuff. At my current place of work we have a mature data platform and a dedicated platform engineering team that helps out with our Kubernetes clusters etc. They don't know the specifics of the data infrastructure but help us easily spin up new clusters, make CI/CD pipelines and so on. However, in smaller companies this could all be on the data engineer or even analysts.
|
# ¿ Mar 2, 2024 10:47 |
|
Oysters Autobio posted:Data analyst / BI analyst here who's interested in upskilling into a more direct data engineering role, particularly with a focus on analytics (ie data warehousing, modelling, ETL etc). Currently most of my work is spent running typical adhoc analysis and spreadsheet wrangling. Maybe you'd be interested in pursuing something akin to analytics engineering. With respect to skills I'd value personal projects highly in a hiring situation, comp sci degrees are nice but outranked by actual work experience when I look at CVs. I have no experience with bootcamps unfortunately.
|
# ¿ Mar 2, 2024 10:52 |
|
|
# ¿ May 21, 2024 09:57 |
|
Oysters Autobio posted:So conceptually, absolutely yeah I would like to do DE work as close to the analytics side as possible. It might not be called an analytics engineer, but focusing on analytics engineering as described by dbt (they coined the term) will definitely land you and edge in both data analyst and data engineering jobs. For example, I'd love a more analytics focused engineer in our team - whether the official job title is data analyst or engineer is of less importance. This blog post describes analytics engineering very well in my opinion. Basically, bringing software engineering practices into the analytics space, for example by designing test suites for analyses, CI/CD for the analytics pipelines, etc. This also means that you will definitely need Python in your arsenal among others. dbt is also just data as code - something that can be achieved with tools like SQLMesh which looks super interesting. My point here is that dbt is just a tool, but the principles behind analytics engineering (and data engineering, and any kind of engineering) can be applied across tool choices.
|
# ¿ Mar 3, 2024 14:07 |