Data engineering: what are you doing in my data swamp?!

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Data engineering: what are you doing in my data swamp?!

BAD AT STUFF: May 10, 2012; We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

Seventh Arrow posted:

However don't let job interviewers intimidate you if you have AWS and they're an Azure house. It's all the same stuff with different labels slapped on, and you need to show them that.

That's true even if you're coming from an on-prem background! There the biggest change is infra. It's really nice to be able to allocate your own resources via Terraform rather than depending on another team, for example.

Once you get to the point of actually doing work there's a ton of overlap: k8s is still k8s, Pyspark is still Pyspark.

# ¿ Dec 29, 2023 16:46

Adbot: ADBOT LOVES YOU

# ¿ May 21, 2024 03:22

BAD AT STUFF: May 10, 2012; We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

For folks who work somewhere with a clear distinction between data science and data engineering, what are some strategies that have helped make the handoff of projects/code between roles less painful?

At my company, data science comes up with an approach for something (a new data asset, an ML model, whatever). Then engineering builds that out into a full application, handles automation, sets up CI/CD, makes sure everything scales, etc. We've been struggling with inconsistencies in how that work gets delivered from our data scientists. Some of them are really technical and awesome. They'll put PRs in on the repos and have tests for their new code. Others are sending us links to unversioned notebooks filled with print statements and no guidance for how you actually run these things and in what order.

I'd be interested in hearing about anyone's social or technical solutions to these kinds of problems. We've talked about a few things but haven't gone too far down any particular road yet.

# ¿ Dec 29, 2023 17:14

BAD AT STUFF: May 10, 2012; We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

monochromagic posted:

Some thoughts - for context I have been the sole data engineer at my company (1500+ employees, ecommerce) for a year and just recently got a junior data engineer. We are also migrating from Airflow 1 to Airflow 2 and from Dataform to dbt while implementing Airbyte for EL and 60% of the data team are hired within the last 6 months. So...

1) Standardise the way people hand over code, for me that would mean stop accepting notebooks. PRs are king and an onboarding goal of ours is that you have your first code merged within your first week.

2) Let the data scientists learn from each other. When our data scientists/analysts have e.g. dbt questions, we take a discussion and they will write some new documentation they can refer to in the future. In your case I would encourage the awesome technical data scientists to be part of standardising handovers.

3) I'm a bit curious - are you implementing custom CI/CD for each new project? This touches a bit on platform engineering but perhaps this could be standardised as well, e.g. prefer docker images or something similar that you then have a unified CI/CD process for. I'm trying to move us towards more docker and using the Kubernetes executor in Airflow for example - that's how we run dbt in prod.

I appreciate the thoughts. For reference, this is a team that typically has about 5 to 10 data engineers and 5 to 10 data scientists. We're a pretty isolated team within a much larger company. One of the reasons this is an issue is that our org structure is based around job function (science reports to science, engineering reports to engineering), rather than area of the business. The level of leadership that could be dictatorial about enforcing standards on everyone is far enough removed that it would cause political headaches if we escalated things. So I'm trying to come up with some ideas that can make things easier for everyone and that we can come to a consensus about.

1) I think I made us sound more dysfunctional than we actually are there, but that's why we can't go for the low hanging fruit of "every change must be a PR". We're doing that on the engineering side, but getting science to buy in is the harder part. We've been working on things like standardizing PR templates this past year. I'd also like to look into more sophisticated linting options for notebooks. Or anything else folks can think of to make better development patterns more appealing and easy to use.

2) We've been talking to some of the more technical data scientists already. It's going to take time to shift things, but I'm hopeful. Most of the time they already do a good job of giving us something usable. There have just been a couple of notable cases lately that when what we got was unusually bad.

3) We've recently moved to Github Actions for CI/CD, and as part of that all of our pipelines are using standard components. However, there's a decent amount of variability in what the projects do. Some may just be a Python library. Some may be a library plus notebooks that need to get deployed to Databricks. Some might have FastAPI service that needs to be deployed.

I think overall we're making our projects too large (often at a solution/product level rather than component level), which makes it harder to have complete CI/CD out of the box from the project template without needing to tweak anything. That's another area I'm looking to improve, but thankfully no, we aren't reinventing the wheel every time.

# ¿ Dec 29, 2023 23:28

BAD AT STUFF: May 10, 2012; We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

CompeAnansi posted:

It is very rare for anyone to just start as a junior data engineer with no prior experience, even with a CS degree. Most people start as backend engineers and move to DE because they like working with data, or they start as a data analyst and move to DE because their company does what lots of companies do and expect analysts to do both the DE work and the analyst work and they realized they like the DE side better. I'd feel pretty uncomfortable suggesting that someone just dive right in and try to get a job as a DE with no prior data/engineering roles.

I started as a Data Engineer after my undergrad, but my initial work was all porting WPS/SAS code to Pyspark. That only required Python and SQL skills, and the rest I picked up on the job. I agree that it's a hard thing to do coming out of school. We didn't have any classes that covered big data and distributed computing.

I think that you're right about backend being a good entry point. General SQL and data experience are more important than the specific flavor of SQLite/MySQL/Postgres. It's also good to think about the roles at the company if you're looking to make an internal move in the future. Working at a data focused company in a non-DE role is a decent place to start.

Lib and let die posted:

Maybe you all can help me put a label on the job I've done/job I am looking for, because I seem to be struggling in that department right now in my job search.

My last couple of roles were doing "data services" - updating, suppressing/deleting records, ETL for new client data intake from legacy systems, creating and processing duplicate reports, and creating custom reports to client spec. The actual position name I held, "Data Services Specialist" ends up showing me Data Engineer jobs which to me make sense - not being responsible for the design, maintenance, or troubleshooting of the database architecture itself, I always sort of just assumed I fell into the "engineer" bucket rather than the "administrator" bucket making me a Data Engineer since the scope of my work was restricted to the data itself. (I'm also competent enough with PHP/HTML to throw together a basic web interface that can query/write to a database but I haven't seen a ton of overlap in job postings there, most full stack stuff wants experience with Ruby etc)

From what I've been able to gather from job postings, what I've done in the past is more of a database administrator role and if that's the case there's some upskilling I need to do on the creation/maintenance/policy admin side (or a lot of resume and interview bluffing to cover that blind spot while I try and actually catch up) but what with my lovely underemployed after getting laid off situation if I'm going to spend money for a course I need to make sure it aligns with my expectations of a job role.

If your focus was on data quality issues and being informed about the data assets, that sounds like a role I've worked with before called a Data Solutions Analyst. They weren't a data scientist building new models or a data engineer creating new systems. But if we wanted to know about the contents of specific columns or report to answer a specific question (e.g. how many of X event did we have on the website last quarter?) then we'd go to the DSA.

# ¿ Jan 1, 2024 04:31

BAD AT STUFF: May 10, 2012; We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

Oysters Autobio posted:

How do most organizations deal with ad-hoc data imports? i.e. CSV imports / spreadsheets

Somehow as a DA team, we've got stuck where 80% of our jobs became cleaning and importing ad-hoc spreadsheets and ad-hoc datasets into our BI tools so they're searchable. Now I'm spending 80% of my day hand-cleaning spreadsheets in pandas or pyspark.

What's the "best practice" way of dealing with this disorganized ad-hoc mess that somehow our team drew the short-straw on? Are there common open source spreadsheet import tools out there we could deploy that let business users map out columns to destination schemas? I feel like I'm missing some really obvious answer here.

What type of cleaning do you have to do? You mentioned columns to destination schemas, is everyone using their own format for these things and it's up to you to reconcile them?

You could try to get fancy with inferring column contents in code or you could build out a web app to collect some of this metadata for mappings as users submit their spreadsheets. But the obvious answer is to create a standardized template and refuse to accept any files that don't follow the format. Assuming, of course, that your team has the political capital to say "no" to these users.

# ¿ Feb 26, 2024 18:04

BAD AT STUFF: May 10, 2012; We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

Hughmoris posted:

Thanks for the insight!

Sorry, I forgot to post my answer to your original post.

I'm a senior level data engineer, angling for a role as a tech lead. My team has a lot of legacy processes that still run on our on-prem Hadoop cluster, but we're migrating to Databricks in Azure.

My days are generally split between use case work and capability work. Our company has a pretty sharp distinction between data science and data engineering. The data scientists are developing new statistical and ML approaches to creating data assets. It's my job to make sure that their code is scalable, that it works with our automation (Airflow, Databricks jobs, ADF), and that they're following good engineering standards with their code. "Productionizing" is the term we came up with to encompass all of that. It's PySpark development, but very collaborative.

On the capability side, that's where I have more flexibility about what I'm doing. Developing new patterns, data pipelines to link together our different systems, that kind of thing. With the migration work that we're currently doing, there is more of that than we'd typically have. Still, that's a skill that I think is important for data engineers above junior level. You have a number of different tools to accomplish your goals of moving data between different systems, applying transformations, and producing data products or serving models. A lot of data engineering isn't "how do I do this?", but "what's the best way to do this?".

# ¿ Mar 8, 2024 04:56

BAD AT STUFF: May 10, 2012; We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

If pipelines A and B are specific to what you're doing in C, the easiest solution run them all in a larger pipeline with an Execute Pipeline activities like you were saying.

If A and B are separate or are used by other processes, then I'd look at it from a dependency management perspective rather than as directly preceding tasks. If you can programmatically check the status of A and B (look for output files? or use the ADF API to check pipeline runs?), you can begin C with an Until activity that does those checks with a Wait thrown in so you only check every 15 minutes or something. That's pretty much an Airflow Sensor, which I assume is what people were telling you to use.

Good luck with ADF. Our team barely got into it, hated it, put our cloud migration on hold for like 6 months, and then switched to Databricks Workflows.

# ¿ Apr 25, 2024 14:16

Adbot: ADBOT LOVES YOU

# ¿ May 21, 2024 03:22

BAD AT STUFF: May 10, 2012; We choose to go to the moon in this decade and do the other things, not because they are easy, but because fuck you.

Generic Monk posted:

Yeah, the datasets are going to be combined a fair bit eventually I think so the latter solution is probably better. At the end of the day everything ends up in the data lake in a relatively consistently named format so it would probably be possible to check for it. I guess you'd still need to trigger the transform pipeline to a schedule that roughly conforms with when the extract/load pipelines run though right? But these are daily loads so I don't think it really matters.

Yes. If you're checking to see if your upstream dependencies have been completed, you'd need to schedule the pipeline run so those checks can start. If they're daily runs, that's pretty easy.

Generic Monk posted:

I'm right in saying that their proposed solution of event grid/azure functions seems like an antipattern right? I've not seen any resource recommending this solution. Event grid seems to be designed for routing large amounts of streaming data which this isn't, and azure functions seems to be for little atomic operations which this isn't, if the function has to be constantly running to retain state.

An event driven approach can absolutely be a good thing, but I'd be wary about going down that route unless there's a well established best practice to follow. My company has an internal team that created a dependency service. You can register applications and set what dependencies they have. The service will initiate your jobs after the dependencies are met, and then once your job completes you send a message back to the service so that processes that depend on you can do the same thing.

There are a lot of little things that can bite you with event driven jobs, though, which is why they created that service. If your upstream jobs don't run, what kind of alerting do you have to know that you're missing SLOs/SLAs? How do you implement the logic to trigger the job only once both dependencies are done? For a single dependency it's easy: when the message arrives in the queue, kick off your job.

Handling multiple jobs would be harder, and that's the thing I wouldn't want to do without docs to work from (which I didn't see for event grid in a couple minutes of searching). I suppose you could write an Azure function that is triggered on message arrival in queue A and checks to see if there's a corresponding message in queue B. If so, trigger pipeline C. If not, requeue the message from queue A with a delay (you can do this with Service Bus, not sure about Event Grid) so that you're not constantly burning resources with the Azure function.

I think that would work, but I wouldn't be comfortable with it unless I had a lot of testing and monitoring in place. That's where I'd push back on the people making the suggestion. If there are docs or usage examples that show other people using the tools this way, then great. If not, that's a big red flag that we're misusing these managed services.

# ¿ Apr 27, 2024 18:45

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Data engineering: what are you doing in my data swamp?!