Data engineering: what are you doing in my data swamp?!

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Data engineering: what are you doing in my data swamp?!

Generic Monk: Oct 31, 2011

Oysters Autobio posted:

We have zero ownership of any of the systems, and our dataops and data engineer teams have other priorities (like deploying internal LLMs and other VP-bait projects).

The business side thinks being "data centric" is "look at all this data we're getting!".

There's literally another BI team I'm aware of that run their Tableau dashboards from manually maintained spreadsheets. They've hired someone to literally go and do data entry into spreadsheets from correspondence and other unstructured corporate records that our bureaucracies continue to generate.

If you can't hook stuff up to ingest this stuff in an automated way, and it's non-negotiable that it requires users manually sending stuff, you need to set boundaries, either by having an automated way to reject stuff not conforming to the template or by rejecting it yourself. Setting boundaries usually requires some buy-in by management. It's more of a social and political problem than anything else.

Or, if you can't set boundaries you need to set aside budget for extra resource to do that work, like the other team has. Maybe it's best framed as that kind of dichotomy when you approach management - either users conform to this simple template or they spend god knows how much on your or another employee's time to make up for the lack of standardisation.

# ¿ Feb 27, 2024 16:31

Adbot: ADBOT LOVES YOU

# ¿ May 17, 2024 18:35

Generic Monk: Oct 31, 2011

Just started a new job. Does anyone here use Azure data factory? Is there a way to set up dependencies between pipelines, that have multiple dependencies?

I.e. If we have raw ingest pipelines A and B, and a transformation pipeline C that depends on both A and B succeeding, can it be configured such that C only runs when A and B complete successfully?

The only trigger that seems to do something like this is dependent tumbling window triggers, but I'm wary of this as it seems to be designed for workloads much more frequent than the 'once a day in the morning' runs we are currently doing. Also setting up all those triggers is going to be kind of clunky.

https://learn.microsoft.com/en-us/azure/data-factory/tumbling-window-trigger-dependency

I'm really surprised that this functionality doesn't seem to be fleshed out. I've been handed a bit of work started a year ago wherein they've tried to work around it by sending events to Azure Event Grid that then kicks off a function that runs continually to check if the dependencies have been satisfied then calls the dependent pipelines with the Azure API (I think). This seems like a kind of overengineered solution that's punching out of ADF which is nominally supposed to be the orchestration platform, and isn't really what Event Grid is designed for anyway.

While writing this just occurred to me can you create a new DAG/pipeline that calls other pipelines from within it and have dependencies between them, but then that whole process would need to be triggered at a certain time (I think) so you'd lose flexibility if you wanted to kick off pipeline A and B at very specific times. IDK. Anyone else dealt with this? Googled reddit posts say 'just use airflow' which isn't really an option.

# ¿ Apr 25, 2024 12:08

Generic Monk: Oct 31, 2011

Feels like I've been bamboozled a bit since I'm new to proper data engineering work and orchestration, but it seems like the logical solution to 'We have some tasks that extract and load data and then another task that transforms the data' is just to create another DAG that includes the transform tasks as dependencies of the extract and load tasks.

This seems like the most logical and standard way to do it honestly, my only concern is that it seems like you'd lose granular control of when the extract and load tasks kick off, and that as the complexity of the thing grows it could get unwieldy. But probably not that unwieldy as most of the transformation is sequestered into databricks notebooks and the ingests are pretty modularised anyway

# ¿ Apr 25, 2024 12:29

Generic Monk: Oct 31, 2011

BAD AT STUFF posted:

If pipelines A and B are specific to what you're doing in C, the easiest solution run them all in a larger pipeline with an Execute Pipeline activities like you were saying.

If A and B are separate or are used by other processes, then I'd look at it from a dependency management perspective rather than as directly preceding tasks. If you can programmatically check the status of A and B (look for output files? or use the ADF API to check pipeline runs?), you can begin C with an Until activity that does those checks with a Wait thrown in so you only check every 15 minutes or something. That's pretty much an Airflow Sensor, which I assume is what people were telling you to use.

Good luck with ADF. Our team barely got into it, hated it, put our cloud migration on hold for like 6 months, and then switched to Databricks Workflows.

Yeah, the datasets are going to be combined a fair bit eventually I think so the latter solution is probably better. At the end of the day everything ends up in the data lake in a relatively consistently named format so it would probably be possible to check for it. I guess you'd still need to trigger the transform pipeline to a schedule that roughly conforms with when the extract/load pipelines run though right? But these are daily loads so I don't think it really matters.

I'm right in saying that their proposed solution of event grid/azure functions seems like an antipattern right? I've not seen any resource recommending this solution. Event grid seems to be designed for routing large amounts of streaming data which this isn't, and azure functions seems to be for little atomic operations which this isn't, if the function has to be constantly running to retain state.

Yeah ADF seems nice and friendly but you rapidly run up against the limits of it. They're all in on Azure though so I don't think that's changing. At least we have the escape hatch of just calling Databricks notebooks when we run into a wall though.

# ¿ Apr 25, 2024 15:29

Generic Monk: Oct 31, 2011

CompeAnansi posted:

Why are they using ADF Piplines instead of Workflow Orchestration Manager given that they're hiring and paying data engineers? Feels like running up against the limits of Pipelines means moving to a proper system like Workflow Orchestration Manager (aka Airflow).

No idea, I assume they wanted to keep everything in the microsoft stack or something. This project is still pretty nascent so maybe it just hasn't been considered. And I didn't know Azure had a managed Airflow service, that's great! Exactly the problem Airflow is meant to solve. I'll bring it up

# ¿ Apr 27, 2024 11:56

Adbot: ADBOT LOVES YOU

# ¿ May 17, 2024 18:35

Generic Monk: Oct 31, 2011

BAD AT STUFF posted:

Yes. If you're checking to see if your upstream dependencies have been completed, you'd need to schedule the pipeline run so those checks can start. If they're daily runs, that's pretty easy.

An event driven approach can absolutely be a good thing, but I'd be wary about going down that route unless there's a well established best practice to follow. My company has an internal team that created a dependency service. You can register applications and set what dependencies they have. The service will initiate your jobs after the dependencies are met, and then once your job completes you send a message back to the service so that processes that depend on you can do the same thing.

There are a lot of little things that can bite you with event driven jobs, though, which is why they created that service. If your upstream jobs don't run, what kind of alerting do you have to know that you're missing SLOs/SLAs? How do you implement the logic to trigger the job only once both dependencies are done? For a single dependency it's easy: when the message arrives in the queue, kick off your job.

Handling multiple jobs would be harder, and that's the thing I wouldn't want to do without docs to work from (which I didn't see for event grid in a couple minutes of searching). I suppose you could write an Azure function that is triggered on message arrival in queue A and checks to see if there's a corresponding message in queue B. If so, trigger pipeline C. If not, requeue the message from queue A with a delay (you can do this with Service Bus, not sure about Event Grid) so that you're not constantly burning resources with the Azure function.

I think that would work, but I wouldn't be comfortable with it unless I had a lot of testing and monitoring in place. That's where I'd push back on the people making the suggestion. If there are docs or usage examples that show other people using the tools this way, then great. If not, that's a big red flag that we're misusing these managed services.

Yeah, the added complexity of this is what skeeves me out. I think they got the idea from the SWE side of the organisation which is heavily into kubernetes/microservices from what I can see We're a small team so productionising this does not seem like the best use of resources when something like managed airflow would work just as well for our use case, with there already being documentation and a huge community of practice behind it

# ¿ Apr 27, 2024 19:15

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Data engineering: what are you doing in my data swamp?!