|
Oysters Autobio posted:
If you can't hook stuff up to ingest this stuff in an automated way, and it's non-negotiable that it requires users manually sending stuff, you need to set boundaries, either by having an automated way to reject stuff not conforming to the template or by rejecting it yourself. Setting boundaries usually requires some buy-in by management. It's more of a social and political problem than anything else. Or, if you can't set boundaries you need to set aside budget for extra resource to do that work, like the other team has. Maybe it's best framed as that kind of dichotomy when you approach management - either users conform to this simple template or they spend god knows how much on your or another employee's time to make up for the lack of standardisation.
|
# ¿ Feb 27, 2024 16:31 |
|
|
# ¿ May 17, 2024 18:35 |
|
Just started a new job. Does anyone here use Azure data factory? Is there a way to set up dependencies between pipelines, that have multiple dependencies? I.e. If we have raw ingest pipelines A and B, and a transformation pipeline C that depends on both A and B succeeding, can it be configured such that C only runs when A and B complete successfully? The only trigger that seems to do something like this is dependent tumbling window triggers, but I'm wary of this as it seems to be designed for workloads much more frequent than the 'once a day in the morning' runs we are currently doing. Also setting up all those triggers is going to be kind of clunky. https://learn.microsoft.com/en-us/azure/data-factory/tumbling-window-trigger-dependency I'm really surprised that this functionality doesn't seem to be fleshed out. I've been handed a bit of work started a year ago wherein they've tried to work around it by sending events to Azure Event Grid that then kicks off a function that runs continually to check if the dependencies have been satisfied then calls the dependent pipelines with the Azure API (I think). This seems like a kind of overengineered solution that's punching out of ADF which is nominally supposed to be the orchestration platform, and isn't really what Event Grid is designed for anyway. While writing this just occurred to me can you create a new DAG/pipeline that calls other pipelines from within it and have dependencies between them, but then that whole process would need to be triggered at a certain time (I think) so you'd lose flexibility if you wanted to kick off pipeline A and B at very specific times. IDK. Anyone else dealt with this? Googled reddit posts say 'just use airflow' which isn't really an option.
|
# ¿ Apr 25, 2024 12:08 |
|
Feels like I've been bamboozled a bit since I'm new to proper data engineering work and orchestration, but it seems like the logical solution to 'We have some tasks that extract and load data and then another task that transforms the data' is just to create another DAG that includes the transform tasks as dependencies of the extract and load tasks. This seems like the most logical and standard way to do it honestly, my only concern is that it seems like you'd lose granular control of when the extract and load tasks kick off, and that as the complexity of the thing grows it could get unwieldy. But probably not that unwieldy as most of the transformation is sequestered into databricks notebooks and the ingests are pretty modularised anyway
|
# ¿ Apr 25, 2024 12:29 |
|
BAD AT STUFF posted:If pipelines A and B are specific to what you're doing in C, the easiest solution run them all in a larger pipeline with an Execute Pipeline activities like you were saying. Yeah, the datasets are going to be combined a fair bit eventually I think so the latter solution is probably better. At the end of the day everything ends up in the data lake in a relatively consistently named format so it would probably be possible to check for it. I guess you'd still need to trigger the transform pipeline to a schedule that roughly conforms with when the extract/load pipelines run though right? But these are daily loads so I don't think it really matters. I'm right in saying that their proposed solution of event grid/azure functions seems like an antipattern right? I've not seen any resource recommending this solution. Event grid seems to be designed for routing large amounts of streaming data which this isn't, and azure functions seems to be for little atomic operations which this isn't, if the function has to be constantly running to retain state. Yeah ADF seems nice and friendly but you rapidly run up against the limits of it. They're all in on Azure though so I don't think that's changing. At least we have the escape hatch of just calling Databricks notebooks when we run into a wall though.
|
# ¿ Apr 25, 2024 15:29 |
|
CompeAnansi posted:Why are they using ADF Piplines instead of Workflow Orchestration Manager given that they're hiring and paying data engineers? Feels like running up against the limits of Pipelines means moving to a proper system like Workflow Orchestration Manager (aka Airflow). No idea, I assume they wanted to keep everything in the microsoft stack or something. This project is still pretty nascent so maybe it just hasn't been considered. And I didn't know Azure had a managed Airflow service, that's great! Exactly the problem Airflow is meant to solve. I'll bring it up
|
# ¿ Apr 27, 2024 11:56 |
|
|
# ¿ May 17, 2024 18:35 |
|
BAD AT STUFF posted:Yes. If you're checking to see if your upstream dependencies have been completed, you'd need to schedule the pipeline run so those checks can start. If they're daily runs, that's pretty easy. Yeah, the added complexity of this is what skeeves me out. I think they got the idea from the SWE side of the organisation which is heavily into kubernetes/microservices from what I can see We're a small team so productionising this does not seem like the best use of resources when something like managed airflow would work just as well for our use case, with there already being documentation and a huge community of practice behind it
|
# ¿ Apr 27, 2024 19:15 |