Data engineering: what are you doing in my data swamp?!

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Data engineering: what are you doing in my data swamp?!

Oysters Autobio: Mar 13, 2017

So I'm a data analyst who's been trying to learn more DE work. Additionally, our DE team is heavily overworked on projects and just really has no time for things like data warehousing or dimensional models.

I'm interested in dbt, particularly to leverage the macros so I could parameterize some of the repetitive logical transforms.

Basically our situation is that the DE team will output their pipelines and adhoc ETLs as tables into impala. They don't do any cleaning or transformation on the data or try and map it to any sort of standard set of schemas.

As a result we have a sprawling catalogue of tables with different schemas but with many containing repeating types of data that very much could get assembled into a few analytical tables. Good news is that there is a metadata API built on-top of a metadata SQL database, but very little is tagged (I could crowdsource some tagging if needed).

I have an idea in mind on a dimensional model and how it might look like in terms of fact and dim tables, and I could go table by table to hand-construct the transform code, but I feel like there has to be better idiomatic ways to parameterize this process so it could be somewhat reusable in the future. Like, "for every table that's ETLd with a column tagged as 'dog names', append data from that column to the master table 'dim_dognames'. Is this something doable just in dbt or would I have to do this kind of transform further upstream in pyspark? If so in the latter case, rather than hardcoding all the code and logic, any tips managing schema mapping (ie something like a user can pass a YAML file which identifies each column name from source to common schema in target table).

Also, there a term or concept here that describes an ETL pattern like this (ie using common metadata tags to map parameterized transform logic into analytical tables)? Is this a common practice? Obviously the "right" answer here is to get the clients who are getting this data to conform it to a standard schema prior to dumping it into our data lake but bureacracy, politics and very poor data literacy is hampering that.

I've thrown this idea by some more experienced folks and they flagged something like avro or protobuf, but I don't quite understand how you would use serialization libraries for this kind of thing.

Oysters Autobio fucked around with this message at 00:53 on Jan 11, 2024

# ¿ Jan 11, 2024 00:48

Adbot: ADBOT LOVES YOU

# ¿ May 17, 2024 20:13

Oysters Autobio: Mar 13, 2017

monochromagic posted:

With respect to your dimensional modelling approach, I think this should be possible with dbt but you probably need code generation on top. What you are touching upon here is also related to data contracts, a feature dbt core is supporting in its newer versions. The reason people are recommending avro or protobuf is that they are industry standards for describing schemas (or contracts) between systems, and both can be used for code generation.
I think data mesh and data contracts might be interesting concepts for you to research - I hope our datalake will mature enough over the next few years that these concepts are achievable for us.

Yeah this is interesting and I'd love to learn more practical skills in this but I can't seem to find anything on data contracts and data mesh that isnt SEO-generated VP-bait nonsense bullshit. Is there any actual content that showcases demos, code bases or examples on how this works?

Additionally, I'm a bit skeptical over data mesh actually working in the practical world unless the org is willing to actually invest in the technical expertise needed to make it happen. Each domain having its own data infrastructure team sounds a bit like a pipe dream unless you're afloat in VC cash or something, and a data mesh without subject matter experts being the main participants sounds like it wont work.

The org I support barely even uses spreadsheets let alone understands how structured data works and what a schema even is. I tried to ask for a schema change to look at including a really useful property into a data model used for managing memorandum-style reports and I got blank looks and response saying "Why? People can just write whatever they want in the body of the text".

# ¿ Feb 7, 2024 14:21

Oysters Autobio: Mar 13, 2017

How do most organizations deal with ad-hoc data imports? i.e. CSV imports / spreadsheets

Somehow as a DA team, we've got stuck where 80% of our jobs became cleaning and importing ad-hoc spreadsheets and ad-hoc datasets into our BI tools so they're searchable. Now I'm spending 80% of my day hand-cleaning spreadsheets in pandas or pyspark.

What's the "best practice" way of dealing with this disorganized ad-hoc mess that somehow our team drew the short-straw on? Are there common open source spreadsheet import tools out there we could deploy that let business users map out columns to destination schemas? I feel like I'm missing some really obvious answer here.

# ¿ Feb 26, 2024 14:04

Oysters Autobio: Mar 13, 2017

BAD AT STUFF posted:

What type of cleaning do you have to do? You mentioned columns to destination schemas, is everyone using their own format for these things and it's up to you to reconcile them?

Yup, everyone is using their own formats and we're responsible to map to destination schemas

quote:

You could try to get fancy with inferring column contents in code or you could build out a web app to collect some of this metadata for mappings as users submit their spreadsheets. But the obvious answer is to create a standardized template and refuse to accept any files that don't follow the format. Assuming, of course, that your team has the political capital to say "no" to these users.

Unfortunately I've been very vocal about this issue but have been met with shrugs as if there's nothing better we could be doing with our time. Our team lead doesn't even have data analysis experience so they barely understand the issues and are terrible at going to bat for us.

The template is probably the right idea, but my concern is people simply won't do it, and then continue not even bothering to get that data into an actual warehouse we can use.

What ends up happening is that these spreadsheets get dumped as an attachment into our internal content management system (ie where people save corporate word docs, emails etc), and when we are actually asked to do real data analysis that's the data they point to.

I'm just surprised that there aren't any existing open source stacks or apps for this kind of thing but maybe this stupidity has been so normalized for me that I'm assuming others deal with this.

have you seen my baby posted:

You're doing all of the work associated with organizing and thinking about the data. Do you think your business users are going to willingly start doing the work that you are currently doing for them?

No, I don't think they'd be willing.

Despite there being a big appetite for BI and data analysis, there's zero appetite for making actual systems to enable it and this work is seen as something for us as a "support" team to deal with.

We have zero ownership of any of the systems, and our dataops and data engineer teams have other priorities (like deploying internal LLMs and other VP-bait projects).

The business side thinks being "data centric" is "look at all this data we're getting!".

There's literally another BI team I'm aware of that run their Tableau dashboards from manually maintained spreadsheets. They've hired someone to literally go and do data entry into spreadsheets from correspondence and other unstructured corporate records that our bureaucracies continue to generate.

Oysters Autobio fucked around with this message at 02:24 on Feb 27, 2024

# ¿ Feb 27, 2024 02:13

Oysters Autobio: Mar 13, 2017

Data analyst / BI analyst here who's interested in upskilling into a more direct data engineering role, particularly with a focus on analytics (ie data warehousing, modelling, ETL etc). Currently most of my work is spent running typical adhoc analysis and spreadsheet wrangling.

Aside from taking on work at the office that's more focused on modelling (ie kimball / data warehousing), what are folks recommendations for other tech professionals looking to upskill into more DE engineering roles?

Online MOOCs? Bootcamps? Part-time polytechnic / university?

Particularly on the last two, I have heard that bootcamps tend to have a bad rep outside of web dev / front end circles, is this true?

Previously I have read the kind of "you couldn't possibly learn from a bootcamp what than the CS or SWE program I did in undergrad/graduate programs". If people do feel this is the case, I'm very open for the insight and on any suggestions (it's not like I can go back in time and change my polytechnic program and all). Part-time undergraduate programs? Post-grad or post-bac polytechnic programs? Save money and go fully back to school full time in the most data engineering focused program I can find?

which if people feel is true I'm open to that kind of insight.

Alternatively, I've heard doing your own side projects for building a portfolio is valued but I'm struggling to find a side project in my mind that's doable as a portfolio example while being related to data engineering.

Oysters Autobio fucked around with this message at 03:47 on Mar 2, 2024

# ¿ Mar 2, 2024 03:35

Oysters Autobio: Mar 13, 2017

monochromagic posted:

Maybe you'd be interested in pursuing something akin to analytics engineering

So conceptually, absolutely yeah I would like to do DE work as close to the analytics side as possible.

But only thing is that I haven't been able to find material on the domain that isn't just dbt promotional blogs and such. Are employers actually hiring positions like this? And if so, outside of just maintaining and creating dbt models, what else does the role entail?

Not slagging dbt here, but I haven't seen anything like "analytics engineering with python" or basically anything that constitutes the stack that isn't dbt, so I'm skeptical.

# ¿ Mar 3, 2024 01:38

Oysters Autobio: Mar 13, 2017

Edit; Nevermind

Oysters Autobio fucked around with this message at 16:45 on Apr 19, 2024

# ¿ Mar 14, 2024 23:54

Adbot: ADBOT LOVES YOU

# ¿ May 17, 2024 20:13

Oysters Autobio: Mar 13, 2017

Anyone have good resources or guides on functional programming styles in pyspark?

I've seen a few style guides for pyspark, but would really like to see some more in-depth walkthroughs with concrete step by steps and/or practice guides for common data engineering patterns with pyspark. Sort of "idiomatic pyspark" or whatever you want to call it.

Bonus points for:

- some kind of useful data testing guides for data pipelines or even just adhoc / batch ETL.

- ETL practices with Jupyter notebooks (we don't have any sort of orchestration or workflow system like airflow. It's all just notebooks).

# ¿ Apr 30, 2024 00:59

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Data engineering: what are you doing in my data swamp?!