Data engineering: what are you doing in my data swamp?!

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Data engineering: what are you doing in my data swamp?!

CompeAnansi: Feb 1, 2011; I respectfully decline
the invitation to join
your hallucination

UU TEE EFF posted:

I am a newbie to both programming and SQL in general. I can follow a guide and understand what I am doing, but that is about it. but I want to expand my skills a little in managing and analyzing a database. I think I want to do it by creating a local database on my windows machine and then connecting to it with Power BI. I was able to follow the python tutorial https://docs.python.org/3/library/sqlite3.html#tutorial to create an SQLite3 database. I then connected to it through Power BI. So that part is good.

Am I on the right track using SQLite with Python to learn this? The number of ways to do this is a little overwhelming for a newbie like myself, so I am not sure if I should change tack.

Next step I want to do is to create a new database with my CSV data files and start building the table relationships. Anyone know a good tutorial on that? Particularly, I want to connect to a directory of CSV files that are stored on my PC in an organized file structure I see some guides for loading one file, but am not sure how to load a directory of files.

It is very rare for anyone to just start as a junior data engineer with no prior experience, even with a CS degree. Most people start as backend engineers and move to DE because they like working with data, or they start as a data analyst and move to DE because their company does what lots of companies do and expect analysts to do both the DE work and the analyst work and they realized they like the DE side better. I'd feel pretty uncomfortable suggesting that someone just dive right in and try to get a job as a DE with no prior data/engineering roles.

# ¿ Dec 31, 2023 21:28

Adbot: ADBOT LOVES YOU

# ¿ May 17, 2024 01:16

CompeAnansi: Feb 1, 2011; I respectfully decline
the invitation to join
your hallucination

For OLTP DBs, the most popular one for startups in the US (and medium-sized newer tech companies) seems to be Postgres, unless they've tied themselves to the Microsoft stack, in which case they're on SQL Server. MySQL is still somewhat popular in the US but is very popular in Asia, especially China. No one uses Oracle if they can help it, basically only large old non-tech companies, which means that there are a ton of jobs that use it but no one likes them.

# ¿ Jan 3, 2024 19:50

CompeAnansi: Feb 1, 2011; I respectfully decline
the invitation to join
your hallucination

Hughmoris posted:

For those of you currently working as Data Engineers, what does your day-to-day look like? What technologies are you working with?

Our stack is:

1. Self-hosted Airflow as our orchestrator, which executes two kinds of jobs:
a. Python pipelines that are either replicating tables from our production db or pulling data from APIs
b. DBT jobs that create derived tables from those base-level tables created/updated by the Python pipelines

2. A "warehouse", which is a managed Postgres instance because it is cheap and our data sizes are small enough that we can get away with it for now.

Changes we're considering are to either move to a lakehouse architecture with Iceberg + Trino, or to move us to a proper columnar db, like self-hosted StarRocks (aka Apache Doris).

My day-to-day at my startup is:

1. Checking the orchestrator status page - Mostly just rerunning failed jobs and/or fixing code if, e.g., an API change happened that caused the failed job.
2. Fielding tickets for data requests - This is the bulk of my job. If the request requires a brand new source, then it'll require writing a new pipeline job to pull data from an API. But often it just requires new DBT transforms to present the existing data in a different way so we can hook it to Tableau or Geckoboard.
3. Occasional calls - I am blessed to have been able to set hard boundaries (and be supported in this by my boss) regarding the number of meetings I am in since I am still an IC even though I am the data lead.
4. Occasional work to re-architect existing systems - We've always chosen the simplest/cheapest route with the thought that when it breaks as our data volumes grow, then we'll revisit it and choose a more complicated or more expensive setup. Well, our data volumes have continued to grow and things do break as expected. When that happens this becomes my main focus until resolved, putting incoming tickets on hold.

CompeAnansi fucked around with this message at 21:04 on Jan 3, 2024

# ¿ Jan 3, 2024 20:05

CompeAnansi: Feb 1, 2011; I respectfully decline
the invitation to join
your hallucination

Hughmoris posted:

Do your data engineers also administer your Postgres instance, or do you have a dedicated DBA team for that?

We're paying for managed Postgres so it is largely handled for us, but anything that needs doing I do myself. No DBA team.

# ¿ Jan 3, 2024 20:59

CompeAnansi: Feb 1, 2011; I respectfully decline
the invitation to join
your hallucination

lazerwolf posted:

Hello Data Engineering thread! I�m sort of a hybrid Full Stack Engineer / Data Engineer in the Biotechnology space meaning I can build front facing UIs, backend APIs and Data pipelines.

We�re an AWS shop so we use MWAA, S3 l, Redshift etc. React for any UI application and Django for REST and GraphQL APIs to serve the frontends and house business logic for data pipelines.

Does anyone have a great approach syncing external databases into Redshift as is (aka no transformation)?

Since no one is replying I'll mention what we do for Postgres to Postgres syncs since Redshift is based on Postgres and has the COPY command (although it may be somewhat restricted). Assuming that none of the tables in the DB are larger than memory, you can use:

code:

COPY table TO STDOUT

to pull the table into an in memory buffer, then immediately push it back out to the destination (after setting the buffer back to the start with file.seek(0)) with:

code:

COPY table FROM STDIN

As I said I'm not 100% sure how that'll carry over to Redshift (or what your source DBs are), but it's a potential starting place for high speed duplication.

EDIT: A cursory google search after posting turned this up, though: https://stackoverflow.com/questions/31603429/redshift-copy-from-stdin-using-java-program. Sounds like you'll need to copy it to S3 first then you can use the

code:

COPY table FROM Amazon S3

command instead, but this is significantly less efficient because you have to store the table in an intermediary stage rather than just temporarily in memory.

CompeAnansi fucked around with this message at 05:19 on Jan 5, 2024

# ¿ Jan 5, 2024 01:10

CompeAnansi: Feb 1, 2011; I respectfully decline
the invitation to join
your hallucination

Has anyone tried running dbt over iceberg yet (with whatever query layer you prefer)? I know views aren�t supported on iceberg yet but you should in theory be able to materialize as a table. But the process for creating a table in iceberg is way more complicated that it is in a traditional db so I�m not sure if the system can pull it off.

# ¿ Feb 4, 2024 07:33

CompeAnansi: Feb 1, 2011; I respectfully decline
the invitation to join
your hallucination

Hed posted:

Cross-posting since this is more data than Python (thanks WHERE MY HAT IS AT)

Is there a standard tool to look at parquet files?

I'm trying to go through a slog of parquet files in polars and keep getting an exception:
Python code:
Traceback (most recent call last):
  File "log_count.py", line 57, in <module>
    daily_output = result.collect()
                   ^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\polars\lazyframe\frame.py", line 1937, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^

polars.exceptions.ComputeError: not implemented: reading parquet type Double to Int64 still not implemented
I know what this means, but I don't have a good way to diagnose what errors the files are in, and so end up moving groups of files around until it works, then putting them back in one by one until I find the offender.

I understand I'm trying to have the efficiency of polars in lazy mode, but I'd love to know where it specifically blows up to help figure out the problem upstream.

Is there a better place to ask polars / data science questions?

Try DuckDB. Its very first designed use case was to be a quick in-process way of looking at the contents of parquet files. Here's their docs on using it with parquet: https://duckdb.org/docs/data/parquet/overview.html

# ¿ Feb 21, 2024 21:45

CompeAnansi: Feb 1, 2011; I respectfully decline
the invitation to join
your hallucination

Hadlock posted:

I'm curious what is the dividing line between data engineering, and devops/platform engineering/sre

For example we had an "analytics" department that was maintaining their own spaghetti code of bash and SQL, database creation/modification, as well as a home built kitchen/pentaho/spoon ETL ~~disaster~~ and ftp/sftp stuff

I rewrote most everything but the sql and ported them to a new platform (with an eye towards moving them to airflow as the next step), parallelized a lot of the tasks and got the run time way down

On the other hand I've got a coworker that's building out some data pipelines for Google forms and excel data and piping it into tableau for upper management to do internal business forecasting. Arguably you could farm that out to a devops guy

It depends on the company. If you have a devops/platform team, then generally they do the job of maintaining the tools, and DEs just use them to accomplish data tasks. If you're small enough that you don't have a dedicated team, then DEs generally have to setup, maintain, and use the tools. If you're really small, they might expect an underpaid analyst to do all that and create charts too.

My experience has been that companies only allow specialists when they're forced to. So, if they've been forced to setup devops teams already, then the generalist DE doesn't need to do that bit of work, otherwise it's all on them. And so on.

# ¿ Feb 26, 2024 22:31

CompeAnansi: Feb 1, 2011; I respectfully decline
the invitation to join
your hallucination

Hadlock posted:

I mean, is data engineering a sub-discipline of devops, like devsec ops or platform engineering, or is it somehow independent? Arguably it's a sub discipline of platform engineering

My view is that Data Engineering is most fundamentally about moving data around (and maybe also transforming it). Whether you have to build your own tools, host open-source tools, or use cloud tools is a decision for the company to make and the amount of work the DEs have to do to maintain those tools will change depending on what decisions are made there. There are software engineers that specialize in building data tools (but not using them) and those are the individuals I'd see as falling into the discipline of platform engineering. But I wouldn't call those software engineers "Data Engineers" unless they were also using those tools they built to move data around.

# ¿ Feb 28, 2024 01:07

Adbot: ADBOT LOVES YOU

# ¿ May 17, 2024 01:16

CompeAnansi: Feb 1, 2011; I respectfully decline
the invitation to join
your hallucination

Why are they using ADF Piplines instead of Workflow Orchestration Manager given that they're hiring and paying data engineers? Feels like running up against the limits of Pipelines means moving to a proper system like Workflow Orchestration Manager (aka Airflow).

# ¿ Apr 25, 2024 17:55

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Data engineering: what are you doing in my data swamp?!