Data engineering: what are you doing in my data swamp?!

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Data engineering: what are you doing in my data swamp?!

lazerwolf: Dec 22, 2009; Orange and Black

Hello Data Engineering thread! I�m sort of a hybrid Full Stack Engineer / Data Engineer in the Biotechnology space meaning I can build front facing UIs, backend APIs and Data pipelines.

We�re an AWS shop so we use MWAA, S3 l, Redshift etc. React for any UI application and Django for REST and GraphQL APIs to serve the frontends and house business logic for data pipelines.

Does anyone have a great approach syncing external databases into Redshift as is (aka no transformation)?

# ¿ Jan 3, 2024 22:24

Adbot: ADBOT LOVES YOU

# ¿ May 17, 2024 20:12

lazerwolf: Dec 22, 2009; Orange and Black

CompeAnansi posted:

Since no one is replying I'll mention what we do for Postgres to Postgres syncs since Redshift is based on Postgres and has the COPY command (although it may be somewhat restricted). Assuming that none of the tables in the DB are larger than memory, you can use:
code:
COPY table TO STDOUT
to pull the table into an in memory buffer, then immediately push it back out to the destination (after setting the buffer back to the start with file.seek(0)) with:
code:
COPY table FROM STDIN
As I said I'm not 100% sure how that'll carry over to Redshift (or what your source DBs are), but it's a potential starting place for high speed duplication.

EDIT: A cursory google search after posting turned this up, though: https://stackoverflow.com/questions/31603429/redshift-copy-from-stdin-using-java-program. Sounds like you'll need to copy it to S3 first then you can use the
code:
COPY table FROM Amazon S3
command instead, but this is significantly less efficient because you have to store the table in an intermediary stage rather than just temporarily in memory.

Some of these databases have 600+ tables and I doubt they would fit into memory. We have been doing the table to s3 copy paradigm. I was wondering if there was a better way but seems like not really. Especially since we�re being really lazy right now with dropping the replica and copying full each time.

# ¿ Jan 6, 2024 02:38

lazerwolf: Dec 22, 2009; Orange and Black

I have a file based data engineering challenge. We are looking to migrate objects from one bucket to another but modify the object key in the process (I.e. inject path metadata). Anyone have any other solutions than invoking a lambda function/step function?

# ¿ Apr 30, 2024 21:20

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Data engineering: what are you doing in my data swamp?!