Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
lazerwolf
Dec 22, 2009

Orange and Black
Hello Data Engineering thread! I’m sort of a hybrid Full Stack Engineer / Data Engineer in the Biotechnology space meaning I can build front facing UIs, backend APIs and Data pipelines.

We’re an AWS shop so we use MWAA, S3 l, Redshift etc. React for any UI application and Django for REST and GraphQL APIs to serve the frontends and house business logic for data pipelines.

Does anyone have a great approach syncing external databases into Redshift as is (aka no transformation)?

Adbot
ADBOT LOVES YOU

lazerwolf
Dec 22, 2009

Orange and Black

CompeAnansi posted:

Since no one is replying I'll mention what we do for Postgres to Postgres syncs since Redshift is based on Postgres and has the COPY command (although it may be somewhat restricted). Assuming that none of the tables in the DB are larger than memory, you can use:
code:
COPY table TO STDOUT
to pull the table into an in memory buffer, then immediately push it back out to the destination (after setting the buffer back to the start with file.seek(0)) with:
code:
COPY table FROM STDIN
As I said I'm not 100% sure how that'll carry over to Redshift (or what your source DBs are), but it's a potential starting place for high speed duplication.

EDIT: A cursory google search after posting turned this up, though: https://stackoverflow.com/questions/31603429/redshift-copy-from-stdin-using-java-program. Sounds like you'll need to copy it to S3 first then you can use the
code:
COPY table FROM Amazon S3
command instead, but this is significantly less efficient because you have to store the table in an intermediary stage rather than just temporarily in memory.

Some of these databases have 600+ tables and I doubt they would fit into memory. We have been doing the table to s3 copy paradigm. I was wondering if there was a better way but seems like not really. Especially since we’re being really lazy right now with dropping the replica and copying full each time.

lazerwolf
Dec 22, 2009

Orange and Black
I have a file based data engineering challenge. We are looking to migrate objects from one bucket to another but modify the object key in the process (I.e. inject path metadata). Anyone have any other solutions than invoking a lambda function/step function?

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply