|
UU TEE EFF posted:I am a newbie to both programming and SQL in general. I can follow a guide and understand what I am doing, but that is about it. but I want to expand my skills a little in managing and analyzing a database. I think I want to do it by creating a local database on my windows machine and then connecting to it with Power BI. I was able to follow the python tutorial https://docs.python.org/3/library/sqlite3.html#tutorial to create an SQLite3 database. I then connected to it through Power BI. So that part is good. It is very rare for anyone to just start as a junior data engineer with no prior experience, even with a CS degree. Most people start as backend engineers and move to DE because they like working with data, or they start as a data analyst and move to DE because their company does what lots of companies do and expect analysts to do both the DE work and the analyst work and they realized they like the DE side better. I'd feel pretty uncomfortable suggesting that someone just dive right in and try to get a job as a DE with no prior data/engineering roles.
|
# ¿ Dec 31, 2023 21:28 |
|
|
# ¿ May 17, 2024 01:16 |
|
For OLTP DBs, the most popular one for startups in the US (and medium-sized newer tech companies) seems to be Postgres, unless they've tied themselves to the Microsoft stack, in which case they're on SQL Server. MySQL is still somewhat popular in the US but is very popular in Asia, especially China. No one uses Oracle if they can help it, basically only large old non-tech companies, which means that there are a ton of jobs that use it but no one likes them.
|
# ¿ Jan 3, 2024 19:50 |
|
Hughmoris posted:For those of you currently working as Data Engineers, what does your day-to-day look like? What technologies are you working with? Our stack is: 1. Self-hosted Airflow as our orchestrator, which executes two kinds of jobs: a. Python pipelines that are either replicating tables from our production db or pulling data from APIs b. DBT jobs that create derived tables from those base-level tables created/updated by the Python pipelines 2. A "warehouse", which is a managed Postgres instance because it is cheap and our data sizes are small enough that we can get away with it for now. Changes we're considering are to either move to a lakehouse architecture with Iceberg + Trino, or to move us to a proper columnar db, like self-hosted StarRocks (aka Apache Doris). My day-to-day at my startup is: 1. Checking the orchestrator status page - Mostly just rerunning failed jobs and/or fixing code if, e.g., an API change happened that caused the failed job. 2. Fielding tickets for data requests - This is the bulk of my job. If the request requires a brand new source, then it'll require writing a new pipeline job to pull data from an API. But often it just requires new DBT transforms to present the existing data in a different way so we can hook it to Tableau or Geckoboard. 3. Occasional calls - I am blessed to have been able to set hard boundaries (and be supported in this by my boss) regarding the number of meetings I am in since I am still an IC even though I am the data lead. 4. Occasional work to re-architect existing systems - We've always chosen the simplest/cheapest route with the thought that when it breaks as our data volumes grow, then we'll revisit it and choose a more complicated or more expensive setup. Well, our data volumes have continued to grow and things do break as expected. When that happens this becomes my main focus until resolved, putting incoming tickets on hold. CompeAnansi fucked around with this message at 21:04 on Jan 3, 2024 |
# ¿ Jan 3, 2024 20:05 |
|
Hughmoris posted:Do your data engineers also administer your Postgres instance, or do you have a dedicated DBA team for that? We're paying for managed Postgres so it is largely handled for us, but anything that needs doing I do myself. No DBA team.
|
# ¿ Jan 3, 2024 20:59 |
|
lazerwolf posted:Hello Data Engineering thread! I’m sort of a hybrid Full Stack Engineer / Data Engineer in the Biotechnology space meaning I can build front facing UIs, backend APIs and Data pipelines. Since no one is replying I'll mention what we do for Postgres to Postgres syncs since Redshift is based on Postgres and has the COPY command (although it may be somewhat restricted). Assuming that none of the tables in the DB are larger than memory, you can use: code:
code:
EDIT: A cursory google search after posting turned this up, though: https://stackoverflow.com/questions/31603429/redshift-copy-from-stdin-using-java-program. Sounds like you'll need to copy it to S3 first then you can use the code:
CompeAnansi fucked around with this message at 05:19 on Jan 5, 2024 |
# ¿ Jan 5, 2024 01:10 |
|
Has anyone tried running dbt over iceberg yet (with whatever query layer you prefer)? I know views aren’t supported on iceberg yet but you should in theory be able to materialize as a table. But the process for creating a table in iceberg is way more complicated that it is in a traditional db so I’m not sure if the system can pull it off.
|
# ¿ Feb 4, 2024 07:33 |
|
Hed posted:Cross-posting since this is more data than Python (thanks WHERE MY HAT IS AT) Try DuckDB. Its very first designed use case was to be a quick in-process way of looking at the contents of parquet files. Here's their docs on using it with parquet: https://duckdb.org/docs/data/parquet/overview.html
|
# ¿ Feb 21, 2024 21:45 |
|
Hadlock posted:I'm curious what is the dividing line between data engineering, and devops/platform engineering/sre It depends on the company. If you have a devops/platform team, then generally they do the job of maintaining the tools, and DEs just use them to accomplish data tasks. If you're small enough that you don't have a dedicated team, then DEs generally have to setup, maintain, and use the tools. If you're really small, they might expect an underpaid analyst to do all that and create charts too. My experience has been that companies only allow specialists when they're forced to. So, if they've been forced to setup devops teams already, then the generalist DE doesn't need to do that bit of work, otherwise it's all on them. And so on.
|
# ¿ Feb 26, 2024 22:31 |
|
Hadlock posted:I mean, is data engineering a sub-discipline of devops, like devsec ops or platform engineering, or is it somehow independent? Arguably it's a sub discipline of platform engineering My view is that Data Engineering is most fundamentally about moving data around (and maybe also transforming it). Whether you have to build your own tools, host open-source tools, or use cloud tools is a decision for the company to make and the amount of work the DEs have to do to maintain those tools will change depending on what decisions are made there. There are software engineers that specialize in building data tools (but not using them) and those are the individuals I'd see as falling into the discipline of platform engineering. But I wouldn't call those software engineers "Data Engineers" unless they were also using those tools they built to move data around.
|
# ¿ Feb 28, 2024 01:07 |
|
|
# ¿ May 17, 2024 01:16 |
|
Why are they using ADF Piplines instead of Workflow Orchestration Manager given that they're hiring and paying data engineers? Feels like running up against the limits of Pipelines means moving to a proper system like Workflow Orchestration Manager (aka Airflow).
|
# ¿ Apr 25, 2024 17:55 |