Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
ulmont
Sep 15, 2010

IF I EVER MISS VOTING IN AN ELECTION (EVEN AMERICAN IDOL) ,OR HAVE UNPAID PARKING TICKETS, PLEASE TAKE AWAY MY FRANCHISE

credburn posted:

In order to do that, I need to first find out what districts are most common. That should be easy. But then I need to find out what actors most frequently appear in the films being rented by those districts. That's where, conceptually, I get lost. I feel like this is the sort of mundane and typical task that will become second nature in a career setting, but right now I find I'm overwhelmed.

Assuming that you are doing things in multiple SQL statements, you might consider something like this:

1. Find out what districts are most common. You say this is easy, so I'll assume it is.
2. Find out, in those districts, what films are being rented. Since your statement was "where the most movies are rented", I'm going to assume here that duplicate rentals should count twice here.
3. Find out, in those rentals, what actors appear the most.

Adbot
ADBOT LOVES YOU

nielsm
Jun 1, 2009



That ER diagram technically seems to be missing an arrow from inventory to store, but that shouldn't be relevant to the question.

Intermediate questions:
How would you get the actors that appear in the most films?
How would you get the actors that appear in the most films that are in inventory for a specific store?
How would you find the most rented out films for a specific store?
How would you find the single most rented out film for every store?
How would you find the most popular actors for the films rented out for a specific store?

Away all Goats
Jul 5, 2005

Goose's rebellion

I got hired for a DB Admin role and I have no idea what I'm doing. Oh god.

credburn
Jun 22, 2016
A tangled skein of bad opinions, the hottest takes, and the the world's most misinformed nonsense. Do not engage with me, it's useless, and better yet, put me on ignore.

ulmont posted:

Assuming that you are doing things in multiple SQL statements, you might consider something like this:

1. Find out what districts are most common. You say this is easy, so I'll assume it is.
2. Find out, in those districts, what films are being rented. Since your statement was "where the most movies are rented", I'm going to assume here that duplicate rentals should count twice here.
3. Find out, in those rentals, what actors appear the most.


nielsm posted:

That ER diagram technically seems to be missing an arrow from inventory to store, but that shouldn't be relevant to the question.

Intermediate questions:
How would you get the actors that appear in the most films?
How would you get the actors that appear in the most films that are in inventory for a specific store?
How would you find the most rented out films for a specific store?
How would you find the single most rented out film for every store?
How would you find the most popular actors for the films rented out for a specific store?

Thanks so much, gang! This is actually exactly what I was looking for without realizing it!

abelwingnut
Dec 23, 2002


Away all Goats posted:

I got hired for a DB Admin role and I have no idea what I'm doing. Oh god.

if you're dealing with a small oltp, relatively low traffic database, you probably won't have to do much admin-wise. just reindex now and again, least privilege security, backups, etc. just don't be an ioiot/rock its boat, obviously.

cam definitely get hairier so hopefully that's all.

Away all Goats
Jul 5, 2005

Goose's rebellion

abelwingnut posted:

if you're dealing with a small oltp, relatively low traffic database, you probably won't have to do much admin-wise. just reindex now and again, least privilege security, backups, etc. just don't be an ioiot/rock its boat, obviously.

cam definitely get hairier so hopefully that's all.

Thanks, that does make me feel a bit better. It's a small (30~ people) company dealing with healthcare data. But it's my first bigboy job in the industry that isn't an internship/co-op(and they were unrelated to databases) so I'm getting a huge sense of imposter syndrome.

Away all Goats fucked around with this message at 00:59 on Jul 26, 2023

abelwingnut
Dec 23, 2002


i haven't worked in healthcare, but i imagine it has its own hurdles. hipaa, probably lots of auditing, etc.

can't help you there, but it can't be impossible.

Hughmoris
Apr 21, 2007
Let's go to the abyss!

Away all Goats posted:

I got hired for a DB Admin role and I have no idea what I'm doing. Oh god.

You got this!

What type(s) of databases are you going to work with? SQL Server? Oracle?

Away all Goats
Jul 5, 2005

Goose's rebellion

Hughmoris posted:

You got this!

What type(s) of databases are you going to work with? SQL Server? Oracle?

Thanks. They mentioned it was mostly working with PostgreSQL, which I was honest about and said I had no experience in. I mostly worked with Oracle previously. From what I've seen they're pretty similar.

Kuule hain nussivan
Nov 27, 2008

You'll do fine. Everyone is as bad or worse than you are. Even the healthcare side of things, despite regulation, is pretty rotten. I am not a pessimist and don't ask me how I know this.

Moonwolf
Jun 29, 2004

Flee from th' terrifyin' evil of "NHS"!


Away all Goats posted:

Thanks. They mentioned it was mostly working with PostgreSQL, which I was honest about and said I had no experience in. I mostly worked with Oracle previously. From what I've seen they're pretty similar.

The SQL syntax is basically identical because the Postgres devs follow what Oracle puts on the spec. They're pretty different still in management, but your main things are going to be the obvious a) find where the backups are, test them, confirm they work, b) work out how to tell how your DB is doing and have it yell at you if it goes out of the regular boundaries.

Generic Monk
Oct 31, 2011

For the past couple of jobs now I've been doing ETL stuff with MSSQL. I don't really have any formal training with this stuff so I've pretty much been picking up best practices on the fly and not trying to completely cock stuff up like my colleagues do on the regular.

It's kind of difficult to get people without much knowledge or care for best practices to follow a source control proces without it being braindead simple, so in these jobs I've settled into the routine of 'The create table/procedure/function etc scripts are tracked in Git; if you want to edit an entity in the database you open/edit/run these files so you're not making changes to stuff in the database that then doesn't get recorded in source control'. This works but it's prone to error and feels like a clunky hack.

How do others manage this? It seems like SQL server management studio really isn't remotely interested in the idea of source control or that you would regularly be running external files, but surely you would want to have this stuff in source control. Am I just using the wrong tool? I just found out DataGrip exists which looks really nice, but It's quite possible my workflow is just hosed up.

monochromagic
Jun 17, 2023

Generic Monk posted:

For the past couple of jobs now I've been doing ETL stuff with MSSQL. I don't really have any formal training with this stuff so I've pretty much been picking up best practices on the fly and not trying to completely cock stuff up like my colleagues do on the regular.

It's kind of difficult to get people without much knowledge or care for best practices to follow a source control proces without it being braindead simple, so in these jobs I've settled into the routine of 'The create table/procedure/function etc scripts are tracked in Git; if you want to edit an entity in the database you open/edit/run these files so you're not making changes to stuff in the database that then doesn't get recorded in source control'. This works but it's prone to error and feels like a clunky hack.

How do others manage this? It seems like SQL server management studio really isn't remotely interested in the idea of source control or that you would regularly be running external files, but surely you would want to have this stuff in source control. Am I just using the wrong tool? I just found out DataGrip exists which looks really nice, but It's quite possible my workflow is just hosed up.

For ETL stuff I'd recommend looking into dbt. There are many similar tools out there, and some might be better for your use case, but start with dbt. It is by far the largest and most well-supported, and seems to fit your needs, and you can run it open source. Also hooks nicely into CI/CD processes so you don't actually have to do the changes manually.

(As an aside - maybe also look into doing ELT instead, it's what all the cool kids are doing)

Generic Monk
Oct 31, 2011

monochromagic posted:

For ETL stuff I'd recommend looking into dbt. There are many similar tools out there, and some might be better for your use case, but start with dbt. It is by far the largest and most well-supported, and seems to fit your needs, and you can run it open source. Also hooks nicely into CI/CD processes so you don't actually have to do the changes manually.

(As an aside - maybe also look into doing ELT instead, it's what all the cool kids are doing)

The business is going to move to data warehousing in Redshift IIRC after this project is done, but this project is just a bunch of stored procedures that take data from 3 source systems and spit out a bunch of CSV files to be imported into a third party's CRM, so dbt is a bit out of scope at this stage. I think the moral is that the whole way we're doing things has been a bit hosed, I'm just trying to make the best out of a bad situation. I do need to learn dbt though; it keeps coming up. I do bounce off the marketing jargon in their website a little though.

monochromagic
Jun 17, 2023


:psyduck:

Wrt dbt marketing lingo I agree - I'd simply go straight to the docs. They have a pretty good sample project as well which showcases common usage patterns. Totally understand making the best of a bad situation, and fwiw it sounds like you're on the right track given the circumstances. But yeah, for levelling up the ETL game in the future I'd recommend dbt, Airbyte and Dagster - but I'm just waffling on about data engineering now and it probably isn't the thread for that.

Average Lettuce
Oct 22, 2012


monochromagic posted:

but I'm just waffling on about data engineering now and it probably isn't the thread for that.

Is there a thread for that?

abelwingnut
Dec 23, 2002


we moved our warehose to redshift from snowflake a few months ago. while it is quicker and better structurally, the web ide is loving awful. it'll run queries up to 300k characters, but it won't save queries longer than 30k, which...what the gently caress? and yes, those numbers are correct. it'll run queries it can't save. baffling.

furthermore, it has the dumbest autocomplete/intellisense in the known universe. there's zero in the way of auto-formatting. the lefthand panel containing the schema is a ux nightmare.

oh, and it crashes like every tenth query you run.

don't go to redshift if you can avoid it! at least snowflake wouldn't crash all the time, even if writing in it is hellish.

Hughmoris
Apr 21, 2007
Let's go to the abyss!

abelwingnut posted:

we moved our warehose to redshift from snowflake a few months ago. while it is quicker and better structurally, the web ide is loving awful. it'll run queries up to 300k characters, but it won't save queries longer than 30k, which...what the gently caress? and yes, those numbers are correct. it'll run queries it can't save. baffling.

furthermore, it has the dumbest autocomplete/intellisense in the known universe. there's zero in the way of auto-formatting. the lefthand panel containing the schema is a ux nightmare.

oh, and it crashes like every tenth query you run.

don't go to redshift if you can avoid it! at least snowflake wouldn't crash all the time, even if writing in it is hellish.

Are you an analyst, engineer, or administrator for Redshift? Or a combination of all three?

From the outside looking in, I often hear that Redshift is a platform that can offer a lot of juice but you have to know what you're doing in terms of modeling and writing queries. I've toyed with it on a few personal projects but my job doesn't use it.

Seventh Arrow
Jan 26, 2005

Average Lettuce posted:

Is there a thread for that?

I created one years ago, but there was zero interest. The data engineering subreddit is pretty decent, though.

abelwingnut
Dec 23, 2002


Hughmoris posted:

Are you an analyst, engineer, or administrator for Redshift? Or a combination of all three?

From the outside looking in, I often hear that Redshift is a platform that can offer a lot of juice but you have to know what you're doing in terms of modeling and writing queries. I've toyed with it on a few personal projects but my job doesn't use it.

engineer, but they're trying to shoehorn me into a bunch of analysis and i'm not liking it!

Average Lettuce
Oct 22, 2012


Seventh Arrow posted:

I created one years ago, but there was zero interest. The data engineering subreddit is pretty decent, though.

Yeah, that's my go-to as well.

Hughmoris
Apr 21, 2007
Let's go to the abyss!
I'm all for using this as a general purpose Database/Analyst/Engineer thread. It's not like we are getting so much traffic that we can't keep up with the topic.

Thoughts?

abelwingnut
Dec 23, 2002


i always assumed this thread was just a catch-all for data jobs given there were no other threads at the time? i've got this bookmarked in my control panel and honeslty haven't looked at the subforum's main page since then, which was like...two years ago?, so maybe there are some new threads now.

but yea, one thread for data makes a ton of sense to me. analyst, admin, engineer, visualizer/whatever they're called, scientist, anything.

abelwingnut fucked around with this message at 18:10 on Jul 31, 2023

monochromagic
Jun 17, 2023

I'd be open to create an OP for a data thread and see if there's interest. I love talking about that poo poo and am currently a one person data engineering army, so having some input would be welcome. Ofc, there might be decent overlap with this thread as SQL is still king in this area, but I feel that there's a decent amount of stuff that's out of scope for people mostly interested in DB management (love you guys and everything you do).

Hughmoris
Apr 21, 2007
Let's go to the abyss!

monochromagic posted:

I'd be open to create an OP for a data thread and see if there's interest. I love talking about that poo poo and am currently a one person data engineering army, so having some input would be welcome. Ofc, there might be decent overlap with this thread as SQL is still king in this area, but I feel that there's a decent amount of stuff that's out of scope for people mostly interested in DB management (love you guys and everything you do).

I think that's a great idea!

I've done data analyst work for a while and really want to break into data engineering. Specifically, I (think) I want to become a Databricks engineer so I've been brushing up on that. Any of yous doing Databricks work?

Seventh Arrow
Jan 26, 2005

Hughmoris posted:

I think that's a great idea!

I've done data analyst work for a while and really want to break into data engineering. Specifically, I (think) I want to become a Databricks engineer so I've been brushing up on that. Any of yous doing Databricks work?

If you want to break into data engineering, the things you should prioritize the most are:

Python
SQL
Data Modeling
ETL Principals
One of the big 3 cloud technologies

Databricks has kind of evolved beyond pyspark, but being proficient at pyspark should nevertheless help you get your foot in the door at least.

Hughmoris
Apr 21, 2007
Let's go to the abyss!

Seventh Arrow posted:

If you want to break into data engineering, the things you should prioritize the most are:

Python
SQL
Data Modeling
ETL Principals
One of the big 3 cloud technologies

Databricks has kind of evolved beyond pyspark, but being proficient at pyspark should nevertheless help you get your foot in the door at least.

Thanks for the feedback. I have a good grasp on Python and SQL, with a decent grasp on Data Modeling and ETL. I've picked up an AWS Solutions Architect - Associate cert, and the Azure Data Engineer cert.

My current job has a LOT of downtime, and they foot the bill on some training costs (hence the certs). I'm trying to stay productive until I job hop, which could be later this year.

Some of the DE-related topics I've been looking at diving into:
- Databricks
- Snowflake
- Redshift/Glue/EMR
- Elasticsearch and Kibana (I know the least about these)
- Splunk

Gwaihir
Dec 8, 2009
Hair Elf

Generic Monk posted:

For the past couple of jobs now I've been doing ETL stuff with MSSQL. I don't really have any formal training with this stuff so I've pretty much been picking up best practices on the fly and not trying to completely cock stuff up like my colleagues do on the regular.

It's kind of difficult to get people without much knowledge or care for best practices to follow a source control proces without it being braindead simple, so in these jobs I've settled into the routine of 'The create table/procedure/function etc scripts are tracked in Git; if you want to edit an entity in the database you open/edit/run these files so you're not making changes to stuff in the database that then doesn't get recorded in source control'. This works but it's prone to error and feels like a clunky hack.

How do others manage this? It seems like SQL server management studio really isn't remotely interested in the idea of source control or that you would regularly be running external files, but surely you would want to have this stuff in source control. Am I just using the wrong tool? I just found out DataGrip exists which looks really nice, but It's quite possible my workflow is just hosed up.

We use Liquibase for this:
https://www.liquibase.org/

Every change, update, or insert for our various DBs is written as a liquibase change set and saved in the relevant git repo. When it comes time for a new version deployment, the whole thing gets run and takes care of any changes needed, with history tracking and all that good stuff.

It can take some getting used to for people to write their statements correctly (You have to be good about using IF NOT EXISTS, setting the context for stuff like "run on changes, run always, run in different environments, or you'll get weird poo poo happening), but for any decently sized project it feels like a must.

Gwaihir fucked around with this message at 20:24 on Jul 31, 2023

RandomBlue
Dec 30, 2012

hay guys!


Biscuit Hider

Seventh Arrow posted:

If you want to break into data engineering, the things you should prioritize the most are:

Python
SQL
Data Modeling
ETL Principals
One of the big 3 cloud technologies

Databricks has kind of evolved beyond pyspark, but being proficient at pyspark should nevertheless help you get your foot in the door at least.

python? you mean the money laundering language?

Hughmoris
Apr 21, 2007
Let's go to the abyss!

abelwingnut posted:

engineer, but they're trying to shoehorn me into a bunch of analysis and i'm not liking it!

Do you know why the company switched off Snowflake? It seems the biggest reason I hear about is Snowflake is pretty drat pricey.

abelwingnut
Dec 23, 2002


Hughmoris posted:

Do you know why the company switched off Snowflake? It seems the biggest reason I hear about is Snowflake is pretty drat pricey.

our new cto is a huge amazon fanboy and has switched over our entire infrastructure to aws. it'd be possible to have snowflake as the warehouse while having glue, dms,, et al do their various jobs and write to snowflake and what not. but if you're already moving over dbt and orchestration over to aws, might as well move to redshift as well.

from a usability standpoint, geesh, i don't know which i'd choose. snowflake has its own litany of ui/ux problems. i'd probably choose redshift given aws seems to actually be working on its products. snowflake has barely changed in like 3yr and already feels somewhat over.

never used azure so no idea how it compares. i hear it's better, though.

abelwingnut fucked around with this message at 20:58 on Jul 31, 2023

Seventh Arrow
Jan 26, 2005

RandomBlue posted:

python? you mean the money laundering language?

welcome to coding prison, sucker!

lazerwolf
Dec 22, 2009

Orange and Black

monochromagic posted:

I'd be open to create an OP for a data thread and see if there's interest. I love talking about that poo poo and am currently a one person data engineering army, so having some input would be welcome. Ofc, there might be decent overlap with this thread as SQL is still king in this area, but I feel that there's a decent amount of stuff that's out of scope for people mostly interested in DB management (love you guys and everything you do).

I’d be in for a data thread as well. Transitioning my role from Full stack developer to more of a Data Engineering role so I’m interested in discussing more.

I would blow Dane Cook
Dec 26, 2008
Why is Oracle (the company) unpleasant to deal with?

mortarr
Apr 28, 2005

frozen meat at high speed
A lot of Data Engineering jobs mention python - where should I start? I've done a lot of data related dev stuff, almost all on c#/ms SQL if that helps? Looking at one of the Azure certs mentioned above too.

Generic Monk
Oct 31, 2011


is it bad? the company is all in on amazon everything including the VPN which takes 2 solid minutes to make a connection and doesn’t remember your credentials, and their citrix/terminal server product on which they can’t figure out how to increase the space from 10GB on the user partition if anyone needs it

Star War Sex Parrot
Oct 2, 2003

mortarr posted:

A lot of Data Engineering jobs mention python - where should I start? I've done a lot of data related dev stuff, almost all on c#/ms SQL if that helps? Looking at one of the Azure certs mentioned above too.
"Data engineer" feels like a fairly broad and underspecified title right now, so the exact role and necessary background is gonna vary from place to place. I would expect you to know the current glossary of jargon (i.e., data lake, data warehouse) as well as basic differentiators between the major players (e.g., RedShift, Databricks, Snowflake, and maybe BigQuery). At a minimum, Python knowledge probably encompasses (1) glue to stitch data capture pipelines together, and (2) APIs to prototype analysts' queries.

Star War Sex Parrot fucked around with this message at 19:27 on Aug 1, 2023

credburn
Jun 22, 2016
A tangled skein of bad opinions, the hottest takes, and the the world's most misinformed nonsense. Do not engage with me, it's useless, and better yet, put me on ignore.
Hey gang, I am struggling to join these tables and produce the results I'm looking for. I'm using postgres and this diagram:



This is a slight variation from my earlier goal; now what I am trying to accomplish is, I want to find out what the five most common districts are, and in those I want to produce what genre is most popular in each of the five districts.

I am able to get the five most common districts pretty simply:

code:
SELECT district, COUNT(*) as district_count
FROM address
GROUP BY district
ORDER BY district_count DESC
LIMIT 5;
Now I want to use that data in combination with the genre information all the way on the other side of the diagram using a lot of joins. First, I turn the query into a sub-query...

code:
SELECT top_districts, category.name AS popular_genre, COUNT(*) AS genre_count
FROM (
  SELECT district, COUNT(*) as district_count
  FROM address
  GROUP BY district
  ORDER BY district_count DESC
  LIMIT 5
) AS top_districts
Then I add the joins:

code:
JOIN customer ON top_districts.district = address.district
JOIN rental ON customer.customer_id = rental.customer_id
JOIN inventory ON rental.rental_id = inventory.rental_id
JOIN film ON inventory.film_id = film.film_id
JOIN film_category ON film.film_id = film_category.film_id
JOIN category ON film_category.category_id = category.category_id
GROUP BY top_districts.district, category.name
ORDER BY top_districts.district, genre_count DESC;
But no good. I think maybe the problem is with my sub-query? It comes back with district and count kind of crammed into a single field and that may be causing the issue? I've been working with ChatGPT a little bit and it says this code is golden. I'm trying not to rely on AI so much though, for, well, reasons like this.

Specifically the error I'm getting back when I use this code altogether is:

ERROR: missing FROM-clause entry for table "address"
LINE 10: JOIN customer ON top_district = address.district

Sir Bobert Fishbone
Jan 16, 2006

Beebort
The problem is this join: "JOIN customer ON top_districts.district = address.district". One side of the ON has to include the table you want to join to (customer). Also, because it's within the subquery, the address table is not available to the rest of your query unless you join it back on or expose one of its IDs in the subquery.

Adbot
ADBOT LOVES YOU

kumba
Nov 8, 2003

I posted my food for USPOL Thanksgiving!

enjoy the ride

Lipstick Apathy

credburn posted:

Hey gang, I am struggling to join these tables and produce the results I'm looking for. I'm using postgres and this diagram:



This is a slight variation from my earlier goal; now what I am trying to accomplish is, I want to find out what the five most common districts are, and in those I want to produce what genre is most popular in each of the five districts.

I am able to get the five most common districts pretty simply:

code:
SELECT district, COUNT(*) as district_count
FROM address
GROUP BY district
ORDER BY district_count DESC
LIMIT 5;
Now I want to use that data in combination with the genre information all the way on the other side of the diagram using a lot of joins. First, I turn the query into a sub-query...

code:
SELECT top_districts, category.name AS popular_genre, COUNT(*) AS genre_count
FROM (
  SELECT district, COUNT(*) as district_count
  FROM address
  GROUP BY district
  ORDER BY district_count DESC
  LIMIT 5
) AS top_districts
Then I add the joins:

code:
JOIN customer ON top_districts.district = address.district
JOIN rental ON customer.customer_id = rental.customer_id
JOIN inventory ON rental.rental_id = inventory.rental_id
JOIN film ON inventory.film_id = film.film_id
JOIN film_category ON film.film_id = film_category.film_id
JOIN category ON film_category.category_id = category.category_id
GROUP BY top_districts.district, category.name
ORDER BY top_districts.district, genre_count DESC;
But no good. I think maybe the problem is with my sub-query? It comes back with district and count kind of crammed into a single field and that may be causing the issue? I've been working with ChatGPT a little bit and it says this code is golden. I'm trying not to rely on AI so much though, for, well, reasons like this.

Specifically the error I'm getting back when I use this code altogether is:

ERROR: missing FROM-clause entry for table "address"
LINE 10: JOIN customer ON top_district = address.district


a couple things:
  • you cannot SELECT top_districts - top_districts is a table (or more specifically, an inline view) you're selecting from, not a field you're selecting from a table
  • your join to customer references the address table - the address table is not available to join on because it's in a subquery, so you can't join to it
  • your join to customer has no field from the customer table - i would expect a JOIN ON customer.[some field] = top_districts.[some other field]

e: f;b

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply