Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



Am I not explaining it well enough or did you just wanna shitpost?

Adbot
ADBOT LOVES YOU

nielsm
Jun 1, 2009



Can anyone confirm if this way of securing an application sounds correct?

MS SQL Server 2019
The application has a couple data tables, containing sensitive data, and a table for logging access to the application and data accessed.
A selection of domain users needs access to the application, other users have no access.
Write a set of stored procedures to do all data access required by users, these sprocs validate parameters and write to the access log table. The sprocs are marked as execute as owner (and owned by dbo).
Create an AD group (CONTOSO\AppUsers) and make the users member of this group.
The AD group is added as a login user to the SQL database, with no permissions on its own.
Create a database-level role "app_user" and add the login user "CONTOSO\AppUsers" as member of that role.
Grant the "app_user" role execute the sprocs.
In the sprocs, the ORIGINAL_LOGIN() function can then be used to obtain the user's own account name (CONTOSO\jsmith) for access logging.

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


nielsm posted:

Can anyone confirm if this way of securing an application sounds correct?

MS SQL Server 2019
The application has a couple data tables, containing sensitive data, and a table for logging access to the application and data accessed.
A selection of domain users needs access to the application, other users have no access.
Write a set of stored procedures to do all data access required by users, these sprocs validate parameters and write to the access log table. The sprocs are marked as execute as owner (and owned by dbo).
Create an AD group (CONTOSO\AppUsers) and make the users member of this group.
The AD group is added as a login user to the SQL database, with no permissions on its own.
Create a database-level role "app_user" and add the login user "CONTOSO\AppUsers" as member of that role.
Grant the "app_user" role execute the sprocs.
In the sprocs, the ORIGINAL_LOGIN() function can then be used to obtain the user's own account name (CONTOSO\jsmith) for access logging.

That sounds more or less correct. I think we do ours around USER_NAME() and SUSER_NAME() but we are doing more row level security than table level security with access logging.

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Why EXECUTE AS OWNER?

You don’t really need that. As long as you can deploy the SP permissions are implicit. If the user can execute the SP they will not be blocked on access to tables they otherwise would not have permission against.

nielsm
Jun 1, 2009



Ruggan posted:

Why EXECUTE AS OWNER?

You don’t really need that. As long as you can deploy the SP permissions are implicit. If the user can execute the SP they will not be blocked on access to tables they otherwise would not have permission against.

I'm not entirely sure I understand what you're arguing here, but I also don't understand the authorization model entirely.

Do you mean that SPs have permissions on tables (etc.) checked at creation time, and then after creation, no checks are made at execution? Other than user having permission to execute the SP itself.

So in theory, I could create an SP that accesses a table I have permissions on, such that I'm the owner of that SP, then have my permissions on that table revoked, and then still be able to execute the SP to access the table in that way?

Are there any disadvantages or risks involved in using EXECUTE AS OWNER?

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


nielsm posted:

I'm not entirely sure I understand what you're arguing here, but I also don't understand the authorization model entirely.

Do you mean that SPs have permissions on tables (etc.) checked at creation time, and then after creation, no checks are made at execution? Other than user having permission to execute the SP itself.

So in theory, I could create an SP that accesses a table I have permissions on, such that I'm the owner of that SP, then have my permissions on that table revoked, and then still be able to execute the SP to access the table in that way?

Are there any disadvantages or risks involved in using EXECUTE AS OWNER?

Yeah - I’m suggesting that you dump EXECUTE AS OWNER.

You are correct. Permissions are checked at creation time, and are not rechecked at runtime - but only for objects that live inside the same database. This allows you to, for example, create an interfaces schema, create stored procedures in that schema that do CRUD ops against tables in dbo, and grant users only access to execute SPs in the interfaces schema. That prevents them from directly accessing tables and allows you the ability to change the underlying schema as you please as long as you maintain the interfaces. Same strategy as interfaces in other languages and very powerful as you are no longer schema locked by downstream dependencies.

Permissions for objects outside of your database are checked at both deploy and execution. Let’s say Bob maintains BobDB on the same server and you join to BobDB.dbo.Stuff. When you deploy your stored procedure, you must have access to query that table. However, any user running the stored procedure must also have access at execution time. If this weren’t true, Bob would have no way of revoking access once your SP was deployed.

One disadvantage to EXECUTE AS OWNER is that the deploying user must have access to execute as owner. This can be a problem if you have contributors that need to make a minor change to a SP but don’t have db_owner. It can also be a problem if you use some form of Continuous Integration / Deployment and a service account is the one deploying. In these cases you may need to give much more security than you want the deploying user to have.

I’m sure there are other implications but in practice I’ve tried to avoid any impersonation as it’s rarely needed. I view it as a code smell that should be resolved some other way (i.e. give the use appropriate access if they need it).

Baronash
Feb 29, 2012

So what do you want to be called?
I'm a complete beginner at SQL, but I'm trying to make a change to some records. We have kids who registered for an event this year, and because of covid we're holding those reservations until next year. As a result, I'm trying to manually increment their grade in school so that they get placed in the right programs/sleeping arrangements.

This is what I tried:
code:
UPDATE customers INNER JOIN SummerRegistration ON customers.CustomerID = SummerRegistration.CustomerID 
SET customers.Grade = Grade+1;
It worked okay, but a single customerID can be tied to multiple records in SummerRegistration if they're coming for multiple events. The way I have it set up, their grade gets incremented for every SummerRegistration record.

My non-SQL brain feels like I should be performing some sort of filtering operation rather than joining.

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
code:
UPDATE customers c SET Grade = Grade + 1
WHERE exists (SELECT 1 FROM SummerRegistration sr WHERE c.CustomerID = sr.CustomerID);

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Baronash posted:

I'm a complete beginner at SQL, but I'm trying to make a change to some records. We have kids who registered for an event this year, and because of covid we're holding those reservations until next year. As a result, I'm trying to manually increment their grade in school so that they get placed in the right programs/sleeping arrangements.

This is what I tried:
code:
UPDATE customers INNER JOIN SummerRegistration ON customers.CustomerID = SummerRegistration.CustomerID 
SET customers.Grade = Grade+1;
It worked okay, but a single customerID can be tied to multiple records in SummerRegistration if they're coming for multiple events. The way I have it set up, their grade gets incremented for every SummerRegistration record.

My non-SQL brain feels like I should be performing some sort of filtering operation rather than joining.

Try this:
code:
UPDATE c --customers
SET Grade = Grade + 1
FROM customers AS C
WHERE EXISTS (
  SELECT *
  FROM SummerRegistration AS S
  WHERE S.CustomerID = C.CustomerID
);
Here was my extremely quick and dirty POC:
code:
CREATE TABLE dbo.#Thing (
  ColVal INT,
  FKCol INT
);

CREATE TABLE dbo.#Thing2 (
  Thing2_Key INT IDENTITY,
  ColVal INT
);

INSERT dbo.#Thing (
  ColVal,
  FKCol
)
VALUES
  (1, 1),
  (4, 1),
  (0, 2),
  (0, 0);

INSERT dbo.#Thing2 (
  ColVal
)
VALUES
(1),
(2),
(3)

SELECT *
FROM dbo.#Thing;
/*
     ColVal       FKCol
----------- -----------
          1           1
          4           1
          0           2
          0           0
*/
UPDATE dbo.#Thing
SET ColVal = ColVal+1
WHERE EXISTS (
  SELECT *
  FROM dbo.#Thing2
  WHERE Thing2_Key = FKCol
);

SELECT *
FROM dbo.#Thing;
/*
     ColVal       FKCol
----------- -----------
          2           1
          5           1
          1           2
          0           0
*/
DROP TABLE dbo.#Thing2;

DROP TABLE dbo.#Thing;

Baronash
Feb 29, 2012

So what do you want to be called?

DELETE CASCADE posted:

code:
UPDATE customers c SET Grade = Grade + 1
WHERE exists (SELECT 1 FROM SummerRegistration sr WHERE c.CustomerID = sr.CustomerID);


Nth Doctor posted:

Try this:
code:
UPDATE c --customers
SET Grade = Grade + 1
FROM customers AS C
WHERE EXISTS (
  SELECT *
  FROM SummerRegistration AS S
  WHERE S.CustomerID = C.CustomerID
);
Here was my extremely quick and dirty POC:
code:
CREATE TABLE dbo.#Thing (
  ColVal INT,
  FKCol INT
);

CREATE TABLE dbo.#Thing2 (
  Thing2_Key INT IDENTITY,
  ColVal INT
);

INSERT dbo.#Thing (
  ColVal,
  FKCol
)
VALUES
  (1, 1),
  (4, 1),
  (0, 2),
  (0, 0);

INSERT dbo.#Thing2 (
  ColVal
)
VALUES
(1),
(2),
(3)

SELECT *
FROM dbo.#Thing;
/*
     ColVal       FKCol
----------- -----------
          1           1
          4           1
          0           2
          0           0
*/
UPDATE dbo.#Thing
SET ColVal = ColVal+1
WHERE EXISTS (
  SELECT *
  FROM dbo.#Thing2
  WHERE Thing2_Key = FKCol
);

SELECT *
FROM dbo.#Thing;
/*
     ColVal       FKCol
----------- -----------
          2           1
          5           1
          1           2
          0           0
*/
DROP TABLE dbo.#Thing2;

DROP TABLE dbo.#Thing;

These worked perfectly. Thanks!

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
fyi, i think you are doing better at sql than you think. your instinct that you "should be performing some sort of filtering operation rather than joining" is exactly correct. as you can see, in the solution we both posted, the join moves to the where clause, which is where filtering happens. unlike the inner join, which duplicates the left table row for every matching right table row, the exists() function will immediately return when it finds the first match. although, technically it's just a matter of sql syntax that it looks this way. the operation you are performing in relational db theory is called a semi-join. the opposite, where you use "not exists" instead, is called an anti-join. sql just doesn't have key words for these, like it does for "inner join" and "left join". so if you feel deep down like this filtering we are doing is kind of also a join, then you are also correct!

NPR Journalizard
Feb 14, 2008

Is there a way to export the results from a database schema compare from visual studio, into something that approaches a readable report?

I saw an internet post that asked for this over a decade ago that got a reply that they were working on it, so hopefully its done by now.

Spring Heeled Jack
Feb 25, 2007

If you can read this you can read
I have a question regarding user logins with Azure SQL and a failover group.

We have a failover group setup and I read that in order for the logins (sql users in this case) to be replicated they need to be contained users. As we already had them created as traditional users from logins that existed on the master, I went through and converted them using the following script: https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-migrate-user-to-contained-transact-sql

So now I can see the users showing up on the replicated DBs, but logging in is still not working. I made sure I am specifying the initial database in SSMS, but no go. Is there something else I am missing here? SQL isn’t normally my forte so feel free to give me any suggestions.

jarlywarly
Aug 31, 2018
I have a query that selects rows from a table where the foreign key could be a key in 2 other tables I need to return records where it matches both however the primary key in the 2 tables is linked to one other key which is the key we receive in the request but the 3rd table is only linked to this key through the second table. (I hope this makes sense)

table1 is key|ambiguous f_key|data
table2 is key|common_key|data
table3 is key|table2_fkey|data

My current query looks a bit like this

SELECT * from table1
WHERE table1.fkey IN (SELECT table3.pkey FROM table3 JOIN table2 ON table3.fkey = table2.pkey WHERE table2.commonkey = ?) OR table1.fkey IN (SELECT table2.pkey from table2 where commonkey = ?)
AND some_other_bullshit

This works but seems slow, is there a better way to go about this query? Or have a I messed up in some fundamental way?

Wait wait a min I can just use = and not IN right..

Thanks

jarlywarly fucked around with this message at 17:21 on Dec 9, 2020

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


jarlywarly posted:

I have a query that selects rows from a table where the foreign key could be a key in 2 other tables I need to return records where it matches both however the primary key in the 2 tables is linked to one other key which is the key we receive in the request but the 3rd table is only linked to this key through the second table. (I hope this makes sense)

table1 is key|ambiguous f_key|data
table2 is key|common_key|data
table3 is key|table2_fkey|data

My current query looks a bit like this

SELECT * from table1
WHERE table1.fkey IN (SELECT table3.pkey FROM table3 JOIN table2 ON table3.fkey = table2.pkey WHERE table2.commonkey = ?) OR table1.fkey IN (SELECT table2.pkey from table2 where commonkey = ?)
AND some_other_bullshit

This works but seems slow, is there a better way to go about this query? Or have a I messed up in some fundamental way?

Wait wait a min I can just use = and not IN right..

Thanks

I think your edit came in as I was reading, but yeah I'm intuiting a pair of LEFT OUTER JOINs instead of the INs and in your WHERE you can sort out which LOJ should be treated like an INNER JOIN on a row by row basis if you really need it.

For content from the Coding Horror thread. I'm trying to learn LINQ.

Nth Doctor posted:

I'm doing the Advent of Code this year, and using it as an excuse to cultivate some good habits and build up some experience with stuff I tend not to use very often.

One of the things I'm trying to do is to learn and use LINQ when I can justify it.

Spoilers for solutions to part 2 of yesterday's puzzles

Yesterday's challenge involved figuring out your seat on a sold out flight, when you've forgotten your boarding pass but managed to scan everyone else's passes.

Coming from a deep SQL background, I immediately thought of how I would use LEFT OUTER JOINs to do this and implemented this nonsense in LINQ:


// Looking for an unclaimed seat with neighbors
// Means we're looking for a pair of seats where one does not exist between them.
var missingId = boardingPasses
// self join to boarding passes where the seatID is one higher
.GroupJoin(
boardingPasses,
p1 => p1.SeatId,
p2 => p2.SeatId - 1,
(lowerPass, upperPassEnumerable) => new { lowerPass, upperPassEnumerable }
)

// flatten the upperPass IEnumerable, and default to null if the enumerable is empty
// map into a lower pass and an upper pass
.SelectMany(
pair => pair.upperPassEnumerable.DefaultIfEmpty(),
(lowerPair, upperPass) => new { lowerPair.lowerPass, upperPass }
)

// Preemptively filter out anything where we have a matching pair. We only want ones where the upperPass is not found
.Where(pair => pair.upperPass is null)

// Join again to boarding passes, on passes whose seat IDs are 2 higher than our pair's lower seatId
.GroupJoin(
boardingPasses,
p1 => p1.lowerPass.SeatId,
p2 => p2.SeatId - 2,
(lowerPair, upperPassEnumerable) => new { lowerPair.lowerPass, middlePass = lowerPair.upperPass, upperPassEnumerable }
)

// Once again flatten the upperPassEnumerable and default to null if it is empty
// Map into a trio of passes
.SelectMany(
passTrio => passTrio.upperPassEnumerable.DefaultIfEmpty(),
(triple, upperPass) => new { triple.lowerPass, triple.middlePass, upperPass }
)

// Filter to only cases where we have a non-null set of outer passes, and a null middle pass.
.Where(passTrio => (passTrio.lowerPass != null && passTrio.middlePass is null && passTrio.upperPass != null))

// Derive the missing ID from the upper and lower SeatIds
.Select(p => new { missingId = (p.upperPass.SeatId + p.lowerPass.SeatId) / 2 } )

// Give the first entry in the final set
.FirstOrDefault();

Console.WriteLine(missingId);


I shamefacedly shared this solution to my colleague who told me about Enumerable.Range() and Enumerable.Except():


var seatIds = boardingPasses.Select(pass => pass.SeatId);
var idRange = Enumerable.Range(seatIds.Min(), seatIds.Max());

var missingId = idRange.Except(seatIds).First();
Console.WriteLine(missingId);



It's me, I'm the horror.

jarlywarly
Aug 31, 2018

Nth Doctor posted:

I think your edit came in as I was reading, but yeah I'm intuiting a pair of LEFT OUTER JOINs instead of the INs and in your WHERE you can sort out which LOJ should be treated like an INNER JOIN on a row by row basis if you really need it.

For content from the Coding Horror thread. I'm trying to learn LINQ.

Thanks I'm not sure though how to go about this, the main issue is that the 2 primary keys are different and used in the same foreign key in table1 also I need to join table1 to table2 and table3 but also select from both using another join to get the real key from table2 from a key in table3

Data looks like this

tbl1

p_key|f_key|data
1 23 blah
2 96 blah

tbl2

p_key|f_key|source_key|data
1 23 99999

tbl3

pkey|f_key_to_tbl1|data
96 1 blah

I get 99999 and need to return both rows from tbl1

So I need to look up the f_key from tbl2 and select from tbl1 and also need to look up the p_key from tbl3 based on the f_key from tbl2 and then use that to select from tbl1 also.

Man just writing it out is making my brain whirr.

nielsm
Jun 1, 2009



Try this:

SQL code:
SELECT COALESCE(a1.p_key, a2.p_key) p_key, COALESCE(a1.f_key, a2.f_key) f_key, COALESCE(a1.data, a2.data) data
FROM tbl2 b
LEFT OUTER JOIN tbl1 a1 ON b.fkey=a1.f_key
LEFT OUTER JOIN tbl3 c ON b.p_key=c.f_key_to_tbl1
LEFT OUTER JOIN tbl1 a2 ON c.pkey=a2.f_key
WHERE (a1.p_key IS NULL) <> (a2.p_key IS NULL)
Join either one direction or the other, use a condition that selects only the rows where one or the other direction produces a join, but not where both or neither do.

Actually, the other solution that may be simpler is to just write two separate queries and use UNION ALL between them.

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

jarlywarly posted:

Thanks I'm not sure though how to go about this, the main issue is that the 2 primary keys are different and used in the same foreign key in table1 also I need to join table1 to table2 and table3 but also select from both using another join to get the real key from table2 from a key in table3

Data looks like this

tbl1

p_key|f_key|data
1 23 blah
2 96 blah

tbl2

p_key|f_key|source_key|data
1 23 99999

tbl3

pkey|f_key_to_tbl1|data
96 1 blah

I get 99999 and need to return both rows from tbl1

So I need to look up the f_key from tbl2 and select from tbl1 and also need to look up the p_key from tbl3 based on the f_key from tbl2 and then use that to select from tbl1 also.

Man just writing it out is making my brain whirr.

My friend, have you heard the good word of our lord and savior, the semi-join?

SQL code:
select
  tbl1.p_key,
  tbl1.data
from tbl1
where
  exists(
    select 1
    from tbl2
    where
      tbl1.f_key = tbl2.f_key and
      tbl2.source_key = ?
  )
  or exists(
    select 1
    from
      tbl3
      inner join tbl2 on tbl3.f_key_to_tbl2 = tbl2.p_key
    where
      tbl3.pkey = tbl1.f_key and
      tbl2.source_key = ?
  )
But really, fix your database design, a foreign key with no actual constraint is a shameful foreign key.

Also this solution is most likely going to perform worse than this:

nielsm posted:

write two separate queries and use UNION ALL between them.

Left joins are the devil, and joins and OR conditions don't go well together in general. Something deceptively simple like

SQL code:
select distinct sometable.*
from
  sometable
  left join othertable on sometable.othertable_id = othertable.id
where
  sometable.somecol = someparam or
  othertable.othercol = otherparam
is surprisingly hard to optimize and at least in postgres usually leads to a seq scan of both tables even if both columns used in the WHERE clause are indexed. I recently took a query that was only slightly more complex than this (it had another semi-join to another couple tables as well) and rewrote it to

SQL code:
select sometable.*
from
  sometable
where
  sometable.somecol = someparam
union
select sometable.*
from
  sometable
  inner join othertable on sometable.othertable_id = othertable.id
where 
  othertable.othercol = otherparam
and it ran about a factor 1000 faster.

In fact, when I tried these two queries (exactly as simple as written above) side by side on a couple of tables in a production size db right now these two I got these two plans:



Left join on the left, union on the right. The main table ("sometable" in the examples above) has about 6.75m rows; the other table only has less than 10k rows. Query plan visualizer used is the excellent explain.dailibo.com; I used the compact node visualization without table names to protect the innocent.

Friends don't let friends use left joins.

TheFluff fucked around with this message at 10:14 on Dec 11, 2020

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
Suggesting that people should altogether avoid writing outer joins is malpractice.

jarlywarly
Aug 31, 2018

TheFluff posted:

My friend, have you heard the good word of our lord and savior, the semi-join?

SQL code:
select
  tbl1.p_key,
  tbl1.data
from tbl1
where
  exists(
    select 1
    from tbl2
    where
      tbl1.f_key = tbl2.f_key and
      tbl2.source_key = ?
  )
  or exists(
    select 1
    from
      tbl3
      inner join tbl2 on tbl3.f_key_to_tbl2 = tbl2.p_key
    where
      tbl3.pkey = tbl1.f_key and
      tbl2.source_key = ?
  )
But really, fix your database design, a foreign key with no actual constraint is a shameful foreign key.

Also this solution is most likely going to perform worse than this:


Left joins are the devil, and joins and OR conditions don't go well together in general. Something deceptively simple like

SQL code:
select distinct sometable.*
from
  sometable
  left join othertable on sometable.othertable_id = othertable.id
where
  sometable.somecol = someparam or
  othertable.othercol = otherparam
is surprisingly hard to optimize and at least in postgres usually leads to a seq scan of both tables even if both columns used in the WHERE clause are indexed. I recently took a query that was only slightly more complex than this (it had another semi-join to another couple tables as well) and rewrote it to

SQL code:
select sometable.*
from
  sometable
where
  sometable.somecol = someparam
union
select sometable.*
from
  sometable
  inner join othertable on sometable.othertable_id = othertable.id
where 
  othertable.othercol = otherparam
and it ran about a factor 1000 faster.

In fact, when I tried these two queries (exactly as simple as written above) side by side on a couple of tables in a production size db right now these two I got these two plans:



Left join on the left, union on the right. The main table ("sometable" in the examples above) has about 6.75m rows; the other table only has less than 10k rows. Query plan visualizer used is the excellent explain.dailibo.com; I used the compact node visualization without table names to protect the innocent.

Friends don't let friends use left joins.

Yeah it's the same old story this DB is old and it's not getting changed soon.

Thanks though I came to the same conclusion a UNION query is the best performing solution, one minor issue is elegantly passing the parameter twice from the source code (Javascript lol)

One thing that I didnt realise I didn't know was how do you efficiently query when you need to use 2 tables to get your key i.e. in this circumstance I get a secondary key and that key gives me a key from another table where I get the final key to use in the search

ie I get X

tbl3 is

p_key, 2_key
123,X

tbl2 is

p_key,f_key
345,123

tbl1 is
p_key,f_key
678,345

the code I tried gave me values from tbl1 where there was a key collision between tbl3 and tbl2

jarlywarly fucked around with this message at 10:43 on Dec 11, 2020

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

Hammerite posted:

Suggesting that people should altogether avoid writing outer joins is malpractice.

If you want the thrill of never knowing if a value is going to be null or not just use a nosql database :colbert:

No but seriously though I almost never use outer joins. There's usually a better way of expressing what you want. Don't use nullable columns either if you can at all avoid it.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

TheFluff posted:

No but seriously though I almost never use outer joins. There's usually a better way of expressing what you want.

This claim you make is not consistent with my experience.

TheFluff posted:

Don't use nullable columns either if you can at all avoid it.

If you need to represent "data is missing" or "column is not applicable", then a nullable column is indicated. If you don't need to represent either of these situations, then a nullable column is not indicated. I've never been in a situation where it's a matter of personal stylistic choice whether to make a column nullable or not, it's always been clear from the context whether it would or wouldn't be correct to make the column nullable. I suppose that to the extent that your observation boils down to "if the column shouldn't be nullable, then don't make it nullable", I agree with it.

nielsm
Jun 1, 2009



Hammerite posted:

If you need to represent "data is missing" or "column is not applicable", then a nullable column is indicated. If you don't need to represent either of these situations, then a nullable column is not indicated. I've never been in a situation where it's a matter of personal stylistic choice whether to make a column nullable or not, it's always been clear from the context whether it would or wouldn't be correct to make the column nullable. I suppose that to the extent that your observation boils down to "if the column shouldn't be nullable, then don't make it nullable", I agree with it.

I believe the idea is that if you have data that can be absent, then those data should live in a separate table that joins with the one you'd otherwise put the nullable column in.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

nielsm posted:

I believe the idea is that if you have data that can be absent, then those data should live in a separate table that joins with the one you'd otherwise put the nullable column in.

You can do that, but then to get at that data you would use... an outer join, which TheFluff does not advise!

and if it's a single column, then you may or may not gain anything by having another table. That is, if it's a single value that can be either present or not present, it is frequently fine just to have a nullable column. If it's multiple columns that are either all null or all not-null, then that data is "trying to be" another table, and you should make it so. The main scenarios where it should be another table even though it's a single column is when you want to enforce that the values belong to an enumerated list, or when values that are the same in different rows ought to be explicitly recognised as being "the same thing" shared between those rows.

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

Hammerite posted:

This claim you make is not consistent with my experience.
Okay? There are some very specific situations where a left join is appropriate, but in my experience the vast majority of things people actually use left joins for would be better written as something else.

Like, if you're using any conditions at all to filter the query by the outer-joined table then it's no longer an outer join at all, so just write inner join instead. It's also common to see people implement anti-joins using a left join, but I find a NOT EXISTS to be a much more explicit way of writing that. In the case I showed above where the left join was really slow, we weren't even using data from the right hand table so we could have written EXISTS instead, too, and a semi-join can usually be faster than an actual join (in that case it wouldn't have made a difference though, because the root cause of the problem was actually the OR conditional).

The use case "I want data from this other table if it exists, but I don't want to filter by it" is really not that common in practice my experience. I mean, why do you want data that might not exist? In most cases when I'm finding myself using a left join I'm doing some kind of aggregation where I'm counting both how many of a thing exist and how many of these things have another thing associated with them.

Hammerite posted:

If you need to represent "data is missing" or "column is not applicable", then a nullable column is indicated. If you don't need to represent either of these situations, then a nullable column is not indicated. I've never been in a situation where it's a matter of personal stylistic choice whether to make a column nullable or not, it's always been clear from the context whether it would or wouldn't be correct to make the column nullable. I suppose that to the extent that your observation boils down to "if the column shouldn't be nullable, then don't make it nullable", I agree with it.

Things are rarely that clear-cut in reality, in my experience. A lot of the time you have a choice between null and a default like 0/empty string/false, and in that case even when null may be more "semantically correct" or whatever, choosing null has significant complexity costs because now everything needs to handle both nulls and values. Even when it's legitimately important to be able to represent the lack of a value, you should step back and ask yourself why that is. Why is this column in the table in the first place if the data sometimes isn't there? Why isn't it there? Will it be filled in later? Is it something unknown? Do we need to distinguish between "not known" and "known to not exist"? Maybe it'd be more appropriate to factor it out to a separate table instead? If you can avoid a nullable column with a different design, either by changing the database or by changing whatever system interacts with it, it's usually a good idea to do so.

TheFluff fucked around with this message at 12:51 on Dec 11, 2020

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

Hammerite posted:

You can do that, but then to get at that data you would use... an outer join, which TheFluff does not advise!
You can, but only if you want to pretend that two things that are actually different are the same. See, the fundamental issue here is that if you have a Foo that can optionally have a Bar and a Baz, then Foos without those two is actually an object of a different type than Foos that have them. In some cases it may be appropriate to treat them the same, but in others it might not. Even if you want to treat them the same you'll probably need a bunch of logic in a bunch of places to handle the fact that they're actually different. What I'm trying to get at there is that there is a real cost to having to do this, so you should consider if it's actually necessary.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
I feel like most everything you say is backed up by appealing to what your past experience has been, and that seems to have been very different to what my past experience has been, and so it's difficult for me to (1) reason about what you say [because I lack the motivating examples that are informing your attitudes] and (2) respond to it.

For example: 'The use case "I want data from this other table if it exists, but I don't want to filter by it" is really not that common in practice my experience.' OK, then all I can say to that is that my experience has been quite different from yours.

Other points you make I can't really make heads or tails of. "Why do you want data that might not exist?" Because I want to know whether it exists and if it does, what it is. Similarly, "Why is this column in the table in the first place if the data sometimes isn't there?" Because sometimes it is there, and I want to keep a record of whether it exists and if it does, what it is.

You make the point that 'Do we need to distinguish between "not known" and "known to not exist"?' This is a weakness of the SQL concept of NULL. If both possibilities have to be accounted for and distinguished from one another, then it is clear that just using a nullable column is not an adequate solution, since NULL is only one state that the column can be in yet two states need to be represented.

In my prior experience, NULL qua "known to be inapplicable" is a great deal more common than NULL qua "value unknown", but others' experience may differ. There is even perhaps a third kind of usage where the NULL means that the application using this database as its backing store regards the data as (definitively) not applicable, but only because it hasn't been entered by the human operators, and their reason for not populating the data is unknown or is outside of the scope of the application.

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

I'm with Hammerite, incomplete data - or more exactly, not yet complete data - is the bread and butter of business application.

I'm guessing maybe TheFluff works mostly with time series, metric, or other analytics data? In which case there are generally fewer levels of nesting and 'partial' objects are rare.

But consider the archetypal CRUD service: an e-commerce that takes orders and ships them. The order, when first created, will have no shipments, and only later it will have 1 or possibly many shipments associated with it. Your dashboards will need to handle both cases, so your queries are going to be some variation of SNELECT (fields) FROM orders LEFT JOIN shipments ON orders.id = shipments.order_id

How would you handle that without left joins? You could run two queries - SNELECT (fields) FROM orders WHERE (conditions); SNELECT (fields) FROM shipments WHERE order_id IN (SELECT order_id FROM orders WHERE (conditions)) - and then perform the join in your application code, with an ORM or manually, but if you want to see a single result in SQL that's not an option.

edit: loving LOL, the forums blocked my posts when I had inline SQL :v: I changed 'select' to 'snelect' to bypass that.

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

NihilCredo posted:

But consider the archetypal CRUD service: an e-commerce that takes orders and ships them. The order, when first created, will have no shipments, and only later it will have 1 or possibly many shipments associated with it. Your dashboards will need to handle both cases, so your queries are going to be some variation of SNELECT (fields) FROM orders LEFT JOIN shipments ON orders.id = shipments.order_id

How would you handle that without left joins? You could run two queries - SNELECT (fields) FROM orders WHERE (conditions); SNELECT (fields) FROM shipments WHERE order_id IN (SELECT order_id FROM orders WHERE (conditions)) - and then perform the join in your application code, with an ORM or manually, but if you want to see a single result in SQL that's not an option.

This is a good example of what I'm talking about. An order that has shipped (or partially shipped) is very different from one that hasn't shipped at all. Is it really that common to treat these two identically?

But for the sake of the argument let's say that we have a use case for retreiving many orders regardless of whether they have shipments or not, and we actually want data from both orders and shipments, not just an aggregation of shipments. For this case a lot of ORM's will default to doing just what you describe; do a left join and then deduplicate the resultset on the application side using the primary key of the orders table and construct model instances appropriately. This is sort of okay at small scales but in general this isn't a great pattern because the way the resultset grows with increasing number of combinations (like if you're also involving a products table, for example - then you'll get orders * shipments * products rows in the resultset), and all of the associated duplicated data that has to be transferred. It's very common but it really doesn't scale well. Multiple queries is likely to scale better even if you have to join on the application side.

Left joins can make sense for aggregations in a use case like this though, like for example listing orders together with how many shipments they have, or something like that. If you're doing something where you just want to know if there is any shipment at all though then an exists subquery in the select clause is probably a better idea.

TheFluff fucked around with this message at 15:13 on Dec 11, 2020

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


I'm also in Hammerite and NihilCredo's camp. One-to-zero-or-more relationships are found all over the applications I work with.

Aside from the order-shipment example NihilCredo gives, orders themselves can have an internal LOJ needed from the order header to the order lines.

Or say you're a commercial kitchen working with ingredients, some may need an MSDS included but you don't store that in the record of the container or the ingredient itself. You deduplicate so that you have:

a ingredient table generally describing the stuff you're working with like name, color, taste etc...

a container table FK'd to ingredient describing the stuff you're working with having stuff like: container number, creation/purchase date, quantity, use by date, ingredient key

an MSDS table for any ingredients the business has deemed necessary as needing them, where ingredient has a FK to MSDS

To get from scanning the container number to pulling up the container info possibly including the MSDS you need a LEFT OUTER JOIN from ingredient.

Similarly if you're in charge of stocking the kitchen up on ingredients, querying for containers needs an LOJ because you'd certainly want to know if there are NO remaining containers of something. Milk doesn't cease to exist just because you don't have any left.

I'm really interested, TheFluff about the data set you work with. Care to share?

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

Nth Doctor posted:

I'm also in Hammerite and NihilCredo's camp. One-to-zero-or-more relationships are found all over the applications I work with.

Aside from the order-shipment example NihilCredo gives, orders themselves can have an internal LOJ needed from the order header to the order lines.

Or say you're a commercial kitchen working with ingredients, some may need an MSDS included but you don't store that in the record of the container or the ingredient itself. You deduplicate so that you have:

a ingredient table generally describing the stuff you're working with like name, color, taste etc...

a container table FK'd to ingredient describing the stuff you're working with having stuff like: container number, creation/purchase date, quantity, use by date, ingredient key

an MSDS table for any ingredients the business has deemed necessary as needing them, where ingredient has a FK to MSDS

To get from scanning the container number to pulling up the container info possibly including the MSDS you need a LEFT OUTER JOIN from ingredient.

Similarly if you're in charge of stocking the kitchen up on ingredients, querying for containers needs an LOJ because you'd certainly want to know if there are NO remaining containers of something. Milk doesn't cease to exist just because you don't have any left.

I'm really interested, TheFluff about the data set you work with. Care to share?

0-n relationships in themselves are perfectly fine, there are many things that are best modeled using that pattern. In many cases I don't think outer joins are the appropriate tool for traversing them, though. As I mentioned above, in a lot of the cases people are describing, two (or more) queries would make more sense than an outer join. An outer join kinda implies that you're squeezing more than one type of object into the resultset, and there are some good reasons for not doing that.

I work on a completely bog standard education web app, nothing unusual about it at all. Lots of plain old CRUD stuff, some tree data, some statistics stuff, but nothing particularly unusual. Plenty of nullable columns (that I'm gradually trying to get rid of whenever it makes sense) and plenty of 0-n relationships. Our ORM generates plenty of left joins for various reasons but when I'm handwriting queries (when doing statistics reports, for example) I rarely use them.

My opinions about outer joins and nulls in general are of course informed by my experience, but also from reading a lot of people angry about computers (such as Gary Bernhardt, for example).

TheFluff fucked around with this message at 15:40 on Dec 11, 2020

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Throwing my hat in with the left join and nullable column camp. In my many years working with large-scale enterprise data (e.g. project, employee, and software development data at a 10k employee org) left joins and nullable columns were extremely common and often the most simple and efficient way to handle storing and traversing the data.

I’m on my phone or I would write up more, but my perspective informed by experience is very similar to what Hammerite describes.

I totally see what you’re saying - that for example orders that haven’t been shipped and orders that have are intrinsically different and could be separated purposefully - but I see this as a sort of academic “in theory”-style argument. That makes it easy to refute simple examples. In practice the reality is that business data is rarely that simple. I don’t think I would have been able to do my job effectively without left joins and nullable columns.

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003

TheFluff posted:

The use case "I want data from this other table if it exists, but I don't want to filter by it" is really not that common in practice my experience. I mean, why do you want data that might not exist?

because I'm generating a report for some idiot to view in excel and I want the blank cells to be blank???

outer joins are absolutely completely fine, they exist for a reason, yes they can be misused and it's not always obvious when you're misusing one, that comes with sql experience. but sometimes you need an outer join and when you're doing it right there is no performance penalty whatsoever

Slimchandi
May 13, 2005
That finger on your temple is the barrel of my raygun
I'm looking on tips for a SQL learning resource (either printed reference or online tutorial or both). I've come from a data analysis background, working predominantly with Pandas, JSON etc. My new role will involve working on BI data pipelines and dashboarding. At my previous org we did a lot of the manipulation work in pandas (as it was the common tool we were most familiar with), but in the new place this will be SQL on AWS Athena.

I understand the basic paradigms of data analysis, joins, filters, groupbys etc and familiar with optimising them in pandas/python, but I'm concerned from reading a bit of this thread that whatever I hack together with SO will be a horrible unperformant mess. Is there something you could recommend beyond the stuff available at sites like sqlzoo that discusses both the how and why of modern performant SQL?

Slimchandi fucked around with this message at 10:40 on Dec 13, 2020

redleader
Aug 18, 2005

Engage according to operational parameters
i didn't know that "left join" was a hot topic in the sql community

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


redleader posted:

i didn't know that "left join" was a hot topic in the sql community

don’t get me started on inners

NPR Journalizard
Feb 14, 2008

Slimchandi posted:

I'm looking on tips for a SQL learning resource (either printed reference or online tutorial or both). I've come from a data analysis background, working predominantly with Pandas, JSON etc. My new role will involve working on BI data pipelines and dashboarding. At my previous org we did a lot of the manipulation work in pandas (as it was the common tool we were most familiar with), but in the new place this will be SQL on AWS Athena.

I understand the basic paradigms of data analysis, joins, filters, groupbys etc and familiar with optimising them in pandas/python, but I'm concerned from reading a bit of this thread that whatever I hack together with SO will be a horrible unperformant mess. Is there something you could recommend beyond the stuff available at sites like sqlzoo that discusses both the how and why of modern performant SQL?

I learnt a bunch from https://www.w3schools.com/sql/default.asp when I started out.

abelwingnut
Dec 23, 2002


i hadn’t heard of athena. it sounds like amazon’s answer to snowflake?

NinpoEspiritoSanto
Oct 22, 2013




Please don't give w3schools traffic

Adbot
ADBOT LOVES YOU

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

redleader posted:

i didn't know that "left join" was a hot topic in the sql community

NULL in SQL is quite controversial and has led to a lot of people being very mad at computers, and outer joins result in nulls, so it shouldn't come as a surprise really.

DELETE CASCADE posted:

because I'm generating a report for some idiot to view in excel and I want the blank cells to be blank???

outer joins are absolutely completely fine, they exist for a reason, yes they can be misused and it's not always obvious when you're misusing one, that comes with sql experience. but sometimes you need an outer join and when you're doing it right there is no performance penalty whatsoever

Well, yeah, obviously, there are use cases where an outer join is exactly the right tool for the job and I've explicitly agreed with that above. My entire point is that a lot of the common usage patterns for outer joins (and nullable columns) are footguns or at least footgun-enablers. SQL has no shortage of that sort of thing but nulls in general and things that lead to nulls is one of the most common.

Ruggan posted:

I don't think I would have been able to do my job effectively without left joins and nullable columns.

I don't think anyone could? Like, when I wrote "friends don't let friends left join" I thought that was obviously hyperbolic enough to come off as a joke.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply