Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
It's been a while since I did anytyhing with MSSQL, but I think if you do WITH SCHEMABINDING in the function declaration, SQL Server will be smart enough to know your function is deterministic.

Adbot
ADBOT LOVES YOU

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

kumba posted:

I've been working in SQL dbs for the last 2 years, primarily as an application / systems analyst doing troubleshooting and rather simple data analysis. I find myself now in a position where I'm needing to design a database for some quality metrics for one of our departments, and now I'm running into what I think are probably simple questions that I'm having a hard time googling a good answer for.

When setting up a basic structure for users and groups, how do I go about ensuring users can move between groups while maintaining the ability to report on a user's performance over time? I know I need to setup a cross-reference table between users and groups, but I'm wondering what the best way to go about tracking this is.

Currently here's my structure I'm thinking of:

[snip]
This is more or less a Slowly Changing Dimension, Type II. It looks like you've pretty much got the basics.

quote:

So, xr.RepsTeams has RepID and TeamID as well as an effective date as a composite primary key. I just now realized that because of that, I probably don't need the [JoinedTeamOn] field in Reps, since I have that in the xr table but anyway I digress.
Since both [TeamID] and [JoinedTeamOn] are aspects of the Rep-Team relationship, you'd want to keep them in the RepTeam table to stay properly normalized. So get rid of both unless you do performance testing that shows that the RepTeam table is a huge hit.

quote:

My main question is: what is the best method to actually populate the xr table? My thought is that while I hate triggers in general, this seems to be the best use case for one outside of auditing tables. So,
  • An insert trigger to add a RepID/TeamID/Date combination for a new hire.
  • An update trigger to add a RepID/TeamID/Date combination for an existing employee changing teams (so anytime an update is performed on the TeamID column in dbo.Reps)
You write your app so that it inserts and/or updates the records in the xr table, and then you don't need a trigger. Maybe you write an Insert trigger to set the EndDate on the now-previous active record, but that's it.

quote:

  • A delete trigger? Once a rep is in Reps, no one should (in theory) be deleted from it (that's what the Deprecated column is for - an entry with Deprecated = 1 is considered to be Inactive. I don't know if that's standard procedure everywhere but in almost all our databases where I work that is standard). But in the event someone does get deleted from Reps, I guess I need this to avoid orphans?
You'd create the FK to be either CASCADE (to delete orphans) or NO ACTION (to forbid the creation of orphans). The effect is, more or less, the same as a trigger of some sort, but it communicates the intent better and keeps the constraint in one place. The FK on TeamID would definitely be NO ACTION, whereas for the FK on RepID either is sensible.

quote:

As a secondary question, should I have an "EndDate" field in the XR table as well for ease of use or is this overkill? I wouldn't make it part of the composite key, and I'd leave it NULL until a person left a team, [snip]
In general, you'd want either an EndDate or a CurrentFlag to make finding the current record easy, and since you're looking to do a lot of slicing by date, EndDate seems to make the most sense. That way you don't have to look at multiple records to figure out where one record begins and one ends.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

PierreTheMime posted:

It's definitely not a specific offset, unfortunately.

This is for a system where all possible functions are objects. A script to execute is an object, a user is an object, a calendar is an object, etc. The main table contains all default object properties such as "object name" and an identifying number to use to cross reference with other tables for more specific information such as "user first name" which is stored in a separate user table. What I'm trying to do is search for an object type and find what user modified it last. The problem being that the object type and the user both need to get the name from the same table from the same column, just a different row based on the "moduserid" which is a value on the row of the object type. I don't know how to select the same column from the same table twice and get different results based on a value on a secondary column from the first object.

I'll edit in an example query and the result when I get to my desk.
Self join?
code:
select objtypes.objectname object_type_name,
             users.objectname last_modified_by
  from objects objtypes
       inner join objects users on objtypes.moduserid = users.objectid
 where objtype.type= 'OBJECT_TYPE'
You're using the same table twice for two different things, so you use the table twice as if they were two different tables.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
It's not 100% clear what output you're expecting. Based on your question, there are two options.
code:
OBJECTNAME				MODIFIEDDATE			USERID
MAT.SCH.DAILY_EST.OLD.20170502.195334	2017-05-02 19:45:26.000		999999/CORP
MAT.SCH.DAILY_EST			2017-05-02 19:53:34.000		999999/CORP
MAT.SCH.SLEEP_05			2017-05-04 14:58:05.000		111111/Blah
MAT.SCH.SLEEP_05.OLD.20170504.144048	2017-05-04 14:30:47.000		111111/Blah
R33.SCH.YIEOR.OLD.20170502.190701	2017-05-02 19:07:00.000		222222/Butts
Or
code:
OBJECTNAME				MODIFIEDDATE			USERID
MAT.SCH.DAILY_EST.OLD.20170502.195334	2017-05-02 19:45:26.000		999999/CORP
MAT.SCH.DAILY_EST			2017-05-02 19:53:34.000		999999/CORP
MAT.SCH.SLEEP_05			2017-05-04 14:58:05.000		999999/CORP
MAT.SCH.SLEEP_05.OLD.20170504.144048	2017-05-04 14:30:47.000		999999/CORP
R33.SCH.YIEOR.OLD.20170502.190701	2017-05-02 19:07:00.000		999999/CORP
If it's the first, that's a self-join.
code:
select O.OH_Name as ObjectName, O.OH_ModDate as ModifiedDate, OU.OH_Name as UserId from OH O, USR U,
OH OU
where O.OH_OType = 'JSCH' 
and O.OH_Client = &$CLIENT# 
and O.OH_ModDate >= SYSDATE-7 
and U.USR_OH_Idnr = O.OH_ModUserIdnr 
and U.USR_FirstName not like 'One'
and OU.OH_IDNR = O.OH_ModUserIdnr
If it's the second, you need the self-join plus some way to determine what the "first" row is, using ranking functions and a CTE or subquery (like kumba suggested).

Jethro fucked around with this message at 22:01 on May 5, 2017

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
Why are you concerned about pairs of records when the dups don't come in pairs? Is there no universal ranking you can use to determine the one winner, so pairwise comparison is necessary? Assuming you can just order by things (as kumba put it):

code:
;with dups as (
select main_id,
       addr_ID,
       row_number() over (partition by main_id order by things) addr_rnk,
       count(*) over (partition by main_id) dup_ct
  from base_table),
dup_win as (
select main_id,
       addr_ID winner_id
  from dups
 where addr_rnk = 1
       and dup_ct > 1)
select main_id,
       winner_id
  from dup_win

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Roundboy posted:

dupes come in pairs because i can only find them with a self join ala:


Edit: I think I got it. I switched to using rank() vs row_number() becaus I actually wanted it to rank, and adjusted my partition by because it was grabbing too much,

The ultimate result was that there were the same 3 pairs of groups as above, but the ones I wanted were all rnk = 1 .. shich I can just select for in the CTE

code:
select 
a.main_id,a.addr_id
,row_number() over(partition by a.main_id, a.address_status, a.address_line1, a.address_line2, a.state, a.zip order by a.addr_rank asc,a.addr_status desc) as rnk
from address a

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
code:
SELECT T.TicketNumber,
       B.*
  FROM Ticket T
        LEFT OUTER JOIN Branch B ON B.BranchNum = SUBSTRING(T.TicketNumber,1,1)

quote:

edit2: There is no branches table to join with which is why I'm resorting to this.
In that case you could probably fake it with a CASE statement, but you'd have to hard-code it all into the query, in which case you aren't gaining anything, so either create the Branch table or don't bother.

E: I suppose you might be able to create a function, but you're still hard-coding stuff.

Jethro fucked around with this message at 16:19 on Jul 17, 2017

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
I'm pretty sure the problem is that the "number of posts in the thread" is materialized in the thread record, and this is incremented every time someone posts. Meanwhile, the unread post count probably looks directly at the post records, so now that 6 posts have been deleted from the post table, the number of read posts is always at least 6 less than the "number of posts" on the thread. So either zdr can fix all threads where the post count doesn't line up, or we can just deal with it and/or recreate the thread.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Withnail posted:

Can anyone help a sql dummy how to get the following hierarchical query to order by the hierarchy? The query below just orders by the order in the table.
code:
WITH PartList AS
( 
  SELECT *,
         part_name as hier
    FROM dbo.Part
   WHERE PartId = '42'
  UNION ALL
  SELECT a.*,
         p.hier+'|'+a.part_name
    FROM dbo.Part a 
         INNER JOIN PartList p ON a.ParentPartId = p.PartId 
) 
SELECT * FROM PartList
order by hier

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

nwiniwn posted:

Two very basic questions coming your way...

I have a Field in a table where the field is 'Firstname, Lastname', so it looks like:

'John, Smith'
'Jane, Doe'

How do I go about sorting this field by last name in alphabetical order, so it displays like:

'Doe, Jane'
'Smith, John'

Also, is there a way to add text into a field's data?

For example, I have a field that lists all of the hire dates for employees that displays as:
'11/11/1996'
'02/13/2008'

But I would like each individual entry to display as:

'Hired in 1996'
'Hired in 2008'

removing the mm/dd and only keeping the year. Any ideas?

Thanks for the help-I'm using Oracle. That function for hired in year says it's missing a right parenthesis.

code:
select regexp_replace(name,'^([^,]+),\s+(.*)$','\2, \1') as formatted_name
from table
order by formatted_name
code:
select 'Hired in ' ||to_char(hiredate,'YYYY') hired
from table

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Busy Bee posted:

Thank you for the clarification.

I moved onto the ANY and ALL operators and I'm having a hard time wrapping my head around it while I prepare for this analyst interview. Here is the resource I am using: https://www.w3schools.com/sql/sql_any_all.asp

I don't understand the use of the ALL operator, especially with a data set. I would assume its not that common for a column of data to ALL be the same where it returns TRUE?

None of the videos on YouTube all the tutorials really explains it well in using ANY / ALL operators.

In addition, why would someone want to use a subquery rather than using JOIN. Are they not similar?

Busy Bee posted:

code:
SELECT ProductName
  FROM Products
 WHERE ProductID = ALL (SELECT ProductID FROM Orders WHERE Status = 'Shipped')
This is where I am confused. If you go to this link here and look at the last example where they use ALL and if you run the query https://www.w3schools.com/sql/sql_any_all.asp there are no orders in quantity 10 that come up and the results are 0. Rather, if I change "ALL" to "ANY", then all the orders where quantity = 10 pops up. So in your code above, what is the advantage of using "ALL" instead of "ANY"? Would ANY show all the ProductID's where the Status is 'Shipped'?

First note: I have been writing SQL almost every working day for the last 10+ years and I have never used or even thought about using the ANY or ALL operators. If anyone brings them up in a tech interview they're probably smug shitlords you wouldn't want to work with.

Second note: The explanation and examples given by w3schools and Gatac* on ALL are wrong. Here's a better page https://oracle-base.com/articles/misc/all-any-some-comparison-conditions-in-sql. ANY and ALL are about comparing a value to ANY or ALL of the values from a list or subquery. As you surmised, you're almost never going to have a list or subquery return all the same value, so Column =ALL (select value from subquery) is going to be pretty useless. But you can still use other operators, so Column >ALL (select value from subquery) would work just fine, except you'd never do that instead of Column > (select max(value) from subquery), and you'd never use Column =ANY (select value from subquery) instead of Column in (select value from subquery).

*It's not Gatac's fault. ANY and ALL are kinda confusing, not really used all that often, and he was looking at the wrong info from the page you had linked. It is, however, w3schools' fault for giving out bad instruction, as is not uncommon.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

LargeHadron posted:

I have a SQL index that uses the INCLUDE clause. My layman's understanding is that this increases performance, as the columns in the INCLUDE clause are available for read in the index itself, removing the need for a lookup in the main table after finding the records I want. I have to reproduce this index in a SQLite database, but SQLite doesn't seem to support the INCLUDE clause. What are my options here?

Either just make an index with the indexed columns and deal with the probably pretty negligible performance impact (since if you cared that much you probably wouldn't be using SQLite), or add the included columns to the index in some order that makes sense for how they're used.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
Yeah, you have to have two statements to update two tables, and if you're just doing an update, as opposed to update or insert depending on some logic, then you gain nothing by doing a merge (and in fact the Docs say you can negatively affect performance by doing a merge when just a straight update would do).

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Munkeymon posted:

Each one is about 25 lines, but a lot of it is nesting ANDs and ORs. Whole thing is a workaround for unreliable data, but the point here is that, even if it wasn't, I'm not sure how else to express that I want a result set that contains two rows per match, not one row of A.*, B.*
Just to clarify, you have one table that has duplicate records in it, but the exact definition of "duplicate" is complicated?

Depending on the exact way you write the conditions, maybe you could do something like:
SQL code:
Select A.*
  from SomeTable as A
 where exists (select 0 from SomeTable as B
                where A.whatever = B.whatever
                      /*your 25 lines of criteria*/
                      and A.record_id <> B.record_id /*or whatever method you use to only get duplicates instead of matching up a record to itself*/
                      )
But that depends on you being able to write your conditions such that it didn't matter whether a record was in A or B in your original query. Assuming you can do that, then this method does let you only write the criteria once.

If your duplicate criteria was a bit simpler (and your RDBMS supports window functions), you could also do something like:
SQL code:
Select *
  from (
Select A.*,
       count(*) over (partition by key_column_1, key_column_2 /*etc.*/) dup_ct
  from SomeTable as A
) as Duplicates
where dup_ct > 1

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
As much as I love window functions and other fancy stuff, I would think a simple subquery (or CTE if you like) with a group by that gets both min times would be the way to go.
SQL code:
select patient_id,
       min(case when lab_type = 'ekg' then lab_time end) 1st_ekg_time,
       min(case when lab_type = 'mri' then lab_time end) 1st_mri_time
  from lab_table
 where lab_type in ('ekg','mri')
group by patient_id
That said, if the same person is admitted multiple times, do they get a different patient_id? Or is the combined key for patient and admittance instance available in the lab table? If neither is true, then assuming you want to only show the first ekg and mri for that particular hospital visit, some sort of cross apply would be necessary (but even then I think you could just cross apply one subquery with a group by and two min(case when...) columns).

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Scaramouche posted:

[Recursive stuff]

PhantomOfTheCopier posted:

Something doesn't logically match the definition of a "row" here. I'm having a hard time seeing a situation where a 'thing' has both a parent and child defined within the row. Which is leading to the confusion that it's... backwards. c.parentsku=t.childsku, in the first recursive step,
finds all new rows having as a child the row from the initialization. Now one could construct supersets, but I'm fairly certain you intended to select the children of the current rows, which is c.childsku=bom.thingyid.
Yeah, you have the join backwards and you never actually use the data from the "child" row anyway.

Try something like this (assuming this is T-SQL):

SQL code:
;WITH CTE
AS
(SELECT  DISTINCT ParentSKU TopLevelSKU, CAST(null as VARCHAR(30)) as ParentSKU, ParentSKU as ChildSKU, CAST(ParentSKU AS VARCHAR(255)) as thing, CAST('|'+ParentSKU AS VARCHAR(255)) as bomPath, 0 as level
    FROM    BOM
    UNION ALL
   SELECT  P.TopLevelSKU, C.ParentSKU, C.ChildSKU, CAST(SPACE((P.level+1)*2)+'->' + C.ChildSKU AS VARCHAR(255)) as thing, CAST(P.bomPath + '|' + C.ChildSKU AS VARCHAR(255)) as bomPath, P.level+1 as level
   FROM    CTE AS  P
        INNER JOIN  BOM  AS  C   ON  P.ChildSKU =   C.ParentSKU AND C.ChildSKU <>   C.ParentSKU
 AND P.bomPath not like '%|'+C.ParentSKU+'|%'
)
SELECT * FROM CTE where TopLevelSKU='SIMPLEPRODUCT'
order by bomPath
Untested, but this should go multiple levels, and it goes in the right direction. Also, I think this will filter out all circular BOMs, not just ones where the parent and child are the same.

A few more notes:

If you're only ever returning the BOM for one top-level product (or a defined list of top-level products) at a time, you may want to put the TopLevelSKU='SIMPLEPRODUCT' filter condition in the base query. The DB is probably smart enough to push the parameter down without you explicitly telling it to, but if it's not, then you're generating the entire BOM for all products and then throwing most of it away.
Fake Edit: Oh, you've already done this.

If you're instead returning the BOM for all products (once you're done with your testing), you should also consider what the definition of "all" is. If you're returning the BOM for ProductA and ProductB, each of which contains Component1, do you also need to return the BOM where Component1 is considered the top-level product? You may need to come up with logic for how to determine what is a top-level product (since if you had a list you wouldn't be reading this paragraph). My suggestion would be something like putting the condition
SQL code:
where ParentSKU not in (select ChildSKU from BOM where ChildSKU<>ParentSKU)
in the base query.
E: Could that be what the ChildSKU=ParentSKU means, or are you sure that this is a data problem?

Fake Edit:

Scaramouche posted:

Sorry dude I was having problems myself posting but only to this topic weirdly enough. ZDR said that in the past some errant sql in this thread was triggering captchas, I wonder if that was happening again, but thanks for your insight.


So it's looking like partly this, but also because the data is kind of crap. I have the CTE working (thanks PhantomOfTheCopier) but it's giving me "vertical" results (e.g. all rows in 1,2 configuration) whereas what I'd prefer is a "horizontal" result (e.g. 1,2,3,4,5,6 to the length of all components with 1 being the first "parent") so it looks like I'm going to have to pivot it out. I also can only do one SKU at a time otherwise I hit recursion limit eventually, even at max setting (2^15). This is where it's at currently:
[snip]
If you want each row to build on the results from previous iterations, you have to actually use the results from previous iterations. Right now you're using the previous rows to determine what additional data to fetch, but then you don't actually use more than one level of the hierarchy at a time.

Jethro fucked around with this message at 19:41 on Jul 30, 2018

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Nth Doctor posted:

[snip]
E: So I've never worked with HiveQL but it looks like it supports windowing functions so having a column like:
SUM() OVER(PARTITION BY group) AS subtotal_column should work. The sum would end up being twice the actual subtotal since we're including the subtotal column in that sum as well but since this is only for ordering and not display it shouldn't be a huge problem unless you're bumping up against max values in your aggregates.
Yeah, if Hive supports windowing then you're done. You don't even need to add any columns (assuming Hive supports putting the window functions in the ORDER BY).
code:
ORDER BY CASE
            when supergroup = 'TOTAL' then 1
            else 2
        END,
        SUM(units_sold) OVER(PARTITION BY supergroup) DESC,
        supergroup,
        CASE
            when group = 'subtotal' then 1
            else 2
        END,
        units_sold DESC
So, use the CASE to put the TOTAL row at the top, sort by (twice) the supergroup subtotal to get the largest supergroups after that, sort by supegroup (to keep each supergroup together in case of ties), use the second CASE to put the subtotals at the top within each supergroup, and then sort by units_sold. You may also want to put group at the end of the ORDER BY if you care about having a particular consistent order in case of ties.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Pollyanna posted:

We have a Postgres table with roughly 10 million to 15 million rows, and some of the columns in that table contain PII. We want to blank out that PII in those columns and I tried to do so via an “update_all” in Rails equivalent to “update <table> set <the columns = fake data>”, but it seems to take a very very long time. Why would this happen? Is there a way to figure out why it would be so slow? Is the time it takes to update an entire column in a table just that long?
A few questions:
Are you testing what happens when you do the update query directly in the DB, or are you only testing through Rails? Are you sure Rails is doing what you are expecting and not something stupid?
Is there a lot of activity happening on that table while you're running the update? Are you sure there's nothing else blocking you?
Maybe my opinions are skewed from being old now, but 10 million rows isn't nothing (though it's certainly not Big Data). Is there some way you could break things up so that you can 1) maybe do things in a more parallel fashion and 2) not have a lot of resources devoted to one big ongoing uncommitted transaction. I don't have any experience with Postgres so I don't know how much issue 2 applies, nor do I know if there is there some sort of hint or setting that could enable or force more parallelism.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
Does the data that comes in each hour (potentially) contain multiple records per user, or is this just one record per user per hour? If it's just one record per user per hour, do all the new records that come in have the same timestamp, or are all the timestamps different?

Is user_hourly_summary truncated each hour before the data comes in, or does it have all the data that has ever come in? If all the data stays in there, is there a where clause that you didn't show us that prevents you from recalculating all the user activity that you calculated in the previous hours?

If you're seeing multiple timestamps per hour and you're already only calculating based on the new data that came in, then you're probably already as good as you're going to get.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Remora posted:

I suppose what I'm not getting is - and I haven't watched the videos yet, phoneposting before I leave for work - that's structured very similarly to a lot of examples I've seen, and I don't see why you wouldn't just SELECT FROM [Posts] WHERE [ParentThread]=whatever. The threads table doesn't look like it has anything you even want.
Well, in the case of generating a list of posts, you are correct that you might not need anything from the thread if you know the thread id, but there are plenty of times when you do need the information from multiple tables in each row. Or maybe you only have the thread title, not the id:
SQL code:
Select pu.name poster,
       pu.regDate,
       pu.avatar,
       pu.title,
       p.postedDate,
       p.postContents,
       IIf(p.editFlag = 'Y',CONCATENATE(IIf(IsNull(peu.name),'Somebody',peu.name),' hosed around with this message at ',DateFormat(p.editDate,'Mon D, YYYY'),' around ',DateFormat(p.editDate,'HH:MI')),) editMessage
  From [Threads] as [t]
       Inner Join [Posts] as [p] on t.threadId = p.threadId
       Inner Join [Users] as [pu] on p.postedBy = pu.userId
       Left Outer Join [Users] as [peu] on p.editedBy = peu.userId
 Where [t].Title = 'SELECT * FROM Questions WHERE Type = \'Stupid\''
So, to get the information about the poster and who (if anyone) edited the post, you need to join to the Users table two times.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Paul MaudDib posted:

We seem to be getting much slower performance using an inner join than a theta join using Oracle. i.e.

"SELECT a.uid from TableA a, TableB b WHERE b.parentFk = a.uid AND b.someVal = '1234' order by a.uid asc"

runs much faster than:

SELECT a.uid from TableA a INNER JOIN TableB on b.parentFk = a.uid WHERE b.someVal = '1234' group by a.uid"

Is there some intuitive reason why theta joins should be faster? You would think it would be slower given that theta joins can include arbitrary comparisons...
First, as McGlockenshire said, check the plans to see what the db is doing.
My guess would be that the engine is giving more deference to your table order in the explicit join query than in the other one, and that is causing the difference. See what happens with
SQL code:
SELECT a.uid
  from TableB b
       INNER JOIN TableA a on b.parentFk = a.uid
 WHERE b.someVal = '1234'
group by a.uid
Finally, just a minor point, but those queries are both inner equijoins. One uses explicit (ANSI SQL) join syntax, and the other uses implicit join syntax, but they are otherwise the same join (except for maybe left table vs. right table). A theta join is a join that isn't a natural join that uses =, <, ≤, >, or ≥. An equijoin is just the special = case of a theta join. A theta join could potentially use comparisons other than = (and you could use any comparison you like with implicit or explicit join syntax), but no db is going to perform poorly just because you could have written < instead of =. You wrote = and the db can see that you wrote =, so it's going to be able to take that into account.

Jethro fucked around with this message at 14:36 on Oct 4, 2018

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Love Stole the Day posted:

(This is Postgres) I am using an array of foreign keys and creating a json object based on its corresponding rows to the reference table.

code:
		
    select jsonb_build_object('id', t2.id, 'name', t2.name, 'biography', t2.biography) c1 
    from (
      (
        select distinct unnest(m3.cast_ids) c_id
        from movies m3
        where m3.cast_ids is not null 
          and m3.id = 2                   <----------- issue
        order by c_id asc
      ) t1
      left join people p1 on p1.id = t1.c_id
    ) t2
The problem I'm trying to tackle at the moment is that I need to somehow add a condition check so that this returns not every row of the `people` table but rather just the relevant ones contained in that original `unnest(...) c1` list.

Whenever I try to do this the intuitive way, I run into endless "undefined value" problems due to context. I've been trying to untangle this for several days now and am having trouble getting it done. I wasn't able to find a way to make use of a `left join lateral` and I wasn't able to find a way to shoe-horn in the condition checks necessary anywhere. So I'm thinking I might need to just scrap all of this and start over with a new approach. I'm hoping someone here can point out my blind spot for me.

As you can see where I marked the issue, I am able to manually control which movie's cast data is returned, but I can't seem to figure out how to make it work generally so that each row will yield based on the given `movie` table's `cast_ids.` If I remove the line that I labeled with the above arrow, then for every movie I will get the entire `people` table as if everybody in the entire table was involved for each movie. So naturally I'm trying to figure out how to make this where clause to filter out the non-relevant things.

If all the data you are interested in is from the people table, why is that the outer table in the join? I guess in theory every t1.c_id should have a corresponding p1.id, so it shouldn't make that much of a difference, but it certainly looks funny. You filter out null cast_ids, but are you sure none of the entries in the array are null (I have never used Postgres or array columns, so I don't know if that's even a sensible question)?

If I want "records from one table where the id is in another table", I write a query using an IN

SQL code:
select jsonb_build_object('id', p1.id, 'name', p1.name, 'biography', p1.biography) c1 
  from people p1
 where p1.id in (
        select unnest(m3.cast_ids) c_id
        from movies m3
        where m3.cast_ids is not null 
          and m3.id = 2)
Or am I misunderstanding what you want?

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Protocol7 posted:

I have a coworker who is arguing in favor of the One True Lookup Table implementation, specifically for data-driven ListBox/combo box values, since in our systems the key-value for list/combo boxes is int-string (basically an enum).

I have argued that this is bad DB normalization, adds maintenance headaches and generally does not solve any problems with the DB, since the data tables utilizing the lookup table should have a good FK enforcing the referential integrity. Plus, any developer with a brain can find that relationship in the DB schema, so you don't save any time or effort.

He thinks the time saved in not having to create new tables in the future is worth it. I think it's a bad idea, but I'm not skilled enough of a SQL developer to know more reasons why. Is there actually a case for an OTLT here, or is my coworker just being stubborn and dumping OOP concepts into SQL?

I can think of two other benefits to OTLT:
1) If users can create their own lookup types, it's probably easier to use OTLT than to try to write code to do the DDL. (Though maybe I'm wrong and there are ORMs that can handle that sort of thing well. I don't do a lot of application development, so maybe I have an incorrect conception of the state of tools out there.)
2) If you have some sort of pre-fetch-and-cache of your lookups (or any other kind of process that takes your data from the DB and puts it somewhere else, like some kind of reporting ETL), a process can do that with new lookups as soon as they're added to the table instead of needing to edit the process or remember to add the new table to a list somewhere or make sure you use the correct naming convention to get it picked up automatically or something.

That said, how long does it take to create a new lookup table? You open up the ddl script for the last one you did, change the table name and maybe some column names if you're feeling ambitious, and hit execute. Is that 12 seconds you save not going to be given back instantly by the fact that you now need to manually document which lookups are used where? As you said, with separate tables and foreign keys, the relationships are documented by the db itself, so you don't have to go hunting for whatever other table or spreadsheet or word doc or whatever has the (almost certainly out of date) relationships in it.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Grump posted:

This is a good point. I'd probably be better off just adding a "step text" column or something like that. Thanks for that. I think I'm over-complicating this.

Scaramouche posted:

I realize this is arm-chair architecting so if it comes across as wrong-headed I apologize, but it sounds like you need a different class of "ingredients" that are more like "instructions"?
I would think the problem is that ingredients are not steps. Cocktails are the class of recipes where the correspondence is closest to 1-to-1, but it's still not exact. My suggestion would be to keep the ingredients with the unique many-to-many, but get rid of the action column. Then, have the steps be either a text field on the recipe header or be a detail table with a foreign key to the recipe header. And maybe make the steps optional since there are plenty of cocktails where all you need to know is what goes into them, but it's not all of them.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
Like I said, I would remove "step" information from CocktailIngredients and put it in a separate table. An ingredient can be used in multiple steps, and a step can involve multiple ingredients or no ingredients at all. E.G. when you stir a Martini, you're not stirring any particular ingredients (maybe you could say you're stirring all of them).

E: F,b.

E2: If you just want something simple, on the cocktail itself, instead of the "Finish" column, I'd have a varchar(4000) "Instructions" column

Jethro fucked around with this message at 19:49 on May 28, 2019

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
Another reason to use a (non-correlated) subquery instead of a join is when the join would return duplicate records.

For example: Return all customers that had an order in the last month:

SQL code:
select c.*
  from customers c
 where c.id in (select customer_id from orders where OrderDate > today()-30)
vs
SQL code:
select distinct c.*
  from customers c
        inner join orders o on c.id = o.customerId
 where o.orderDate > today()-30

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

TheFluff posted:

[snip]
e: I guess in many cases the id in(subquery) ends up with the same query plan as where exists huh, so it doesn't really matter does it
id in (select id from subquery) probably matches the intent of the query writer (see if this id from the main query is in the list of ids returned by the subquery) more often than exists (select 0 from subquery s where s.id = m.id) (see if the subquery returns any results, which we don't actually care about so we put garbage in the select, when we plug the id from the main query into the subquery). RDBMS creators figured out a while ago that the former often makes more intuitive sense than the latter, so they made the plans identical so, at least in this case, you don't have to pick performance over clarity.

That said, I use exists pretty much 100% of the time. Mostly because I got most of my SQL training from someone who did most of their learning when that could make a huge difference.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
Yeah, don't complain about your query not returning rows with an elapsed time >= 45 hours when you explicitly write the query to only include rows where the time is < 45 hours. Also, as written, the query will also exclude rows where there is no previous data, defeating the left outer join you presumably used for a reason. Also, why are you using a subquery in the where condition? Either the time stamps are all the same, in which case just use t2.stamp > (t1.stamp - interval '45 hours'), or they're not all the same and the subquery will return multiple values which your DB will (should?) complain about.

If the previous results are too old, can the previous column be empty (as if there were no previous results)? If so, just put the condition in the join.
SQL code:
select
    t1.stamp as current,
    t2.stamp as previous,
    t1.username,
    t1.team,
    coalesce(t1.score - t2.score, 0) as ppd,
    coalesce(t1.wu - t2.wu, 0) as wupd
from
    public.user_temp_summary_current t1
    left join public.user_temp_summary_previous t2
        on t1.username = t2.username and t1.team = t2.team
           and t2.stamp > (t1.stamp - interval '45 hours')
Otherwise, yeah, you'll have to do the timestamp comparison in a case statement in the ppd and wupd columns and you'll have to "duplicate" your logic. But that makes sense because you just want the logic in those two columns. You don't want to affect the whole query (otherwise you would have put the condition in the join).

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Paul MaudDib posted:

I have a table of data that I'm trying to extract occurrences of features out of. A data row can have one or more features in it. Features are represented by a features_table that contains a match_mask regex and a not_match_mask regex. As the names would suggest, a feature exists if the match_mask regex matches, and the not_match_mask either does not or is null.

This makes it a straightforward (for me) inner join between the data table and the feature table. The join criteria is where the regex matching happens.

The edge case where this falls down is when not_match_mask and match_mask are not completely disjoint - so let's say you have one feature "foo" with not_match_mask "foobar" and another feature that matches "foobar", and a data row can hypothetically contain "foo ... foobar". This should produce one feature foo and feature foobar for that data row.

That's not desirable of course but that's the reality of the feature I need to extract... "foo" is contextually different from "foobar" and there can potentially be both in a data row.

Handling this programmatically seems tricky so what I was thinking is to find rows where more than one match_mask fits and then handle them manually (hopefully there won't be more than a few dozen). So like, do the inner join, then GROUP BY back to the input rows, and then select rows having count > 1 ?

Am I completely barking up the wrong tree here? Is there some analytics thing that will help me out? I'm using Postgres 12.

My first suggestion would be rewrite your match_mask regexes to remove the need for the not_match_mask. So 'foo' and not 'foobar' becomes 'foo(?!bar)'.

Also, what is the structure of a data_row? Is it just one huge undifferentiated varchar column? Is the data at least delimited?

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

prom candy posted:

The reason I'm returning everything in one row is because that's how the data ends up getting displayed, and we also need to be able to order and paginate the results. Each row in the report shows the total value for the parent account, and then a breakdown by child.

It's entirely possible/extremely probable I'm doing everything all wrong. I've been working with SQL for like 15 years but 99% of that has just been having my hand-held by ORMs doing pretty basic stuff. This is definitely the most complicated report I've had to build. Potentially I could strip out the child fields and then change how the report is displayed so that someone needs to expand a row to see the child breakdown but I'm definitely going to get pushback on that as management wants it just-so.
If you're already dynamically creating the query, then this is a perfect opportunity to do a pivot table:
SQL code:
select p.parent_info1,
       p.parent_info2,
       p.parent_info3
       p.<additional parent columns>,
       max(case when <filter condition for child 1> then c.stat_a end) child_1_stat_a,
       max(case when <filter condition for child 1> then c.stat_b end) child_1_stat_a,
       max(case when <filter condition for child 2> then c.stat_a end) child_2_stat_a,
       max(case when <filter condition for child 2> then c.stat_b end) child_2_stat_b,
       <repeat for each child>
  from parent_table p
       inner join (<child subquery>) c on c.parent_id = p.id
 where <whatever conditions>
group by p.parent_info1,
       p.parent_info2,
       p.parent_info3
       p.<additional parent columns>
So, you do one join from the parent table to the child subquery, as if you were going to return one row per child, and then you use a group by and dynamically generated aggregate columns with case statements to squash it all into one row per parent.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
Is (UserId,ProductId) unique? If so, why not just make that the composite key? If not, are you sure you're OK with the key being something that could change based on the phase of the moon?

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
I see two problems with your design.

1) as written in your post, your updated trigger will set the Index for all rows with the given productId, not just the row for that UserId, ProductId combo. So if ProductId 1234 is Index 5 for UserId 123, the trigger will set Index to 5 for ProductId 1234 for all users.

2) Are you sure you want it to have the same ordering as ProductId instead of something like insert order? If UserId 123 has ProductId 1234 and 3456, do you really want the Index on the 3456 row to change when someone inserts ProductId 2345? You said this is the composite key in the app. No offense to your coworkers, but I don't imagine the people who made that design decision were ready to deal with the key for a record changing under them.

If you've got a new enough version of MSSQL, you might be able to hack something together with a UDF that gets values from a set of sequences.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Munkeymon posted:

I think I've miscommunicated something. Users → Products is one to many. One User gets as many entries in the Product table* as they want and ProductId is just int IDENTITY and PK so it is insert order. The Product table has an FK into users, so given a UserId and an Index, you should get exactly one Product row back, by the API design.
Oh, I thought ProductId was a FK to another table where it was an IDENTITY column. In that case, my "bug report" was spurious.

quote:

Thanks for pointing out that I was updating all Product rows for a User - that was not intentional. However, having had time to stress test it, I'm getting duplicate Indexes again, so (╯°□°)╯︵ ┻━┻ I have no flucking clue what to do anymore

*we only have one product we offer to end users, which is why I named it that - sorry for the confusion
I wouldn't think this could happen, given that triggers are part of the same transaction as the triggering event, but maybe the inserts and triggers are running slightly out of order when you get a whole bunch at once. Say Client A and Client B both send inserts for User 123 at almost the same time, and the SQL Server is being real granular with its locking (like is EF putting WITH ROWLOCK on the inserts?), so Insert A happens and gets ProductId 456, then Insert B happens and gets ProductId 457, then the trigger for B runs and sets the index for 457 to 5 (or whatever), then the trigger for A runs and sets the index for 456 to 5 (because it thinks it should have 5 and 457 should get 6).
I'm thinking this is the kind of race condition that you can really only solve with something purpose built like IDENTITY columns or sequences or the like.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Munkeymon posted:

That'd be cool. How do I do that? I don't see anything in the spec to make a trigger per-row instead of per statement, so I did what I had to do to make it work in the general case.
I'm pretty sure (though it's been a few years since I've done anything outside of Oracle), that you can't do a for each trigger in MS SQL. Unless you did something like create a cursor in the trig...Nope, can't finish that thought.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Munkeymon posted:

I'm leaning strongly towards making the application figure out the index and writing some automated tests to guarantee the historical data doesn't come out wrong at this point. At least that doesn't feel like building a layer cake of One Weird Tricks.
That's probably for the best, especially if it has only been a problem in your "way more activity than we've ever seen in prod" testing. Just toss a unique index on there and be done with it.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Nth Doctor posted:

oh my god.


Yeah you want to pivot based on the month and year of table1.created_at

Unless I'm misunderstanding, first he wants to group by month and year and then pivot by year.

E: I misunderstood.

Jethro fucked around with this message at 20:53 on Feb 27, 2020

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

TheFluff posted:

NULL in SQL is quite controversial and has led to a lot of people being very mad at computers, and outer joins result in nulls, so it shouldn't come as a surprise really.


Well, yeah, obviously, there are use cases where an outer join is exactly the right tool for the job and I've explicitly agreed with that above. My entire point is that a lot of the common usage patterns for outer joins (and nullable columns) are footguns or at least footgun-enablers. SQL has no shortage of that sort of thing but nulls in general and things that lead to nulls is one of the most common.


I don't think anyone could? Like, when I wrote "friends don't let friends left join" I thought that was obviously hyperbolic enough to come off as a joke.
I think the thing is, you are saying "only use outer joins when they are the right tool for the job", and most people say "use outer joins unless they are the wrong tool". Like, sure nulls can be weird and sometimes an outer join won't give you what you want, but both are pretty fundamental to how SQL and relational databases work so you shouldn't be afraid to use them as long as you have a handle on the most common gotchas (like filtering on a column from an outer table).

Adbot
ADBOT LOVES YOU

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
I've never seen DISTINCT ON before, but a little research leads me to believe you are selecting the first row for each combination of reference_date and moneyness with no particular order. If you want to get one row per reference_date then only put reference_date in the DISTINCT ON clause. Then do your ORDER BY as you currently are.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply