Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


PhantomOfTheCopier posted:

Agreed. I definitely agree that it's doable. I worry about the strange cases; none have sprung to mind but I haven't given it much thought.

I think the key to the design is that foreign keys have to notify the referring table, and the columns tell it which ones matter.

Thanks for putting up with the walls of text above. I wasn't as descriptive as I could have been initially.

I think what I’m missing is why included columns can’t count towards that requirement. Technically and hypothetically speaking.

Assume table X with columns a, b, c and unique index idxX(a). If idxX(a) also includes b, it is still unique. A change to X.b will still trigger a change to a row in idxX if b is included. Isn’t this sufficient to notify references, making a FK reference to X(a,b) technically viable?

Adbot
ADBOT LOVES YOU

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Ruggan posted:

I think what I’m missing is why included columns can’t count towards that requirement. Technically and hypothetically speaking.

Assume table X with columns a, b, c and unique index idxX(a). If idxX(a) also includes b, it is still unique. A change to X.b will still trigger a change to a row in idxX if b is included. Isn’t this sufficient to notify references, making a FK reference to X(a,b) technically viable?
Short answer: Because indexes are supposed to be faster than table scans. :buddy:

* Historical: "INCLUDE" is new(ish). There's every reason to believe the feature was added disregarding triggering since that can be achieved in different, simpler ways.
* Performance: You shouldn't use INCLUDE anyway, excepting in cases where you have stats showing that it could provide benefits.
* Purpose: INCLUDE specifically addresses an optimization concern about fetching data using an index-only scan. There's no reason a priori that it should be equivalent to an index that covers that field.
* Question: Will it let you INDEX (a,b,c) INCLUDE (c) ?
* Design: It's likely that all index types supported by the database (ie not btree) provide the same functionality of "relation-to-relation on fields F" support. Depending on implementation, INCLUDE may be nothing more than raw data, so it may not even have an equality operator. Converting lookups to "relation-to-relation on fields Fa, coupled with various possibly-wide fields scanned for various types of equality" requires more data structures in the index to support and will blow performance out of the water.

Suppose you have your setup, idxX(a)+includes(b), and suppose you have two tables referencing it, Y(a) and Z(a,b). If a value of 'b' changes, how shall it determine which to update? If (a,b) are both "in the index", the tuple id in X could cause an indextuple id update, triggering a foreign tuple update in both Y and Z... but it's not needed in Y. And if only (a) is in there, then you get the current behavior. So now you need an extra field in your on-disk data structure to track different column purposes in your index... which is not impossible.

There are open source databases, so you can certainly go add this functionality. Perhaps the mailing lists have information on the reasons, if any, that they haven't within their respective implementations.

That's all the time I have right now.

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


PhantomOfTheCopier posted:

Suppose you have your setup, idxX(a)+includes(b), and suppose you have two tables referencing it, Y(a) and Z(a,b). If a value of 'b' changes, how shall it determine which to update? If (a,b) are both "in the index", the tuple id in X could cause an indextuple id update, triggering a foreign tuple update in both Y and Z... but it's not needed in Y. And if only (a) is in there, then you get the current behavior. So now you need an extra field in your on-disk data structure to track different column purposes in your index... which is not impossible.

Thanks, this helps clarify it. I can see how it’s a sensible requirement to require a full, exact index match for a FK from this. When you add more included columns and allow this it gets even worse: idxX(a) + include(b,c) with fks A(a) B(a,b) C(a,c) D(a,b,c). Gets ugly.

Thanks for the example.

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

Am I the only one who thinks this is stupid? MSSQL:

code:
// works just fine, even when x is never empty
select * from t order by (case when t.x = '' then 0 else 1 end)
code:
// ERROR: A constant expression was encountered in the ORDER BY list, position 1
select * from t order by (case when '' = '' then 0 else 1 end)
If it's perfectly fine to sort by a function that happens to always return the same value, why should it be a problem to sort by a constant?

It's no different than filtering by WHERE 1=1, and most likely for the same reason (programmatically-generated code)

mik
Oct 16, 2003
oh
Here's a WHERE Type = 'Stupid' question:

Is there a functional/performance difference between these JOINs in terms of how I restrict Field2 on Table2? In practice they seem to have the same outcome, but I don't know which is 'proper'.

code:
SELECT 
	* 
FROM Table1 A

LEFT JOIN Table2 B 
	ON B.Field1 = A.Field1 
	AND B.Field2 = 'ABC'
and

code:
SELECT 
	* 
FROM Table1 A

LEFT JOIN Table2 B 
	ON B.Field1 = A.Field1 
	
WHERE B.Field2 = 'ABC'

mik fucked around with this message at 13:56 on May 7, 2020

Gatac
Apr 22, 2008

Fifty Cent's next biopic.

NihilCredo posted:

Am I the only one who thinks this is stupid? MSSQL:

code:
// works just fine, even when x is never empty
select * from t order by (case when t.x = '' then 0 else 1 end)
code:
// ERROR: A constant expression was encountered in the ORDER BY list, position 1
select * from t order by (case when '' = '' then 0 else 1 end)
If it's perfectly fine to sort by a function that happens to always return the same value, why should it be a problem to sort by a constant?

It's no different than filtering by WHERE 1=1, and most likely for the same reason (programmatically-generated code)

But you're not filtering, you're sorting. What do you expect the DBMS to do when the sorting expression evaluates to the same value for every row?

nielsm
Jun 1, 2009



mik posted:

Here's a WHERE Type = 'Stupid' question:

Is there a functional/performance difference between these JOINs in terms of how I restrict Field2 on Table2? In practice they seem to have the same outcome, but I don't know which is 'proper'.

code:
SELECT 
	* 
FROM Table1 A

LEFT JOIN Table2 B 
	ON B.Field1 = A.Field1 
	AND B.Field2 = 'ABC'
and

code:
SELECT 
	* 
FROM Table1 A

LEFT JOIN Table2 B 
	ON B.Field1 = A.Field1 
	
WHERE B.Field2 = 'ABC'

Yes, there is a difference, when you make a left join. If you did an (inner) join they would be equivalent.

The first will match every row in A with all rows in B where Field1 matches, and B.Field2 has the value. If there are rows where Field1 matches none where B.Field2 does, you will get a row without any data from B.
The second will match every row in A with all rows in B where Field 1 matches. It will then filter out joined rows where B.Field2 does not have the value. This means rows from A where there were no matching rows from B will be left out. (It will effectively behave as a plain inner join.)

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

Gatac posted:

But you're not filtering, you're sorting. What do you expect the DBMS to do when the sorting expression evaluates to the same value for every row?

The same thing it already does when I ask it to ORDER BY Foo and Foo happens to have the same value in every row: return the equivalent rows in any order it likes, as it will satisfy my request.

I cannot think of a single scenario where I would want it to crash instead.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

NihilCredo posted:

The same thing it already does when I ask it to ORDER BY Foo and Foo happens to have the same value in every row: return the equivalent rows in any order it likes, as it will satisfy my request.

Yeah, I see no reason why it wouldn't do as you ask rather than try to be clever. It can recognise already that the expression is a constant, so why not just silently disregard that item in the ORDER BY list and come up with the same query plan it would do if that item wasn't in the list? (in this case, the same query plan as it would if there were no ORDER BY clause at all, since it's the only list item)

Gatac
Apr 22, 2008

Fifty Cent's next biopic.
My perspective on this is just the other way: I'd like the DBMS to not just silently disregard stuff like that, but error out so I can fix my query.

NihilCredo's first example, where there's an expression that (in some cases) might just happen to evaluate to the same value for every row is accepted as sort criteria because the DBMS can't determine that it'll be the same value across all rows before it actually executes the query. The second example clearly can only evaluate to a constant value of 0, so there's a rule to catch it before executing the query. I don't see this as the system trying to be clever; it's trying to save you from nonsense in your query. Apparently harmless nonsense in this case, granted, but I can see where the people who made it error out were coming from.

Should it be a configurable warning rather than an error? Maybe.

Moonwolf
Jun 29, 2004

Flee from th' terrifyin' evil of "NHS"!


It's also a guard against dynamic/constructed SQL with empty values in, which is generally a sign something's gone terribly wrong and terminating rather than continuing is the best bet.

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


I dunno. I don’t think it is the engine’s job to enforce coding style, just syntax. If someone hands the optimizer a query with dumb bullshit and it is valid SQL the engine should execute it.

I would expect sorting by a constant to have no impact on the output sort order.

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



NihilCredo posted:

Am I the only one who thinks this is stupid? MSSQL:

code:
// works just fine, even when x is never empty
select * from t order by (case when t.x = '' then 0 else 1 end)
code:
// ERROR: A constant expression was encountered in the ORDER BY list, position 1
select * from t order by (case when '' = '' then 0 else 1 end)
If it's perfectly fine to sort by a function that happens to always return the same value, why should it be a problem to sort by a constant?

It's no different than filtering by WHERE 1=1, and most likely for the same reason (programmatically-generated code)

I found a line in a text copy of the ANSI standard from 1992 (because they make you pay hundreds of dollars for the current PDFs 🙄) that says

quote:

If a <sort specification> contains an <unsigned integer>, then the <unsigned integer> shall be greater than 0 and not greater than the degree of T.
so maybe the actual error is caused by the engine simplifying your CASE to 0?

Munkeymon fucked around with this message at 14:35 on May 8, 2020

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Munkeymon posted:

I found a line in a text copy of the ANSI standard from 1992 (because they make you pay hundreds of dollars for the current PDFs 🙄) that says

so maybe the actual error is caused by the engine simplifying your CASE to 0?

:stare:

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



Nope, still fails if you do 1 ELSE 0

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



code:
order by (select case when '' = '' then 0 else 1 end)
Works just fine, though, which is interesting. I guess that makes it a constant rather than an expression?

Credit where credit's due.

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

I've got a PostgreSQL 11 table that contains only a 1:N set of lookup codes of various types for a particular entity, like this:

code:
create table things
  ( thing_id uuid primary key
  , some_property text
  , ...
  );

create table thing_lookup_codes
  ( thing_id uuid references things(thing_id)
  , code_source text
  , code_type text
  , code text
  , primary key (thing_id, code_source, code_type)
  );

// a lot of queries like this
select things.a_bunch_of_properties
from things
join thing_lookup_codes using (thing_id)
where code_source = @code_source and code_type  = @code_type and code = @code
The data distribution is as follows: "code_source" and "code_type" only have a handful of possible values each. "code" are mostly unique but occasionally codes of different source/types will have the same value.

If I want to improve the performance of such queries, what would it be best to add?

- The most obvious: a hash index on (code_source, code_type, code)
- A hash index on (code)
- A hash index on (code) INCLUDING (code_source, code_type)
- Other option?

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


NihilCredo posted:

I've got a PostgreSQL 11 table that contains only a 1:N set of lookup codes of various types for a particular entity, like this:

code:
create table things
  ( thing_id uuid primary key
  , some_property text
  , ...
  );

create table thing_lookup_codes
  ( thing_id uuid references things(thing_id)
  , code_source text
  , code_type text
  , code text
  , primary key (thing_id, code_source, code_type)
  );

// a lot of queries like this
select things.a_bunch_of_properties
from things
join thing_lookup_codes using (thing_id)
where code_source = @code_source and code_type  = @code_type and code = @code
The data distribution is as follows: "code_source" and "code_type" only have a handful of possible values each. "code" are mostly unique but occasionally codes of different source/types will have the same value.

If I want to improve the performance of such queries, what would it be best to add?

- The most obvious: a hash index on (code_source, code_type, code)
- A hash index on (code)
- A hash index on (code) INCLUDING (code_source, code_type)
- Other option?

Admittedly I'm a T-SQL/SQL Server guy and am not well versed on the finer points of the differences between it and PostgreSQL but the text columns being part of a PK is a smell to me. I would normalize them out to their own table, especially if the code source and code types are going to be a fairly limited set.

How does PostgreSQL do things like clustering? I saw from documentation that defining the primary key creates its own b tree index but that appears to be somewhere that SQL Server and PostgreSQL diverge.

Thirdly I think that using uuids for pks isn't a great fit compared to something like ints at the CPU level if only because you can fit two entire ints into the ALU's inputs for comparison in a single instruction vs. 128 bits for a uuid, or even worse for string comparisons with the text columns.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

NihilCredo posted:

I've got a PostgreSQL 11 table that contains only a 1:N set of lookup codes of various types for a particular entity, like this:

code:
create table things
  ( thing_id uuid primary key
  , some_property text
  , ...
  );

create table thing_lookup_codes
  ( thing_id uuid references things(thing_id)
  , code_source text
  , code_type text
  , code text
  , primary key (thing_id, code_source, code_type)
  );

// a lot of queries like this
select things.a_bunch_of_properties
from things
join thing_lookup_codes using (thing_id)
where code_source = @code_source and code_type  = @code_type and code = @code
The data distribution is as follows: "code_source" and "code_type" only have a handful of possible values each. "code" are mostly unique but occasionally codes of different source/types will have the same value.

If I want to improve the performance of such queries, what would it be best to add?

- The most obvious: a hash index on (code_source, code_type, code)
- A hash index on (code)
- A hash index on (code) INCLUDING (code_source, code_type)
- Other option?
If nothing to else, I did some checking and found that the old advice of "never use hash indexes" has been corrected. As of PG11, however, it's not clear that write performance has been optimized for that index type. You'll want to use the index that works best for all the activity, not just these queries.

I'll still point out, however, that btree indexes are more robust. Anything other than direct equality won't use the hash index. (Regular expressions, like, ranges, inequality.)

The best way would be to create a (source,type,code) index, check the resulting size and performance improvement via Explain Analyze, and then compare it to other choices. It's possible a (source,type) index would be sufficient, so...

I'd argue that your lookupcodes table is in the wrong order. If instead, "this table relates source/types to a set of IDs and their specific code", then the primary key would be source/type/id and the query planner could index scan on the first two fields. A standalone index on the id would help for thingy lookups.

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE
Postgres question: we have a table in production that has sneaked up to about 132M rows without anybody really noticing. It's a fairly simple table - integer primary key, three integer foreign keys (one of them nullable), two booleans, one integer and a timestamp. Each of the foreign key columns has its own index.

Normally querying this table isn't a problem but we're starting to notice that postgres sometimes seems to do seq scans on it even though the query filters by the indexed foreign keys. For example, if one particular query I've been looking at uses a semi-join on one of the foreign key columns it'll do a seq scan, but if you take the list of matching ID's and do a WHERE foreign_id IN(long_ass_list_of_ids) it'll use the index in a bitmap index scan. When I look at the query plan for one of the problematic queries where it's doing this, the planner is overestimating rows by a factor 90, and when I look at the table stats the n_distinct estimate for some of the foreign key columns is off by about a factor 100 compared to the actual value. I've been doing some reading and I think I want to try raising the statistics_target for some of these columns (currently using the default 100), but what's a reasonable value to pick for a table of this size? Is there a downside to just setting it to the max value 1000?

We're currently on PG 9 but are working on migrating our way to 12.

TheFluff fucked around with this message at 11:38 on May 15, 2020

Pardot
Jul 25, 2001




TheFluff posted:

Postgres question

The tradeoff in increasing the statistics target is that all query planning that uses that will be a little slower. It's probably okay, but be aware of it. However it may just be that it hasn't had a chance to update the statistics in a while, especially if you haven't done any autovacuum tuning yet. By default autovacuum is set to be pretty timid and doesn't wan to use too many system resources, but then doesn't get around to actually working through the whole db and this is when stats are updated. You can force an update with `vacuum (analyze, verbose) tablename;`. Give that a try first, then see if the queries use the indexes like you think they should before messing around with the targets.

Chakan
Mar 30, 2011
I'm mostly self-taught so apologies if this is simple. I have a query I've built and I'm trying to figure out if there's a way to remove all entries that are tied to a field value that only appear once in the final result. I can do this easy enough in Excel with conditional formatting to highlight and remove those that only appear once, but I want to be able to scale large it if need-be.

So if I wanted to remove records with unique 'name' field information and my data looked like this:
code:
color	name	age

red	sally	12
orange	sally	11
blue	fred	12
green	emily	11
blue	george	12
green	emily	12
Then I'd want the end result to look like this:
code:
color	name	age

red	sally	12
orange	sally	11
green	emily	11
green	emily	12
e: VV I don't think having will work for how I've set it up because I'm already grouping in a way that makes me think it'll give bunk data if I do that, but that's good to remember & I'll experiment. Thanks!

Chakan fucked around with this message at 20:23 on May 20, 2020

Sir Bobert Fishbone
Jan 16, 2006

Beebort

Chakan posted:

I'm mostly self-taught so apologies if this is simple. I have a query I've built and I'm trying to figure out if there's a way to remove all entries that are tied to a field value that only appear once in the final result. I can do this easy enough in Excel with conditional formatting to highlight and remove those that only appear once, but I want to be able to scale large it if need-be.

So if I wanted to remove records with unique 'name' field information and my data looked like this:
code:
color	name	age

red	sally	12
orange	sally	11
blue	fred	12
green	emily	11
blue	george	12
green	emily	12
Then I'd want the end result to look like this:
code:
color	name	age

red	sally	12
orange	sally	11
green	emily	11
green	emily	12

Check this or similar:
https://www.w3schools.com/sql/sql_having.asp

kumba
Nov 8, 2003

I posted my food for USPOL Thanksgiving!

enjoy the ride

Lipstick Apathy

Chakan posted:

e: VV I don't think having will work for how I've set it up because I'm already grouping in a way that makes me think it'll give bunk data if I do that, but that's good to remember & I'll experiment. Thanks!

it shouldn't matter what you're grouping by, you should be able to exclude names via subselect in your where clause using group by and having and then group the outer data by whatever

abelwingnut
Dec 23, 2002


t-sql here, is it possible to have stored procedures that produce temp tables that are then joined within the scope of a grand stored procedure? like:

--

spOverall
- spProducesTemp1
- spProducesTemp2
- spProducesTemp3

SELECT blah FROM #Temp1 JOIN #Temp2 JOIN Temp3

<result is update to a non-temp table>

END --spOverall

--

And yes, spProducesTemp2 could use the temp table from spProducesTemp1 to generate #Temp2.

abelwingnut fucked around with this message at 00:41 on May 21, 2020

kumba
Nov 8, 2003

I posted my food for USPOL Thanksgiving!

enjoy the ride

Lipstick Apathy
E: nm misunderstood question

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


abelwingnut posted:

t-sql here, is it possible to have stored procedures that produce temp tables that are then joined within the scope of a grand stored procedure? like:

--

spOverall
- spProducesTemp1
- spProducesTemp2
- spProducesTemp3

SELECT blah FROM #Temp1 JOIN #Temp2 JOIN Temp3

<result is update to a non-temp table>

END --spOverall

--

And yes, spProducesTemp2 could use the temp table from spProducesTemp1 to generate #Temp2.

Yes, you can define a temp table to live in the global scope by using double hash ala ##globalTemp.


https://docs.microsoft.com/en-us/sql/relational-databases/tables/tables?view=sql-server-ver15

Scroll down to the 2nd entry under table types.

redleader
Aug 18, 2005

Engage according to operational parameters
check out How to Share Data between Stored Procedures, by notable sql server greybeard erland sommarskog. you should check out his other articles too, just so you know what topics he's covered in enormous detail

abelwingnut
Dec 23, 2002


thanks, will check out all of this tonight or tomorrow. gotta a 4k-line single stored procedure that i should probably reduce down to specific jobs!

Chakan
Mar 30, 2011

kumba posted:

it shouldn't matter what you're grouping by, you should be able to exclude names via subselect in your where clause using group by and having and then group the outer data by whatever

Ah, thank you.

Xae
Jan 19, 2005

Anyone have experience tuning configs for Postgres?

We're moving to an Azure Postgres-as-a-service instance and the MS guys who are supposed to help us with this keep blowing us off.

Right now we're on a 16 core server, but the config sets the max_parallel_workers is set to 8 and the max_parallel_workers_per_gather is set to 2.

Which seems to be why the server maxes at 50% CPU usage under load.


In general the server is just unbearably slow. Non-concurrent index builds for large (50-100gb) tables are taking 6-8 hours.

Star War Sex Parrot
Oct 2, 2003

I'd at least start from a config that PGTune spits out.

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
creating an index uses maintenance_work_mem, so if you have a big one to create and your default from the config is not very high (which is generally advisable), you can increase that setting for your session to give it more memory to sort in

Xae
Jan 19, 2005

Star War Sex Parrot posted:

I'd at least start from a config that PGTune spits out.

That is the plan.

The pg folks from MS are supposed to get back to us this week. Hopefully we get a technical person and not another drat salesman.

ChickenWing
Jul 22, 2010

:v:

Riddle me this, SQL fans:

I've got a view that joins 8 tables. 7 inners, one left. call these A-H. In production, this view returns 1.5m rows.

I've got a query that left joins table C on this view again, groups by C's ID, and does a bunch of aggregations (sums, mostly). The result set in prod is about 4500 rows.

We discovered that, when run on production data, the query takes about six and a half minutes and was timing out our app requests. However, in staging, it took a significantly saner amount of time (30s) with *almost* identical data (couple items added by our QA team). Bad times, so we start looking into it.

What we found was that there were two columns in table F that were unpopulated (null values all the way down) in production, but had data in staging. This seems to be the exclusive difference. We haven't added any data to prod to test this, but running the query with the removal of those columns dropped the query time back down to 30 seconds.

Why the gently caress is this happening? Explain plan is not helpful, and these columns are strictly data holders with no indices or weird data types or anything.

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

ChickenWing posted:

Riddle me this, SQL fans:

I've got a view that joins 8 tables. 7 inners, one left. call these A-H. In production, this view returns 1.5m rows.

I've got a query that left joins table C on this view again, groups by C's ID, and does a bunch of aggregations (sums, mostly). The result set in prod is about 4500 rows.

We discovered that, when run on production data, the query takes about six and a half minutes and was timing out our app requests. However, in staging, it took a significantly saner amount of time (30s) with *almost* identical data (couple items added by our QA team). Bad times, so we start looking into it.

What we found was that there were two columns in table F that were unpopulated (null values all the way down) in production, but had data in staging. This seems to be the exclusive difference. We haven't added any data to prod to test this, but running the query with the removal of those columns dropped the query time back down to 30 seconds.

Why the gently caress is this happening? Explain plan is not helpful, and these columns are strictly data holders with no indices or weird data types or anything.

EXPLAIN ANALYZE

Xae
Jan 19, 2005

Xae posted:

That is the plan.

The pg folks from MS are supposed to get back to us this week. Hopefully we get a technical person and not another drat salesman.

We found our problem: min_parallel_table_scan_size .

The Azure default is messed up. It thinks the size is in bytes, but it is in 8kb blocks.

Meaning it was set to not parallelize a query unless the plan estimated 8,388,608‬ 8kb blocks , roughly 70gb, to read.

redleader
Aug 18, 2005

Engage according to operational parameters

ChickenWing posted:

Riddle me this, SQL fans:

I've got a view that joins 8 tables. 7 inners, one left. call these A-H. In production, this view returns 1.5m rows.

I've got a query that left joins table C on this view again, groups by C's ID, and does a bunch of aggregations (sums, mostly). The result set in prod is about 4500 rows.

We discovered that, when run on production data, the query takes about six and a half minutes and was timing out our app requests. However, in staging, it took a significantly saner amount of time (30s) with *almost* identical data (couple items added by our QA team). Bad times, so we start looking into it.

What we found was that there were two columns in table F that were unpopulated (null values all the way down) in production, but had data in staging. This seems to be the exclusive difference. We haven't added any data to prod to test this, but running the query with the removal of those columns dropped the query time back down to 30 seconds.

Why the gently caress is this happening? Explain plan is not helpful, and these columns are strictly data holders with no indices or weird data types or anything.

which rdbms? were those columns used in the query and/or view, and if so, how? i assume the result set is equivalent between staging and prod

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


ChickenWing posted:

Riddle me this, SQL fans:

I've got a view that joins 8 tables. 7 inners, one left. call these A-H. In production, this view returns 1.5m rows.

I've got a query that left joins table C on this view again, groups by C's ID, and does a bunch of aggregations (sums, mostly). The result set in prod is about 4500 rows.

We discovered that, when run on production data, the query takes about six and a half minutes and was timing out our app requests. However, in staging, it took a significantly saner amount of time (30s) with *almost* identical data (couple items added by our QA team). Bad times, so we start looking into it.

What we found was that there were two columns in table F that were unpopulated (null values all the way down) in production, but had data in staging. This seems to be the exclusive difference. We haven't added any data to prod to test this, but running the query with the removal of those columns dropped the query time back down to 30 seconds.

Why the gently caress is this happening? Explain plan is not helpful, and these columns are strictly data holders with no indices or weird data types or anything.

Are your offending columns potentially large data types that, when non null, might be causing out-of-row data reads?

In MS SQL Server this includes varchar(max), nvarchar(max), varbinary(max), xml, text, ntext, image.

Could also be contributing to a row not fitting on a single page.

Adbot
ADBOT LOVES YOU

ChickenWing
Jul 22, 2010

:v:

Malcolm XML posted:

EXPLAIN ANALYZE

well that found the errant issue - join with expected cost 9000 actual cost about 7000000 with about 3 billion rows filtered out

redleader posted:

which rdbms? were those columns used in the query and/or view, and if so, how? i assume the result set is equivalent between staging and prod

postgres 11, columns are selected into the view, then used in calculations in the query output - the calculation is basically (col1 * (col2 - col3 + col4(!) + col5(!) + col6) / col7) where (!) indicates the fully-null columns

Ruggan posted:

Are your offending columns potentially large data types that, when non null, might be causing out-of-row data reads?

just NUMERICs

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply