Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Olly the Otter posted:

I'd be astonished if it's actually running the subquery multiple times, with row deletion happening in between. Is there a query I could try that might confirm this behavior deterministically?

I tried EXPLAIN, but I'm not very good at reading the output. I can tell that it's basically saying the query's inefficient due to the bad primary key, but not sure what else is helpful here.

code:
mydb=> explain delete from tasks where tid in (select tid from tasks order by tid limit 1 for update) returning dsid;
                                                 QUERY PLAN                                                 
------------------------------------------------------------------------------------------------------------
 Delete on tasks  (cost=43.94..82.48 rows=1 width=34)
   ->  Hash Semi Join  (cost=43.94..82.48 rows=1 width=34)
         Hash Cond: (tasks.tid = "ANY_subquery".tid)
         ->  Seq Scan on tasks  (cost=0.00..32.60 rows=2260 width=10)
         ->  Hash  (cost=43.92..43.92 rows=1 width=32)
               ->  Subquery Scan on "ANY_subquery"  (cost=43.90..43.92 rows=1 width=32)
                     ->  Limit  (cost=43.90..43.91 rows=1 width=10)
                           ->  LockRows  (cost=43.90..72.15 rows=2260 width=10)
                                 ->  Sort  (cost=43.90..49.55 rows=2260 width=10)
                                       Sort Key: tasks_1.tid
                                       ->  Seq Scan on tasks tasks_1  (cost=0.00..32.60 rows=2260 width=10)

I can read it and it doesn't match your example. There are 2260 rows here, not five, and it plans to delete 1 row, not all five. Note that it's locking all 2260 rows, which is probably not what you want. Meanwhile it's performing a subquery scan in a way that allows the query planner to optimize in a way that may perform multiple passes.

Do EXPLAIN ANALYZE to show the actual loops executed. You need to put it in a CTE to materialize and prevent the query planner from second-guessing what you want.

You get different results in different cases because the optimizer uses updated statistics each time the query runs.

Adbot
ADBOT LOVES YOU

Olly the Otter
Jul 22, 2007
Ok, here's an EXPLAIN ANALYZE, in a case where it's definitely deleting 6 rows:

code:
tadpole=> explain analyze delete from tasks where tid in (select tid from tasks order by tid limit 1 for update) returning dsid;
                                                                 QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 Delete on tasks  (cost=1.01..2.05 rows=1 width=34) (actual time=0.089..0.141 rows=6 loops=1)
   ->  Nested Loop Semi Join  (cost=1.01..2.05 rows=1 width=34) (actual time=0.076..0.112 rows=6 loops=1)
         Join Filter: (tasks.tid = "ANY_subquery".tid)
         ->  Seq Scan on tasks  (cost=0.00..1.00 rows=1 width=10) (actual time=0.020..0.023 rows=6 loops=1)
         ->  Subquery Scan on "ANY_subquery"  (cost=1.01..1.03 rows=1 width=32) (actual time=0.013..0.013 rows=1 loops=6)
               ->  Limit  (cost=1.01..1.02 rows=1 width=10) (actual time=0.011..0.011 rows=1 loops=6)
                     ->  LockRows  (cost=1.01..1.02 rows=1 width=10) (actual time=0.011..0.011 rows=1 loops=6)
                           ->  Sort  (cost=1.01..1.01 rows=1 width=10) (actual time=0.004..0.005 rows=4 loops=6)
                                 Sort Key: tasks_1.tid
                                 Sort Method: quicksort  Memory: 25kB
                                 ->  Seq Scan on tasks tasks_1  (cost=0.00..1.00 rows=1 width=10) (actual time=0.006..0.008 rows=6 loops=1)
 Planning time: 0.303 ms
 Execution time: 0.212 ms
(13 rows)

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
Loops=6 all over the place.

Look at these two lines:
Seq Scan on tasks (cost=0.00..1.00 rows=1 width=10) (actual time=0.020..0.023 rows=6 loops=1)
-> Subquery Scan on "ANY_subquery" (cost=1.01..1.03 rows=1 width=32) (actual time=0.013..0.013 rows=1 loops=6)

Read bottom up. It did the sub-query scan six times and turned that into 6 rows.


The simplest and probably slightly incorrect explanation that I can give would be... It's forced to do the sequence scan to fulfill the IN operator because there's no index, and the stored statistics on the table are such that the query planner is running multiple loops to fulfill that condition. For example, there may actually be six stored data pages for the table (or a database may decide to do some of the evaluation in parallel), so it breaks the conditional validation into multiple pieces to save speed and memory, or to prevent locking multiple pages at the same time.

PhantomOfTheCopier fucked around with this message at 23:24 on Jun 27, 2019

Olly the Otter
Jul 22, 2007
Ok, that helps, thanks.

redleader
Aug 18, 2005

Engage according to operational parameters
That almost sounds like a bug, particularly when the same query gives different results based on... something.

McGlockenshire
Dec 16, 2005

GOLLOCKS!

Olly the Otter posted:

Edit2: The EXPLAIN seems to have changed -- the previous EXPLAIN was before the table had become corrupted, so here's one from this apparently corrupted table:

Have you ANALYZEd the table since the corruption?

Dominoes
Sep 20, 2007

Hey dudes. How do you run PgAdmin4 in Ubuntu? In win, it's an installer. The official instructions are old/not good/don't work with new versions of Ubuntu. There's an approach where you can install as a python wheel, but I'm getting permission or package not installed errors even after setting up the wheel (the script attempts to edit something in the protected /etc directory, but if you sudo it, you then lose access to the packages the wheel installed). This is oddly difficult / not user-friendly.

Dominoes fucked around with this message at 12:52 on Jul 14, 2019

SixPabst
Oct 24, 2006

Hi all. I have a date query question that I can't figure out this late in the day.

Say I have a table that looks like

code:
|   PK     |   CompanyId             |   StartDate  |   EndDate     |
---------------------------------------------------------------------------------
|   1      |   1                     |   1/1/2019   |   12/30/2019  |
|   2      |   1                     |   3/1/2019   |    2/29/2020  |
|   3      |   2                     |   2/1/2019   |   12/30/2019  |
|   4      |   2                     |   1/1/2020   |   12/30/2020  |
What I want to do is find all records for any company where start date and end date overlap or intersect. In this example, I'd get rows 1 and 2 back. This seems straightforward by joining the table to itself and doing some date comparison but I'm not getting the results I expect and feel like I'm missing something obvious.

kumba
Nov 8, 2003

I posted my food for USPOL Thanksgiving!

enjoy the ride

Lipstick Apathy

SixPabst posted:

Hi all. I have a date query question that I can't figure out this late in the day.

Say I have a table that looks like

code:
|   PK     |   CompanyId             |   StartDate  |   EndDate     |
---------------------------------------------------------------------------------
|   1      |   1                     |   1/1/2019   |   12/30/2019  |
|   2      |   1                     |   3/1/2019   |    2/29/2020  |
|   3      |   2                     |   2/1/2019   |   12/30/2019  |
|   4      |   2                     |   1/1/2020   |   12/30/2020  |
What I want to do is find all records for any company where start date and end date overlap or intersect. In this example, I'd get rows 1 and 2 back. This seems straightforward by joining the table to itself and doing some date comparison but I'm not getting the results I expect and feel like I'm missing something obvious.

code:
select
	*
	from @mytable m1
	join @mytable m2 on m1.CompanyID = m2.CompanyID and m1.ID > m2.ID
	where (m1.StartDate between m2.StartDate and m2.EndDate)
	or (m2.StartDate between m1.StartDate and m1.EndDate)
with more than 4 records in the table you'll probably end up with duplicates from this i think? don't have time to work that out right now but here's a starting point

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
Yeah not entirely clear. If window A overlaps B, and B overlaps C, but C doesn't overlap A, what is the desired result?

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Our resident SQL performance guy shared this brain teaser in our internal forums this week:
(our database is multi-tenant)

Performance Guy posted:

In 5 seconds or less, identity the problem with this query:
code:
SELECT DISTINCT
  O.Tenant_ID,
  O.Order_Key
FROM dbo.Shipment AS S
JOIN dbo.Shipment_Item AS SI
  ON SI.Tenant_ID = S.Tenant_ID 
  AND SI.Shipment_Key = S.Shipment_Key
JOIN dbo.Order_Item AS O
  ON O.Tenant_ID = SI.Tenant_ID
  AND O.Order_Item_Key = SI.Order_Item_Key
WHERE S.Tenant_ID = @Tenant_ID
  AND (@Shipment_Method_Key IS NOT NULL AND @Shipment_Method_Key = S.Shipment_Method_Key) OR (@Shipment_Key IS NOT NULL AND S.Shipment_Key = @Shipment_Key)

The problem being that the OR in the WHERE clause isn't properly wrapped in parenthesis, so the where clause logically:
code:
WHERE
  (
    S.Tenant_ID = @Tenant_ID
    AND (@Shipment_Method_Key IS NOT NULL AND @Shipment_Method_Key = S.Shipment_Method_Key)
  )
  OR (@Shipment_Key IS NOT NULL AND S.Shipment_Key = @Shipment_Key)

When it SHOULD be:
code:
WHERE
  S.Tenant_ID = @Tenant_ID
  AND (
    (@Shipment_Method_Key IS NOT NULL AND @Shipment_Method_Key = S.Shipment_Method_Key)
    OR (@Shipment_Key IS NOT NULL AND S.Shipment_Key = @Shipment_Key)
  )
so that the data returned is all from the passed in tenant instead of from unrelated other customers of ours.

After half a dozen quick and correct analyses by engineers like me, a customer support guy chimes in:

Not a developer guy posted:

Well, to begin with there is no structure to the query. I was taught to focus on structure so that the query is easily readable. Even if your code ran 500 times faster but no structure, you failed.



I would have to give it structure before commenting.



This is an example of one I wrote.

code:
SELECT  distinct
   c.item_number,
   c.revision,
   e.Creation_Step_Codename,
   d.Price_Component,
   max(a.add_date)

from   Item_Price_Component a inner join Item_Price b
         on a.Item_Price = b.Item_Price
       Inner join Item_List c
        on b.Item_Key = c.Item_Key
       inner join Supplier_Item_Price_Component d
        on  a.Supplier_Item_Price_Component_Key = d.Supplier_Item_Price_Component_Key
       inner join Item_Creation_Step f
        on  f.Item_Key  = c.item_key
         and f.Item_Creation_Step_Key = b.Item_Creation_Step_Key       
       inner join Creation_Step e
          on  e.Creation_Step_Key = f.creation_step_key
         
           
         


where  e.material = '1' and b.active = 'True'

Group by c.item_number,c.revision,e.Creation_Step_Codename,d.Price_Component

:fuckoff: with that poo poo, man. The only changes I made were to rename the tables and columns. All formatting, capitalization, and aliasing are as they appeared.

I want to light this dude's keyboard on fire.

Another high up engineer and performance guy had a very diplomatic response:

Performance Guy 2 posted:

Not a developer guy,

Other than the poorly formed OR where condition that was the point of his example, Performance Guy's code correctly follows our SQL Standards and is laid out logically and consistently.



You may wish to review this document if you want to be able to write and format SQL in a way acceptable to our developers: <link to the SQL standards guide>

Tibalt
May 14, 2017

What, drawn, and talk of peace! I hate the word, As I hate hell, all Montagues, and thee

Can someone explain the difference between SSMS, SSDT, SSRS, and SSIS to me like I'm an idiot.

Because I thought I understood the difference but either I don't or my director doesn't, and I'm trying not to look like I'm an rear end in a top hat if I'm the one being the dumbass.

Joda
Apr 24, 2010

When I'm off, I just like to really let go and have fun, y'know?

Fun Shoe
500 times faster? I'd make the most unreadable, unmaintainable piece of poo poo you'd ever seen involving every bad practice you could think of and multiple fast_sqrt magic number equivalents if I could make a central, heavyand often used query 500 times faster.

Shy
Mar 20, 2010

Tibalt posted:

Can someone explain the difference between SSMS, SSDT, SSRS, and SSIS to me like I'm an idiot.

Because I thought I understood the difference but either I don't or my director doesn't, and I'm trying not to look like I'm an rear end in a top hat if I'm the one being the dumbass.

SSMS is a gui for the database. SSDT is a set of db development tools that integrate into VS, SSRS is tools for reports, SSIS is programmability/ETL. Which part is confusing? Just google any of those terms, MS docs explains it well enough.

Shy fucked around with this message at 18:42 on Jul 19, 2019

Tibalt
May 14, 2017

What, drawn, and talk of peace! I hate the word, As I hate hell, all Montagues, and thee

Shy posted:

SSMS is a gui for the database. SSDT is a set of db development tools that integrate into VS, SSRS is tools for reports, SSIS is programmability/ETL. Which part is confusing? Just google any of those terms, MS docs explains it well enough.
Well, here's my understanding: SSDT slots into Visual Studios, and SSRS/SSIS slot into SSDT. SSMS is a separate thing, right? You wouldn't use SSMS to create a SSRS project, right? If I'm suppose to be using SSRS/SSIS, I need SSDT and VS installed on my computer, right?

The reason for my frustration is that I've been hired onto this new startup, and things are a bit wacky. I'm being told that I have everything I need, but I only have SSMS installed. Sure, I can write scripts and use the Wizards to get by, but I need SSDT, right? Before I start pushing back on this, I really want to make sure I'm not making an rear end of myself.

Shy
Mar 20, 2010

Tibalt posted:

Well, here's my understanding: SSDT slots into Visual Studios, and SSRS/SSIS slot into SSDT. SSMS is a separate thing, right? You wouldn't use SSMS to create a SSRS project, right? If I'm suppose to be using SSRS/SSIS, I need SSDT and VS installed on my computer, right?

The reason for my frustration is that I've been hired onto this new startup, and things are a bit wacky. I'm being told that I have everything I need, but I only have SSMS installed. Sure, I can write scripts and use the Wizards to get by, but I need SSDT, right? Before I start pushing back on this, I really want to make sure I'm not making an rear end of myself.

Yes you need VS and SSDT (data storage and processing workload in the VS installer) for development, see https://docs.microsoft.com/en-us/sql/integration-services/integration-services-ssis-development-and-management-tools?view=sql-server-2017
SSMS is a management tool, if you're developing you'll need VS

Tibalt
May 14, 2017

What, drawn, and talk of peace! I hate the word, As I hate hell, all Montagues, and thee

Shy posted:

Yes you need VS and SSDT (data storage and processing workload in the VS installer) for development, see https://docs.microsoft.com/en-us/sql/integration-services/integration-services-ssis-development-and-management-tools?view=sql-server-2017
SSMS is a management tool, if you're developing you'll need VS
Update: My director thought SSMS and Visual Studio were the same thing, and requested only SSMS for me. When he was telling me I had everything I needed, he was referring to the completed ticket to install SSMS, I don't think he read my request for VS. Thank you for assuring me I wasn't crazy.

ModeSix
Mar 14, 2009

I am trying to design a database schema and I have a question about the best way to do something.

I want to have a ratings database that will have a few thousand items that can be rated by each user. Each user can only rate an item once. I need to be able to get an average rating for each item. New items will be periodically added to the database which can then be rated.

Which of the following scenarios will work best.

1) Have a separate table of ratings for each item -> store the userid that rated the item and the rating they gave.
ex: item123rating -> userid, rating
item1234rating -> userid, rating

or

2) Have the rating tied to the user -> a field for each item attached to the user table.
user123 -> item123rating, rating
user123 -> item1234rating, rating

or

3) ????? a better way that someone with more experience can recommend

I seem to think in my mind that the first method is the best way to do it, but I am not database expert.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
You need a table with 2 foreign keys - one to the user table and one to the "rateable items" table. This table will have either 3 or 4 columns: the 2 foreign keys, the rating given by the user, and optionally an artificial primary key. (If you do not use an artificial primary key, then the primary key is a composite key consisting of the two foreign keys in one order or the other.)

Your first idea is incorrect because it proposes having "a table for each item" where the collection of items can be added to as part of ordinary operation of the software. That's always the wrong solution - categorically.

Your second idea similarly is incorrect because it's also wrong to have "a column for each item" where the collection of items can be added to as part of ordinary operation of the software.

Both of them are wrong not only conceptually, but also more practically because your RDBMS isn't designed to work like that, but it is designed to work with foreign key relationships and more-or-less normalised database table designs. Doing it the right way will also mean that you can more easily write code that does the very simple task you wanted to achieve - getting the average for each item.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
Assuming the table is UserRating (UserID, ItemID, Rating)

I want to find the average rating for item 123:

SELECT AVG(Rating) FROM UserRating WHERE ItemID = 123

I want to find the average rating for each item that has at least one rating:

SELECT ItemID, AVG(Rating) FROM UserRating GROUP BY ItemID

ModeSix
Mar 14, 2009

Hammerite posted:

You need a table with 2 foreign keys - one to the user table and one to the "rateable items" table. This table will have either 3 or 4 columns: the 2 foreign keys, the rating given by the user, and optionally an artificial primary key. (If you do not use an artificial primary key, then the primary key is a composite key consisting of the two foreign keys in one order or the other.)

Your first idea is incorrect because it proposes having "a table for each item" where the collection of items can be added to as part of ordinary operation of the software. That's always the wrong solution - categorically.

Your second idea similarly is incorrect because it's also wrong to have "a column for each item" where the collection of items can be added to as part of ordinary operation of the software.

Both of them are wrong not only conceptually, but also more practically because your RDBMS isn't designed to work like that, but it is designed to work with foreign key relationships and more-or-less normalised database table designs. Doing it the right way will also mean that you can more easily write code that does the very simple task you wanted to achieve - getting the average for each item.


Hammerite posted:

Assuming the table is UserRating (UserID, ItemID, Rating)

I want to find the average rating for item 123:

SELECT AVG(Rating) FROM UserRating WHERE ItemID = 123

I want to find the average rating for each item that has at least one rating:

SELECT ItemID, AVG(Rating) FROM UserRating GROUP BY ItemID

Thank you very much. This makes complete sense to me.

Pollyanna
Mar 5, 2005

Milk's on them.


Today I brought down our site for 30 minutes by running a count on a single table, and what I thought killed the query actually didn’t. How the hell would a count on a single table gently caress things up that badly?

Furia
Jul 26, 2015

Grimey Drawer
No idea. Never worked with DBs connected to websites but I would begin with: How big was the table? How did you run the count? How integral is the table to the website? How active was it at time of querying? Etc etc

Pollyanna
Mar 5, 2005

Milk's on them.


Table’s apparently pretty big, several gigs in size. Flat out select count(*) from table query. The table is pretty important to only half the site, most of the site doesn’t use it, but to the ones that do it’s integral. It gets queried whenever a fairly common page is loaded.

:cripes: I hate my life.

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



Also depends on the default read isolation level*. See if you can give it a hint that you're OK with an inaccurate count. That'd be with(nolock) in TSQL but IDK how it'd look elsewhere.

*might be transaction isolation level?

Pollyanna
Mar 5, 2005

Milk's on them.


I’m not even gonna bother. This DB and app is a train wreck as-is without me bringing it down.

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Yeah most likely you read locked the table. At least in MS SQL, reads usually issue shared locks which allow other reads but block writes. It’s to prevent you from reading data in an inconsistent state.

If your table is huge and the site relies on inserting data into it or updating data in it, your long running query probably blocked that. Isolation levels or nolock hints are the way to avoid that with various associated cons.

nem
Jan 4, 2003

panel.dev
apnscp: cPanel evolved
Bad query created a table-level lock and dependent queries swamped active connections as it paged in-memory tables to disk. Happened to me way back when. Set a threshold for the max joins in a table if MySQL for example, "[mysqld] => max_join_size = 5000000" or however egregious you want your indecency :whip:

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
It very very much depends on the database you're using. In any case, please use information schema (caveats).

Edit: Indeed anything higher than read committed is not typical, so while transaction isolation could cause it, if your app is changing the level you have... bigger concerns, let us say.

Edit: A couple of gigs of data is nothing. That query should run without issue until you get to terabytes.

... which means I have a guess of what you're using, and that it's not tuned for cache and worker memory. This would still require some interesting interplay with other traffic, especially for a 30min outage. What's the nominal TPS otherwise?

PhantomOfTheCopier fucked around with this message at 22:08 on Jul 23, 2019

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Oh boy, transaction isolation levels. Good stuff.

If you're on SQL Server, there's a fast way to get a rough rowcount:
code:
USE Your_Database_Here;

SELECT
  I.[rows]
FROM sys.sysindexes AS I
WHERE
  I.id = OBJECT_ID('Your_Table_Name_Here')
  AND I.indid < 2; -- We're aiming for the PK index
YMMV in other servers.

Pollyanna
Mar 5, 2005

Milk's on them.


The data we’re trying to get doesn’t even belong in our database so it can go to hell as far as I’m concerned. Not even my half of the application. I try to debug it and it spits in my face. gently caress it.

Macichne Leainig
Jul 26, 2012

by VG

Pollyanna posted:

Table’s apparently pretty big, several gigs in size. Flat out select count(*) from table query. The table is pretty important to only half the site, most of the site doesn’t use it, but to the ones that do it’s integral. It gets queried whenever a fairly common page is loaded.

:cripes: I hate my life.

I don't know if this works outside of MSSQL but you might be able to do a select count(1) from table, that way it doesn't have to bother retrieving all of the data just to count rows. If I remember right it should just return a '1' for each row in the table.

Been a hot minute since I've touched a real SQL database, so I'm probably wrong. Either way the with (nolock) hints as suggested before should at least stop the query from deadlocking the table.

Furia
Jul 26, 2015

Grimey Drawer

Protocol7 posted:

I don't know if this works outside of MSSQL but you might be able to do a select count(1) from table, that way it doesn't have to bother retrieving all of the data just to count rows. If I remember right it should just return a '1' for each row in the table.

Been a hot minute since I've touched a real SQL database, so I'm probably wrong. Either way the with (nolock) hints as suggested before should at least stop the query from deadlocking the table.

This was my thought as well. It sounds right but I can’t quite commit to it for some reason

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
It depends on the database. Some actually select *, some prefer select(id), etc.

redleader
Aug 18, 2005

Engage according to operational parameters
What RDBMS?

Joda
Apr 24, 2010

When I'm off, I just like to really let go and have fun, y'know?

Fun Shoe
How come binary database interfaces aren't more common with ORMs? It seems a waste to communicate using massive strings if you, for instance, just want to select a tonne of data by IDs.

Tibalt
May 14, 2017

What, drawn, and talk of peace! I hate the word, As I hate hell, all Montagues, and thee

I've inherited a script that I'm updating for the current year, and it takes forever, so I'm trying to optimize it. I've got three questions that I'm not quite sure about.

Would using UNION ALL with two SELECT DISTINCT queries provide a benefit over UNION, if I can assume the two select queries aren't going to have any duplicate records?

I've read that using INNER JOINs are more efficient. Does that rule still apply to UPDATE queries, like this?
code:
 update a
 set
	a.Flag = '1'
from
	TABLE_1 a,
	(select distinct ID from TABLE_1 where Other_Flag = 'Y') b
where
	a.ID = b.ID
versus
code:
 update a
 set
	a.Flag = '1'
from
	TABLE_1 a
INNER JOIN
	(select distinct ID from TABLE_1 where Other_Flag = 'Y') b
ON
	a.ID = b.ID
Would getting rid of the DISTINCT in the SELECT subquery above help performance?

Moonwolf
Jun 29, 2004

Flee from th' terrifyin' evil of "NHS"!


Those should plan to the same anyway. What could be faster depending on the size of the data would be using an IN clause:

code:
UPDATE
  table
SET
  flag=1
WHERE
  id IN (
  SELECT
    id
  FROM
    table
  WHERE
    otherflag='Y'
)
If id is the primary key then this is total overkill anyway, and could just be
code:
UPDATE
  table
SET
  flag=1
WHERE
  otherflag='Y'

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


The answer to all your questions is “look at the query plan”. But I’ll try to address them.

Tibalt posted:

Would using UNION ALL with two SELECT DISTINCT queries provide a benefit over UNION, if I can assume the two select queries aren't going to have any duplicate records?

UNION ALL is usually faster than UNION (maybe not due to the potential transmission of more records). I don’t know if your two DISTINCTs will be faster or slower than a single UNION. It’s gonna depend on your data and schema.

Tibalt posted:

I've read that using INNER JOINs are more efficient. Does that rule still apply to UPDATE queries, like this?
code:
 update a
 set
	a.Flag = '1'
from
	TABLE_1 a,
	(select distinct ID from TABLE_1 where Other_Flag = 'Y') b
where
	a.ID = b.ID
versus
code:
 update a
 set
	a.Flag = '1'
from
	TABLE_1 a
INNER JOIN
	(select distinct ID from TABLE_1 where Other_Flag = 'Y') b
ON
	a.ID = b.ID
Would getting rid of the DISTINCT in the SELECT subquery above help performance?

INNER JOINs aren’t blanket more efficient. Also the comma operator is functionally equivalent to an inner join.

Hard to say if getting rid off the DISTINCT will improve performance - gotta look at that plan.

Adbot
ADBOT LOVES YOU

Tibalt
May 14, 2017

What, drawn, and talk of peace! I hate the word, As I hate hell, all Montagues, and thee

Unfortunately, no, not a PK. It's more like a invoice number - I want to flag the whole invoice if one of the items meets criteria.

What size are we talking about for the IN clause? I'm looking at just over a million records here, with a couple hundred thousand ID numbers.

Edit: Thanks for the reply Ruggan as well. I'll take a look at the query plan as well.

Tibalt fucked around with this message at 20:27 on Jul 30, 2019

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply