|
PhantomOfTheCopier posted:Contrary to that, since outer joins are so special and inner joins are routine, I want outers to look different and will continue to write inners as straightforward queries with all the tables in one place, all the field equalities together, all the restrictions together, and let my query planner do its job. "... and let your query planner do its job"? I don't follow what the significance is meant to be of this flourish at the end of your remark. The query planner will "do its job" equally well regardless of whether join conditions appear in one place or another. The difference is one of readability for a human, and it is for reasons of readability that join conditions belong in the ON clause, as I said. The reason why filtering conditions belong in the WHERE clause, unaccompanied by join conditions, is that this draws the reader's attention to them. They are much more likely to define what the query does. The join conditions are boilerplate - necessary boilerplate, but boilerplate all the same. That is why they should be filed away in the ON clause, where they can be checked if necessary but don't distract from the important filtering business taking place in the WHERE clause.
|
# ? Dec 2, 2018 01:56 |
|
|
# ? Jun 8, 2024 00:55 |
|
kumba posted:There can be a difference, yes. See: Thanks for the links, and the other helpful information.
|
# ? Dec 2, 2018 03:49 |
|
I think you’re both expressing the same thing - that SQL is (mostly) declarative - and that you should generally write queries in the most human parseable format as the optimizer is usually pretty good at determining an efficient control flow. Just a quick reminder though that a declarative language is only as efficient as its language’s implementation, which while for SQL is great, it’s not perfect. Especially for complex queries, syntax and order of joins can make a difference. Here’s a really good article on the subject: https://explainextended.com/2009/07/12/double-thinking-in-sql/ Back to the original topic, I disagree that all filtering conditions belong in the where clause. Even in a query fully composed of inner joins I often find the query easier to read when certain filter criteria of table B (in a A inner join B scenarios) are written as a part of the join condition. This makes it clear to me exactly which rows from table B are being considered for joining rather than needing to jump down to the where clause. But I acknowledge that this is all personal opinion.
|
# ? Dec 2, 2018 04:03 |
|
If we’re going to be selecting from a particular table by a range of the latest date to an earlier date e.g. 2 weeks prior, 8 weeks prior, quarter to date, etc. then the index I want is on the timestamp descending, right?
|
# ? Dec 4, 2018 16:19 |
|
Pollyanna posted:If we’re going to be selecting from a particular table by a range of the latest date to an earlier date e.g. 2 weeks prior, 8 weeks prior, quarter to date, etc. then the index I want is on the timestamp descending, right? It probably doesn't matter from a querying perspective if you do ASC or DESC, but it may be a hassle from a page management perspective on insert where every new row goes to the first entry of the index rather than the last. I'm not explaining it super well but I'd think your b trees would thank you for writing consistently to the tail of the index rather than the head.
|
# ? Dec 4, 2018 16:28 |
|
Nth Doctor posted:It probably doesn't matter from a querying perspective if you do ASC or DESC, but it may be a hassle from a page management perspective on insert where every new row goes to the first entry of the index rather than the last. I'm not explaining it super well but I'd think your b trees would thank you for writing consistently to the tail of the index rather than the head. So I won’t get an advantage on querying for data if I make a btree (dayof DESC) index? What should I use if I’m going to be regularly “looking backwards” on a date range? We will also be regularly clearing and reinserting a certain number of rows (for a given uuid attribute). Basically we’re planning on dropping the past 13 days of data and re-inserting 14 days worth of data every day at 3am. What can I add to the database to make that easier for us? Pollyanna fucked around with this message at 17:01 on Dec 4, 2018 |
# ? Dec 4, 2018 16:55 |
|
Pollyanna posted:So I won’t get an advantage on querying for data if I make a btree (dayof DESC) index? What should I use if I’m going to be regularly “looking backwards” on a date range? So I have questions about the delete 13 insert 14 process, but leaving those aside: Using either a date asc or date desc should result in index seeks like you want which will give you the performance boost you're after. Conceptualizing the structure of the index itself is where I'm making my argument for asc, but it presupposed you were going for an append only type of insertion. Here's where I'm coming from: think about the low level memory manipulation that goes on when you allocate a range of address and use them to store a list of data. It's a cheaper operation to append to the end of the list provided you're still within the allocated bounds vs. prepend an entry to the head of the list because the head of the list is at the boundary of your memory allocation. Prepending means every entry in the list needs to be moved down one spot. Appending means maybe a pointer to the tail gets updated. I'm still phone posting so I can't do fancy pictures to illustrate my point but I think that adding new entries to the head of an index could have similar performance issues as prepending an entry to a low level memory allocated list.
|
# ? Dec 4, 2018 18:05 |
|
Nth Doctor posted:So I have questions about the delete 13 insert 14 process, but leaving those aside: Oh man, trust me: I do too. quote:Using either a date asc or date desc should result in index seeks like you want which will give you the performance boost you're after. Been a while since I had to think about this, but I get what you mean. If we were only ever appending, then having a tail of the most recent just means updating a pointer. Unfortunately we’re also doing some weird windowing stuff so we’re also going back and dropping 13, then appending 14, and I guess that might make it more annoying, especially since we’re also scoping that by another column rather than the whole table.
|
# ? Dec 4, 2018 22:55 |
|
If you reinsert with it ordered properly then that'll help with reducing reads. If you're doing a ton of whole-day fuckery then you might be best served with partitioning if it's available in your db, since that'll make throwing away the old data fast, unless you're actually truncating the whole table. Although if 10 days ago hasn't changed then how does binning it and reinserting help anything other than your storage vendor?
|
# ? Dec 5, 2018 00:53 |
|
Moonwolf posted:If you reinsert with it ordered properly then that'll help with reducing reads. If you're doing a ton of whole-day fuckery then you might be best served with partitioning if it's available in your db, since that'll make throwing away the old data fast, unless you're actually truncating the whole table. Proper ordering on insert won't matter all that much if there's already the index. Unless the dates are themselves the clustered index, insertion order would only affect insertion time in a substantial way. You might get a tiny bit less fragmentation but that's really all it would do.
|
# ? Dec 5, 2018 01:31 |
|
Has anyone leveraged Change Tracking (MS SQL) on a linked server or through SSIS? Looks like the CT functions can’t be accessed remotely which leaves me with doing some hacky OpenQuery thing or writing it all directly in SSIS. Just wondering if anyone has experience here.
|
# ? Dec 5, 2018 15:25 |
|
Ruggan posted:Has anyone leveraged Change Tracking (MS SQL) on a linked server or through SSIS? Sup. I rolled my own replication leveraging change tracking. E: code:
Nth Doctor fucked around with this message at 16:51 on Dec 5, 2018 |
# ? Dec 5, 2018 16:45 |
|
Moonwolf posted:Although if 10 days ago hasn't changed then how does binning it and reinserting help anything other than your storage vendor? It might change, since the data we get each day may not be complete - part of the data takes time to be completely up to date. It might be finished the next day, so we can’t just get it once. I suggested making two tables for the different data, but it ended up being more complicated than it was worth. It’s super stupid and I hate our reporting system and the less I think about it the saner I remain.
|
# ? Dec 6, 2018 03:13 |
|
Question: What is the difference between creating a table with an autoincrementing primary key with a secondary index on the relevant columns or creating a primary key across the relevant columns? Said another way: Given that I have a table ( in MS SQL 2017) of five columns and three of them are the index columns, what's the difference between creating a PK on the three columns or adding an autoincrementing column and making it the PK and then creating a secondary index containing the three columns?
|
# ? Dec 7, 2018 04:21 |
|
Agrikk posted:Question: Depending on what the other columns are, it's a hell of a lot easier on the server to compare straight integers rather than long varchars and the like. Our typical PK is a tenant id and an identity column at a minimum. If there are more int columns as candidate keys we sometimes include them in the PK definition for clustered index benefits.
|
# ? Dec 7, 2018 04:33 |
|
Nth Doctor posted:Depending on what the other columns are, it's a hell of a lot easier on the server to compare straight integers rather than long varchars and the like. The three columns are a user ID (varchar(50)), a group ID (varchar(20) and a timestamp, making the row unique for those three columns. Typical queries include searching for all timestamps for a given user/group combo and searching for all users in a given timestamp. If I were to use an autoincrementing primary key, I'd still have to have an index to make these use cases work properly, necessitating the use of an update to an index. What is a use case where an identity column makes sense?
|
# ? Dec 7, 2018 05:22 |
|
Agrikk posted:The three columns are a user ID (varchar(50)), a group ID (varchar(20) and a timestamp, making the row unique for those three columns. Typical queries include searching for all timestamps for a given user/group combo and searching for all users in a given timestamp. Well, when you hit the limits, for example: primary key - max 32 columns index - 900 bytes for a clustered index. 1,700 for a nonclustered index. In general if you have a bunch of columns that specify the "uniqueness" of a record, just give it a unique key (identity or whatever else) and just make appropriate indexes for the most common queries. If you have only 3 columns that determine the uniqueness of a record, then it should be fine to have those as a primary key. Though, putting a timestamp in a unique constraint does seem a bit weird. It depends how it's used though: does it ever get updated? Who insert's it? Are there possibilities of collision (same timestamp to be set from different clients/applications)? It can be fine though.
|
# ? Dec 7, 2018 05:49 |
|
Volguus posted:Well, when you hit the limits, for example: Thank you for this explanation. This helps. In my case, the data comes in for each user/group in batches, and the batch never changes. Thus there are never any updates and inserts happen once an hour for all user/group rows. The time of the insert is the timestamp.
|
# ? Dec 7, 2018 05:57 |
|
Agrikk posted:Question: If this data is actually relational and other tables might have foreign keys referencing rows in this table, then you should pretty much always use a surrogate primary key (i.e. an integer that doesn't mean anything). Natural primary keys are tempting but down that road lies madness. Also, keep in mind that primary keys are clustered by default in MSSQL and if you're using a natural primary key then it's quite likely that it won't be a good clustering key. This is a really complex topic but you can check out this for a primer. TheFluff fucked around with this message at 15:21 on Dec 7, 2018 |
# ? Dec 7, 2018 15:12 |
|
Yeah it's tempting to go "oh hey I don't need to waste space/resources on that primary key because I have uniqueness in (x)+(y)+(z)" but it almost never works out that way because of some of the reasons stated here.
|
# ? Dec 7, 2018 18:47 |
|
Oh yes, foreign keys. I forgot that one. A quite strong reason to not have a composite PK.
|
# ? Dec 7, 2018 19:31 |
|
I'd go so far as to put the 'natural' key in it's own table and have the rows point to it with an FK. Better normalization and a unique index wouldn't have to be nearly as big to cover the user+group+timestamp combination.
|
# ? Dec 7, 2018 20:00 |
|
Performance and viability may depend on careful optimizations within the database or the construction of the indexes when you're using tricky primary keys. When you come to realize that AbcSql no longer supports some of your other needs and you want to switch to SqlLmn++ you'll almost certainly get to rebuild your application to use a basic integer key. This problem will be worse if you have text as a key field since behavior is database-dependent.
|
# ? Dec 8, 2018 14:39 |
|
Is there any recommended reading for SQL performance tuning for beginners? I'm using SQL Server, and have a query that takes roughly 2 minutes to run. The query runs overnight and is not mission critical but I'd like to try my hand at speeding it up. I'm not quite sure where to start reading/digging. *I'm not a DBA, just an analyst pulling data.
|
# ? Dec 8, 2018 17:10 |
|
The biggest thing for reads is to EXPLAIN and see if it does any table scans. Another important pitfall to beware of is correlated subqueries.
|
# ? Dec 8, 2018 18:08 |
|
Ruggan posted:I think you’re both expressing the same thing - that SQL is (mostly) declarative - and that you should generally write queries in the most human parseable format as the optimizer is usually pretty good at determining an efficient control flow. Thanks for the link, was an interesting read. rt4 posted:The biggest thing for reads is to EXPLAIN and see if it does any table scans. Another important pitfall to beware of is correlated subqueries. I'll take a look at EXPLAIN, thanks!
|
# ? Dec 9, 2018 17:53 |
|
Hughmoris posted:Is there any recommended reading for SQL performance tuning for beginners? There are a couple of red flags like functions in the WHERE clause or in JOIN conditions. Another consideration is the size of columns being used in comparisons, how many joins, the number of columns returned, and the count of rows in each table. There's also analyzing the query execution plan. Redgate and some other companies have tools that make analyzing the execution plan easier that might be helpful as a beginner. If you could anonymize it, then it might be worth posting here or on dba.stackexchange to see if anyone can see some obvious red flags. When I did query analysis ~2 years ago it was always hard to find a quick beginners guide. It always felt a bit like an artform to get decent at identifying problems.
|
# ? Dec 10, 2018 14:07 |
|
Hughmoris posted:Is there any recommended reading for SQL performance tuning for beginners? You're probably going to end up using indexes, so you should understand them https://use-the-index-luke.com/sql/preface SSMS has a plan viewer built in (a GUI for the EXPLAIN output) but it's kinda annoying to use if you've got a complicated query or stored procedure. I like https://www.sentryone.com/plan-explorer better. Both will show you where the server thinks you're missing an index, if it has that advice. Sometimes it's even good advice!
|
# ? Dec 10, 2018 15:16 |
|
Yeah it seems that it always comes down to: - Slow reads: Not enough indexes, unforseen joins on non-indexes/order by, implicit cast/conversion in the WHERE, weird isolation levels set/unset - Slow updates: Too many indexes, unknown triggers, table wide locks, stalled jobs
|
# ? Dec 10, 2018 19:57 |
|
Depending on the DB, complicated queries can produce --- guess --- complicated query execution plans. Sometimes you can learn a lot by playing. Start with the individual pieces of the query and build them up progressively and see how the performance degrades. When you hit am obvious bottleneck, use EXPLAIN to see what has been added or changed to the plan that's causing the misbehavior. But yeah, there no guide as such because queries are built from simpler forms. When you understand optimizing the basics, the rest will follow.
|
# ? Dec 11, 2018 00:16 |
|
I'd be curious to see your query and the DDL for the tables involved
|
# ? Dec 11, 2018 00:21 |
|
Thanks for all the ideas. This is dealing with healthcare data so I err on the side of caution in regards to posting relevant details but I'll research the topics pointed out. I do have a problem that has turned out to be a stumper for me and two others I work with: I have a table that contains alarms and acknowledgements. An acknowledgement can clear multiple alarms. For a given alarm, when was the next acknowledgment that occurred? Original Table code:
code:
I've started utilizing window functions more but mainly for first and last values. This feels like I could use LEAD but I'm not sure how to use the conditional of AcknowledgeTime > AlarmTime. Hughmoris fucked around with this message at 00:49 on Dec 11, 2018 |
# ? Dec 11, 2018 00:37 |
|
Some of the general guidelines my company has used in performance tuning: Examine the actual query plan as everyone else has said. Using the handy SSMS visualization is super useful. Look for thick arrows indicating a shitload of rows going through an operation that yields thin arrows indicating few rows. Those can probably be redone to filter better. Look for unnecessary table and index scans. Look for spooling. Look for operations with drastic differences between the estimated number of rows and the actual number. Though the query optimizer can and will rearrange the join orders on you, make an attempt to help it along by joining from tables with few rows to tables with many List join predicates in the same order as a hypothetical index on the table you are joining Try to leverage PK / clustered indices in join predicates Keep index statistics relatively up to date Rebuild the indices during low usage times if they need it Avoid subqueries Avoid UDFs Avoid Outer Apply If you can't bash the query into submission with those steps, try breaking out bigger guns Use index hints Use join type hints Option (force order) These things take flexibility away from the query optimizer, though, which isn't really great for the long term. I can probably come off more sensibly when not phone posting, but hopefully that with everything else people have said will help.
|
# ? Dec 11, 2018 00:49 |
|
Hughmoris posted:Thanks for all the ideas. This is dealing with healthcare data so I err on the side of caution in regards to posting relevant details but I'll research the topics pointed out. Ugh. I don't like schemas like this. What filters are you getting for the query? What indices are on the table? Is it really a patient name or some numerical ID? Same question for event type?
|
# ? Dec 11, 2018 01:04 |
|
Nth Doctor posted:Good stuff... Thanks for this! Nth Doctor posted:Ugh. I don't like schemas like this. I don't know on the first two questions. I need to read up on indices. Its a patient numerical ID, which is unique to the patient encounter. The event type is generic. The only unique things are the patient ID and the (relatively unique) timestamp. The gist of the basic query that returns the original table above is: SQL code:
|
# ? Dec 11, 2018 01:13 |
|
Hughmoris posted:Thanks for all the ideas. This is dealing with healthcare data so I err on the side of caution in regards to posting relevant details but I'll research the topics pointed out. Are you asking what query gets the results in "Ideal Endstate"? If so, you can join two data sets based on timestamps. The first data set is Alarms and the second is Acknowledgements. The join is pretty complex because you need the first row from the Acknowledgement set where the timestamp is greater than the Alarm timestamp. I can't remember exactly what the syntax was, but it's possible. Actually it might be cross apply. downout fucked around with this message at 02:51 on Dec 11, 2018 |
# ? Dec 11, 2018 02:48 |
|
downout posted:Are you asking what query gets the results in "Ideal Endstate"? If so, you can join two data sets based on timestamps. The first data set is Alarms and the second is Acknowledgements. The join is pretty complex because you need the first row from the Acknowledgement set where the timestamp is greater than the Alarm timestamp. I can't remember exactly what the syntax was, but it's possible. Actually it might be cross apply. Correct, looking for a query to get the Ideal EndState. I believe I figured it out using a correlated subquery. The performance is a hot garbage fire but it returns the correct results (that I've validated so far).
|
# ? Dec 11, 2018 03:03 |
|
I don't think it's possible to get a correct result here without a correlated subquery
|
# ? Dec 11, 2018 04:08 |
|
you could i think........ but it would involve a while loop
|
# ? Dec 11, 2018 04:11 |
|
|
# ? Jun 8, 2024 00:55 |
|
I took a derived table approach with a window function and had something close to working:code:
pre:Distinct_Patients Total_Records ----------------- ------------- 10001 1000000 (1 row(s) affected) Event_ID Total ----------- ----------- 1 600373 2 399627 (2 row(s) affected) Patient_No Alarm Alarm_Date Acknowledgement Acknowledgement_Date ----------- ----------- ----------------------- --------------- ----------------------- 1 1 1970-01-07 22:41:00.000 2 1970-02-25 13:28:00.000 1 1 1970-02-04 17:25:00.000 2 1970-02-25 13:28:00.000 1 1 1970-02-18 14:47:00.000 2 1970-02-25 13:28:00.000 1 1 1970-04-29 01:37:00.000 2 1970-06-09 17:43:00.000 ... This runs in 9 seconds for a million row table, including the time taken to generate the data. Nth Doctor fucked around with this message at 18:45 on Dec 11, 2018 |
# ? Dec 11, 2018 18:36 |