|
Razzled posted:Sure, in SQL Server there are 2 main types of indexes-- clustered and nonclustered indexes. Feel free to look up the specifics of each but basically the difference is that you can only have one clustered index per table but many nonclustered indexes on a single table. In any case though, you can always delete them if you mess up and make new ones. You might want to avoid doing this too much for huge tables though as it is a resource-expensive operation. -S- posted:You could throw your OfficeId=<officeguid> into the/another join since it looks like you want them to be the same: Thank you for the suggestion on the inner join. And well, uh, what you said to make drat sure of, unfortunately at the moment that is exactly how I've setup the db! By default when you create a table in SQL Azure it will create a clustered index on the primary key column, which in the majority of my tables is a guid. So I messed that up from the get go. So at the moment I think I'm dealing with 2 issues: 1. I need to change the clustered index from the guid primary key to another column in the db. I'm reading up on how to do this, but would appreciate any insight. 2. I need to add some non-clustered index's for overall performance. I think I need to tackle the first problem before worrying about the second. Seems like that could be a bigger issue and I may hopefully see some performance gains from fixing that. One suggestion I saw was to use the CreatedDate column as the clustered index on those tables. I'm going to start there, my only concern is not all of the tables have a CreatedDate column. Thanks for the help guys!
|
# ? Jun 22, 2014 17:12 |
|
|
# ? May 31, 2024 01:41 |
|
Essential posted:Alright I'm finally able to get back on this so thank you for the information Razzled. What I usually do is something like: Table1.Table1Id IDENTITY INT Clustered Index Table1.Table1GUID GUID Non-Clustered Index, Primary Key
|
# ? Jun 22, 2014 18:10 |
|
-S- posted:What I usually do is something like: Ints all the way, motherfuckers! Out of curiosity, why did you choose to do PK GUIDs?
|
# ? Jun 22, 2014 18:23 |
|
-S- posted:What I usually do is something like: Got it, thanks! Now, is there a way to add that to my table and have it populate? I can always do it thru code, but if I can do it thru sql that would be awesome. Also, is the IDENTITY column something that you just let auto-increment? I mean, is it something that can be mostly ignored after adding it as it just does it's job as the clustered index? Nth Doctor posted:Ints all the way, motherfuckers! Out of curiosity, why did you choose to do PK GUIDs? Each office is assigned a guid and I had always been told to create pk's as guids, except for small lookup tables. I'm realizing that the guys I learned from never worked on db's of any real size, however they talked like they had. My db isn't particularly large right now, only a few tables have over 1 million rows, but it's going to grow exponentially so now I'm pretty worried now about how I designed this. Essential fucked around with this message at 19:04 on Jun 22, 2014 |
# ? Jun 22, 2014 18:59 |
|
Nth Doctor posted:Ints all the way, motherfuckers! Out of curiosity, why did you choose to do PK GUIDs? Because I have data that exists in multiple systems/servers/dbs and across discrete client systems. I use GUIDs and a few natural keys on other items. Order 1 in system 1 may be Order 35644 in system 2, so the GUID is how I know it's the same order. That's a made up example, but you get the point. There are a few other reasons based on organizational requirements that make it the right call, but those are internal company reasons. Also it makes merging to the data warehouse from all of the discrete systems pretty dadgum easy. Essential posted:Got it, thanks! Now, is there a way to add that to my table and have it populate? I can always do it thru code, but if I can do it thru sql that would be awesome. It's really set and forget. It'll just keep on incrementing. When you create it, specify 1,1: IDENTITY(1,1). That means "start at 1, increment by 1". That should explain why you don't use GUIDs with CIs - CIs sort, and how the hell do you quickly sort GUIDs? So, use an identity INT column as the CI on the table. edit: i missed your first question; You can ALTER TABLE but it'll add the identity column to the end. Ideally you can destroy and rebuild the table from scratch, but I don't know how feasible that will be - usually your foreign key constraints will make this a bitch and a half. From the way you describe how the DB is set up, I'd be surprised if they had actual FK Constraints anywhere, though. If you CAN destroy and rebuild the table, rename the existing table, build your new table, move the data into the new table, drop the old table. big trivia FAIL fucked around with this message at 20:31 on Jun 22, 2014 |
# ? Jun 22, 2014 20:07 |
|
But guids are basically random 128bit ints? Clustered indexing on a guid is gonna make inserts bad and is not going to give you any perf benefits but it isn't going to magically make your other indices useless. For real try running whatever Sql Server's equivalent of explain analyze is first to profile your bottlenecks. Without that data it's basically impossible to tell how the planner has mangled your query. e: pretty sure SSMS works on sql azure Malcolm XML fucked around with this message at 20:35 on Jun 22, 2014 |
# ? Jun 22, 2014 20:31 |
|
Malcolm XML posted:But guids are basically random 128bit ints? Clustered indexing on a guid is gonna make inserts bad and is not going to give you any perf benefits but it isn't going to magically make your other indices useless. what? I specifically told him to take the clustered indexes off of his GUIDs for that very reason. Make an identity int your CI, GUID your PK with a non-clustered idx, and add other non clustered idxes where necessary.
|
# ? Jun 22, 2014 20:38 |
|
-S- posted:It's really set and forget. It'll just keep on incrementing. When you create it, specify 1,1: IDENTITY(1,1). That means "start at 1, increment by 1". That should explain why you don't use GUIDs with CIs - CIs sort, and how the hell do you quickly sort GUIDs? So, use an identity INT column as the CI on the table. Thanks so much for the help and advice. I'm going to spend this week rebuilding the tables with your suggested method. I don't see any reason I can't destroy/rebuild the tables at night as nobody access's them at night.
|
# ? Jun 23, 2014 00:05 |
|
-S- posted:what? I specifically told him to take the clustered indexes off of his GUIDs for that very reason. Make an identity int your CI, GUID your PK with a non-clustered idx, and add other non clustered idxes where necessary. Blame awful.ipa You already have a pk in the guid cluster off of something meaningful. Don't add a useless field! But clustered indices don't reduce read perf and it's a minor insert issue at worst
|
# ? Jun 27, 2014 16:08 |
|
Malcolm XML posted:Blame awful.ipa This is the most ridiculous thing ever. Run a performance scan against GUIDs vs INTs as the PK clustered index on read and get back to me. Also, I'm saying KEEP THE GUID THE PK. Just don't make it the Clustered Index on the table. edit an hour and a half later: lol @ "minor insert issue" and also edit again: the INT CIdx isn't a useless field it's the goddamn CI on the table lol. you could make it the timestamp field if you're a baby but to call it a "useless" field is literally spouting databases 101 material that you've heard and never practiced. big trivia FAIL fucked around with this message at 01:40 on Jun 28, 2014 |
# ? Jun 27, 2014 23:51 |
|
-S- posted:This is the most ridiculous thing ever. Run a performance scan against GUIDs vs INTs as the PK clustered index on read and get back to me. Chill dude. Use a CI on a field that you will benefit from having the data, well, clustered You can have non clustered pks as well so I really don't get why you want to add yet another column that is semantically meaningless and wasting the clustering index on something that won't benefit from the acceleration is dumb as heck I'm assuming an extremely asymmetric distribution where you're rarely inserting but that depends on your processing needs (profile !!!) Unless the guy is using a bunch of range lookups on a particular field then I wouldn't even bother with selecting a CI unless you need to since sql server is totally cool with that How big is the data and what is the nature of the queries that come in?
|
# ? Jun 28, 2014 03:48 |
|
Malcolm XML posted:Chill dude. That's part of the issue. I don't know his data or what columns in his tables will benefit from a CIdx, so I gave him the quickest fix for his issue. He had queries taking ages to run, and removing the CIdx from the GUID and adding one to an INT ID column was the quickest solution for his issue. If I worked with him I might could give him a better solution, but on the Something Awful Forums to give a dude a little lift without any legitimate insight into his data structure, I gave him the best option available with the information on hand. I think I'm still mad about that dude saying "bbbbbut cursors are just loops!" as if that makes them any goddamn better to run in SQL.
|
# ? Jun 28, 2014 03:55 |
|
But also CIdx on GUID is bad bad bad and is a whole lot worse than "a minor insert issue" duder
|
# ? Jun 28, 2014 03:59 |
|
-S- posted:But also CIdx on GUID is bad bad bad and is a whole lot worse than "a minor insert issue" duder Depends on the volatility of the table and the fill rate, as well as how often you are willing to reorganize/rebuild the thing.
|
# ? Jun 28, 2014 06:35 |
|
-S- posted:But also CIdx on GUID is bad bad bad and is a whole lot worse than "a minor insert issue" duder No not really it depends on the usage pattern But a clustering index always slows down other indices so having one on a field that essentially never range queried is a great way to slow down your queries for no reason! There is no silver bullet to sql perf other than profile your usage and try to get index only scans
|
# ? Jun 28, 2014 12:51 |
|
Now that the forums are back up I can jump in and let you guys know the situation (also how dumb I am). I'll apologize right now for the confusion. The situation I described wasn't accurate. The only table I have with a guid PK is the Offices table. The other tables (OfficeProcedures etc) have an Int PK and a FK to the Office PK. So -S-, the situation you described with having an auto incrementing int as the PK is how my tables are setup. And since this is SQL Azure, that int column WAS the clustered index by default. The thing I didn't have though was any non-clustered indexes and that's where you guys helped me tremendously. Once I added a non-clustered index on the officeprocedures.officeid guid that particular query went from 15 minutes to ~1 second. There was an immediate and incredible speed increase to that query. I then went into each table that has a fk to officeid and added a non-clustered index to each one and saw some serious speed increases to most of my queries, however some actually slowed down. I was then able to spend a few days reading up on indexes and I added some more non-clustered index's to a few tables and now everything is running much faster then ever before. The sql queries are now the fastest part of my processing. -S- posted:That's part of the issue. I don't know his data or what columns in his tables will benefit from a CIdx, so I gave him the quickest fix for his issue. He had queries taking ages to run, and removing the CIdx from the GUID and adding one to an INT ID column was the quickest solution for his issue. If I worked with him I might could give him a better solution, but on the Something Awful Forums to give a dude a little lift without any legitimate insight into his data structure, I gave him the best option available with the information on hand. I wanted to add that this was exactly the kind of help I was looking for and expected no more than this, so big thank you -S-! I wasn't looking for a dissertation or 50 page whitepaper on the proper way to setup sql tables & indexes. I'm aware my db isn't designed in the best way possible and was hoping (and received) a quick way to implement an immediate impact for my customers. This quick fix is allowing me the breathing room to look into the issue in more depth and detail and decide what I can do on my own, or when it is time to hire a real dba to come in and fix what I can't. Essential fucked around with this message at 18:49 on Jun 28, 2014 |
# ? Jun 28, 2014 18:36 |
|
Essential posted:Now that the forums are back up I can jump in and let you guys know the situation (also how dumb I am). I'll apologize right now for the confusion. The situation I described wasn't accurate. The only table I have with a guid PK is the Offices table. The other tables (OfficeProcedures etc) have an Int PK and a FK to the Office PK. So -S-, the situation you described with having an auto incrementing int as the PK is how my tables are setup. And since this is SQL Azure, that int column WAS the clustered index by default. Given those results there is nothing wrong with doing the int pk CI but it's worth noting that by doing that you have reinvented heap tables E: turns out sql azure doesn't support non clustered tables so this is more acceptable I guess. Still it's worth picking a better index to cluster around. Malcolm XML fucked around with this message at 22:38 on Jun 28, 2014 |
# ? Jun 28, 2014 22:32 |
|
Malcolm XML posted:Given those results there is nothing wrong with doing the int pk CI but it's worth noting that by doing that you have reinvented heap tables I'll look into the heap tables thing, but yes, as I've learned all tables in sql azure must have a clustered index and by default, if you do not specify one during table creation, sql azure will assign the pk as the clustered index.
|
# ? Jun 29, 2014 02:46 |
|
Essential posted:I'll look into the heap tables thing, but yes, as I've learned all tables in sql azure must have a clustered index and by default, if you do not specify one during table creation, sql azure will assign the pk as the clustered index. I think what Malcolm XML means is that this kind of Clustered Index is basically useless since (from what I've gathered) it's on a surrogate key column that is essentially meaningless and is not a sensible candidate for range-scans. So, you have a table that is basically just a heap table with extra overhead. A heap table does not force a relationship between the organization of the table on disk and the primary key index. With a clustered index, the idea is that if you select a contiguous slice of the clustered index, the table access should involve a roughly contiguous slice of blocks from the disk; it does this by forcing the actual table data to be sorted on insertion or update. (a heap table can still be well-clustered with respect to one key, it's just not forced to be). E.g. code:
code:
As a developer who mostly works with Oracle RDBMS, I'm not sure why Microsoft is so hung up on using CI's for everything. (the default is heap table. the equivalent of a Clustered Index is an Index Organized Table). A CI seems to me is good for one thing, which is an index with a leading column or column set that will be range-scanned a lot. That's why for Clustered Indexes people like to use a candidate key that leads with something like a date column. Also Malcolm brings up a good point which is a CI is not free because it adds extra overhead on insert, especially if you start adding other indexes (having multiple indexes is perfectly valid and fine, but they add more overhead than usual if there's a clustered index because rows don't have fixed physical locations). But I guess you have no choice on having one, which seems weird? If you're forced to have one then yes, I guess a useless CI is better than an actively harmful CI. FieryBalrog fucked around with this message at 06:39 on Jun 29, 2014 |
# ? Jun 29, 2014 06:04 |
|
Exactly this Apparently sql azure requires a clustered index to do replication but ye olde sql server 2014 does not and is happy to use heap tables if you state a nonclustered pk
|
# ? Jun 29, 2014 13:47 |
|
FieryBalrog posted:stuff Thanks for the detailed post. As for my tables & queries it's like this: First, all data is inserted in bulk, however that's going to change soon. Right now data is uploaded at the end of the month, but soon will be switching to daily uploads. Almost everything is queried by date range, so for instance all procedures for a given month, for a particular office. So "Select * From OfficeProcedures Where OfficeID = '<officeguid>' and (ProcDate >= '<date>' And ProcDate <= '<date>')". The OfficeProcedures table has many offices, let's say 10 different offices, with each office having upwards of 300k procedures. The data breaks down so that any particular day all 10 of the offices may have 100's of procedures. My first instinct is that ProcDate may be a good candidate for the CI, however all 10 offices probably have multiple procedures on the same day. I have know idea if that matters or if I'm even thinking about it the right way. Some other candidates could be: PK from originating db (varchar), CreatedDate from originating db. If I made the ProcDate the CI and had a non-CI on the OfficeID, would sql group procedures first by date, then by office for indexing? Should I be trying to find something that will allow office records to be indexed together? I feel like something is going right over my head and I'm confused on things I shouldn't even be thinking about. Essential fucked around with this message at 20:39 on Jun 29, 2014 |
# ? Jun 29, 2014 20:36 |
|
What's the volume of data going in each day roughly ? What's the size of your db in total? For that office procedure table it looks like around 50k rows a day which is not all that much and at that volume you can afford to go hog wild with your indices esp if you can schedule the batch insert at a time when people aren't doing anything else on the db Sql server is totally cool with non unique clustered indices but it adds some tiny overhead But since your data is already clustered you will have to recreate the table so don't bother What you should bother to do is installing sql server management studio and verify your indices are being used by your queries and if they aren't figure out why not
|
# ? Jun 30, 2014 01:15 |
|
Malcolm XML posted:What's the volume of data going in each day roughly ? Total db size is 1.2gigs and 50k rows daily is pretty drat close. If things continue the way they are that number is doubling every month, so could be around 600k-700k rows daily in 1 years time. I should add that about 1/2 of those rows are deleted and then re-inserted so the total db size isn't growing by 50k rows daily. In case anyone is wondering why they are deleted and then re-inserted, the reason for that is the originating db doesn't have createddate/modifieddate for almost all of their tables, therefore there is no way to know if a record was added or changed. Deleting and re-inserting was much easier at the time than trying to use the sql row hash or other methods to see if data has changed (I am/was using entity framework for db access and it's horribly slow to do data row by row compararisions, so I decided to just delete/re-insert on certain tables).
|
# ? Jun 30, 2014 02:58 |
|
We have an old lovely mysql database that does not have relations. I have been adding relations to our development database and pretty often I get foreign key constraint errors because there is for example, a vehicle that refers to a user that does not exist. So vehicle.userid = 42 and there is no user with id 42. There must be some simple query I can run to either set the userid to null or outright delete any vehicle rows that refer to inexistent users. delete from vehicle where id in (some subquery that lists ids of vehicles that refer to inexistent users)
|
# ? Jun 30, 2014 13:18 |
|
Wheany posted:We have an old lovely mysql database that does not have relations. I have been adding relations to our development database and pretty often I get foreign key constraint errors because there is for example, a vehicle that refers to a user that does not exist. So vehicle.userid = 42 and there is no user with id 42. code:
code:
FieryBalrog fucked around with this message at 14:13 on Jun 30, 2014 |
# ? Jun 30, 2014 13:29 |
|
Essential posted:Thanks for the detailed post. As for my tables & queries it's like this: code:
code:
You would also almost certainly want an alternate index with OfficeId as the leading column, to allow efficient access to date-range data for queries that look at a particular OfficeId or just a few OfficeIds. It looks "wasteful", but the following is actually very useful: code:
TLDR: indexes exist to serve query access paths. So if you have multiple common query access paths, you should probably have multiple indexes to serve them, and it can even be just a re-arrangement of columns of an existing index. FieryBalrog fucked around with this message at 14:09 on Jun 30, 2014 |
# ? Jun 30, 2014 13:45 |
|
Essential posted:Total db size is 1.2gigs and 50k rows daily is pretty drat close. If things continue the way they are that number is doubling every month, so could be around 600k-700k rows daily in 1 years time. I should add that about 1/2 of those rows are deleted and then re-inserted so the total db size isn't growing by 50k rows daily. Oh lol 1.2GB is tiny. If sql azure weren't crippled in many ways I would look into clustered columnstores since your usecase is more OLAP than OLTP but your EA contact (you have one and signed a BAA for HIPAA compliance and are implementing all of that right? Right????) should really know more. Also entity framework is hot garbage for anything but RAD try using Dapper or something less terrible. Malcolm XML fucked around with this message at 14:12 on Jun 30, 2014 |
# ? Jun 30, 2014 14:07 |
|
FieryBalrog posted:
Thanks, exactly what I needed.
|
# ? Jun 30, 2014 14:38 |
|
I have an odd question time related question. I have a table name "Addresses" with a field named "Fees". This field basically represent the condo fees of an address X. I figure that odds are it could change over time. I'm looking for a smart way to keep track of how the fees changed over time. Would the best way to do it be via some sort of Audit table where I would keep track of the old value, new value and timestamp ? If this is the way to go, is there some sort of query I could run to have the the value of the fee over the course of a calendar year?
|
# ? Jun 30, 2014 15:33 |
|
xenilk posted:I have an odd question time related question. You presumably want an address table (no time dimension, just address dimension) that doesn't have "fees" as a column, and then a table that tracks addresses over time by intersecting that address table vs. time-dependent information like fees. this table can just point to the address table with a foreign key and attach information, with a primary key defined by (address_foreign_key, valid_start_date) or some such. That way you can keep track of other time-dependent information too. Audit table is fine too, but it's slightly less normalized (because you have fee information in two places that needs to be kept in sync and might cause referential issues, or you have to muck around with triggers). Also it will probably get messy if you need to track multiple changing things.
|
# ? Jun 30, 2014 16:21 |
|
FieryBalrog posted:You know you can have a multi-column index, right? We have lots of historical tables. An index might look like this: Nope, I had no idea on that. Thank you very much for the detailed help. Things are starting to make a lot more sense now. Malcolm XML posted:Oh lol 1.2GB is tiny. If sql azure weren't crippled in many ways I would look into clustered columnstores since your usecase is more OLAP than OLTP but your EA contact (you have one and signed a BAA for HIPAA compliance and are implementing all of that right? Right????) should really know more. Regarding Hipaa compliance, we don't have any identifying information so we don't have to worry about hipaa compliance (although any gray area's are encrypted just to be safe). You are exactly right in our use is OLAP, being entirely BI focused.
|
# ? Jun 30, 2014 21:20 |
|
Essential posted:Nope, I had no idea on that. Thank you very much for the detailed help. Things are starting to make a lot more sense now. OK I'm gonna assume your lawyers have given you clearance. If you had an on premise i'd suggest using a clustered columnstore index but i guess the best sql azure can do is multicolumn clustered indices. Rebuild it after the ETL window closes.
|
# ? Jun 30, 2014 23:18 |
|
Malcolm XML posted:If you had an on premise i'd suggest using a clustered columnstore index but i guess the best sql azure can do is multicolumn clustered indices. Rebuild it after the ETL window closes. Had to look up ETL but I got it, thanks! And thank you to everyone for the help, it really is appreciated.
|
# ? Jul 1, 2014 00:08 |
|
Hi, what tools do you use for mySQL clients? Most of our work day is spent writing Queries, either live or in stored procedures. We're currently using SQL Explorer, an Eclipse plugin, which works nicely with our shared svn repository of SQL scripts, which we imported as project in Eclipse using SVNKit. The issue we have is that Eclipse/ SQL Explorer crashes a lot whenever we click the wrong button at the wrong time and it gets really frustrating because usually quite some queries/code are just gone. I've been looking at other tools, but I haven't found one yet that offers such a nice integration of browsing/committing/updating a svn repository. At this point we would also easily consider a commercial solution. Any recommendations here?
|
# ? Jul 1, 2014 11:39 |
|
e: oops i misread that post!
|
# ? Jul 1, 2014 16:40 |
I have a query that's running about 20 seconds. It grabs all the data just fine but when it tries the ORDER BY it craps itself. It's got about 650k rows that need to be sorted according to 4 columns, which can come in one of 4 ways. E.g. ORDER BY a, b, c, d or ORDER BY b, a, c, d and so on. Here's one instance of the query, that is taking 20 seconds to execute and generating 650k rows before the LIMIT. code:
code:
When I ORDER BY only the first table's columns (i.e. d.shortname,d.startdate) it's near-instant. It's the second table that's hosed EXPLAIN SELECTED for the base query: code:
I'm trying a whole bunch of different things, including pre-sorting the different tables, e.g. code:
That final pre-sorted query takes way longer than a minute to execute, such that I just cancel it every time rather than trying to find out the exact time. Here's the explain extended: code:
|
|
# ? Jul 1, 2014 18:59 |
|
Sulla-Marius 88 posted:I have a query that's running about 20 seconds. It grabs all the data just fine but when it tries the ORDER BY it craps itself. It's got about 650k rows that need to be sorted according to 4 columns, which can come in one of 4 ways... The EXPLAIN explains it all about your other table. Those indexes you're making aren't doing the trick for this thing -- SQL chose to use the PK for mdl_user rather than a specific index that includes those columns, and your other index (newindex1) isn't being used either. 99% bet that it's not being used because the column you're using as a join predicate-cum-WHERE clause isn't included in the index. Replace CREATE INDEX newindex1 ON mdl_course ( shortname, startdate ); with CREATE INDEX IX_IDStartShortName ON mdl_course ( id, startdate, shortname ); and you should see some improvement. Similarly, CREATE INDEX IX_IDName ON mdl_user ( id, firstname, lastname ); should help. I'd be surprised if it didn't.
|
# ? Jul 1, 2014 20:35 |
|
How do I format something like this: Select Name FROM List WHERE Gender = 'Man' AND Color = 'Blue' OR Color = 'Red' OR Color = 'Yellow' I want it to choose the first option AND one of the others, how do I make sure that happens? Also, is there any way I can write the second condition where I don't have to write each COLOR every time?
|
# ? Jul 1, 2014 20:50 |
|
the posted:How do I format something like this: I'm an SQL novice but this should work: Select Name FROM List WHERE Gender = 'Man' AND Color IN ('Blue', 'Red', 'Yellow')
|
# ? Jul 1, 2014 20:59 |
|
|
# ? May 31, 2024 01:41 |
the posted:How do I format something like this: Kumbamontu's response is the better solution, but I just wanted to point out that you weren't too far off. Just need to add some parentheses: code:
|
|
# ? Jul 1, 2014 21:36 |