Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Essential
Aug 14, 2003

Razzled posted:

Sure, in SQL Server there are 2 main types of indexes-- clustered and nonclustered indexes. Feel free to look up the specifics of each but basically the difference is that you can only have one clustered index per table but many nonclustered indexes on a single table. In any case though, you can always delete them if you mess up and make new ones. You might want to avoid doing this too much for huge tables though as it is a resource-expensive operation.

The reasoning behind this is because clustered indexes deal with the actual order of the rows of data in the table. A nonclustered index is a separate data structure from the table data and contains references to the table's data to assist queries in finding the data faster. Another side note is that clustered indexes consume no additional space because they are logical constructs that deal with the order of the table data itself. Nonclustered indexes on the other hand require additional disk space and will scale in size proportional to the size of the underlying table.

Take care of the fact that indexing tables that are enormous in size can be a very taxing operation and can cause not only performance degradation but also prolonged locking. Maybe do it in a test environment if you're worried about breaking stuff.
Alright I'm finally able to get back on this so thank you for the information Razzled.

-S- posted:

You could throw your OfficeId=<officeguid> into the/another join since it looks like you want them to be the same:

Inner Join OfficeProcedures On OfficeProcedures.PMPatientID = OfficePatients.PMID AND OfficeProcedures.OfficeId = OfficePatients.OfficeId and just do one lookup in the WHERE (OP.OfficeId=<GUID>)

Also, make drat sure you do NOT have a clustered index on that GUID column (non-clustered index is fine). Short of that, as everyone else said, make sure your columns are indexed properly.

Thank you for the suggestion on the inner join. And well, uh, what you said to make drat sure of, unfortunately at the moment that is exactly how I've setup the db! By default when you create a table in SQL Azure it will create a clustered index on the primary key column, which in the majority of my tables is a guid. So I messed that up from the get go.

So at the moment I think I'm dealing with 2 issues:

1. I need to change the clustered index from the guid primary key to another column in the db. I'm reading up on how to do this, but would appreciate any insight.
2. I need to add some non-clustered index's for overall performance.

I think I need to tackle the first problem before worrying about the second. Seems like that could be a bigger issue and I may hopefully see some performance gains from fixing that. One suggestion I saw was to use the CreatedDate column as the clustered index on those tables. I'm going to start there, my only concern is not all of the tables have a CreatedDate column.

Thanks for the help guys!

Adbot
ADBOT LOVES YOU

big trivia FAIL
May 9, 2003

"Jorge wants to be hardcore,
but his mom won't let him"

Essential posted:

Alright I'm finally able to get back on this so thank you for the information Razzled.


Thank you for the suggestion on the inner join. And well, uh, what you said to make drat sure of, unfortunately at the moment that is exactly how I've setup the db! By default when you create a table in SQL Azure it will create a clustered index on the primary key column, which in the majority of my tables is a guid. So I messed that up from the get go.

So at the moment I think I'm dealing with 2 issues:

1. I need to change the clustered index from the guid primary key to another column in the db. I'm reading up on how to do this, but would appreciate any insight.
2. I need to add some non-clustered index's for overall performance.

I think I need to tackle the first problem before worrying about the second. Seems like that could be a bigger issue and I may hopefully see some performance gains from fixing that. One suggestion I saw was to use the CreatedDate column as the clustered index on those tables. I'm going to start there, my only concern is not all of the tables have a CreatedDate column.

Thanks for the help guys!


What I usually do is something like:

Table1.Table1Id IDENTITY INT Clustered Index
Table1.Table1GUID GUID Non-Clustered Index, Primary Key

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


-S- posted:

What I usually do is something like:

Table1.Table1Id IDENTITY INT Clustered Index
Table1.Table1GUID GUID Non-Clustered Index, Primary Key

Ints all the way, motherfuckers! Out of curiosity, why did you choose to do PK GUIDs?

Essential
Aug 14, 2003

-S- posted:

What I usually do is something like:

Table1.Table1Id IDENTITY INT Clustered Index
Table1.Table1GUID GUID Non-Clustered Index, Primary Key

Got it, thanks! Now, is there a way to add that to my table and have it populate? I can always do it thru code, but if I can do it thru sql that would be awesome.

Also, is the IDENTITY column something that you just let auto-increment? I mean, is it something that can be mostly ignored after adding it as it just does it's job as the clustered index?

Nth Doctor posted:

Ints all the way, motherfuckers! Out of curiosity, why did you choose to do PK GUIDs?

Each office is assigned a guid and I had always been told to create pk's as guids, except for small lookup tables. I'm realizing that the guys I learned from never worked on db's of any real size, however they talked like they had. My db isn't particularly large right now, only a few tables have over 1 million rows, but it's going to grow exponentially so now I'm pretty worried now about how I designed this.

Essential fucked around with this message at 19:04 on Jun 22, 2014

big trivia FAIL
May 9, 2003

"Jorge wants to be hardcore,
but his mom won't let him"

Nth Doctor posted:

Ints all the way, motherfuckers! Out of curiosity, why did you choose to do PK GUIDs?

Because I have data that exists in multiple systems/servers/dbs and across discrete client systems. I use GUIDs and a few natural keys on other items.

Order 1 in system 1 may be Order 35644 in system 2, so the GUID is how I know it's the same order. That's a made up example, but you get the point.
There are a few other reasons based on organizational requirements that make it the right call, but those are internal company reasons. Also it makes merging to the data warehouse from all of the discrete systems pretty dadgum easy.

Essential posted:

Got it, thanks! Now, is there a way to add that to my table and have it populate? I can always do it thru code, but if I can do it thru sql that would be awesome.

Also, is the IDENTITY column something that you just let auto-increment? I mean, is it something that can be mostly ignored after adding it as it just does it's job as the clustered index?

It's really set and forget. It'll just keep on incrementing. When you create it, specify 1,1: IDENTITY(1,1). That means "start at 1, increment by 1". That should explain why you don't use GUIDs with CIs - CIs sort, and how the hell do you quickly sort GUIDs? So, use an identity INT column as the CI on the table.

edit: i missed your first question; You can ALTER TABLE but it'll add the identity column to the end. Ideally you can destroy and rebuild the table from scratch, but I don't know how feasible that will be - usually your foreign key constraints will make this a bitch and a half. From the way you describe how the DB is set up, I'd be surprised if they had actual FK Constraints anywhere, though.

If you CAN destroy and rebuild the table, rename the existing table, build your new table, move the data into the new table, drop the old table.

big trivia FAIL fucked around with this message at 20:31 on Jun 22, 2014

Malcolm XML
Aug 8, 2009

I always knew it would end like this.



But guids are basically random 128bit ints? Clustered indexing on a guid is gonna make inserts bad and is not going to give you any perf benefits but it isn't going to magically make your other indices useless.

For real try running whatever Sql Server's equivalent of explain analyze is first to profile your bottlenecks. Without that data it's basically impossible to tell how the planner has mangled your query.

e: pretty sure SSMS works on sql azure

Malcolm XML fucked around with this message at 20:35 on Jun 22, 2014

big trivia FAIL
May 9, 2003

"Jorge wants to be hardcore,
but his mom won't let him"

Malcolm XML posted:

But guids are basically random 128bit ints? Clustered indexing on a guid is gonna make inserts bad and is not going to give you any perf benefits but it isn't going to magically make your other indices useless.

For real try running whatever Sql Server's equivalent of explain analyze is first to profile your bottlenecks. Without that data it's basically impossible to tell how the planner has mangled your query.

e: pretty sure SSMS works on sql azure

what? I specifically told him to take the clustered indexes off of his GUIDs for that very reason. Make an identity int your CI, GUID your PK with a non-clustered idx, and add other non clustered idxes where necessary.

Essential
Aug 14, 2003

-S- posted:

It's really set and forget. It'll just keep on incrementing. When you create it, specify 1,1: IDENTITY(1,1). That means "start at 1, increment by 1". That should explain why you don't use GUIDs with CIs - CIs sort, and how the hell do you quickly sort GUIDs? So, use an identity INT column as the CI on the table.

edit: i missed your first question; You can ALTER TABLE but it'll add the identity column to the end. Ideally you can destroy and rebuild the table from scratch, but I don't know how feasible that will be - usually your foreign key constraints will make this a bitch and a half. From the way you describe how the DB is set up, I'd be surprised if they had actual FK Constraints anywhere, though.

If you CAN destroy and rebuild the table, rename the existing table, build your new table, move the data into the new table, drop the old table.

Thanks so much for the help and advice. I'm going to spend this week rebuilding the tables with your suggested method. I don't see any reason I can't destroy/rebuild the tables at night as nobody access's them at night.

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

-S- posted:

what? I specifically told him to take the clustered indexes off of his GUIDs for that very reason. Make an identity int your CI, GUID your PK with a non-clustered idx, and add other non clustered idxes where necessary.

Blame awful.ipa

You already have a pk in the guid cluster off of something meaningful. Don't add a useless field!

But clustered indices don't reduce read perf and it's a minor insert issue at worst

big trivia FAIL
May 9, 2003

"Jorge wants to be hardcore,
but his mom won't let him"

Malcolm XML posted:

Blame awful.ipa

You already have a pk in the guid cluster off of something meaningful. Don't add a useless field!

But clustered indices don't reduce read perf and it's a minor insert issue at worst

This is the most ridiculous thing ever. Run a performance scan against GUIDs vs INTs as the PK clustered index on read and get back to me.

Also, I'm saying KEEP THE GUID THE PK. Just don't make it the Clustered Index on the table.

edit an hour and a half later: lol @ "minor insert issue"
and also edit again: the INT CIdx isn't a useless field it's the goddamn CI on the table lol. you could make it the timestamp field if you're a baby but to call it a "useless" field is literally spouting databases 101 material that you've heard and never practiced.

big trivia FAIL fucked around with this message at 01:40 on Jun 28, 2014

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

-S- posted:

This is the most ridiculous thing ever. Run a performance scan against GUIDs vs INTs as the PK clustered index on read and get back to me.

Also, I'm saying KEEP THE GUID THE PK. Just don't make it the Clustered Index on the table.

edit an hour and a half later: lol @ "minor insert issue"
and also edit again: the INT CIdx isn't a useless field it's the goddamn CI on the table lol. you could make it the timestamp field if you're a baby but to call it a "useless" field is literally spouting databases 101 material that you've heard and never practiced.

Chill dude.

Use a CI on a field that you will benefit from having the data, well, clustered

You can have non clustered pks as well so I really don't get why you want to add yet another column that is semantically meaningless and wasting the clustering index on something that won't benefit from the acceleration is dumb as heck

I'm assuming an extremely asymmetric distribution where you're rarely inserting but that depends on your processing needs (profile !!!)

Unless the guy is using a bunch of range lookups on a particular field then I wouldn't even bother with selecting a CI unless you need to since sql server is totally cool with that

How big is the data and what is the nature of the queries that come in?

big trivia FAIL
May 9, 2003

"Jorge wants to be hardcore,
but his mom won't let him"

Malcolm XML posted:

Chill dude.

Use a CI on a field that you will benefit from having the data, well, clustered

You can have non clustered pks as well so I really don't get why you want to add yet another column that is semantically meaningless and wasting the clustering index on something that won't benefit from the acceleration is dumb as heck

I'm assuming an extremely asymmetric distribution where you're rarely inserting but that depends on your processing needs (profile !!!)

Unless the guy is using a bunch of range lookups on a particular field then I wouldn't even bother with selecting a CI unless you need to since sql server is totally cool with that

How big is the data and what is the nature of the queries that come in?

That's part of the issue. I don't know his data or what columns in his tables will benefit from a CIdx, so I gave him the quickest fix for his issue. He had queries taking ages to run, and removing the CIdx from the GUID and adding one to an INT ID column was the quickest solution for his issue. If I worked with him I might could give him a better solution, but on the Something Awful Forums to give a dude a little lift without any legitimate insight into his data structure, I gave him the best option available with the information on hand.

I think I'm still mad about that dude saying "bbbbbut cursors are just loops!" as if that makes them any goddamn better to run in SQL.

big trivia FAIL
May 9, 2003

"Jorge wants to be hardcore,
but his mom won't let him"

But also CIdx on GUID is bad bad bad and is a whole lot worse than "a minor insert issue" duder

Xae
Jan 19, 2005

-S- posted:

But also CIdx on GUID is bad bad bad and is a whole lot worse than "a minor insert issue" duder

Depends on the volatility of the table and the fill rate, as well as how often you are willing to reorganize/rebuild the thing.

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

-S- posted:

But also CIdx on GUID is bad bad bad and is a whole lot worse than "a minor insert issue" duder

No not really it depends on the usage pattern

But a clustering index always slows down other indices so having one on a field that essentially never range queried is a great way to slow down your queries for no reason!


There is no silver bullet to sql perf other than profile your usage and try to get index only scans

Essential
Aug 14, 2003
Now that the forums are back up I can jump in and let you guys know the situation (also how dumb I am). I'll apologize right now for the confusion. The situation I described wasn't accurate. The only table I have with a guid PK is the Offices table. The other tables (OfficeProcedures etc) have an Int PK and a FK to the Office PK. So -S-, the situation you described with having an auto incrementing int as the PK is how my tables are setup. And since this is SQL Azure, that int column WAS the clustered index by default.

The thing I didn't have though was any non-clustered indexes and that's where you guys helped me tremendously. Once I added a non-clustered index on the officeprocedures.officeid guid that particular query went from 15 minutes to ~1 second. There was an immediate and incredible speed increase to that query.

I then went into each table that has a fk to officeid and added a non-clustered index to each one and saw some serious speed increases to most of my queries, however some actually slowed down. I was then able to spend a few days reading up on indexes and I added some more non-clustered index's to a few tables and now everything is running much faster then ever before. The sql queries are now the fastest part of my processing.

-S- posted:

That's part of the issue. I don't know his data or what columns in his tables will benefit from a CIdx, so I gave him the quickest fix for his issue. He had queries taking ages to run, and removing the CIdx from the GUID and adding one to an INT ID column was the quickest solution for his issue. If I worked with him I might could give him a better solution, but on the Something Awful Forums to give a dude a little lift without any legitimate insight into his data structure, I gave him the best option available with the information on hand.

I wanted to add that this was exactly the kind of help I was looking for and expected no more than this, so big thank you -S-! I wasn't looking for a dissertation or 50 page whitepaper on the proper way to setup sql tables & indexes. I'm aware my db isn't designed in the best way possible and was hoping (and received) a quick way to implement an immediate impact for my customers. This quick fix is allowing me the breathing room to look into the issue in more depth and detail and decide what I can do on my own, or when it is time to hire a real dba to come in and fix what I can't.

Essential fucked around with this message at 18:49 on Jun 28, 2014

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

Essential posted:

Now that the forums are back up I can jump in and let you guys know the situation (also how dumb I am). I'll apologize right now for the confusion. The situation I described wasn't accurate. The only table I have with a guid PK is the Offices table. The other tables (OfficeProcedures etc) have an Int PK and a FK to the Office PK. So -S-, the situation you described with having an auto incrementing int as the PK is how my tables are setup. And since this is SQL Azure, that int column WAS the clustered index by default.

The thing I didn't have though was any non-clustered indexes and that's where you guys helped me tremendously. Once I added a non-clustered index on the officeprocedures.officeid guid that particular query went from 15 minutes to ~1 second. There was an immediate and incredible speed increase to that query.

I then went into each table that has a fk to officeid and added a non-clustered index to each one and saw some serious speed increases to most of my queries, however some actually slowed down. I was then able to spend a few days reading up on indexes and I added some more non-clustered index's to a few tables and now everything is running much faster then ever before. The sql queries are now the fastest part of my processing.


I wanted to add that this was exactly the kind of help I was looking for and expected no more than this, so big thank you -S-! I wasn't looking for a dissertation or 50 page whitepaper on the proper way to setup sql tables & indexes. I'm aware my db isn't designed in the best way possible and was hoping (and received) a quick way to implement an immediate impact for my customers. This quick fix is allowing me the breathing room to look into the issue in more depth and detail and decide what I can do on my own, or when it is time to hire a real dba to come in and fix what I can't.

Given those results there is nothing wrong with doing the int pk CI but it's worth noting that by doing that you have reinvented heap tables

E: turns out sql azure doesn't support non clustered tables so this is more acceptable I guess. Still it's worth picking a better index to cluster around.

Malcolm XML fucked around with this message at 22:38 on Jun 28, 2014

Essential
Aug 14, 2003

Malcolm XML posted:

Given those results there is nothing wrong with doing the int pk CI but it's worth noting that by doing that you have reinvented heap tables

E: turns out sql azure doesn't support non clustered tables so this is more acceptable I guess. Still it's worth picking a better index to cluster around.

I'll look into the heap tables thing, but yes, as I've learned all tables in sql azure must have a clustered index and by default, if you do not specify one during table creation, sql azure will assign the pk as the clustered index.

FieryBalrog
Apr 7, 2010
Grimey Drawer

Essential posted:

I'll look into the heap tables thing, but yes, as I've learned all tables in sql azure must have a clustered index and by default, if you do not specify one during table creation, sql azure will assign the pk as the clustered index.

I think what Malcolm XML means is that this kind of Clustered Index is basically useless since (from what I've gathered) it's on a surrogate key column that is essentially meaningless and is not a sensible candidate for range-scans. So, you have a table that is basically just a heap table with extra overhead. A heap table does not force a relationship between the organization of the table on disk and the primary key index. With a clustered index, the idea is that if you select a contiguous slice of the clustered index, the table access should involve a roughly contiguous slice of blocks from the disk; it does this by forcing the actual table data to be sorted on insertion or update. (a heap table can still be well-clustered with respect to one key, it's just not forced to be).


E.g.
code:
create unique index pk_example on t_example (some_date, account_number);
If you have an access path like this:
code:
select * from t_example
where some_date = trunc(sysdate)
This is a range scan that might benefit from the table data being forced to mirror the ordering of (some_date, account_number). But an index range scan isn't very useful vis-a-vis an arbitrary surrogate key. (Now, if your insertion pattern already keeps the table well-clustered with respect to whatever range of values you query, the CI on the surrogate key won't do any real harm, but it won't be doing any real benefit either).

As a developer who mostly works with Oracle RDBMS, I'm not sure why Microsoft is so hung up on using CI's for everything. (the default is heap table. the equivalent of a Clustered Index is an Index Organized Table). A CI seems to me is good for one thing, which is an index with a leading column or column set that will be range-scanned a lot. That's why for Clustered Indexes people like to use a candidate key that leads with something like a date column. Also Malcolm brings up a good point which is a CI is not free because it adds extra overhead on insert, especially if you start adding other indexes (having multiple indexes is perfectly valid and fine, but they add more overhead than usual if there's a clustered index because rows don't have fixed physical locations).

But I guess you have no choice on having one, which seems weird? If you're forced to have one then yes, I guess a useless CI is better than an actively harmful CI.

FieryBalrog fucked around with this message at 06:39 on Jun 29, 2014

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

Exactly this

Apparently sql azure requires a clustered index to do replication but ye olde sql server 2014 does not and is happy to use heap tables if you state a nonclustered pk

Essential
Aug 14, 2003

Thanks for the detailed post. As for my tables & queries it's like this:

First, all data is inserted in bulk, however that's going to change soon. Right now data is uploaded at the end of the month, but soon will be switching to daily uploads. Almost everything is queried by date range, so for instance all procedures for a given month, for a particular office. So "Select * From OfficeProcedures Where OfficeID = '<officeguid>' and (ProcDate >= '<date>' And ProcDate <= '<date>')".

The OfficeProcedures table has many offices, let's say 10 different offices, with each office having upwards of 300k procedures. The data breaks down so that any particular day all 10 of the offices may have 100's of procedures. My first instinct is that ProcDate may be a good candidate for the CI, however all 10 offices probably have multiple procedures on the same day. I have know idea if that matters or if I'm even thinking about it the right way. Some other candidates could be: PK from originating db (varchar), CreatedDate from originating db.

If I made the ProcDate the CI and had a non-CI on the OfficeID, would sql group procedures first by date, then by office for indexing? Should I be trying to find something that will allow office records to be indexed together? I feel like something is going right over my head and I'm confused on things I shouldn't even be thinking about.

Essential fucked around with this message at 20:39 on Jun 29, 2014

Malcolm XML
Aug 8, 2009

I always knew it would end like this.
What's the volume of data going in each day roughly ?

What's the size of your db in total?

For that office procedure table it looks like around 50k rows a day which is not all that much and at that volume you can afford to go hog wild with your indices esp if you can schedule the batch insert at a time when people aren't doing anything else on the db

Sql server is totally cool with non unique clustered indices but it adds some tiny overhead

But since your data is already clustered you will have to recreate the table so don't bother

What you should bother to do is installing sql server management studio and verify your indices are being used by your queries and if they aren't figure out why not

Essential
Aug 14, 2003

Malcolm XML posted:

What's the volume of data going in each day roughly ?

What's the size of your db in total?

For that office procedure table it looks like around 50k rows a day which is not all that much and at that volume you can afford to go hog wild with your indices esp if you can schedule the batch insert at a time when people aren't doing anything else on the db

Sql server is totally cool with non unique clustered indices but it adds some tiny overhead

But since your data is already clustered you will have to recreate the table so don't bother

What you should bother to do is installing sql server management studio and verify your indices are being used by your queries and if they aren't figure out why not

Total db size is 1.2gigs and 50k rows daily is pretty drat close. If things continue the way they are that number is doubling every month, so could be around 600k-700k rows daily in 1 years time. I should add that about 1/2 of those rows are deleted and then re-inserted so the total db size isn't growing by 50k rows daily.

In case anyone is wondering why they are deleted and then re-inserted, the reason for that is the originating db doesn't have createddate/modifieddate for almost all of their tables, therefore there is no way to know if a record was added or changed. Deleting and re-inserting was much easier at the time than trying to use the sql row hash or other methods to see if data has changed (I am/was using entity framework for db access and it's horribly slow to do data row by row compararisions, so I decided to just delete/re-insert on certain tables).

Wheany
Mar 17, 2006

Spinyahahahahahahahahahahahaha!

Doctor Rope
We have an old lovely mysql database that does not have relations. I have been adding relations to our development database and pretty often I get foreign key constraint errors because there is for example, a vehicle that refers to a user that does not exist. So vehicle.userid = 42 and there is no user with id 42.

There must be some simple query I can run to either set the userid to null or outright delete any vehicle rows that refer to inexistent users.

delete from vehicle where id in (some subquery that lists ids of vehicles that refer to inexistent users)

FieryBalrog
Apr 7, 2010
Grimey Drawer

Wheany posted:

We have an old lovely mysql database that does not have relations. I have been adding relations to our development database and pretty often I get foreign key constraint errors because there is for example, a vehicle that refers to a user that does not exist. So vehicle.userid = 42 and there is no user with id 42.

There must be some simple query I can run to either set the userid to null or outright delete any vehicle rows that refer to inexistent users.

delete from vehicle where id in (some subquery that lists ids of vehicles that refer to inexistent users)

code:
delete from vehicle v where not exists (select 1 from users u where u.userid = v.userid)
or

code:
update vehicle v 
set v.userid = null
where not exists (select 1 from users u where u.userid = v.userid)

FieryBalrog fucked around with this message at 14:13 on Jun 30, 2014

FieryBalrog
Apr 7, 2010
Grimey Drawer

Essential posted:

Thanks for the detailed post. As for my tables & queries it's like this:

First, all data is inserted in bulk, however that's going to change soon. Right now data is uploaded at the end of the month, but soon will be switching to daily uploads. Almost everything is queried by date range, so for instance all procedures for a given month, for a particular office. So "Select * From OfficeProcedures Where OfficeID = '<officeguid>' and (ProcDate >= '<date>' And ProcDate <= '<date>')".

The OfficeProcedures table has many offices, let's say 10 different offices, with each office having upwards of 300k procedures. The data breaks down so that any particular day all 10 of the offices may have 100's of procedures. My first instinct is that ProcDate may be a good candidate for the CI, however all 10 offices probably have multiple procedures on the same day. I have know idea if that matters or if I'm even thinking about it the right way. Some other candidates could be: PK from originating db (varchar), CreatedDate from originating db.

If I made the ProcDate the CI and had a non-CI on the OfficeID, would sql group procedures first by date, then by office for indexing? Should I be trying to find something that will allow office records to be indexed together? I feel like something is going right over my head and I'm confused on things I shouldn't even be thinking about.
You know you can have a multi-column index, right? We have lots of historical tables. An index might look like this:
code:
alter table t_pricehist
add constraint pk_pricehist primary key (hist_date, ref_id, currency_id, depo_id)
Your instincts are basically on track, except instead of thinking of "a clustered index on date and a nonclustered index on OfficeID", think of a clustered index of all the columns necessary to define your primary key. By which I mean the natural key of the table, not an artificial surrogate key. It might look like this:
code:
alter table OfficeProcedure
add constraint pk_office_procedure (ProcDate, OfficeID, SomeOtherID) 
and yeah, you're right in that if we had an index like this, it would be sorted first by date, then by office Id, then by whatever other columns you need for your key. Index is always sorted in column order. And also yeah, since most of your query access paths involve a date range scan across multiple entity ids (OfficeIDs, in this case) you would likely want the leading column to be the date for the Clustered Index.

You would also almost certainly want an alternate index with OfficeId as the leading column, to allow efficient access to date-range data for queries that look at a particular OfficeId or just a few OfficeIds. It looks "wasteful", but the following is actually very useful:
code:
create index pricehist_indx_2 on t_pricehist (ref_id, hist_date)
When I got to my current job our historical tables had a problem where the only index they had was like the first one I showed above, with date as the leading column of the table's natural key. The problem is if you use a lot of date range scans plus a single Entity ID value- or just a few Entity ID values- if the range-scan column is the leading column, this is quite inefficient since the DB has to do a full scan on that index date range. A column in an index that is after a column which is being range scanned can't be used to define an index slice.

TLDR: indexes exist to serve query access paths. So if you have multiple common query access paths, you should probably have multiple indexes to serve them, and it can even be just a re-arrangement of columns of an existing index.

FieryBalrog fucked around with this message at 14:09 on Jun 30, 2014

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

Essential posted:

Total db size is 1.2gigs and 50k rows daily is pretty drat close. If things continue the way they are that number is doubling every month, so could be around 600k-700k rows daily in 1 years time. I should add that about 1/2 of those rows are deleted and then re-inserted so the total db size isn't growing by 50k rows daily.

In case anyone is wondering why they are deleted and then re-inserted, the reason for that is the originating db doesn't have createddate/modifieddate for almost all of their tables, therefore there is no way to know if a record was added or changed. Deleting and re-inserting was much easier at the time than trying to use the sql row hash or other methods to see if data has changed (I am/was using entity framework for db access and it's horribly slow to do data row by row compararisions, so I decided to just delete/re-insert on certain tables).

Oh lol 1.2GB is tiny. If sql azure weren't crippled in many ways I would look into clustered columnstores since your usecase is more OLAP than OLTP but your EA contact (you have one and signed a BAA for HIPAA compliance and are implementing all of that right? Right????) should really know more.

Also entity framework is hot garbage for anything but RAD try using Dapper or something less terrible.

Malcolm XML fucked around with this message at 14:12 on Jun 30, 2014

Wheany
Mar 17, 2006

Spinyahahahahahahahahahahahaha!

Doctor Rope

FieryBalrog posted:

code:
delete from vehicle v where not exists (select 1 from users u where u.userid = v.userid)
or

code:
update vehicle v 
set v.userid = null
where not exists (select 1 from users u where u.userid = v.userid)

Thanks, exactly what I needed.

xenilk
Apr 17, 2004

ERRYDAY I BE SPLIT-TONING! Honestly, its the only skill I got other than shooting the back of women and calling it "Editorial".
I have an odd question time related question.

I have a table name "Addresses" with a field named "Fees".

This field basically represent the condo fees of an address X. I figure that odds are it could change over time.

I'm looking for a smart way to keep track of how the fees changed over time. Would the best way to do it be via some sort of Audit table where I would keep track of the old value, new value and timestamp ?

If this is the way to go, is there some sort of query I could run to have the the value of the fee over the course of a calendar year?

FieryBalrog
Apr 7, 2010
Grimey Drawer

xenilk posted:

I have an odd question time related question.

I have a table name "Addresses" with a field named "Fees".

This field basically represent the condo fees of an address X. I figure that odds are it could change over time.

I'm looking for a smart way to keep track of how the fees changed over time. Would the best way to do it be via some sort of Audit table where I would keep track of the old value, new value and timestamp ?

If this is the way to go, is there some sort of query I could run to have the the value of the fee over the course of a calendar year?

You presumably want an address table (no time dimension, just address dimension) that doesn't have "fees" as a column, and then a table that tracks addresses over time by intersecting that address table vs. time-dependent information like fees. this table can just point to the address table with a foreign key and attach information, with a primary key defined by (address_foreign_key, valid_start_date) or some such. That way you can keep track of other time-dependent information too.

Audit table is fine too, but it's slightly less normalized (because you have fee information in two places that needs to be kept in sync and might cause referential issues, or you have to muck around with triggers). Also it will probably get messy if you need to track multiple changing things.

Essential
Aug 14, 2003

FieryBalrog posted:

You know you can have a multi-column index, right? We have lots of historical tables. An index might look like this:

Nope, I had no idea on that. Thank you very much for the detailed help. Things are starting to make a lot more sense now.

Malcolm XML posted:

Oh lol 1.2GB is tiny. If sql azure weren't crippled in many ways I would look into clustered columnstores since your usecase is more OLAP than OLTP but your EA contact (you have one and signed a BAA for HIPAA compliance and are implementing all of that right? Right????) should really know more.

Regarding Hipaa compliance, we don't have any identifying information so we don't have to worry about hipaa compliance (although any gray area's are encrypted just to be safe). You are exactly right in our use is OLAP, being entirely BI focused.

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

Essential posted:

Nope, I had no idea on that. Thank you very much for the detailed help. Things are starting to make a lot more sense now.


Regarding Hipaa compliance, we don't have any identifying information so we don't have to worry about hipaa compliance (although any gray area's are encrypted just to be safe). You are exactly right in our use is OLAP, being entirely BI focused.

OK I'm gonna assume your lawyers have given you clearance.

If you had an on premise i'd suggest using a clustered columnstore index but i guess the best sql azure can do is multicolumn clustered indices. Rebuild it after the ETL window closes.

Essential
Aug 14, 2003

Malcolm XML posted:

If you had an on premise i'd suggest using a clustered columnstore index but i guess the best sql azure can do is multicolumn clustered indices. Rebuild it after the ETL window closes.

Had to look up ETL but I got it, thanks!

And thank you to everyone for the help, it really is appreciated.

BabyFur Denny
Mar 18, 2003
Hi, what tools do you use for mySQL clients?
Most of our work day is spent writing Queries, either live or in stored procedures. We're currently using SQL Explorer, an Eclipse plugin, which works nicely with our shared svn repository of SQL scripts, which we imported as project in Eclipse using SVNKit.
The issue we have is that Eclipse/ SQL Explorer crashes a lot whenever we click the wrong button at the wrong time and it gets really frustrating because usually quite some queries/code are just gone.

I've been looking at other tools, but I haven't found one yet that offers such a nice integration of browsing/committing/updating a svn repository.
At this point we would also easily consider a commercial solution.

Any recommendations here?

Razzled
Feb 3, 2011

MY HARLEY IS COOL
e: oops i misread that post!

Sulla Faex
May 14, 2010

No man ever did me so much good, or enemy so much harm, but I repaid him with ENDLESS SHITPOSTING
I have a query that's running about 20 seconds. It grabs all the data just fine but when it tries the ORDER BY it craps itself. It's got about 650k rows that need to be sorted according to 4 columns, which can come in one of 4 ways.

E.g. ORDER BY a, b, c, d
or
ORDER BY b, a, c, d
and so on.

Here's one instance of the query, that is taking 20 seconds to execute and generating 650k rows before the LIMIT.

code:
SELECT c.id, 
       a.id             AS userid, 
       a.firstname, 
       a.lastname, 
       d.shortname, 
       d.startdate, 
       d.id             AS courseid, 
       cc.timecompleted AS status, 
       cc.timestarted 
FROM   mdl_user a, 
       mdl_enrol b, 
       mdl_user_enrolments c, 
       mdl_course d 
       LEFT JOIN mdl_course_completions cc 
              ON cc.course = d.id 
WHERE  a.id = cc.userid 
       AND a.id = c.userid 
       AND b.id = c.enrolid 
       AND b.courseid = d.id 
ORDER  BY d.shortname, 
          d.startdate, 
          a.lastname, 
          a.firstname 
LIMIT  0, 100; 
I've mucked around with it and it's the ORDER BY that's adding the 18 second overhead. I've tried putting indexes on every 'ORDER BY' field that didn't already have it (only 'startdate' is missing it) and also multi-column indexes for the two combos for each table: e.g.

code:
CREATE INDEX newindex1 ON mdl_course ( shortname, startdate );
and the reverse (startdate, shortname), but it didn't affect things at all.

When I ORDER BY only the first table's columns (i.e. d.shortname,d.startdate) it's near-instant. It's the second table that's hosed

EXPLAIN SELECTED for the base query:

code:
+----+-------------+-------+--------+-------------------------------------------------------------------+---------------------------+---------+-----------------------------------------------------------+------+----------+---------------------------------+
| id | select_type | table | type   | possible_keys                                                     | key                       | key_len | ref                                                       | rows | filtered | Extra                           |
+----+-------------+-------+--------+-------------------------------------------------------------------+---------------------------+---------+-----------------------------------------------------------+------+----------+---------------------------------+
|  1 | SIMPLE      | d     | ALL    | PRIMARY                                                           | NULL                      | NULL    | NULL                                                      |  438 |   100.00 | Using temporary; Using filesort |
|  1 | SIMPLE      | b     | ref    | PRIMARY,mdl_enro_cou_ix                                           | mdl_enro_cou_ix           | 8       | MYDATABASE.d.id                                   |    2 |   100.00 | Using index                     |
|  1 | SIMPLE      | c     | ref    | mdl_userenro_enruse_uix,mdl_userenro_enr_ix,mdl_userenro_use_ix   | mdl_userenro_enruse_uix   | 8       | MYDATABASE.b.id                                   |  225 |   100.00 | Using index                     |
|  1 | SIMPLE      | cc    | ref    | mdl_courcomp_use_ix,mdl_courcomp_cou_ix,mdl_course_completions_i2 | mdl_course_completions_i2 | 16      | MYDATABASE.b.courseid,MYDATABASE.c.userid |    1 |   100.00 | Using where                     |
|  1 | SIMPLE      | a     | eq_ref | PRIMARY                                                           | PRIMARY                   | 8       | MYDATABASE.cc.userid                              |    1 |   100.00 | Using where                     |
+----+-------------+-------+--------+-------------------------------------------------------------------+---------------------------+---------+-----------------------------------------------------------+------+----------+---------------------------------+
I'm trying to think how I can re-structure the query or speed it up or cache it or something, but as far as I can tell it's just a big rear end query and it has to do the grunt work of sorting 650k columns 4 times. Are there any tricks for ORDER BY on mysql that can help me here? I'm a DB rookie so I'm not sure what to try, but all the pages I've googled on optimisation and multi-column indexes and everything don't tend to deal with 4 sorting columns and so many different select fields.

I'm trying a whole bunch of different things, including pre-sorting the different tables, e.g.

code:
SELECT c.id, 
       a.id             AS userid, 
       a.firstname, 
       a.lastname, 
       d.shortname, 
       d.startdate, 
       d.id             AS courseid, 
       cc.timecompleted AS status, 
       cc.timestarted 
FROM   mdl_enrol b, 
       mdl_user_enrolments c, 
       (SELECT id, 
               firstname, 
               lastname 
        FROM   mdl_user 
        ORDER  BY lastname, 
                  firstname) a, 
       (SELECT id, 
               shortname, 
               startdate 
        FROM   mdl_course 
        ORDER  BY shortname, 
                  startdate) d 
       LEFT JOIN mdl_course_completions cc 
              ON cc.course = d.id 
WHERE  a.id = cc.userid 
       AND a.id = c.userid 
       AND b.id = c.enrolid 
       AND b.courseid = d.id 
ORDER  BY d.shortname, 
          d.startdate, 
          a.lastname, 
          a.firstname 
LIMIT  0, 100; 
And while each individual pre-sorted SELECT is only a second or so, that just makes it so much worse, if it even works at all, which is just super confusing to me. I get a lot of "ERROR 1054 (42S22): Unknown column 'd.id' in 'on clause'" errors. When I shuffle the order of the 'FROM' definitions around (i.e. 'd' always has to go at the end for some reason) this error does/doesn't happen.

That final pre-sorted query takes way longer than a minute to execute, such that I just cancel it every time rather than trying to find out the exact time. Here's the explain extended:

code:
+----+-------------+------------+--------+-------------------------------------------------------------------+---------------------------+---------+------------------------------------+-------+----------+---------------------------------+
| id | select_type | table      | type   | possible_keys                                                     | key                       | key_len | ref                                | rows  | filtered | Extra                           |
+----+-------------+------------+--------+-------------------------------------------------------------------+---------------------------+---------+------------------------------------+-------+----------+---------------------------------+
|  1 | PRIMARY     | <derived3> | ALL    | NULL                                                              | NULL                      | NULL    | NULL                               |   443 |   100.00 | Using temporary; Using filesort |
|  1 | PRIMARY     | b          | ref    | PRIMARY,mdl_enro_cou_ix                                           | mdl_enro_cou_ix           | 8       | d.id                               |     2 |   100.00 | Using index                     |
|  1 | PRIMARY     | <derived2> | ALL    | NULL                                                              | NULL                      | NULL    | NULL                               | 41923 |   100.00 | Using join buffer               |
|  1 | PRIMARY     | cc         | ref    | mdl_courcomp_use_ix,mdl_courcomp_cou_ix,mdl_course_completions_i2 | mdl_course_completions_i2 | 16      | MYDATABASE.b.courseid,a.id |     1 |   100.00 | Using where                     |
|  1 | PRIMARY     | c          | eq_ref | mdl_userenro_enruse_uix,mdl_userenro_enr_ix,mdl_userenro_use_ix   | mdl_userenro_enruse_uix   | 16      | MYDATABASE.b.id,a.id       |     1 |   100.00 | Using index                     |
|  3 | DERIVED     | mdl_course | ALL    | NULL                                                              | NULL                      | NULL    | NULL                               |   480 |   100.00 | Using filesort                  |
|  2 | DERIVED     | mdl_user   | ALL    | NULL                                                              | NULL                      | NULL    | NULL                               | 41916 |   100.00 | Using filesort                  |
+----+-------------+------------+--------+-------------------------------------------------------------------+---------------------------+---------+------------------------------------+-------+----------+---------------------------------+
In short, am I hosed? I created a stored procedure with the hope that I could create one for each of the 4 'base' queries (these are the ones with no filters applied, so the scope hasn't narrowed at all) but after running the proc for the second time the execution time actually increased. I have no idea what the cache thinks it's doing there, but it clearly isn't helping.

Anaxandrides
Jun 7, 2012

Every summer day in the desert is like a vacation in hell.

Sulla-Marius 88 posted:

I have a query that's running about 20 seconds. It grabs all the data just fine but when it tries the ORDER BY it craps itself. It's got about 650k rows that need to be sorted according to 4 columns, which can come in one of 4 ways...

The EXPLAIN explains it all about your other table. Those indexes you're making aren't doing the trick for this thing -- SQL chose to use the PK for mdl_user rather than a specific index that includes those columns, and your other index (newindex1) isn't being used either. 99% bet that it's not being used because the column you're using as a join predicate-cum-WHERE clause isn't included in the index.

Replace CREATE INDEX newindex1 ON mdl_course ( shortname, startdate ); with CREATE INDEX IX_IDStartShortName ON mdl_course ( id, startdate, shortname ); and you should see some improvement. Similarly, CREATE INDEX IX_IDName ON mdl_user ( id, firstname, lastname ); should help. I'd be surprised if it didn't.

the
Jul 18, 2004

by Cowcaster
How do I format something like this:

Select Name FROM List WHERE Gender = 'Man' AND Color = 'Blue' OR Color = 'Red' OR Color = 'Yellow'


I want it to choose the first option AND one of the others, how do I make sure that happens?

Also, is there any way I can write the second condition where I don't have to write each COLOR every time?

kumba
Nov 8, 2003

I posted my food for USPOL Thanksgiving!

enjoy the ride

Lipstick Apathy

the posted:

How do I format something like this:

Select Name FROM List WHERE Gender = 'Man' AND Color = 'Blue' OR Color = 'Red' OR Color = 'Yellow'


I want it to choose the first option AND one of the others, how do I make sure that happens?

Also, is there any way I can write the second condition where I don't have to write each COLOR every time?

I'm an SQL novice but this should work:

Select Name FROM List WHERE Gender = 'Man' AND Color IN ('Blue', 'Red', 'Yellow')

Adbot
ADBOT LOVES YOU

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb

the posted:

How do I format something like this:

Select Name FROM List WHERE Gender = 'Man' AND Color = 'Blue' OR Color = 'Red' OR Color = 'Yellow'


I want it to choose the first option AND one of the others, how do I make sure that happens?

Also, is there any way I can write the second condition where I don't have to write each COLOR every time?

Kumbamontu's response is the better solution, but I just wanted to point out that you weren't too far off. Just need to add some parentheses:

code:
Select Name FROM List WHERE Gender = 'Man' AND (Color = 'Blue' OR Color = 'Red' OR Color = 'Yellow')

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply