Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
big trivia FAIL
May 9, 2003

"Jorge wants to be hardcore,
but his mom won't let him"

Xae posted:

Who said anything about triggers?

And if you want to Process 15 Terabytes of transactions on an app server they end up being sent over the wire if you use an API or have the App server run SQL statements or whatever. The data either has to go from the DB server to the App server, or be processed on the database server.

Cursors are like anything else: A tool. There are times to use them, there are times to not use them. It all depends on what system you are running on and what you are trying to do.

We're going to have to agree to disagree here. IMO, the data should go to the app server as needed (ie, not in a 15 TB chunk!), and you shouldn't ever put cursors or loops into your DB code if you can at all help it

Adbot
ADBOT LOVES YOU

PT6A
Jan 5, 2006

Public school teachers are callous dictators who won't lift a finger to stop children from peeing in my plane
I'm designing the database schema for an app my friend is making, and I'm running up things that I never encountered in my database design class, particularly with regard to performance versus design purity.

I've got a fairly large relation, which describes a real-life entity with a lot of boolean properties (basically, a list of things that the entity either has or does not have). We're talking approximately 30 different boolean properties, divided into three "categories". Records will need to be searchable by selecting a combination of these attributes. Right now, I'm thinking the most efficient way in terms of simplifying the queries I'll have to write in the future, and probably maximizing search efficiency, is to represent each boolean property as a BOOLEAN column in the table. The other options are to make each of the three categories a SET, but I'm thinking it might be less efficient, or to make each category into a table (which I think is going to be poo poo in terms of storage efficiency and search efficiency, but strikes me as the "theoretical" right way to approach something like this.

Still, there's something that just doesn't seem quite right about having a huge relation with a bunch of BOOLEAN columns, even if it strikes me as the most efficient and easiest way to go about the problem, and everything's still in 3NF as far as I can tell. Am I over-thinking this, or if I take what seems like the obvious best way now, am I setting myself up for a world of pain down the road?

big trivia FAIL
May 9, 2003

"Jorge wants to be hardcore,
but his mom won't let him"

PT6A posted:

or to make each category into a table (which I think is going to be poo poo in terms of storage efficiency and search efficiency, but strikes me as the "theoretical" right way to approach something like this.

Why?

Entities
EntityCategoryA
EntityCategoryB
EntityCategoryC

This is how you should do what you want to do.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
It's hard to tell from limited information whether this would be a good approach, but you could also have a Property table and an EntityProperty table. So Property would have columns (ID, Name) and maybe Category, and about 30 rows. EntityProperty would have (Entity, Property) each of which would be a foreign key to the appropriate table.

This arrangement makes it a bit of a pain to do things like "select all entities which have property X and property Y or have at least one of properties A, B and C" though.

PT6A
Jan 5, 2006

Public school teachers are callous dictators who won't lift a finger to stop children from peeing in my plane

Hammerite posted:

This arrangement makes it a bit of a pain to do things like "select all entities which have property X and property Y or have at least one of properties A, B and C" though.

This is my major concern. I don't think there will be "at least one of A, B, or C," but there will be lots of "select all entities with properties A, B, D, L, M, K, R..." etc. sort of things, which would seem to be fairly inefficient to search for under your proposed structure. Lots of nested queries, unless I'm overlooking something obvious (which I very well could be, in which case, definitely point it out).

-S- posted:

Why?

Entities
EntityCategoryA
EntityCategoryB
EntityCategoryC

This is how you should do what you want to do.

How would you lay out the EntityCategoryX tables? As Hammerite has suggested, or some other way (either with a SET or a series of BOOLEAN columns)? I agree it's more semantically precise that way, but I'm still wondering if it's worth it. Each main entity will almost certainly have at least 1 property from each category, and I'm guessing (on average) a given entity will have at least half of the available properties, so it would seem to me that it's still more efficient, in terms of space, to store everything in a single table.

BabyFur Denny
Mar 18, 2003
Have you considered using an INT? that's 32 bit, each bit represents a property. You can select all entities with certain properties by using bitwise operators.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

PT6A posted:

This is my major concern. I don't think there will be "at least one of A, B, or C," but there will be lots of "select all entities with properties A, B, D, L, M, K, R..." etc. sort of things, which would seem to be fairly inefficient to search for under your proposed structure. Lots of nested queries, unless I'm overlooking something obvious (which I very well could be, in which case, definitely point it out).

It can be done as outer joins instead of subqueries, although the query optimiser should probably recognise each of them as being the same thing and come up with the same execution plan.

big trivia FAIL
May 9, 2003

"Jorge wants to be hardcore,
but his mom won't let him"

PT6A posted:

This is my major concern. I don't think there will be "at least one of A, B, or C," but there will be lots of "select all entities with properties A, B, D, L, M, K, R..." etc. sort of things, which would seem to be fairly inefficient to search for under your proposed structure. Lots of nested queries, unless I'm overlooking something obvious (which I very well could be, in which case, definitely point it out).
Why would you have nested subqueries?

SELECT * FROM Entities e
JOIN EntityCategoryA eca ON eca.EntityId=e.EntityId
JOIN EntityCategoryB ecb ON ecb.EntityId=e.EntityId
JOIN EntityCategoryC ecc ON ecc.EntityId=e.EntityId
WHERE
eca.PropA = 1
AND eca.PropB = 1
AND ecb.PropD = 1
AND ecc.PropR = 0

etc. etc.

alternatively, you can do that right in the join (JOIN <TableName> t ON t.Id=e.Id AND t.Prop1=1 AND t.Prop2=1)

quote:

How would you lay out the EntityCategoryX tables? As Hammerite has suggested, or some other way (either with a SET or a series of BOOLEAN columns)? I agree it's more semantically precise that way, but I'm still wondering if it's worth it. Each main entity will almost certainly have at least 1 property from each category, and I'm guessing (on average) a given entity will have at least half of the available properties, so it would seem to me that it's still more efficient, in terms of space, to store everything in a single table.

I would just have BIT columns. A column for each property in a category. You aren't talking about a lot of or complex data at all.

PT6A
Jan 5, 2006

Public school teachers are callous dictators who won't lift a finger to stop children from peeing in my plane

BabyFur Denny posted:

Have you considered using an INT? that's 32 bit, each bit represents a property. You can select all entities with certain properties by using bitwise operators.

Isn't this basically the same thing as using the SET column type in MySQL? I think I'd favour that just for greater readability, even if there might be a slight performance hit. I read that FIND_IN_SET() uses bitwise arithmetic when working on a SET column, so the performance hit shouldn't be that severe.

Hammerite posted:

It can be done as outer joins instead of subqueries, although the query optimiser should probably recognise each of them as being the same thing and come up with the same execution plan.

I must be having an extended moment of retardation, because I still don't see how this will work.

Let's say for, a simple example, I have an inventory of cars for a dealership

Car (car_id, name)
------------------------
1, foo
2, bar
3, baz

Attribute (attribute_id, name)
-------------------------
1, air_conditioning
2, power_locks
3, heated_seats

CarAttribute (car_id, attribute_id)
----------------------------
1,1
1,2
1,3
2,2
3,1
3,2

It's trivial, then, to select all the cars that have air conditioning:

SELECT car_id FROM Car AS c LEFT JOIN CarAttribute AS ca ON c.car_id=ca.car_id WHERE ca.attribute_id=1

But I don't see how I would select all the cars that have air conditioning and power locks without using a subquery (i.e. using joins alone). It would, again, be easy to select cars that have either power locks or air conditioning:

SELECT DISTINCT car_id FROM Car AS c LEFT JOIN CarAttribute AS ca WHERE ca.attribute_id=1 OR ca.attribute_id=2

But, using AND in place of the OR in that query would produce no results, because no single row would have attribute_id 1 and 2 in the same row. I've got a bit of a cold right now, so maybe I'm just distracted and not seeing something obvious.

PT6A
Jan 5, 2006

Public school teachers are callous dictators who won't lift a finger to stop children from peeing in my plane

-S- posted:

Why would you have nested subqueries?

SELECT * FROM Entities e
JOIN EntityCategoryA eca ON eca.EntityId=e.EntityId
JOIN EntityCategoryB ecb ON ecb.EntityId=e.EntityId
JOIN EntityCategoryC ecc ON ecc.EntityId=e.EntityId
WHERE
eca.PropA = 1
AND eca.PropB = 1
AND ecb.PropD = 1
AND ecc.PropR = 0

etc. etc.

alternatively, you can do that right in the join (JOIN <TableName> t ON t.Id=e.Id AND t.Prop1=1 AND t.Prop2=1)


I would just have BIT columns. A column for each property in a category. You aren't talking about a lot of or complex data at all.

This is basically what I'm thinking of doing (BIT being a better choice than BOOLEAN in this case, thank you for pointing that out). Is there any reason to divide the properties into different tables, beyond readability, assuming each entity is going to have at least one property in each category?

big trivia FAIL
May 9, 2003

"Jorge wants to be hardcore,
but his mom won't let him"

BabyFur Denny posted:

Have you considered using an INT? that's 32 bit, each bit represents a property. You can select all entities with certain properties by using bitwise operators.

You could do some kind of aggregate column on each property table that would be a concatenation of all of the other bit fields in the table, but that's a bit silly

big trivia FAIL
May 9, 2003

"Jorge wants to be hardcore,
but his mom won't let him"

PT6A posted:

This is basically what I'm thinking of doing (BIT being a better choice than BOOLEAN in this case, thank you for pointing that out). Is there any reason to divide the properties into different tables, beyond readability, assuming each entity is going to have at least one property in each category?

The main point of doing so is maintainability and scalability. Basically each table should represent one "Fact" about an entity.

Entity ABC has these Properties from Category 1
Entity ABC has these Properties from Category 2

And each Category table also represents a Fact about that Category (these are the properties in Category 1, these are the properties in Category 2, etc.)

You also have a more readable query, even if you are doing a lot of items in the WHERE clause - it allows someone to see exactly what you are looking for (assuming you name your columns something at makes sense and not EntityCategoryA.Prop1, EntityCategoryA.Prop2, etc.).

big trivia FAIL
May 9, 2003

"Jorge wants to be hardcore,
but his mom won't let him"

PT6A posted:

Isn't this basically the same thing as using the SET column type in MySQL? I think I'd favour that just for greater readability, even if there might be a slight performance hit. I read that FIND_IN_SET() uses bitwise arithmetic when working on a SET column, so the performance hit shouldn't be that severe.


I must be having an extended moment of retardation, because I still don't see how this will work.

Let's say for, a simple example, I have an inventory of cars for a dealership

Car (car_id, name)
------------------------
1, foo
2, bar
3, baz

Attribute (attribute_id, name)
-------------------------
1, air_conditioning
2, power_locks
3, heated_seats

CarAttribute (car_id, attribute_id)
----------------------------
1,1
1,2
1,3
2,2
3,1
3,2

It's trivial, then, to select all the cars that have air conditioning:

SELECT car_id FROM Car AS c LEFT JOIN CarAttribute AS ca ON c.car_id=ca.car_id WHERE ca.attribute_id=1

But I don't see how I would select all the cars that have air conditioning and power locks without using a subquery (i.e. using joins alone). It would, again, be easy to select cars that have either power locks or air conditioning:

SELECT DISTINCT car_id FROM Car AS c LEFT JOIN CarAttribute AS ca WHERE ca.attribute_id=1 OR ca.attribute_id=2

But, using AND in place of the OR in that query would produce no results, because no single row would have attribute_id 1 and 2 in the same row. I've got a bit of a cold right now, so maybe I'm just distracted and not seeing something obvious.

You can join on a table more than once:

SELECT c.name FROM Cars c
JOIN CarAttribute ca ON ca.CarId=c.CarId AND ca.AttributeId=1
JOIN CarAttribute ca2 ON ca2.CarId=c.CarId AND ca2.AttributeId=2

big trivia FAIL fucked around with this message at 20:41 on Jun 15, 2014

the
Jul 18, 2004

by Cowcaster
fletcher, I'm hoping you can be my savior again. I'm not sure how good you are with Python, but you indicated that you use beatbox so I'm hoping you're the perfect person to help me.

I think I've figured out how to query the database thanks to you, but I'm having trouble storing the data. I have code that looks like this:

Python code:
list1 = []

sf = beatbox._tPartnerNS
svc = beatbox.Client()
svc.login("username","password")

query = svc.query("SELECT AccountId, CreatedById, CreatedDate, 
Field, NewValue, OldValue FROM AccountHistory WHERE Account.Industry 
= \'Government\' AND Account.Type = \'Prospect\'")

testdict = query['records']
total = query['size']
queryL = query['queryLocator']

while len(testdict) < int(str(total)):
	query = svc.queryMore(str(queryL))
	queryL = query['queryLocator']
	all_records = [', '.join(str(rec)) for rec in query[sf.records:]]
	list1.append(all_records)
As you can see, I want to just run through and append all of the 37,000 or so records that come up from that query. I originally tried handling them as dict files, but for some reason it was saving them all as instances so I couldn't append them. Now I'm just trying to append the information to a list, but it's currently coming out like this:

code:
In [16]: list1[1][1]
Out[16]: 'A, c, c, o, u, n, t, H, i, s, t, o, r, y, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 6, 
u, U, B, q, A, A, M, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 6, u, 7, g, P, A, A, Q, 2, 0, 1, 2,
 -, 0, 3, -, 2, 7, T, 1, 8, :, 2, 7, :, 0, 6, ., 0, 0, 0, Z, c, r, e, a, t, e, d'
I'm wondering if there's a better way to parse the data coming out of the query.

Xaris
Jul 25, 2006

Lucky there's a family guy
Lucky there's a man who positively can do
All the things that make us
Laugh and cry
e: oops, double.

Xaris fucked around with this message at 18:16 on Jun 17, 2014

Xaris
Jul 25, 2006

Lucky there's a family guy
Lucky there's a man who positively can do
All the things that make us
Laugh and cry
I'm a dumb construction brute whose been working with MS Access to manage construction data and I've had to learn SQL querying on the fly.

One thing that I've been stumped on for days before finally admitting defeat is I want to group overlapping data where it exists.

For example, here is what an example table might look like where there are multiple drilled stages for a hole.

code:
HoleID            StageStart       StageEnd      CementInjected/FtLength
DRILLEDHOLE#1         0               20              23.1
DRILLEDHOLE#1         20              40              17.5
DRILLEDHOLE#1         40              60              87.5

DRILLEDHOLE#2         0               20              11.1
DRILLEDHOLE#2         10              30              12.4
DRILLEDHOLE#2         30              60              107.0

DRILLEDHOLE#3         0               15              9.1
DRILLEDHOLE#3         10              25              8.4
DRILLEDHOLE#3         25              60              7.0
In the case of #1, that remains unchanged but I'd want overlapping stages in the #2 and #3 holes to combine. The brute-forced output I'd like to see is:

code:
HoleID            StageStart       StageEnd      CementInjected/FtLength
DRILLEDHOLE#1         0               20              23.1
DRILLEDHOLE#1         20              40              17.5
DRILLEDHOLE#1         40              60              87.5

DRILLEDHOLE#2         0               10              11.1
DRILLEDHOLE#2         10              20              12.4+11.1 = 23.5
DRILLEDHOLE#2         20              30              12.4
DRILLEDHOLE#2         30              60              107.0

DRILLEDHOLE#3         0               10              9.1
DRILLEDHOLE#3         10              15              9.1+8.4 = 17.5
DRILLEDHOLE#3         15              25              8.4
DRILLEDHOLE#3         25              60              7.0
Does anyone know a way to achieve this?

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb

the posted:

fletcher, I'm hoping you can be my savior again. I'm not sure how good you are with Python, but you indicated that you use beatbox so I'm hoping you're the perfect person to help me.

I think I've figured out how to query the database thanks to you, but I'm having trouble storing the data. I have code that looks like this:

Python code:
list1 = []

sf = beatbox._tPartnerNS
svc = beatbox.Client()
svc.login("username","password")

query = svc.query("SELECT AccountId, CreatedById, CreatedDate, 
Field, NewValue, OldValue FROM AccountHistory WHERE Account.Industry 
= \'Government\' AND Account.Type = \'Prospect\'")

testdict = query['records']
total = query['size']
queryL = query['queryLocator']

while len(testdict) < int(str(total)):
	query = svc.queryMore(str(queryL))
	queryL = query['queryLocator']
	all_records = [', '.join(str(rec)) for rec in query[sf.records:]]
	list1.append(all_records)
As you can see, I want to just run through and append all of the 37,000 or so records that come up from that query. I originally tried handling them as dict files, but for some reason it was saving them all as instances so I couldn't append them. Now I'm just trying to append the information to a list, but it's currently coming out like this:

code:
In [16]: list1[1][1]
Out[16]: 'A, c, c, o, u, n, t, H, i, s, t, o, r, y, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 6, 
u, U, B, q, A, A, M, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 6, u, 7, g, P, A, A, Q, 2, 0, 1, 2,
 -, 0, 3, -, 2, 7, T, 1, 8, :, 2, 7, :, 0, 6, ., 0, 0, 0, Z, c, r, e, a, t, e, d'
I'm wondering if there's a better way to parse the data coming out of the query.

I've never used beatbox before, the bulk of my experience is using the Salesforce SOAP API via Java code.

It looks like you're actually using beatboxxx rather than beatbox? I couldn't get the test code to run with the latter.

What do you plan on doing with these records? Typically you don't want to load them all into memory at once, you process them in chunks and then use queryMore to get the next chunk from Salesforce. This keeps your memory footprint low.

At any rate, if you truly do need an all_records:

Python code:
query = svc.query("SELECT Id, Name FROM Account")

all_records = []

if query['size'] > 0:
    done = False
    while not done:
        all_records = all_records + query['records']
        if query['queryLocator']:
            query = svc.queryMore(query['queryLocator'])
        else:
            done = True

for record in all_records:
    print('%s, %s, %s' % (record.type, record.Id, record.Name))

fletcher fucked around with this message at 18:30 on Jun 17, 2014

the
Jul 18, 2004

by Cowcaster
I want to do data analysis with them.

What I'm specifically doing right now is trying to produce meaningful relationships between the Opportunity Lead Source and the Account History. I want to know whether or not the Lead Source has any affect on whether or not an Account goes from a Suspect to a Prospect.

So, what I wanted to do was grab all the Account Histories from the market I'm looking at, and then also grab the Opportunities from a specific lead source (just to start with). Then I'd cross-reference based on Account ID (which both have), and hopefully end up with one list of all the Account Histories that are from that specific lead source. Then I wanted to look at all the histories where it was originally a Suspect that went to a Prospect, and see the time difference between those two entries.

There's probably a much simpler way to do this, but I'm just learning for the first time. I'm guessing you know some way that I can do this with *just one SOQL query*, right?

the fucked around with this message at 19:58 on Jun 17, 2014

Razzled
Feb 3, 2011

MY HARLEY IS COOL
I don't know if this has been mentioned yet but I thought it worth mentioning at least. JetBrains' database IDE '0xDBE' is in its early access phase (essentially a beta). I just signed up and instantly got a download link so I think it's open to everyone now.

http://www.jetbrains.com/dbe/eap/ is the link.

My thoughts: It's pretty drat convenient to be able to query multiple different DBs within a single program, but I'm not convinced it's feature set is deep enough to be able to deprecate the specialized IDEs. For example: SSMS for SQL Server gives you access to SQL Server Agent jobs etc. Or Aginity for Redshift gives you graphical information on dist keys and also S3 explorer integration.

It makes it kind of a niche program in my book unfortunately since I can just have the DB-specific programs open and have access to all their features. It seems like the jetbrains tool is purely for code so I find it hard to distinguish it from say Notepad++ in terms of purpose.

I'm pretty interested to see how it pans out though

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

Xaris posted:

I'm a dumb construction brute whose been working with MS Access to manage construction data and I've had to learn SQL querying on the fly.

One thing that I've been stumped on for days before finally admitting defeat is I want to group overlapping data where it exists.

For example, here is what an example table might look like where there are multiple drilled stages for a hole.

code:
HoleID            StageStart       StageEnd      CementInjected/FtLength
DRILLEDHOLE#1         0               20              23.1
DRILLEDHOLE#1         20              40              17.5
DRILLEDHOLE#1         40              60              87.5

DRILLEDHOLE#2         0               20              11.1
DRILLEDHOLE#2         10              30              12.4
DRILLEDHOLE#2         30              60              107.0

DRILLEDHOLE#3         0               15              9.1
DRILLEDHOLE#3         10              25              8.4
DRILLEDHOLE#3         25              60              7.0
In the case of #1, that remains unchanged but I'd want overlapping stages in the #2 and #3 holes to combine. The brute-forced output I'd like to see is:

code:
HoleID            StageStart       StageEnd      CementInjected/FtLength
DRILLEDHOLE#1         0               20              23.1
DRILLEDHOLE#1         20              40              17.5
DRILLEDHOLE#1         40              60              87.5

DRILLEDHOLE#2         0               10              11.1
DRILLEDHOLE#2         10              20              12.4+11.1 = 23.5
DRILLEDHOLE#2         20              30              12.4
DRILLEDHOLE#2         30              60              107.0

DRILLEDHOLE#3         0               10              9.1
DRILLEDHOLE#3         10              15              9.1+8.4 = 17.5
DRILLEDHOLE#3         15              25              8.4
DRILLEDHOLE#3         25              60              7.0
Does anyone know a way to achieve this?

There most likely does exist a way to do this in whatever flavour of SQL you are using, but my inclination would be to pull the data out of the database into whatever is your favourite scripting language and treat it there, one group/hole at a time.

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb

the posted:

I want to do data analysis with them.

What I'm specifically doing right now is trying to produce meaningful relationships between the Opportunity Lead Source and the Account History. I want to know whether or not the Lead Source has any affect on whether or not an Account goes from a Suspect to a Prospect.

So, what I wanted to do was grab all the Account Histories from the market I'm looking at, and then also grab the Opportunities from a specific lead source (just to start with). Then I'd cross-reference based on Account ID (which both have), and hopefully end up with one list of all the Account Histories that are from that specific lead source. Then I wanted to look at all the histories where it was originally a Suspect that went to a Prospect, and see the time difference between those two entries.

There's probably a much simpler way to do this, but I'm just learning for the first time. I'm guessing you know some way that I can do this with *just one SOQL query*, right?

Not sure if there is a simple way to do that. You may have to go further down the route you are currently on and query/process the data manually yourself. Just be sure to leverage the built in relationships as much as you can to avoid extra queries.

And to be honest, I only have experience with Salesforce from the developer perspective. When it comes to actually using it like an end user things get a little fuzzy, I've never done anything with those Lead & Opportunity objects.

Could you do something with the delta between the CreatedDate & ConvertedDate?

code:
SELECT CreatedDate, ConvertedDate, ConvertedAccount.Name FROM Lead WHERE LeadSource = 'Web' AND IsConverted = TRUE

the
Jul 18, 2004

by Cowcaster
Well, what I'm currently doing is grabbing the history list from a certain area, filtering it out for all entries that match a certain lead source, filtering it again into two lists: history for created date and history when it was upgraded from suspect to prospect. That leaves me with two matching lists, in the example I did "Advertisement" lead source and ended up with 4 entries in each list with matching IDs. So I just subtract the time entry from the other for each item, and that gives me the time that each sales ID took to go from created to upgraded to a Prospect.

Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

One Swell Foop posted:

OK, here's some pseudocode, I didn't test it but I think it should be pretty close to working:

code:
SELECT SUBSTRING(
	Pairs,
	PATINDEX ('%name:%|nextkey:%', Pairs)  + LEN('name:'), -- start of name's value
	-- get end position by finding start position of reversed pattern
	SELECT (LEN(Pairs) - (PATINDEX('%:yektxen|%' , REVERSE(Pairs)) + LEN(':yektxen')) 
		-- then subtract start position to get length of substring
		- (PATINDEX('%name:%|nextkey:%', Pairs)  + LEN('name:'))
	)
This only works if your keys are in the same order each time.

E: or install Master Data Services on your SQL Server instance and use a regex.

Thanks for taking a whack at it. I decided to go down the road of option 2, 'do something with XML'. I found out the max count of pairs (13) and just whacked this out:
code:
With ValueSplitter As
(
Select 
T.ContactID,
T.pairs,
CONVERT(XML,'<pairs><pair>' + replace(replace(T.pairs,'|','</pair><pair>'),'&','') + '</pair></pairs>') as xmlpairs
From Contacts As T
where Pairs like '%source:myass'
)
Select
xmlpairs,
contactid,
replace(xmlpairs.value('/pairs[1]/pair[1]','varchar(100)'),'name:','') AS pair1, -- first name
xmlpairs.value('/pairs[1]/pair[2]','varchar(100)') AS pair2,
xmlpairs.value('/pairs[1]/pair[3]','varchar(100)') AS pair3,
xmlpairs.value('/pairs[1]/pair[4]','varchar(100)') AS pair4,
xmlpairs.value('/pairs[1]/pair[5]','varchar(100)') AS pair5,
xmlpairs.value('/pairs[1]/pair[6]','varchar(100)') AS pair6,
xmlpairs.value('/pairs[1]/pair[7]','varchar(100)') AS pair7,
xmlpairs.value('/pairs[1]/pair[8]','varchar(100)') AS pair8,
xmlpairs.value('/pairs[1]/pair[9]','varchar(100)') AS pair9,
xmlpairs.value('/pairs[1]/pair[10]','varchar(100)') AS pair10,
replace(xmlpairs.value('/pairs[1]/pair[11]','varchar(100)'),'email:','') AS pair11, -- this one is email
xmlpairs.value('/pairs[1]/pair[12]','varchar(100)') AS pair12,
xmlpairs.value('/pairs[1]/pair[13]','varchar(100)') AS pair13
From ValueSplitter
where
replace(xmlpairs.value('/pairs[1]/pair[11]','varchar(100)'),'email:','')<>''

RedQueen
Apr 21, 2007

It takes all the running you can do just to stay in the same place.
I am a MYSQL newbie and I need some basic help with my database design and the planning behind my website. The site lets people browse all the imaginary weapon skins in the game Counter-Strike and see their prices on the Steam market. My background: I started teaching myself web development in January (knowing nothing) and am taking it more seriously now that the site gets a lot of traffic.

What I have:
A local MS access database file with all the "skins". Each skin has a unique key (autoincrement) and ~20 fields of data about it which I enter (using some Access autolookup stuff to help speed things up). Each skin also has ~20 blank fields for various prices. I convert the access file to a *.sql file with Bullzip MS Access to MySQL converter and import it in PHPMyAdmin on my site. As new skins are added to the game I update the file, upload it, and the new content or changes is instantly available to users.

On my webserver I also have some php scripts that gradually update the price fields (which were originally blank on my local file) every couple hours, and those prices display for users as they browse the skins.

My problem:
When I make changes to the access file, convert it, and import the sql, it discards my existing price data on the website and replaces it with the blank fields from my local database. Users lose price data for a few hours until things repopulate from the continuously running scripts. The Bullzip access > sql converter is limited in that I can't make it update existing records with the available options (pictured below), and editing the *.sql file manually isn't really an option because I'm uploading a couple dozen times when new skins are added to the game.


I want to be able to change data or add new records in my local file and upload it without the losing the price data I've built up online. So I'm thinking I need to separate the two. I've googled into foreign keys, merges, and joins enough where I'm just plain lost about what the "best" way to do this is.

Do I make a separate price table with a common primary key and merge it into the main content table every time prices update (a few every ~10 minutes)?
Do I make a separate price table and adjust my all my queries to pull from it and the main content table when building pages for users? Some sort of join (never done before)?
If I add new records to the content table and upload it, how can ensure those new primary keys are also added to the price table?

I know this is super amateur compared to other stuff in the thread, but I'm stuck. Thanks guys!

RedQueen fucked around with this message at 05:45 on Jun 19, 2014

Begby
Apr 7, 2005

Light saber? Check. Black boots? Check. Codpiece? Check. He's more machine than kid now.

RedQueen posted:

I know this is super amateur compared to other stuff in the thread, but I'm stuck. Thanks guys!

Start over, completely ditch access, do everything in one database. If you really need the forms, you can setup access using an ODBC driver to talk to your mysql database. However, there are a lot of GUIs you can use to modify the records in a MySQL database sorta like excel, which is not always adviseable, but might be ok to just start learning as long as you keep things backed up. You can always code a simple PHP app to show a form that you can use to update your DB.

You could also export the changes as a .csv file, then create a PHP script that will read the csv file and do an update on each record on your server if you really need to keep using access.

Secondly read up on third normal form, which is basically set theory. Try hard not to violate it. Basically third normal form means that every field in a table depends on the key, the whole key, and nothing but the key. If there is one set of prices for each skin, and that price set belongs to that skin alone, then splitting that into a separate table is a violation of third normal form. Sometimes its ok to violate that if you really need too, but what you are doing is throwing some paint on a really nasty hack that is just going to create more and more headaches.

Learning joins isn't all that hard, get a beginners guide on SQL, create a few tables, try selecting some records, deleting records, try some joins, etc.

I have seen a lot of people trying to setup tables and just totally loving it up, which works fine from the start, but then you get totally screwed later on. There is a 10 million dollar a year company that machines parts for industrial loaders in my town, and they were running all the shop jobs on an access database that used the client phone number as the primary key and had all these tables that were joined like you want to do, where the addresses were in their own table, there was no table for the state name so you had to type in both the two digit state abbrevation and the full state name etc., they were totally hosed whenever a customer got a new phone number, and when they tried to determine the number of sales by states, it was all messed up because everyone spelled the states different and got the abbreviations wrong.

Begby fucked around with this message at 18:21 on Jun 19, 2014

Essential
Aug 14, 2003
Hey guys, I'm at my wits end here. I have a query that 90% of the time runs extremely fast and returns results in 2-3 seconds. However, certain offices take 10's of minutes to run. I have no experience with sql optimization so I can't even begin to understand where to start. This is also running on SQL Azure, so my tools for profiling etc. may be limited.

I can't find a direct correlation to the number of records it's trying to process and the time it takes. For instance, I have an office with 157k procedures and 30k patients and the query is done in 2 seconds. I have another office with 47k procedures and 3k patients and it takes 15 minutes for that one to run.

It is always the same offices that are slow and always the same that are fast. It's not hit or miss, it's always the same ones.

The query is as follows:

code:
Select Sum(OfficeProcedures.Amount) As Total, OfficePatients.PMInsCarrierID From OfficePatients
Inner Join OfficeProcedures On OfficeProcedures.PMPatientID = OfficePatients.PMID
Where OfficeProcedures.OfficeID = '<officeguid>'
And OfficeProcedures.ChartStatus = 102
And OfficePatients.OfficeID = '<officeguid>'
And OfficePatients.PatStatus = 1
Group By OfficePatients.PMInsCarrierID
Order By Total DESC
The result looks like:
code:
5804981.00	-1
263679.00	51
231703.00	25
219761.00	557
175758.00	2707
114431.00	2275
...
Any help or direction is greatly appreciated.

Xae
Jan 19, 2005

Essential posted:

Hey guys, I'm at my wits end here. I have a query that 90% of the time runs extremely fast and returns results in 2-3 seconds. However, certain offices take 10's of minutes to run. I have no experience with sql optimization so I can't even begin to understand where to start. This is also running on SQL Azure, so my tools for profiling etc. may be limited.

I can't find a direct correlation to the number of records it's trying to process and the time it takes. For instance, I have an office with 157k procedures and 30k patients and the query is done in 2 seconds. I have another office with 47k procedures and 3k patients and it takes 15 minutes for that one to run.

It is always the same offices that are slow and always the same that are fast. It's not hit or miss, it's always the same ones.

The query is as follows:

code:
Select Sum(OfficeProcedures.Amount) As Total, OfficePatients.PMInsCarrierID From OfficePatients
Inner Join OfficeProcedures On OfficeProcedures.PMPatientID = OfficePatients.PMID
Where OfficeProcedures.OfficeID = '<officeguid>'
And OfficeProcedures.ChartStatus = 102
And OfficePatients.OfficeID = '<officeguid>'
And OfficePatients.PatStatus = 1
Group By OfficePatients.PMInsCarrierID
Order By Total DESC
The result looks like:
code:
5804981.00	-1
263679.00	51
231703.00	25
219761.00	557
175758.00	2707
114431.00	2275
...
Any help or direction is greatly appreciated.

Take a look at the execution plan, SSMS has a "Display Estimated Execution Plan" button. Check and see if the plans are different. If they are, find out why.

Razzled
Feb 3, 2011

MY HARLEY IS COOL

Essential posted:

Hey guys, I'm at my wits end here. I have a query that 90% of the time runs extremely fast and returns results in 2-3 seconds. However, certain offices take 10's of minutes to run. I have no experience with sql optimization so I can't even begin to understand where to start. This is also running on SQL Azure, so my tools for profiling etc. may be limited.

I can't find a direct correlation to the number of records it's trying to process and the time it takes. For instance, I have an office with 157k procedures and 30k patients and the query is done in 2 seconds. I have another office with 47k procedures and 3k patients and it takes 15 minutes for that one to run.

It is always the same offices that are slow and always the same that are fast. It's not hit or miss, it's always the same ones.

The query is as follows:
stuff
Any help or direction is greatly appreciated.

The first step in optimizing a query is usually (for me) to check the query execution plan. This will show you where the relative costs of your query are and also will highlight any obvious issues (table scans against giant tables etc). I'm on SQL Server 2008R2 myself so I'm not familiar with Azure firsthand but from what I've seen it operates on more or less the same principles. Basically just take a look at the execution plan to see how the query optimizer is running your query and make sure your query is written in a way that is taking advantage of indexes etc.

the
Jul 18, 2004

by Cowcaster
I've got a LeadSource stored as "I'm an existing client", but when I try to query it it says that it's an invalid query. Is it stored somehow without the apostrophe?

I have it in python as \'I\'m an existing client\' to account for the apostrophe in the name.

edit: Specifically the error is this:

SoapFaultError: 'MALFORMED_QUERY' "MALFORMED_QUERY: Opportunity WHERE LeadSource = 'I'm an existing client' ^ ERROR at Row:1:Column:69 unexpected token: 'm'"

Essential
Aug 14, 2003

Xae posted:

Take a look at the execution plan, SSMS has a "Display Estimated Execution Plan" button. Check and see if the plans are different. If they are, find out why.

Razzled posted:

The first step in optimizing a query is usually (for me) to check the query execution plan. This will show you where the relative costs of your query are and also will highlight any obvious issues (table scans against giant tables etc). I'm on SQL Server 2008R2 myself so I'm not familiar with Azure firsthand but from what I've seen it operates on more or less the same principles. Basically just take a look at the execution plan to see how the query optimizer is running your query and make sure your query is written in a way that is taking advantage of indexes etc.

Thanks for the suggestion guys.

I did the Estimated Execution Plan and I can see that there is a difference between the quick query and the slow query. The quick query looks like this: select->sort->computescaler->hashmatch->hashmatch->clusteredindexscan->clusteredindexscan.

The slow query looks like this: select->sort->computescaler->streamaggregate->nestedloops->sort->clusteredindexscan->clusteredindexscan.

The first clusteredindexscan on both are in the 80% range (81% for fast, 86% for slow). Another interesting thing is the fast query has estimated rows of various thousands to hundred thousand. The slow query only has 1 estimated row for each step.

The slow query has Estimated CPU Costs of 1.1E-06 for the ComputeScaler and StreamAggregate operations and 4.18E-06 for the NestedLoop operation. The fast query has CPU Costs of .4, .4, .02, .9, .1 etc. nothing that seems high.

Does that provide any meaningful information?

Razzled
Feb 3, 2011

MY HARLEY IS COOL

Essential posted:

Thanks for the suggestion guys.

I did the Estimated Execution Plan and I can see that there is a difference between the quick query and the slow query. The quick query looks like this: select->sort->computescaler->hashmatch->hashmatch->clusteredindexscan->clusteredindexscan.

The slow query looks like this: select->sort->computescaler->streamaggregate->nestedloops->sort->clusteredindexscan->clusteredindexscan.

The first clusteredindexscan on both are in the 80% range (81% for fast, 86% for slow). Another interesting thing is the fast query has estimated rows of various thousands to hundred thousand. The slow query only has 1 estimated row for each step.

The slow query has Estimated CPU Costs of 1.1E-06 for the ComputeScaler and StreamAggregate operations and 4.18E-06 for the NestedLoop operation. The fast query has CPU Costs of .4, .4, .02, .9, .1 etc. nothing that seems high.

Does that provide any meaningful information?

Estimated CPU costs numbers and such are kind of useless info, but from what you've said I think that you should investigate further into the JOIN. Optimally, your query will utilize index seeks rather than scans to provide the fastest route to the data you are looking for. However, if your query is requesting columns that are either not indexed or only partially indexed(ie only some of the columns are indexed) you'll end up with scans instead. Scans can still be fast, it depends on the data set but a seek will be even faster and is best in pretty much all cases.

You probably want to make sure the tables involved are indexed and that you are utilizing those indexes correctly.

Essential
Aug 14, 2003

Razzled posted:

You probably want to make sure the tables involved are indexed and that you are utilizing those indexes correctly.

Thanks for the help. This is the problem/solution I'm sure. By default you have to have indexed tables when using SQL Azure, which you get automatically via the primary key. However, I've done no indexing outside of that.

I'm looking for a quick fix for this, so before I do anything I can't undo, am I able to remove index's that I add? I'd like to see if I can get this working before learning a bunch about indexing. This feels like something I can easily screw up, but it would be really great if I could solve this before learning all about indexing.

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb

the posted:

I've got a LeadSource stored as "I'm an existing client", but when I try to query it it says that it's an invalid query. Is it stored somehow without the apostrophe?

I have it in python as \'I\'m an existing client\' to account for the apostrophe in the name.

edit: Specifically the error is this:

SoapFaultError: 'MALFORMED_QUERY' "MALFORMED_QUERY: Opportunity WHERE LeadSource = 'I'm an existing client' ^ ERROR at Row:1:Column:69 unexpected token: 'm'"

I think you need some more backslashes. You want the query to end up being:
code:
SELECT something FROM somewhere WHERE thingy = 'I\'m a real boy!'
I think something like this (this is output from the python console)
code:
>>> query = 'SELECT something FROM somewhere WHERE thingy = \'I\\\'m a real boy!\''
>>> query
"SELECT something FROM somewhere WHERE thingy = 'I\\'m a real boy!'"
>>> print query
SELECT something FROM somewhere WHERE thingy = 'I\'m a real boy!'

Razzled
Feb 3, 2011

MY HARLEY IS COOL

Essential posted:

Thanks for the help. This is the problem/solution I'm sure. By default you have to have indexed tables when using SQL Azure, which you get automatically via the primary key. However, I've done no indexing outside of that.

I'm looking for a quick fix for this, so before I do anything I can't undo, am I able to remove index's that I add? I'd like to see if I can get this working before learning a bunch about indexing. This feels like something I can easily screw up, but it would be really great if I could solve this before learning all about indexing.

Sure, in SQL Server there are 2 main types of indexes-- clustered and nonclustered indexes. Feel free to look up the specifics of each but basically the difference is that you can only have one clustered index per table but many nonclustered indexes on a single table. In any case though, you can always delete them if you mess up and make new ones. You might want to avoid doing this too much for huge tables though as it is a resource-expensive operation.

The reasoning behind this is because clustered indexes deal with the actual order of the rows of data in the table. A nonclustered index is a separate data structure from the table data and contains references to the table's data to assist queries in finding the data faster. Another side note is that clustered indexes consume no additional space because they are logical constructs that deal with the order of the table data itself. Nonclustered indexes on the other hand require additional disk space and will scale in size proportional to the size of the underlying table.

Take care of the fact that indexing tables that are enormous in size can be a very taxing operation and can cause not only performance degradation but also prolonged locking. Maybe do it in a test environment if you're worried about breaking stuff.

Razzled fucked around with this message at 23:01 on Jun 19, 2014

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

fletcher posted:

I think you need some more backslashes. You want the query to end up being:
code:
SELECT something FROM somewhere WHERE thingy = 'I\'m a real boy!'
I think something like this (this is output from the python console)
code:
>>> query = 'SELECT something FROM somewhere WHERE thingy = \'I\\\'m a real boy!\''
>>> query
"SELECT something FROM somewhere WHERE thingy = 'I\\'m a real boy!'"
>>> print query
SELECT something FROM somewhere WHERE thingy = 'I\'m a real boy!'

Raw strings are great for just this kind of situation, also don't forget to make it easier on yourself by using double- and single-quoted strings as appropriate.

code:
>>> query = r"SELECT something FROM somewhere WHERE thingy = 'I\'m a real boy!'"

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

the posted:

I've got a LeadSource stored as "I'm an existing client", but when I try to query it it says that it's an invalid query. Is it stored somehow without the apostrophe?

I have it in python as \'I\'m an existing client\' to account for the apostrophe in the name.

edit: Specifically the error is this:

SoapFaultError: 'MALFORMED_QUERY' "MALFORMED_QUERY: Opportunity WHERE LeadSource = 'I'm an existing client' ^ ERROR at Row:1:Column:69 unexpected token: 'm'"

Wait, you're using Python and passing a literal string into a database? Have you heard of SQL injection? There must be seventy thousand modules for data handling in Python by now.

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb

PhantomOfTheCopier posted:

Wait, you're using Python and passing a literal string into a database? Have you heard of SQL injection? There must be seventy thousand modules for data handling in Python by now.

It's Salesforce, you pretty much have to use string literals. At any rate, SOQL is read only. You can't CREATE/UPDATE/DELETE, those use a separate API. And there's no risk of querying for data you don't have access to, that is all handled by the Salesforce user security model.

FieryBalrog
Apr 7, 2010
Grimey Drawer

-S- posted:

We're going to have to agree to disagree here. IMO, the data should go to the app server as needed (ie, not in a 15 TB chunk!), and you shouldn't ever put cursors or loops into your DB code if you can at all help it

Most of our business logic is in PL/SQL stored procs called from Java routes, and most of it involves cursors and loops =/

To be fair in my time here I've been working on updating that to more set based operations. BUT a lot of business logic simply has to be done in cursors and loops because of the need to perform the calculation in a particular order (e.g. calculate row A's result and update the impact before calculating row B's result, etc.) OR because the logic for each individual row is too complicated OR because a new bulk process relies on modular use of existing processes instead of re-writing the logic from scratch.

However, I've heard (somewhere) that Oracle optimizes cursors much better than SQL Server does, and apparently a lot of the stigma against cursors/loops comes from bad experiences with SQL Server. I don't actually know much about this though, so this could be wrong.

Our CEO has been pushing for a lot of the business logic to be re-written in Java and avoiding the stored procs, but that's a large project that will take forever so no one's doing it yet.

FieryBalrog fucked around with this message at 15:54 on Jun 20, 2014

FieryBalrog
Apr 7, 2010
Grimey Drawer

Razzled posted:

Estimated CPU costs numbers and such are kind of useless info, but from what you've said I think that you should investigate further into the JOIN. Optimally, your query will utilize index seeks rather than scans to provide the fastest route to the data you are looking for. However, if your query is requesting columns that are either not indexed or only partially indexed(ie only some of the columns are indexed) you'll end up with scans instead. Scans can still be fast, it depends on the data set but a seek will be even faster and is best in pretty much all cases.

You probably want to make sure the tables involved are indexed and that you are utilizing those indexes correctly.

Well, I think this depends on the volume of data that actually needs to be accessed, so I don't agree with the bolded portion. If you only need to access 5 rows out of 100,000 but your index structure is such that the optimizer does an index range scan on 5,000 rows to find the rows you want, then enabling index access by rowID is better (not considering the indexing overhead, just query speed). But if you need to access 5,000 rows then the range scan is better. In fact I think it's the case that for any nontrivial volume of data from a table, a range scan/hash join will be more optimal than rowID access/nested loops, however the data isn't always clustered in a neat indexable range.

Adbot
ADBOT LOVES YOU

big trivia FAIL
May 9, 2003

"Jorge wants to be hardcore,
but his mom won't let him"

Essential posted:

Hey guys, I'm at my wits end here. I have a query that 90% of the time runs extremely fast and returns results in 2-3 seconds. However, certain offices take 10's of minutes to run. I have no experience with sql optimization so I can't even begin to understand where to start. This is also running on SQL Azure, so my tools for profiling etc. may be limited.

I can't find a direct correlation to the number of records it's trying to process and the time it takes. For instance, I have an office with 157k procedures and 30k patients and the query is done in 2 seconds. I have another office with 47k procedures and 3k patients and it takes 15 minutes for that one to run.

It is always the same offices that are slow and always the same that are fast. It's not hit or miss, it's always the same ones.

The query is as follows:

code:
Select Sum(OfficeProcedures.Amount) As Total, OfficePatients.PMInsCarrierID From OfficePatients
Inner Join OfficeProcedures On OfficeProcedures.PMPatientID = OfficePatients.PMID
Where OfficeProcedures.OfficeID = '<officeguid>'
And OfficeProcedures.ChartStatus = 102
And OfficePatients.OfficeID = '<officeguid>'
And OfficePatients.PatStatus = 1
Group By OfficePatients.PMInsCarrierID
Order By Total DESC
The result looks like:
code:
5804981.00	-1
263679.00	51
231703.00	25
219761.00	557
175758.00	2707
114431.00	2275
...
Any help or direction is greatly appreciated.

You could throw your OfficeId=<officeguid> into the/another join since it looks like you want them to be the same:

Inner Join OfficeProcedures On OfficeProcedures.PMPatientID = OfficePatients.PMID AND OfficeProcedures.OfficeId = OfficePatients.OfficeId and just do one lookup in the WHERE (OP.OfficeId=<GUID>)

Also, make drat sure you do NOT have a clustered index on that GUID column (non-clustered index is fine). Short of that, as everyone else said, make sure your columns are indexed properly.

big trivia FAIL fucked around with this message at 13:38 on Jun 21, 2014

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply