SELECT * FROM Questions WHERE Type = 'Stupid'

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > SELECT * FROM Questions WHERE Type = 'Stupid'

thefncrow: Mar 14, 2001

OK, I've got an xquery on SQL 2005 question here. I'm working with an XML structure that we store in a SQL Server 2005 database. The XML structure stored is large, and prohibitive to display all at once via an XSLT transformation, which is the current display method. I'm looking to implement filtering, ordering, and paging on the XML data.

I've got a basic XML structure that's something like this(though I've changed certain details):

code:

<BILL BillID="12345678">
  <Item ID="1">
    <N1 type="Sender">Shipper Name</N1>
    <N2 type="Sender">Shipper Street Address</N2>
    <N3 type="Sender">
      <City>ShipperCity</City>
      <State>SH</State>
      <Postal>12345</Postal>
    </N3>
    <N1 type="Receiver">Receiver Name</N1>
    <N2 type="Receiver">Receiver Street Address</N2>
    <N3 type="Receiver">
      <City>ReceiverCity</City>
      <State>RC</State>
      <Postal>23456</Postal>
    </N3>
    <ItemCode>8675309</ItemCode>
    <Date>01/02/2003</Date>
    <Weight type="lb">3</Weight>
    <Price currency="USD">72.29</Price>
  </Item>
  <Item ID="2">
  ....
   and so on
  ....
</BILL>

Now then, I've written a series of scalar-valued functions that return XML to my application that contain a filtered subset of the original XML, an ordered subset of the original XML, or just a count-limited subset of the XML(say, the items to appear on Page 11 when we use 25 items per page). Performance is going to be an issue, since these things could all potentially be chained together(ie, give me Page 11(at 25 items a page) of all records coming out of Dallas,TX sorted by price in descending order). The paging function is really very speedy, considering the size of the input, and doesn't present that big of a problem. I haven't even gotten to testing the filtering, though, because the ordering functions are a really nasty problem.

For performance testing, I ran my functions against a rather large XML structure in our database, which contains over 25,000(25,057 to be exact) of those Item tags. This is an especially large structure, but it's not necessarily out of touch with the largest structures in production. My paging function can fetch this large structure and return, say, items 776-800 in about 25 seconds. However, if I were to need those same items returned, but with the data sorted by Weight, the OrderByWeight function takes six and a half minutes to execute the following code:

code:

<BILL BillID="{/BILL/@BillID}"> 
{
 for $item in /BILL/Item 
 order by $item/Weight[1] ascending 
 return $item
}
</BILL>

(The header and footer tags are just there so that I can maintain my basic XML structure from function call to function call and chain them as appropriate, ie Page(Order(Filter(OriginalXML))). Once again, this is an approximation of the code I have, but the FLWOR expression is almost exactly what I'm using outside of having changed the name of the tags)

Worse yet, I can apply other order methods that have been asked for, like sorting by "<City>, <State>" on the shipper end. I tried to apply this sort to my large XML file before I went for lunch, thinking I'd check the execution time when I returned. 30 minutes later, it was still executing, at which point I cancelled it. That code was very similar to the above, just like this instead:

code:

<BILL BillID="{/BILL/@BillID}">
{
  for $item in /BILL/Item
  order by $item/N3[@type="Sender"][1]/City[1] ascending, $item/N3[@type="Sender"][1]/State[1] ascending
  return $item
}
</BILL>

So, clearly, using the order by clause is going to be prohibitively expensive. Is there really anything else I can use in XQuery or otherwise for ordering within XML that might be faster? This page is going to end up being customer-facing, and so asking a customer to hold on for 20 minutes while we retrieve page 3 of their data sorted and filtered by some criteria is not really a realistic option, not to mention that this improvement was driven by the sluggish response of those large XML structures to begin with.

# ¿ Jan 14, 2008 22:03

Adbot: ADBOT LOVES YOU

# ¿ May 21, 2024 10:47

thefncrow: Mar 14, 2001

I'd like to ask here to see if any of you guys have any idea on what I might be able to do to speed up an operation I'm working on.

In SQL 2005, I've got a table with a large amount of data, millions of records. The table has a primary key that's a single ID column that's a uniqueidentifier. These ID attributes have been filled, to this point, using newid(). The data I'm really interested in, though, is the data inside an untyped XML column.

The column I have has data something like this:

code:

<ElementA>
  <ElementB>
    <DataToBeRetainedA>147875233</DataToBeRetainedA>
    <ElementToBeWipedOut>12345678</ElementToBeWipedOut>
    <DataToBeRetainedB>1234589</DataToBeRetainedB>
  </ElementB>
</ElementA>

The operation I'm trying to do is to scrub out some data that exists inside the XML column after a given period of time passes. In a generalized form, my query is something like:

code:

UPDATE LargeTable
SET XmlColumn.modify('replace (/ElementA/ElementB/ElementToBeWipedOut/text()) with ""')
WHERE Date <= @CutoffDate AND XmlColumn.exists('/ElementA/ElementB/ElementToBeWipedOut') = 1

That .exists() function in the where clause is there because, with the non-XML column data in the row, I still can't separate out some data that doesn't have this element, and so I need it to prevent causing an error when it pulls a record and we don't have the /ElementA/ElementB/ElementToBeWipedOut element in the XML.

Now, that's not exactly what I have(I'm running the operation in batches, I'm doing a lookup on two other columns that aren't nearly as significant, and I just rewrote that xquery off the top of my head), but that's the gist of it. I've got an index on the Date column, so that's not really the big issue.

The query as it stands now is just entirely too slow. I'm batching in groups of 10,000 records, and removing 1 batch of 10,000 records takes about 15 minutes to run. This isn't really acceptable, because this bit of code is going to have to run, alongside several other routine maintenance and cleanup operations inside of a short window of scheduled downtime. Outside of that window, that table requires high availability and I can't be tying it up trying to white out large numbers of rows.

I've placed an XML Index on the table to try to speed up my .exists() function, but in order to get that XML index on the table, I had to change the existing primary key index from a non-clustered index to a clustered one(since it's required for the XML index), and I've been reading about how clustered indexes on GUID columns are a bad idea if the GUIDs have been created using newid(). I can move to where we'll begin utilizing newsequentialid(), but it'll be several years before the old random data is all gone. However, if I'm going to have an XML index, which seems like it's going to be necessary for any sort of decent speed, I'm going to need to have this.

I'm out of ideas for what I can do to speed this up. It's unacceptably slow now, but if it's as fast as it's going to get without significantly restructuring the table somehow, then it's all I can do. I'm just hoping that there's some avenue here that I've missed that I can use to increase performance.

I've got some options for potentially doing different behavior in the hopes that I could do something less costly, but I'd like to be sure first that I haven't missed an avenue for increasing performance while doing this is the right way, especially since all of those other options get into having to do a system impact to see how other bits of the application riding on this database would react.

Any thoughts?

# ¿ Sep 22, 2008 07:44

thefncrow: Mar 14, 2001

Jargon posted:

Ok this worked:

SELECT scenes.id, count(*) as popularity FROM "scenes" JOIN line_items ON line_items.scene_id = scenes.id GROUP BY scenes.id ORDER BY popularity DESC LIMIT 10

It returned a list of scene IDs and the popularity. However, I'd like to return scenes.* if possible since I have to get that information anyway and I'd like to avoid doing two queries. I tried this:

SELECT scenes.*, count(*) as popularity FROM "scenes" JOIN line_items ON line_items.scene_id = scenes.id GROUP BY scenes.id ORDER BY popularity DESC LIMIT 10

And I got this error:

ERROR: column "scenes.filename_base" must appear in the GROUP BY clause or be used in an aggregate function

So it does seem like every field has to be either in GROUP by or an aggregate clause. I'm not really sure how I would do that though.

I don't know postgres, and I'm coming at this from a SQL 2005 background, so I can't claim to know that this works, but it's what I'd suggest.

Do the group by on the line_items table only, and then LEFT OUTER JOIN scenes to your grouped line_items table. Something like

code:

SELECT Scenes.*, ISNULL(SC.Count, 0) AS 'SceneCount'
FROM Scenes
LEFT OUTER JOIN 
(
  SELECT scene_id, count(*) AS 'Count' 
  FROM line_items 
  GROUP BY scene_id
) SC ON SC.scene_id = Scenes.ID

The downside here is that the inner query we're using here will only include scenes that appear in the line_items table, so they'll all have a count(*) >= 1. So, LEFT OUTER JOIN the inner query, which will give you all scenes, and then use your Postgres equivalent to ISNULL() to resolve the NULL away for scenes that don't appear in your line_items table so they have a 0 count.

I know SQL Server is also picky like this, where a group by query requires that all columns in the select list be either in the group statement, or in an aggregate.

Your other option here, if the "grouping" is really only on 1 item like it seems you have here(as I'm assuming you don't have multiple Scenes with the same ID), you can defeat this by using the aggregate functions, like, say, selecting MAX(scenes.filename_base), which will work, because it's an aggregate function, and there's only 1 value for filename_base that's repeated several times in the join, so it's also MAX(filename_base).

Again, though, I haven't used Postgres, so my suggestions are probably best taken with a strong dose of documentation.

thefncrow fucked around with this message at 18:24 on Sep 22, 2008

# ¿ Sep 22, 2008 17:46

thefncrow: Mar 14, 2001

Begby posted:

Secondly, is @Roles a string? If so can't you do AND USER_ROLE_ID IN ( @Roles )? That alone should save some processing time you would think. I am no SQL Server expert, but I would assume that parse function gets run for every record.

SQL Server will let you do "column IN ( @StringVar )", but when you do it, it's equivalent to "column = @StringVar". If you have a comma delimited list, you have to write something to parse out each item individually, which is what he's doing there.

# ¿ Sep 26, 2008 20:24

thefncrow: Mar 14, 2001

I'm working in a SQL Server 2005 environment here.

I have a large data table with a 3 part primary key. The 3 part primary key is something like CustomerNumber, BillID, LineItem. Changes to this table are difficult at best, because it's large and partitioned.

I've got another table, now, where I'm working with the line items for a given (CustomerNumber, BillID) combination. Basically, this table is an identity int column, CustomerNumber, BillID, and some XML data about those line items and some associated data for each. What I want is a constraint on this table that, for any inserted row, the (CustomerNumber, BillID) combination matches a record in my larger table.

I can't do it as a foreign key, because of the third part of the key(and the PK without the third part is not unique to each record). How can I enforce this? It seems like I'm missing something obvious, but I can't see it.

# ¿ Nov 14, 2008 00:52

thefncrow: Mar 14, 2001

Sergeant Hobo posted:

I think you're looking for a TSQL trigger. Basically, you would need to run a query on the 3-key table to search for the specific CustonerNumber and BillID combo that is coming into the database and, if it exists in the 3-key table, you go to the INSERT statement and, if not, throw back some kind of error message. Sorry, but I'm not familiar with TSQL; the limited experience I have with this stuff has been through Oracle's PL/SQL.

Yeah, I just was thinking about a trigger, and I think that's the way I'm going to want to go at this point. It just didn't occur to me at first because I kept thinking there was some sort of constraint I could apply to the table that would do the trick, but I think it'll just have to be a trigger.

Thanks.

# ¿ Nov 14, 2008 03:07

thefncrow: Mar 14, 2001

I'm really at a loss as to how something like this might be done without serious overhead, so I thought I might lob it out here, see if anyone has a solid solution.

In SQL Server 2005, I'm working with a table that has a PK key sequential GUID called ID and an attribute for a parent item, ParentID. The parent item is the key GUID of the item that this item refers to, and is NULL for root items. Additionally, there is a datetime of when the record was added to the table, DateProcessed. These parent items cascade, such that, for example, we would have one root item with 2 children nodes 2 and 3, and then nodes 4 and 5 whose records reference them as the child of node 2 and 3, respectively. The table is very large, like 200,000+ records, and the size of each grouping is relatively small(most groups don't reach 5 records, though they could), giving me a large number of these groups.

Given this, I want to return just the leaf node that was most recently inserted for each such grouping in the entire table, with a cap at 1000 of the most recent records returned.

The best I've been able to come up with on this has been to rely on the fact that, by using newsequentialid(), the items more recently inserted have GUID values greater than previous items. Based on that, I was able to generate this:

code:

WITH ItemTree AS 
(
	SELECT ID, ROW_NUMBER() OVER (ORDER BY ID ASC) AS ItemGroup 
	FROM Items 
	WHERE ParentID IS NULL 

		UNION ALL

	SELECT I1.ID, I2.ItemGroup 
	FROM Items I1
	INNER JOIN ItemTree I2 ON I1.ParentID = I2.ID
)
SELECT TOP 1000 * 
FROM Items 
WHERE ID IN 
(
	SELECT MAX(CONVERT(BINARY(16), ID)) 
	FROM ItemTree 
	GROUP BY ItemGroup
)
ORDER BY DateProcessed DESC

This gives me the results that I want, but the query takes 6 minutes to execute, and this is supposed to feed data to an ASP.NET web application, so 6 minutes to execute means that it's not a usable solution.

If I move the TOP 1000 inside the CTE to the root expression, I get incredible performance gains, as you might expect. However, the next step, once I have this, is to add some additional where clauses to the outer query in order to provide the ability to search for certain items, and so the filtering needs to occur outside, or else I'm just left doing my search inside that TOP 1000 set instead of searching the whole set and placing a 1000 record limit on the results.

Is there an easier way to do this that I'm just not seeing?

# ¿ Nov 13, 2009 22:33

thefncrow: Mar 14, 2001

Saltin posted:

I have a simple one.I'm not much of a SQL fellow, as you can probably guess from this question!

I have a view with a list of companies.. Some companies have parent companies.

dbo.company_view

dbo.company_view.company_id = a numeric ID (unique)
dbo.company_view.company_name= the common name of the company
dbo.company_view.parent_company_id= the numeric ID of the parent (this keys off company_id)

All I want to do, is take the parent_company_id, lookup the company_name, and spit that out to a new column named "parent_company_name_proper" or something like that.

Any help would be greatly appeciated!

If you just want the name of the parent company, then what I'd suggest is to do something like this:

code:

SELECT CV1.company_id, CV1.company_name, CV2.company_name AS parent_company_name
FROM dbo.company_view CV1
INNER JOIN dbo.company_view CV2 ON CV1.parent_company_id = CV2.company_id

Slap any WHERE clauses you might be using onto the end of that.

# ¿ Nov 13, 2009 22:49

thefncrow: Mar 14, 2001

I actually got this solved late on Friday, but I appreciate the help.

Nurbs posted:

I'm not up on performance, but couldn't you do something like this?

SELECT TOP 1000 * FROM Items WHERE ParentId IS NOT NULL AND Id NOT IN (SELECT ParentId FROM Items) ORDER BY DateProcessed
code:
WITH ParentIds AS
(
  SELECT DISTINCT ParentId FROM Items ORDER BY ParentId --Maybe order it descending?
)
SELECT TOP 1000 * FROM Items WHERE ParentId IS NOT NULL AND Id NOT IN ParentIds
Following your use of the WITH command..

That would pull back all leaf nodes on the tree. Imagine the circumstance Root -> Child -> (Grandchild1, Grandchild2). Grandchildren 1 and 2 both have ID values that don't appear as ParentID values, so they'd both be returned, instead of just returning the most recent of the two.

Jethro posted:

I don't know if this will be any faster, but why not have the ItemGroup in your CTE just be the ID of the top parent item? That way you don't have to do the analytic function.

This was actually part of my eventual solution.

I went with the group item as the ID value of the root node, and, sloppy as it is, I flattened out the CTE to rid myself of it entirely. That meant losing recusion and going to only a fixed number of levels, but based on what the Items table does, it turned out it would already begin experiencing weirdness if we ever go beyond 3 levels of tree depth. So, I compromised on that and went with 3 levels of depth instead of building it recursively, and I ended up with something like

code:

SELECT TOP 1000 *
FROM Items
WHERE ID IN 
(
	SELECT MAX(CONVERT(BINARY(16), ISNULL(I3.ID, ISNULL(I2.ID, I1.ID))))
	FROM Items I1
	LEFT OUTER JOIN Items I2 ON I2.ParentID = I1.ID
	LEFT OUTER JOIN Items I3 ON I3.ParentID = I2.ID
	WHERE I1.ParentID IS NULL
	GROUP BY I1.ID
)

That actually turns out to be a near instant query.

# ¿ Nov 16, 2009 20:50

Adbot: ADBOT LOVES YOU

# ¿ May 21, 2024 10:47

thefncrow: Mar 14, 2001

A question for you guys. I've been trying to google this, but I get items that aren't matching up to what I'm looking for.

I'm on SQL Server 2005 with an insert query where I need to fetch the ID item for an insert. I know about scope_identity, but the ID column on the table is a GUID generated by a default of newid(), and there is not an identity column for the table. My first attempt was that I wanted to do a query like this:

code:

INSERT INTO table (col1, col2, col3)
OUTPUT Inserted.ID
VALUES (@col1, @col2, @col3)

and stuff that returned value into a variable. I know you can do "OUTPUT Inserted.ID INTO @tableVariable", but I wasn't looking to use a table variable, just a scalar variable. And, yet, it doesn't seem like it's possible to do. Doing "SELECT @scalarVar = (insert here)" doesn't work, and neither does "OUTPUT @scalarVar = Inserted.ID".

Is it just not possible to stuff the output items into a scalar variable directly without using an intermediary table variable? I'll use the intermediate table variable if I have to, but it seems like there should be a way to do this without it.

thefncrow fucked around with this message at 20:43 on Aug 30, 2010

# ¿ Aug 30, 2010 20:26

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > SELECT * FROM Questions WHERE Type = 'Stupid'