Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Ruggan posted:

ETL / Data Warehousing Stuff

Hey Ruggan, I ran your post past my company's main ETL guy and this was his take:

Dude one office over posted:

I'm not sure off hand about the performance of merge vs truncate / bulk copy.

I can provide a brief comparison to our process though. Our ETL processes all dimensions before facts. We have work tables in the data warehouse that we bulk copy to nightly.

A second step contains a few parts. It first updates the work table with dimension keys, which is accomplished by joining to the dimension tables on the business (natural) keys. Then it deletes existing data in the fact table by business (natural) key. Finally it inserts the work table data into the fact tables.

A third step truncates the work tables in preparation for the following night's ETL.

It seems that our system might be subject to all the negatives of both approaches he proposed. There is certainly significant room for improvement once we have the infrastructure in place to measure performance & record metrics over time.

Adbot
ADBOT LOVES YOU

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE
I saw a RIGHT JOIN in the wild for the first time today :pwn:

Pardot
Jul 25, 2001





Oh that reminds me, I'm having to do a small project with mssql and also it's my first time using that and Windows since like XP.

I don’t really understand how localhost connections work. If I run this ruby from cmd or powershell it works fine and connects with the default user or whatever

code:
require "sequel"
require "tiny_tds"
DB = Sequel.connect(adapter: 'tinytds', host: 'localhost', database: '<mydbname>')
but if I run it from WSL, I get TinyTds::Error: Adaptive Server connection failed (localhost:1433) (Sequel::DatabaseConnectionError). I haven't been able to figure that out, and the workaround of just doing powershell.exe ruby mything.rb is okay enough, but if you or anyone with windows/mssql experience can point me in the right direction, it'd be nicer to be able stay in WSL.

Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

Pardot posted:

Oh that reminds me, I'm having to do a small project with mssql and also it's my first time using that and Windows since like XP.

I don’t really understand how localhost connections work. If I run this ruby from cmd or powershell it works fine and connects with the default user or whatever

code:
require "sequel"
require "tiny_tds"
DB = Sequel.connect(adapter: 'tinytds', host: 'localhost', database: '<mydbname>')
but if I run it from WSL, I get TinyTds::Error: Adaptive Server connection failed (localhost:1433) (Sequel::DatabaseConnectionError). I haven't been able to figure that out, and the workaround of just doing powershell.exe ruby mything.rb is okay enough, but if you or anyone with windows/mssql experience can point me in the right direction, it'd be nicer to be able stay in WSL.

Shooting from the hip, but the WSL instance's network adapter might not be in the local subnet and therefore counts as an external connection? Localhost auth free connection is locked into local subnets I'm pretty sure.

downout
Jul 6, 2009

Pardot posted:

Oh that reminds me, I'm having to do a small project with mssql and also it's my first time using that and Windows since like XP.

I don’t really understand how localhost connections work. If I run this ruby from cmd or powershell it works fine and connects with the default user or whatever

code:
require "sequel"
require "tiny_tds"
DB = Sequel.connect(adapter: 'tinytds', host: 'localhost', database: '<mydbname>')
but if I run it from WSL, I get TinyTds::Error: Adaptive Server connection failed (localhost:1433) (Sequel::DatabaseConnectionError). I haven't been able to figure that out, and the workaround of just doing powershell.exe ruby mything.rb is okay enough, but if you or anyone with windows/mssql experience can point me in the right direction, it'd be nicer to be able stay in WSL.

The latest fun tool I learned for testing mssql db connectivity is using a UDL file. It's a pretty neat trick instead of installing management studio. Not sure if it will help, but you can at least quickly try host names to see what is/isn't working.

edit: google it for details. On windows you just create a file with the .udl file extension, click on it, and it opens a connect to database menu.

edit2: depending on ODBC data source or whatever is connecting I've seen the same error due to the wrong ODBC drivers, insufficient permission for the account, account credentials wrong, wrong host name, etc, so lots of different failures all reported as the same error. I usually start with checking if the port responds and go from there checking each item required to make a successful connection.

downout fucked around with this message at 04:08 on Jan 12, 2019

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
Not really sure where SSIS package questions go but here goes nothing:

I have an SSIS package that I am executing from a command line. It's purpose is to load approximately 700,000 lines of data in 17,000 files into an AWS RDS MSSQL database.

When I run it in verbose logging mode, it loads all of the lines successfully in 5100 seconds (~85 minutes).

When I run it with logging disabled, it fails after about 80 seconds with 50,000 lines of data loaded, producing the following error:

code:
Description: SSIS Error Code DTS_E_OLEDBERROR.  An OLE DB error has occurred. Error code: 0x80004005.
An OLE DB record is available.  Source: "Microsoft SQL Server Native Client 11.0"  Hresult: 0x80004005  Description: "Login timeout expired".
An OLE DB record is available.  Source: "Microsoft SQL Server Native Client 11.0"  Hresult: 0x80004005  Description: "A network-related or instance-specific error has occurred while establishing a connection to SQL Server. Server is not found or not accessible. Check if instance name is correct and if SQL Server is configured to allow remote connections. For more information see SQL Server Books Online.".
An OLE DB record is available.  Source: "Microsoft SQL Server Native Client 11.0"  Hresult: 0x80004005  Description: "Named Pipes Provider: Could not open a connection to SQL Server [53]. ".
End Error
Since it manages some data loaded in no-logging mode ( MSDTE /EW) and all of the data in verbose mode ( /V) it isn't the connection.

This _feels_ like some kind of connection draining or throttling issue with the database engine.

Because the CMD window scrolls all the verbose text, the SSIS package job runs slowly, allowing the database engine to process it successfully. When I turn off logging, the load rate is much higher and it seems like it is saturating something or perhaps the rate of connects/disconnects is killign the maximum number of connections.

Or something.

Does anyone have any ideas on how to allow throughput and allow the SSIS package to run in /EW mode?

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


What kind of task are you running? Are you literally looping through all the files in your directory individually?

I would also guess you’re getting throttled. If each file is a new connection, it might just be going too fast for your server to keep up. Simple math says that you’re averaging 40 rows per file - in verbose mode you’re doing about 200 files per minute and in no-logging about 1,000 files per minute (50k loaded / 40 per file / 1.33 minutes = approx 1k files per min).

One way to test this would be to use python or something to concatenate all your files and do a single bulk copy. May end up being way more efficient as if each file is a separate connection, that ends up being a lot of overhead across 17k files.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

Ruggan posted:

What kind of task are you running? Are you literally looping through all the files in your directory individually?

I would also guess you’re getting throttled. If each file is a new connection, it might just be going too fast for your server to keep up. Simple math says that you’re averaging 40 rows per file - in verbose mode you’re doing about 200 files per minute and in no-logging about 1,000 files per minute (50k loaded / 40 per file / 1.33 minutes = approx 1k files per min).

One way to test this would be to use python or something to concatenate all your files and do a single bulk copy. May end up being way more efficient as if each file is a separate connection, that ends up being a lot of overhead across 17k files.

That is correct. Its a loop over files task over literally thousands of files per day, each with approximately 1-500 lines per file.

I ended up putting in a step in the data flow that adds a 350msec wait to each loop and the job ran just fine. Another data point leading towards some kind of throttle.

But concatenating all of the files into a single file is a great idea to try. I'll report back.


edit: it worked like a charm. Looping over seven files is clearly more efficient than looping over thousands.


But now I am curious: What is the mechanism that is breaking and how can I fix it to boost performance?

Agrikk fucked around with this message at 07:22 on Jan 13, 2019

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Agrikk posted:

That is correct. Its a loop over files task over literally thousands of files per day, each with approximately 1-500 lines per file.

I ended up putting in a step in the data flow that adds a 350msec wait to each loop and the job ran just fine. Another data point leading towards some kind of throttle.

But concatenating all of the files into a single file is a great idea to try. I'll report back.


edit: it worked like a charm. Looping over seven files is clearly more efficient than looping over thousands.


But now I am curious: What is the mechanism that is breaking and how can I fix it to boost performance?

Nothing is breaking, you’re just treating each file import as it’s own sandbox. For 17k files that means you are opening and disposing of 17k separate connections. When you stuff all the data into 7 files you’re reducing that to 7. Obviously the number of rows you end up importing remains unchanged.

I don’t think it would be a simple fix in SSIS. Assuming you’re using a loop container and a nested data flow task, each file import is functionally separate and you must open a connection and close the connection each time. You need to batch it such that you’re opening a single connection and not closing it until you import all the data. Couple ways of doing this:

- feed it a single file with all the data by joining it all in code, like you’re doing
- not writing a billion files in the first place, for example by writing to the tail of a log as append only
- write your own C# script task that does this without opening a million connections and still loops through all your files (but don’t do this because it isn’t worth the technical debt)

There might be some specialized task in SSIS or some open source one you can find that would do this, but I’m not personally aware of any.

Bobcats
Aug 5, 2004
Oh
Are correlated sub queries bad practice? I’m trying to pull across a ton of tables(last events for whatever into their own column) and I see no other way to accomplish what I want to do.

Bobcats fucked around with this message at 18:58 on Jan 13, 2019

Hughmoris
Apr 21, 2007
Let's go to the abyss!

Bobcats posted:

Are correlated sub queries bad practice? I’m trying to pull across a ton of tables(last events for whatever into their own column) and I see no other way to accomplish what I want to do.

Other people can speak to practice but they've been dirt slow when I've had to use them. I replaced a corellated subquery with a window function and dropped the query time from 8 minutes to 11 seconds.

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Bobcats posted:

Are correlated sub queries bad practice? I’m trying to pull across a ton of tables(last events for whatever into their own column) and I see no other way to accomplish what I want to do.

Correlated subqueries are a good way to ensure poor performance.

Use windowed functions or aggregate subqueries instead. If you post your simplified schema and a description of what you are trying to do, I can probably give you more specific guidance.

teen phone cutie
Jun 18, 2012

last year i rewrote something awful from scratch because i hate myself
I need help creating a SQL query that will return a cocktail from a database, given that I have supplied all the ingredients that go into that cocktail

So for example, I want the "Gin and Tonic" row to return only if I have supplied the correct ids for Gin (id is 1) and Tonic (id is 2). I I only supply "Tonic," I shouldn't get back the row

I'm using SQLAlchemy and Flask, but I'm still having trouble wrapping my head around how the query will work altogether

This is what my table structure looks like

code:
+-------------------+
| Tables_in_my_database |
+-------------------+
| cocktails         |
| ing_in_cocktails  |
| ingredients       |
+-------------------+
This is my cocktails table

code:
+----+----------------+-------+---------+
| id | name           | glass | finish  |
+----+----------------+-------+---------+
|  1 | white russisan | rocks | stirred |
|  2 | gin and tonic  | rocks | stirred |
+----+----------------+-------+---------+
This is my ingredients table

code:
+----+---------+----------+
| id | name    | ing_type |
+----+---------+----------+
|  1 | vodka   | fruit    |
|  2 | kahluha | fruit    |
|  3 | gin     | fruit    |
|  4 | tonic   | fruit    |
+----+---------+----------+
and this is my relational table

code:
+----+-------------+--------+
| id | cocktail_id | ing_id |
+----+-------------+--------+
|  1 |           1 |      1 |
|  2 |           1 |      2 |
|  3 |           2 |      3 |
|  4 |           2 |      4 |
+----+-------------+--------+
Here are the corresponding SQLAlchemy models

code:
class Cocktail(db.Model):
    __tablename__ = 'cocktails'
    id = db.Column(db.Integer, primary_key=True)
    name = db.Column(db.String(80), nullable=False)
    glass = db.Column(db.String(20), nullable=False)
    finish = db.Column(db.String(20), nullable=True)
    ingredients = db.relationship(
        'Ingredient',
        secondary=ing_in_cocktails,
        backref=db.backref('cocktails', lazy='dynamic')
    )

class Ingredient(db.Model):
    __tablename__ = 'ingredients'
    id = db.Column(db.Integer, primary_key=True)
    name = db.Column(db.String(80), nullable=False)
    ing_type = db.Column(db.String(20), nullable=False)


ing_in_cocktails = db.Table(
    'ing_in_cocktails',
    db.Column('id', db.Integer, primary_key=True),
    db.Column('cocktail_id', db.Integer, db.ForeignKey('cocktails.id')),
    db.Column('ing_id', db.Integer, db.ForeignKey('ingredients.id'))
)
This query got me most of the way there, but the issue here is that it's returning "Gin and Tonic" if I provide any of the ingredient in the cocktail

code:
# problematic because this returns "Gin and Tonic," despite not passing
# all the ingredients
Cocktail.query.join(ing_in_cocktails).filter(ing_in_cocktails.columns.ing_id.in_([3]))
And the query above translates to this SQL

code:
SELECT cocktails.id AS cocktails_id, cocktails.name AS cocktails_name, cocktails.glass AS cocktails_glass, cocktails.finish AS cocktails_finish
FROM cocktails INNER JOIN ing_in_cocktails ON cocktails.id = ing_in_cocktails.cocktail_id
WHERE ing_in_cocktails.ing_id IN (%(ing_id_1)s)

teen phone cutie fucked around with this message at 21:20 on Jan 13, 2019

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

Ruggan posted:

Nothing is breaking, you’re just treating each file import as it’s own sandbox. For 17k files that means you are opening and disposing of 17k separate connections. When you stuff all the data into 7 files you’re reducing that to 7. Obviously the number of rows you end up importing remains unchanged.

I don’t think it would be a simple fix in SSIS. Assuming you’re using a loop container and a nested data flow task, each file import is functionally separate and you must open a connection and close the connection each time. You need to batch it such that you’re opening a single connection and not closing it until you import all the data. Couple ways of doing this:

- feed it a single file with all the data by joining it all in code, like you’re doing
- not writing a billion files in the first place, for example by writing to the tail of a log as append only
- write your own C# script task that does this without opening a million connections and still loops through all your files (but don’t do this because it isn’t worth the technical debt)

There might be some specialized task in SSIS or some open source one you can find that would do this, but I’m not personally aware of any.

I understand that each loop creates its own connection but what I don't understand is what limit is being hit that is causing the pool of "connection slots" to drain to zero.

I get that the number of connections to a service is finite, but what is that number and is it adjustable?

spiritual bypass
Feb 19, 2008

Grimey Drawer

TheFluff posted:

I saw a RIGHT JOIN in the wild for the first time today :pwn:

what the gently caress

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Agrikk posted:

I understand that each loop creates its own connection but what I don't understand is what limit is being hit that is causing the pool of "connection slots" to drain to zero.

I get that the number of connections to a service is finite, but what is that number and is it adjustable?

https://docs.microsoft.com/en-us/sq...sql-server-2017

But not sure that’s the exact answer. I would expect the connection to close before the next opens due to the nature of the loop.

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Grump posted:

I need help creating a SQL query that will return a cocktail from a database, given that I have supplied all the ingredients that go into that cocktail

So for example, I want the "Gin and Tonic" row to return only if I have supplied the correct ids for Gin (id is 1) and Tonic (id is 2). I I only supply "Tonic," I shouldn't get back the row

I'm using SQLAlchemy and Flask, but I'm still having trouble wrapping my head around how the query will work altogether

This is what my table structure looks like

code:
+-------------------+
| Tables_in_my_database |
+-------------------+
| cocktails         |
| ing_in_cocktails  |
| ingredients       |
+-------------------+
This is my cocktails table

code:
+----+----------------+-------+---------+
| id | name           | glass | finish  |
+----+----------------+-------+---------+
|  1 | white russisan | rocks | stirred |
|  2 | gin and tonic  | rocks | stirred |
+----+----------------+-------+---------+
This is my ingredients table

code:
+----+---------+----------+
| id | name    | ing_type |
+----+---------+----------+
|  1 | vodka   | fruit    |
|  2 | kahluha | fruit    |
|  3 | gin     | fruit    |
|  4 | tonic   | fruit    |
+----+---------+----------+
and this is my relational table

code:
+----+-------------+--------+
| id | cocktail_id | ing_id |
+----+-------------+--------+
|  1 |           1 |      1 |
|  2 |           1 |      2 |
|  3 |           2 |      3 |
|  4 |           2 |      4 |
+----+-------------+--------+
Here are the corresponding SQLAlchemy models

code:
class Cocktail(db.Model):
    __tablename__ = 'cocktails'
    id = db.Column(db.Integer, primary_key=True)
    name = db.Column(db.String(80), nullable=False)
    glass = db.Column(db.String(20), nullable=False)
    finish = db.Column(db.String(20), nullable=True)
    ingredients = db.relationship(
        'Ingredient',
        secondary=ing_in_cocktails,
        backref=db.backref('cocktails', lazy='dynamic')
    )

class Ingredient(db.Model):
    __tablename__ = 'ingredients'
    id = db.Column(db.Integer, primary_key=True)
    name = db.Column(db.String(80), nullable=False)
    ing_type = db.Column(db.String(20), nullable=False)


ing_in_cocktails = db.Table(
    'ing_in_cocktails',
    db.Column('id', db.Integer, primary_key=True),
    db.Column('cocktail_id', db.Integer, db.ForeignKey('cocktails.id')),
    db.Column('ing_id', db.Integer, db.ForeignKey('ingredients.id'))
)
This query got me most of the way there, but the issue here is that it's returning "Gin and Tonic" if I provide any of the ingredient in the cocktail

code:
# problematic because this returns "Gin and Tonic," despite not passing
# all the ingredients
Cocktail.query.join(ing_in_cocktails).filter(ing_in_cocktails.columns.ing_id.in_([3]))
And the query above translates to this SQL

code:
SELECT cocktails.id AS cocktails_id, cocktails.name AS cocktails_name, cocktails.glass AS cocktails_glass, cocktails.finish AS cocktails_finish
FROM cocktails INNER JOIN ing_in_cocktails ON cocktails.id = ing_in_cocktails.cocktail_id
WHERE ing_in_cocktails.ing_id IN (%(ing_id_1)s)

Are you only looking for EXACT matches, or are you looking for any recipes you can make using your list?

For example, should an input of Gin, Whiskey, and Tonic return Gin & Tonic because you could make it? Or no because that isn’t the exact recipe?

Moonwolf
Jun 29, 2004

Flee from th' terrifyin' evil of "NHS"!


Grump posted:

I need help creating a SQL query that will return a cocktail from a database, given that I have supplied all the ingredients that go into that cocktail

So for example, I want the "Gin and Tonic" row to return only if I have supplied the correct ids for Gin (id is 1) and Tonic (id is 2). I I only supply "Tonic," I shouldn't get back the row

I'm using SQLAlchemy and Flask, but I'm still having trouble wrapping my head around how the query will work altogether
<snip>
[/code]

Ruggan's question's a good one. This is a really hard problem with a dynamic amount of parameters though. What you need to get is:
a) Any cocktail that you have at least one ingredient for;
b) All their ingredients lists;
c) Compare your input list to those lists.

Auto-generated SQL doesn't stand a chance.

teen phone cutie
Jun 18, 2012

last year i rewrote something awful from scratch because i hate myself

Ruggan posted:

Are you only looking for EXACT matches, or are you looking for any recipes you can make using your list?

For example, should an input of Gin, Whiskey, and Tonic return Gin & Tonic because you could make it? Or no because that isn’t the exact recipe?

Yes - Gin and Tonic should be returned if you included Whiskey


Moonwolf posted:

Ruggan's question's a good one. This is a really hard problem with a dynamic amount of parameters though. What you need to get is:
a) Any cocktail that you have at least one ingredient for;
b) All their ingredients lists;
c) Compare your input list to those lists.

Auto-generated SQL doesn't stand a chance.

So you're saying I could do something like

1. Client asks for all liquors including Gin and Tonic
2. SQL statement returns "Gin and Tonic" and "Gin and Tonic and Lime"
3. Python compares the client's list of IDs with the IDs of each Cocktail.ingredients
4. If all the client IDs appear in each Cocktail.ingredients.id, add that cocktail to a result array
5. Return the result array to the client

???

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Let I be ingredients, R be recipes, and D be drinks.

Let @X be a table valued parameter you passed which contains your list of ingredients.

code:
—Get count of input ingredients 
DECLARE @XCt as int = (SELECT COUNT(*) FROM @X)

SELECT D.Name 
FROM D 
INNER JOIN (
    SELECT R.DrinkId, 
        Ingredients = COUNT(R.IngredientId)
        Matched = COUNT(X.IngredientId)
    FROM R
    LEFT JOIN @X X
    ON R.IngredientId = X.IngredientId
    GROUP BY R.DrinkId
    ) LIST
ON D.DrinkId = LIST.DrinkId
WHERE LIST.Ingredients = LIST.Matched
AND LIST.Matched = @XCt
Subquery gets you the count of ingredients in your drink and how many you matched. You compare those to make sure you matched all the necessary ingredients in the first where clause.

If you want to exclude recipes that you passed extra ingredients to, the second where clause statement checks to make sure you matched exactly as many as were passed. Based on your last answer you don’t want this, so comment that out.

The traditional method of solving this is to concatenate your ingredient ids with STRING_AGG and string match the aggregated lists... but that always felt like a garbage solution to me so try wha I posted above - probably more efficient. Make sure you have indexes on the appropriate columns and you have a unique key constraint on Recipes for ingredient to drink.

Phone posting, so sorry if formatting is garbo.

Moonwolf
Jun 29, 2004

Flee from th' terrifyin' evil of "NHS"!


Grump posted:

Yes - Gin and Tonic should be returned if you included Whiskey


So you're saying I could do something like

1. Client asks for all liquors including Gin and Tonic
2. SQL statement returns "Gin and Tonic" and "Gin and Tonic and Lime"
3. Python compares the client's list of IDs with the IDs of each Cocktail.ingredients
4. If all the client IDs appear in each Cocktail.ingredients.id, add that cocktail to a result array
5. Return the result array to the client

???

Forums ate my query, will edit it in once I work out how to get it to let me post it.

How do people get Cloudflare to let you post SQL? It just claims it's an attack.

Moonwolf fucked around with this message at 00:14 on Jan 14, 2019

Bruegels Fuckbooks
Sep 14, 2004

Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

Moonwolf posted:

Forums ate my query, will edit it in once I work out how to get it to let me post it.

How do people get Cloudflare to let you post SQL? It just claims it's an attack.

what do you mean by "posting" sql- you mean in the post of a web request, perchance?

Moonwolf
Jun 29, 2004

Flee from th' terrifyin' evil of "NHS"!


Bruegels Fuckbooks posted:

what do you mean by "posting" sql- you mean in the post of a web request, perchance?

Yeah, just replying with it in a code block, like everyone else is. So that I can reply with the example query to Grump's post. Although now it works, for no obvious reason. This one should be db neutral.

code:
SELECT
	name
FROM
	cocktails JOIN
	(
	SELECT
		potential_cocktails.cocktail_id 
	FROM
		(SELECT DISTINCT
			cocktail_id
		FROM
			ing_in_cocktails
		WHERE
			ing_in_cocktails.ing_id IN (%(ing_id_1)s)) potential_cocktails JOIN
		ing_in_cocktails needed ON needed.cocktail_id = potential_cocktails.cocktail_id LEFT JOIN
		ing_in_cocktails present ON present.ing_id = needed.ing_id
	WHERE
		present.ing_id IN (%(ing_id_1)s)) OR
		present.ing_id IS NULL
	GROUP BY
		potential_cocktails.cocktail_id
	HAVING
		count(present.ing_id) = count(needed.ing_id))) complete_cocktails ON complete_cocktails.cocktail_id = cocktails.id

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Simplified if you can use the in expression.

code:
SELECT R.DrinkId FROM Recipes R GROUP BY R.DrinkId HAVING COUNT(*) = COUNT(CASE WHEN R.IngredientId in (@list) THEN 1 END)

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE
If you have a halfway decent database it will have set operators (INTERSECT/EXCEPT), which can make this kind of thing far easier to read:

SQL code:
SELECT *
FROM cocktails
WHERE NOT EXISTS (
    SELECT ing_in_cocktails.ingredient_id
    FROM ing_in_cocktails
    WHERE ing_in_cocktails.cocktail_id = cocktails.id
    EXCEPT
    SELECT ingredients.id
    FROM ingredients
    WHERE ingredients.id IN (%parameter_list_of_ingredient_ids_goes_here%)
)
Rather than trying to find cocktails we can make by counting the number of necessary ingredients, this works by excluding cocktails we can't make because there's at least one ingredient missing. In the NOT EXISTS subquery we take the set of necessary ingredients for a given cocktail and then subtract the set of ingredients we have available (the EXCEPT operator returns rows that are output by the left query but not by the right query). If the result is the empty set, we know we have all ingredients we need for this cocktail, and so we exclude cocktails where this set is not empty.

I use the set operators exceedingly rarely, but sometimes they're just the perfect tool for the job.

e: I have not actually looked at the execution plan for this, but I doubt it's any slower than joining on an aggregating subquery

e2: the last bit after the EXCEPT (SELECT ingredients.id FROM ingredients...) doesn't even need to be a real query, you can use a table-valued parameter instead, or a VALUES list or w/e.

TheFluff fucked around with this message at 20:53 on Jan 14, 2019

Moonwolf
Jun 29, 2004

Flee from th' terrifyin' evil of "NHS"!


TheFluff posted:

If you have a halfway decent database it will have set operators (INTERSECT/EXCEPT), which can make this kind of thing far easier to read:

SQL code:
SELECT *
FROM cocktails
WHERE NOT EXISTS (
    SELECT ing_in_cocktails.ingredient_id
    FROM ing_in_cocktails
    WHERE ing_in_cocktails.cocktail_id = cocktails.id
    EXCEPT
    SELECT ingredients.id
    FROM ingredients
    WHERE ingredients.id IN (%parameter_list_of_ingredient_ids_goes_here%)
)
Rather than trying to find cocktails we can make by counting the number of necessary ingredients, this works by excluding cocktails we can't make because there's at least one ingredient missing. In the NOT EXISTS subquery we take the set of necessary ingredients for a given cocktail and then subtract the set of ingredients we have available (the EXCEPT operator returns rows that are output by the left query but not by the right query). If the result is the empty set, we know we have all ingredients we need for this cocktail, and so we exclude cocktails where this set is not empty.

I use the set operators exceedingly rarely, but sometimes they're just the perfect tool for the job.

e: I have not actually looked at the execution plan for this, but I doubt it's any slower than joining on an aggregating subquery

e2: the last bit after the EXCEPT (SELECT ingredients.id FROM ingredients...) doesn't even need to be a real query, you can use a table-valued parameter instead, or a VALUES list or w/e.

You can make that even simpler, you don't need the EXCEPT at all:
SQL code:
SELECT *
FROM cocktails
WHERE NOT EXISTS (
    SELECT ing_in_cocktails.ingredient_id
    FROM ing_in_cocktails
    WHERE ing_in_cocktails.cocktail_id = cocktails.id AND
    ing_id NOT IN (%parameter_list_of_ingredient_ids_goes_here%)
)
Of course, the actual performance of any of these would vary based on the size of any of these tables, but is largely going to be similar at any conventionally sized cocktails list.

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

Moonwolf posted:

You can make that even simpler, you don't need the EXCEPT at all:
SQL code:
SELECT *
FROM cocktails
WHERE NOT EXISTS (
    SELECT ing_in_cocktails.ingredient_id
    FROM ing_in_cocktails
    WHERE ing_in_cocktails.cocktail_id = cocktails.id AND
    ing_id NOT IN (%parameter_list_of_ingredient_ids_goes_here%)
)
Of course, the actual performance of any of these would vary based on the size of any of these tables, but is largely going to be similar at any conventionally sized cocktails list.

Dang, you're right. And here I was feeling all clever because I had finally found a use case for the set operators :smith:

Moonwolf
Jun 29, 2004

Flee from th' terrifyin' evil of "NHS"!


I've only ever found them to be useful for reconciling data coming from two difference sources that's meant to be the same, or checking that a new query/ETL flow spits out the same data as the old one when you're refactoring stuff for performance.

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

Moonwolf posted:

I've only ever found them to be useful for reconciling data coming from two difference sources that's meant to be the same, or checking that a new query/ETL flow spits out the same data as the old one when you're refactoring stuff for performance.

Well, yeah, they're really just humble variants of (anti-)semi-joins, so they're normally not that useful since you'd just use an EXISTS() instead. Dunno why I didn't realize why it wasn't necessary in this case.

More generally, problems like this one (find rows where a set of related rows all match a criteria) are usually far easier to approach if you restate them in the negative - that is, find rows where there are no related rows that fail to match the criteria.

e: maybe what we actually want is relational division :pwn:

TheFluff fucked around with this message at 21:44 on Jan 14, 2019

teen phone cutie
Jun 18, 2012

last year i rewrote something awful from scratch because i hate myself

Moonwolf posted:

You can make that even simpler, you don't need the EXCEPT at all:
SQL code:
SELECT *
FROM cocktails
WHERE NOT EXISTS (
    SELECT ing_in_cocktails.ingredient_id
    FROM ing_in_cocktails
    WHERE ing_in_cocktails.cocktail_id = cocktails.id AND
    ing_id NOT IN (%parameter_list_of_ingredient_ids_goes_here%)
)
Of course, the actual performance of any of these would vary based on the size of any of these tables, but is largely going to be similar at any conventionally sized cocktails list.

I’ll try this and some of the others when I get home thanks!

Additionally, I’m still not great at converting SQL to SQLAlchemy so any help there would be great as well

e: looks like this one did the trick!

welp here's the SQLAlchemy after lots of trial and error

code:
      query = db.session.query(Cocktail).filter(
          ~exists().where(
              and_(
                  CocktailIngredient.cocktail_id == Cocktail.id,
                  ~CocktailIngredient.ing_id.in_([15, 17])
              )
          )
      )
which translates to

code:
SELECT cocktails.id AS cocktails_id, cocktails.name AS cocktails_name, cocktails.glass AS cocktails_glass, cocktails.finish AS cocktails_finish
FROM cocktails
WHERE NOT (EXISTS (SELECT *
FROM ings_in_cocktail
WHERE ings_in_cocktail.cocktail_id = cocktails.id AND ings_in_cocktail.ing_id NOT IN (%(ing_id_1)s, %(ing_id_2)s)))
Not sure what I have to do to prevent the SELECT * FROM

teen phone cutie fucked around with this message at 03:17 on Jan 15, 2019

NinpoEspiritoSanto
Oct 22, 2013




Most DBAs will tell you ORMs suck and fast become a hindrance rather than a help. I;m one of them; Laravel's Eloquent ORM produces some utterly horrific garbage queries once relational databases are actually useful and in some degree of Normal Form.

SQLAlchemy DOES let you use things like text() or execute() in these cases, or there's https://docs.sqlalchemy.org/en/latest/core/

teen phone cutie
Jun 18, 2012

last year i rewrote something awful from scratch because i hate myself
Lol wow. I wish I knew this days ago

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
Be careful with your cocktail queries because it's really easy to fall into an O(m*n*o*p*...) full Cartesian product when you're building automated ingredient selection and filtering. At the very least, make sure you test it with a few obvious and broad choices to ensure it doesn't hang for three hours. (Vodka+gin+rum-menthe-cacao) should return a non empty set, most likely.

(Twas many years ago, but if I can dig up the recipe DB performance issue I'll drop it here.)

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

Grump posted:

I’ll try this and some of the others when I get home thanks!

Additionally, I’m still not great at converting SQL to SQLAlchemy so any help there would be great as well

e: looks like this one did the trick!

welp here's the SQLAlchemy after lots of trial and error

code:
      query = db.session.query(Cocktail).filter(
          ~exists().where(
              and_(
                  CocktailIngredient.cocktail_id == Cocktail.id,
                  ~CocktailIngredient.ing_id.in_([15, 17])
              )
          )
      )
which translates to

code:
SELECT cocktails.id AS cocktails_id, cocktails.name AS cocktails_name, cocktails.glass AS cocktails_glass, cocktails.finish AS cocktails_finish
FROM cocktails
WHERE NOT (EXISTS (SELECT *
FROM ings_in_cocktail
WHERE ings_in_cocktail.cocktail_id = cocktails.id AND ings_in_cocktail.ing_id NOT IN (%(ing_id_1)s, %(ing_id_2)s)))
Not sure what I have to do to prevent the SELECT * FROM

This looks perfectly fine. What you have in the select list of an EXISTS subquery doesn't matter - it can be *, a constant, any column - because all the database is checking for is if the query will return at least one row. The contents of that row are never actually retrieved. Selecting 1 or * is common. Just don't select an aggregate (count(), etc) because those always return a row.

When people tell you ORM's are bad they usually mean the ones that have some kind of more or less magical query autogeneration syntax that usually maps rather awkwardly to SQL for all but the simplest queries - Django is one example. SQLAlchemy on the other hand belongs to a different category that could perhaps be called "query builders". jOOQ (Java) is another example from this category (probably the best one, in fact). These are more like metaprogramming tools for SQL and their syntax maps to SQL in a quite straightforward way. They are extremely useful, since SQL itself doesn't readily lend itself to modularity and reusability. SQLAlchemy is great, stick to it.

PhantomOfTheCopier posted:

Be careful with your cocktail queries because it's really easy to fall into an O(m*n*o*p*...) full Cartesian product when you're building automated ingredient selection and filtering. At the very least, make sure you test it with a few obvious and broad choices to ensure it doesn't hang for three hours. (Vodka+gin+rum-menthe-cacao) should return a non empty set, most likely.

(Twas many years ago, but if I can dig up the recipe DB performance issue I'll drop it here.)

Unless you have tens of thousands of cocktails and ingredients I don't really think there's much reason for concern.

TheFluff fucked around with this message at 18:02 on Jan 15, 2019

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
(Here be dragons...)

Well, there is, in fact. It looks like I solved this exact problem in October 2007, and recall it had been a non-starter issue prior to that because the database server wasn't powerful enough at the time to run the query. It probably lacked sufficient memory to cache all the intermediate results, but the query runtime exceeded 10min. On the newer server, it apparently started to work. This was presumably ~2000 recipes and ~150 ingredients. Direct copy/paste from the comments at the time, no proofreading; I take no responsibility :buddy::

code:
-- COST 22 B ---------------------------------------------------------

-- This is the most immediately-obvious query.
-- Notice that everything is found in turn by finding
-- items that are not within a particular sub-select
-- statement.  There are three `in' clauses.

--    4.97s real     0.01s user     0.01s system

explain analyze
select * from recipe where id in (
        select recipeid from components where recipeid not in (
                select distinct recipeid from components where ingredientid in (
                        select id from ingredient where name~*'.*vodka.*'
))) order by id;
Look familiar? While this is apparently in support of explicit "NOT x" support in the search tool, the pattern is rather similar.

Here starts a rather lengthy analysis...



code:
-- COST 22 M ---------------------------------------------------------

-- This query utilizes and except clause for one of
-- the where-ins, which makes sense since both the
-- (level-2) selection and the (level-3) sub-select
-- are from the components table.

explain analyze
select * from recipe where id in (
	select recipeid from components except 
	select distinct recipeid from components where ingredientid in (
		select id from ingredient where name~*'.*vodka.*'
)) order by id;




-- COST 22 k ---------------------------------------------------------

-- This introduces a big change where the `bad'
-- recipeids are a from_item subselect, thus allowing
-- the use of a standard where conditional, namely =,
-- instead of relying on the `in' construct.  There
-- is still an internal `in' for a field in
-- components where _another_ field is in a subselect.

explain analyze
select * from recipe, (
	select recipeid from components except 
	select distinct recipeid from components where ingredientid in (
		select id from ingredient where name~*'.*vodka.*' )
) as ss
where recipe.id=ss.recipeid ;




-- COST 177  ---------------------------------------------------------

-- The internal `in' statement is changed to a
-- cartesian selection with a limiting where clause
-- (namely a components<->ingredient reference).
-- There are no `in' clauses here anymore.

explain analyze
select recipe.id, name, instructions from recipe, (
  select id from recipe except
	select distinct recipeid from components, ingredient
	  where ingredientid=ingredient.id and name~*'.*vodka.*'
) as ss
where recipe.id=ss.id order by recipe.id ;




-- COST 152  ---------------------------------------------------------

-- This removes the from_item subselect entirely,
-- instead opting to do a grand cartesian product on
-- every internal row and a single exception outside.
-- I would think this would be quite a bit slower
-- since there are two fields inside that have
-- nothing to do with limiting the data set, but it
-- is indeed a bit faster.

-- For 50 of these queries:
--     3.28s real     0.49s user     0.03s system
-- Hence for 1 of them:
--   0.0656s real     0.01s user     0.00s system
-- This is 76 times faster.
--
-- Incidentally, doing a prepared query doesn't make it much faster.

explain analyze
select id, name, instructions from recipe as r1 except
select distinct r2.id, r2.name, r2.instructions from recipe as r2, components, ingredient
	where ingredient.name~*'.*vodka.*'
	and ingredient.id=components.ingredientid
	and components.recipeid=r2.id ;
Soooo... if you're building queries at runtime and trying to convert "standard Boolean searches" into SQL, be careful how you do it. More specifically, make sure you add test cases for "a OR b", "a OR b OR c", "a AND b", "a AND b AND c", "NOT a", and so forth, before you're convinced you have support for the feature.

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

PhantomOfTheCopier posted:

(Here be dragons...)

Well, there is, in fact. It looks like I solved this exact problem in October 2007, and recall it had been a non-starter issue prior to that because the database server wasn't powerful enough at the time to run the query. It probably lacked sufficient memory to cache all the intermediate results, but the query runtime exceeded 10min. On the newer server, it apparently started to work. This was presumably ~2000 recipes and ~150 ingredients. Direct copy/paste from the comments at the time, no proofreading; I take no responsibility :buddy::

code:
-- COST 22 B ---------------------------------------------------------

-- This is the most immediately-obvious query.
-- Notice that everything is found in turn by finding
-- items that are not within a particular sub-select
-- statement.  There are three `in' clauses.

--    4.97s real     0.01s user     0.01s system

explain analyze
select * from recipe where id in (
        select recipeid from components where recipeid not in (
                select distinct recipeid from components where ingredientid in (
                        select id from ingredient where name~*'.*vodka.*'
))) order by id;
Look familiar? While this is apparently in support of explicit "NOT x" support in the search tool, the pattern is rather similar.

Here starts a rather lengthy analysis...



code:
-- COST 22 M ---------------------------------------------------------

-- This query utilizes and except clause for one of
-- the where-ins, which makes sense since both the
-- (level-2) selection and the (level-3) sub-select
-- are from the components table.

explain analyze
select * from recipe where id in (
	select recipeid from components except 
	select distinct recipeid from components where ingredientid in (
		select id from ingredient where name~*'.*vodka.*'
)) order by id;




-- COST 22 k ---------------------------------------------------------

-- This introduces a big change where the `bad'
-- recipeids are a from_item subselect, thus allowing
-- the use of a standard where conditional, namely =,
-- instead of relying on the `in' construct.  There
-- is still an internal `in' for a field in
-- components where _another_ field is in a subselect.

explain analyze
select * from recipe, (
	select recipeid from components except 
	select distinct recipeid from components where ingredientid in (
		select id from ingredient where name~*'.*vodka.*' )
) as ss
where recipe.id=ss.recipeid ;




-- COST 177  ---------------------------------------------------------

-- The internal `in' statement is changed to a
-- cartesian selection with a limiting where clause
-- (namely a components<->ingredient reference).
-- There are no `in' clauses here anymore.

explain analyze
select recipe.id, name, instructions from recipe, (
  select id from recipe except
	select distinct recipeid from components, ingredient
	  where ingredientid=ingredient.id and name~*'.*vodka.*'
) as ss
where recipe.id=ss.id order by recipe.id ;




-- COST 152  ---------------------------------------------------------

-- This removes the from_item subselect entirely,
-- instead opting to do a grand cartesian product on
-- every internal row and a single exception outside.
-- I would think this would be quite a bit slower
-- since there are two fields inside that have
-- nothing to do with limiting the data set, but it
-- is indeed a bit faster.

-- For 50 of these queries:
--     3.28s real     0.49s user     0.03s system
-- Hence for 1 of them:
--   0.0656s real     0.01s user     0.00s system
-- This is 76 times faster.
--
-- Incidentally, doing a prepared query doesn't make it much faster.

explain analyze
select id, name, instructions from recipe as r1 except
select distinct r2.id, r2.name, r2.instructions from recipe as r2, components, ingredient
	where ingredient.name~*'.*vodka.*'
	and ingredient.id=components.ingredientid
	and components.recipeid=r2.id ;
Soooo... if you're building queries at runtime and trying to convert "standard Boolean searches" into SQL, be careful how you do it. More specifically, make sure you add test cases for "a OR b", "a OR b OR c", "a AND b", "a AND b AND c", "NOT a", and so forth, before you're convinced you have support for the feature.

This probably isn't much help to 2007 you, but since that seemed a bit out there, I took a pair of tables with a few hundred thousand rows each in one of my test databases (running on Heroku's cheapest paid standard-postgres instance with 4GB RAM) and wrote a very similar query, and sure enough it takes 20 seconds to run with a ~140k rows result set and the plan cost is in the millions. However, when I tried rewriting it with EXISTS instead of using tbl.id IN(subquery) like a barbarian, the cost and the run time were reduced by about two orders of magnitude. For simple queries, the db will usually be able to optimize a id IN(subquery) into the same plan as a semi-join, but not always.

e: actually it might be the NOT IN that's the culprit: https://explainextended.com/2009/09/16/not-in-vs-not-exists-vs-left-join-is-null-postgresql/

TheFluff fucked around with this message at 20:44 on Jan 15, 2019

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
Agreed, that NOT IN optimization issue is something I had forgotten about and it definitely was a thing; it's also possible/probable that there were query planner improvements that made the 5sec version start working.

I recommend careful testing when things like EXCEPT start appearing, because their performance varies a bit between DBs and even versions.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
If I have a tablewith two columns called Character and Integer and for each Character there can be Integers and pairings can repeat within the table:

code:
A 1
A 2
A 2
A 3
A 4
B 1
B 1
B 2
How can I return a set that shows for every column Character the most frequent Integer?


code:
A  2
B  1

Pardot
Jul 25, 2001




Try distinct on

SQL code:
select distinct on (chr) chr, num
from (select *, count(*) from farts group by chr, num) butts 
order by chr, count desc;
https://www.db-fiddle.com/f/avV6svsDdSYp2zTTB7z87P/0

Adbot
ADBOT LOVES YOU

Vegetable
Oct 22, 2010

How do you guys tend to use WITH? I used to have giant nested subqueries. Now I’m moving towards having no nested subqueries at all and just doing:

WITH
a as (select xx from table)
,b as (select xx from a)
,c as (select xx from table)
SELECT b.xx, c.xx from b LEFT JOIN c

Does it make sense to break up the query into a and b like that? Or should I still use *some* nesting?

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply