|
Ruggan posted:ETL / Data Warehousing Stuff Hey Ruggan, I ran your post past my company's main ETL guy and this was his take: Dude one office over posted:I'm not sure off hand about the performance of merge vs truncate / bulk copy.
|
# ? Jan 11, 2019 16:52 |
|
|
# ? Jun 8, 2024 08:01 |
|
I saw a RIGHT JOIN in the wild for the first time today
|
# ? Jan 11, 2019 17:41 |
|
Ruggan posted:MS SQL Oh that reminds me, I'm having to do a small project with mssql and also it's my first time using that and Windows since like XP. I don’t really understand how localhost connections work. If I run this ruby from cmd or powershell it works fine and connects with the default user or whatever code:
|
# ? Jan 11, 2019 23:23 |
|
Pardot posted:Oh that reminds me, I'm having to do a small project with mssql and also it's my first time using that and Windows since like XP. Shooting from the hip, but the WSL instance's network adapter might not be in the local subnet and therefore counts as an external connection? Localhost auth free connection is locked into local subnets I'm pretty sure.
|
# ? Jan 11, 2019 23:39 |
|
Pardot posted:Oh that reminds me, I'm having to do a small project with mssql and also it's my first time using that and Windows since like XP. The latest fun tool I learned for testing mssql db connectivity is using a UDL file. It's a pretty neat trick instead of installing management studio. Not sure if it will help, but you can at least quickly try host names to see what is/isn't working. edit: google it for details. On windows you just create a file with the .udl file extension, click on it, and it opens a connect to database menu. edit2: depending on ODBC data source or whatever is connecting I've seen the same error due to the wrong ODBC drivers, insufficient permission for the account, account credentials wrong, wrong host name, etc, so lots of different failures all reported as the same error. I usually start with checking if the port responds and go from there checking each item required to make a successful connection. downout fucked around with this message at 04:08 on Jan 12, 2019 |
# ? Jan 12, 2019 03:54 |
|
Not really sure where SSIS package questions go but here goes nothing: I have an SSIS package that I am executing from a command line. It's purpose is to load approximately 700,000 lines of data in 17,000 files into an AWS RDS MSSQL database. When I run it in verbose logging mode, it loads all of the lines successfully in 5100 seconds (~85 minutes). When I run it with logging disabled, it fails after about 80 seconds with 50,000 lines of data loaded, producing the following error: code:
This _feels_ like some kind of connection draining or throttling issue with the database engine. Because the CMD window scrolls all the verbose text, the SSIS package job runs slowly, allowing the database engine to process it successfully. When I turn off logging, the load rate is much higher and it seems like it is saturating something or perhaps the rate of connects/disconnects is killign the maximum number of connections. Or something. Does anyone have any ideas on how to allow throughput and allow the SSIS package to run in /EW mode?
|
# ? Jan 12, 2019 05:29 |
|
What kind of task are you running? Are you literally looping through all the files in your directory individually? I would also guess you’re getting throttled. If each file is a new connection, it might just be going too fast for your server to keep up. Simple math says that you’re averaging 40 rows per file - in verbose mode you’re doing about 200 files per minute and in no-logging about 1,000 files per minute (50k loaded / 40 per file / 1.33 minutes = approx 1k files per min). One way to test this would be to use python or something to concatenate all your files and do a single bulk copy. May end up being way more efficient as if each file is a separate connection, that ends up being a lot of overhead across 17k files.
|
# ? Jan 12, 2019 08:08 |
|
Ruggan posted:What kind of task are you running? Are you literally looping through all the files in your directory individually? That is correct. Its a loop over files task over literally thousands of files per day, each with approximately 1-500 lines per file. I ended up putting in a step in the data flow that adds a 350msec wait to each loop and the job ran just fine. Another data point leading towards some kind of throttle. But concatenating all of the files into a single file is a great idea to try. I'll report back. edit: it worked like a charm. Looping over seven files is clearly more efficient than looping over thousands. But now I am curious: What is the mechanism that is breaking and how can I fix it to boost performance? Agrikk fucked around with this message at 07:22 on Jan 13, 2019 |
# ? Jan 13, 2019 05:39 |
|
Agrikk posted:That is correct. Its a loop over files task over literally thousands of files per day, each with approximately 1-500 lines per file. Nothing is breaking, you’re just treating each file import as it’s own sandbox. For 17k files that means you are opening and disposing of 17k separate connections. When you stuff all the data into 7 files you’re reducing that to 7. Obviously the number of rows you end up importing remains unchanged. I don’t think it would be a simple fix in SSIS. Assuming you’re using a loop container and a nested data flow task, each file import is functionally separate and you must open a connection and close the connection each time. You need to batch it such that you’re opening a single connection and not closing it until you import all the data. Couple ways of doing this: - feed it a single file with all the data by joining it all in code, like you’re doing - not writing a billion files in the first place, for example by writing to the tail of a log as append only - write your own C# script task that does this without opening a million connections and still loops through all your files (but don’t do this because it isn’t worth the technical debt) There might be some specialized task in SSIS or some open source one you can find that would do this, but I’m not personally aware of any.
|
# ? Jan 13, 2019 18:39 |
|
Are correlated sub queries bad practice? I’m trying to pull across a ton of tables(last events for whatever into their own column) and I see no other way to accomplish what I want to do.
Bobcats fucked around with this message at 18:58 on Jan 13, 2019 |
# ? Jan 13, 2019 18:55 |
|
Bobcats posted:Are correlated sub queries bad practice? I’m trying to pull across a ton of tables(last events for whatever into their own column) and I see no other way to accomplish what I want to do. Other people can speak to practice but they've been dirt slow when I've had to use them. I replaced a corellated subquery with a window function and dropped the query time from 8 minutes to 11 seconds.
|
# ? Jan 13, 2019 20:11 |
|
Bobcats posted:Are correlated sub queries bad practice? I’m trying to pull across a ton of tables(last events for whatever into their own column) and I see no other way to accomplish what I want to do. Correlated subqueries are a good way to ensure poor performance. Use windowed functions or aggregate subqueries instead. If you post your simplified schema and a description of what you are trying to do, I can probably give you more specific guidance.
|
# ? Jan 13, 2019 20:11 |
|
I need help creating a SQL query that will return a cocktail from a database, given that I have supplied all the ingredients that go into that cocktail So for example, I want the "Gin and Tonic" row to return only if I have supplied the correct ids for Gin (id is 1) and Tonic (id is 2). I I only supply "Tonic," I shouldn't get back the row I'm using SQLAlchemy and Flask, but I'm still having trouble wrapping my head around how the query will work altogether This is what my table structure looks like code:
code:
code:
code:
code:
code:
code:
teen phone cutie fucked around with this message at 21:20 on Jan 13, 2019 |
# ? Jan 13, 2019 21:06 |
|
Ruggan posted:Nothing is breaking, you’re just treating each file import as it’s own sandbox. For 17k files that means you are opening and disposing of 17k separate connections. When you stuff all the data into 7 files you’re reducing that to 7. Obviously the number of rows you end up importing remains unchanged. I understand that each loop creates its own connection but what I don't understand is what limit is being hit that is causing the pool of "connection slots" to drain to zero. I get that the number of connections to a service is finite, but what is that number and is it adjustable?
|
# ? Jan 13, 2019 21:36 |
|
TheFluff posted:I saw a RIGHT JOIN in the wild for the first time today what the gently caress
|
# ? Jan 13, 2019 22:55 |
|
Agrikk posted:I understand that each loop creates its own connection but what I don't understand is what limit is being hit that is causing the pool of "connection slots" to drain to zero. https://docs.microsoft.com/en-us/sq...sql-server-2017 But not sure that’s the exact answer. I would expect the connection to close before the next opens due to the nature of the loop.
|
# ? Jan 13, 2019 22:57 |
|
Grump posted:I need help creating a SQL query that will return a cocktail from a database, given that I have supplied all the ingredients that go into that cocktail Are you only looking for EXACT matches, or are you looking for any recipes you can make using your list? For example, should an input of Gin, Whiskey, and Tonic return Gin & Tonic because you could make it? Or no because that isn’t the exact recipe?
|
# ? Jan 13, 2019 23:19 |
|
Grump posted:I need help creating a SQL query that will return a cocktail from a database, given that I have supplied all the ingredients that go into that cocktail Ruggan's question's a good one. This is a really hard problem with a dynamic amount of parameters though. What you need to get is: a) Any cocktail that you have at least one ingredient for; b) All their ingredients lists; c) Compare your input list to those lists. Auto-generated SQL doesn't stand a chance.
|
# ? Jan 13, 2019 23:25 |
|
Ruggan posted:Are you only looking for EXACT matches, or are you looking for any recipes you can make using your list? Yes - Gin and Tonic should be returned if you included Whiskey Moonwolf posted:Ruggan's question's a good one. This is a really hard problem with a dynamic amount of parameters though. What you need to get is: So you're saying I could do something like 1. Client asks for all liquors including Gin and Tonic 2. SQL statement returns "Gin and Tonic" and "Gin and Tonic and Lime" 3. Python compares the client's list of IDs with the IDs of each Cocktail.ingredients 4. If all the client IDs appear in each Cocktail.ingredients.id, add that cocktail to a result array 5. Return the result array to the client ???
|
# ? Jan 13, 2019 23:46 |
|
Let I be ingredients, R be recipes, and D be drinks. Let @X be a table valued parameter you passed which contains your list of ingredients. code:
If you want to exclude recipes that you passed extra ingredients to, the second where clause statement checks to make sure you matched exactly as many as were passed. Based on your last answer you don’t want this, so comment that out. The traditional method of solving this is to concatenate your ingredient ids with STRING_AGG and string match the aggregated lists... but that always felt like a garbage solution to me so try wha I posted above - probably more efficient. Make sure you have indexes on the appropriate columns and you have a unique key constraint on Recipes for ingredient to drink. Phone posting, so sorry if formatting is garbo.
|
# ? Jan 14, 2019 00:00 |
|
Grump posted:Yes - Gin and Tonic should be returned if you included Whiskey Forums ate my query, will edit it in once I work out how to get it to let me post it. How do people get Cloudflare to let you post SQL? It just claims it's an attack. Moonwolf fucked around with this message at 00:14 on Jan 14, 2019 |
# ? Jan 14, 2019 00:04 |
|
Moonwolf posted:Forums ate my query, will edit it in once I work out how to get it to let me post it. what do you mean by "posting" sql- you mean in the post of a web request, perchance?
|
# ? Jan 14, 2019 00:26 |
|
Bruegels Fuckbooks posted:what do you mean by "posting" sql- you mean in the post of a web request, perchance? Yeah, just replying with it in a code block, like everyone else is. So that I can reply with the example query to Grump's post. Although now it works, for no obvious reason. This one should be db neutral. code:
|
# ? Jan 14, 2019 00:49 |
|
Simplified if you can use the in expression.code:
|
# ? Jan 14, 2019 00:58 |
|
If you have a halfway decent database it will have set operators (INTERSECT/EXCEPT), which can make this kind of thing far easier to read:SQL code:
I use the set operators exceedingly rarely, but sometimes they're just the perfect tool for the job. e: I have not actually looked at the execution plan for this, but I doubt it's any slower than joining on an aggregating subquery e2: the last bit after the EXCEPT (SELECT ingredients.id FROM ingredients...) doesn't even need to be a real query, you can use a table-valued parameter instead, or a VALUES list or w/e. TheFluff fucked around with this message at 20:53 on Jan 14, 2019 |
# ? Jan 14, 2019 20:25 |
|
TheFluff posted:If you have a halfway decent database it will have set operators (INTERSECT/EXCEPT), which can make this kind of thing far easier to read: You can make that even simpler, you don't need the EXCEPT at all: SQL code:
|
# ? Jan 14, 2019 21:01 |
|
Moonwolf posted:You can make that even simpler, you don't need the EXCEPT at all: Dang, you're right. And here I was feeling all clever because I had finally found a use case for the set operators
|
# ? Jan 14, 2019 21:10 |
|
I've only ever found them to be useful for reconciling data coming from two difference sources that's meant to be the same, or checking that a new query/ETL flow spits out the same data as the old one when you're refactoring stuff for performance.
|
# ? Jan 14, 2019 21:17 |
|
Moonwolf posted:I've only ever found them to be useful for reconciling data coming from two difference sources that's meant to be the same, or checking that a new query/ETL flow spits out the same data as the old one when you're refactoring stuff for performance. Well, yeah, they're really just humble variants of (anti-)semi-joins, so they're normally not that useful since you'd just use an EXISTS() instead. Dunno why I didn't realize why it wasn't necessary in this case. More generally, problems like this one (find rows where a set of related rows all match a criteria) are usually far easier to approach if you restate them in the negative - that is, find rows where there are no related rows that fail to match the criteria. e: maybe what we actually want is relational division TheFluff fucked around with this message at 21:44 on Jan 14, 2019 |
# ? Jan 14, 2019 21:35 |
|
Moonwolf posted:You can make that even simpler, you don't need the EXCEPT at all: I’ll try this and some of the others when I get home thanks! Additionally, I’m still not great at converting SQL to SQLAlchemy so any help there would be great as well e: looks like this one did the trick! welp here's the SQLAlchemy after lots of trial and error code:
code:
teen phone cutie fucked around with this message at 03:17 on Jan 15, 2019 |
# ? Jan 14, 2019 22:27 |
Most DBAs will tell you ORMs suck and fast become a hindrance rather than a help. I;m one of them; Laravel's Eloquent ORM produces some utterly horrific garbage queries once relational databases are actually useful and in some degree of Normal Form. SQLAlchemy DOES let you use things like text() or execute() in these cases, or there's https://docs.sqlalchemy.org/en/latest/core/
|
|
# ? Jan 15, 2019 13:24 |
|
Lol wow. I wish I knew this days ago
|
# ? Jan 15, 2019 13:29 |
|
Be careful with your cocktail queries because it's really easy to fall into an O(m*n*o*p*...) full Cartesian product when you're building automated ingredient selection and filtering. At the very least, make sure you test it with a few obvious and broad choices to ensure it doesn't hang for three hours. (Vodka+gin+rum-menthe-cacao) should return a non empty set, most likely. (Twas many years ago, but if I can dig up the recipe DB performance issue I'll drop it here.)
|
# ? Jan 15, 2019 15:40 |
|
Grump posted:I’ll try this and some of the others when I get home thanks! This looks perfectly fine. What you have in the select list of an EXISTS subquery doesn't matter - it can be *, a constant, any column - because all the database is checking for is if the query will return at least one row. The contents of that row are never actually retrieved. Selecting 1 or * is common. Just don't select an aggregate (count(), etc) because those always return a row. When people tell you ORM's are bad they usually mean the ones that have some kind of more or less magical query autogeneration syntax that usually maps rather awkwardly to SQL for all but the simplest queries - Django is one example. SQLAlchemy on the other hand belongs to a different category that could perhaps be called "query builders". jOOQ (Java) is another example from this category (probably the best one, in fact). These are more like metaprogramming tools for SQL and their syntax maps to SQL in a quite straightforward way. They are extremely useful, since SQL itself doesn't readily lend itself to modularity and reusability. SQLAlchemy is great, stick to it. PhantomOfTheCopier posted:Be careful with your cocktail queries because it's really easy to fall into an O(m*n*o*p*...) full Cartesian product when you're building automated ingredient selection and filtering. At the very least, make sure you test it with a few obvious and broad choices to ensure it doesn't hang for three hours. (Vodka+gin+rum-menthe-cacao) should return a non empty set, most likely. Unless you have tens of thousands of cocktails and ingredients I don't really think there's much reason for concern. TheFluff fucked around with this message at 18:02 on Jan 15, 2019 |
# ? Jan 15, 2019 17:55 |
|
(Here be dragons...) Well, there is, in fact. It looks like I solved this exact problem in October 2007, and recall it had been a non-starter issue prior to that because the database server wasn't powerful enough at the time to run the query. It probably lacked sufficient memory to cache all the intermediate results, but the query runtime exceeded 10min. On the newer server, it apparently started to work. This was presumably ~2000 recipes and ~150 ingredients. Direct copy/paste from the comments at the time, no proofreading; I take no responsibility : code:
Here starts a rather lengthy analysis... code:
|
# ? Jan 15, 2019 18:45 |
|
PhantomOfTheCopier posted:(Here be dragons...) This probably isn't much help to 2007 you, but since that seemed a bit out there, I took a pair of tables with a few hundred thousand rows each in one of my test databases (running on Heroku's cheapest paid standard-postgres instance with 4GB RAM) and wrote a very similar query, and sure enough it takes 20 seconds to run with a ~140k rows result set and the plan cost is in the millions. However, when I tried rewriting it with EXISTS instead of using tbl.id IN(subquery) like a barbarian, the cost and the run time were reduced by about two orders of magnitude. For simple queries, the db will usually be able to optimize a id IN(subquery) into the same plan as a semi-join, but not always. e: actually it might be the NOT IN that's the culprit: https://explainextended.com/2009/09/16/not-in-vs-not-exists-vs-left-join-is-null-postgresql/ TheFluff fucked around with this message at 20:44 on Jan 15, 2019 |
# ? Jan 15, 2019 20:26 |
|
Agreed, that NOT IN optimization issue is something I had forgotten about and it definitely was a thing; it's also possible/probable that there were query planner improvements that made the 5sec version start working. I recommend careful testing when things like EXCEPT start appearing, because their performance varies a bit between DBs and even versions.
|
# ? Jan 15, 2019 21:25 |
|
If I have a tablewith two columns called Character and Integer and for each Character there can be Integers and pairings can repeat within the table:code:
code:
|
# ? Jan 17, 2019 02:00 |
|
Try distinct onSQL code:
|
# ? Jan 17, 2019 02:57 |
|
|
# ? Jun 8, 2024 08:01 |
|
How do you guys tend to use WITH? I used to have giant nested subqueries. Now I’m moving towards having no nested subqueries at all and just doing: WITH a as (select xx from table) ,b as (select xx from a) ,c as (select xx from table) SELECT b.xx, c.xx from b LEFT JOIN c Does it make sense to break up the query into a and b like that? Or should I still use *some* nesting?
|
# ? Jan 17, 2019 03:02 |