|
NihilCredo posted:c) <query that I want> Maybe you've already addressed this, but doesn't this (as described) make some matching rows more likely to be selected than others? A row that is considered to fall immediately after a long sequence of rows disqualified by the filter is much more likely to be selected than a row that falls immediately after another matching row. I would have thought that this isn't what you want - that you want each matching row to have an equal probability of being selected.
|
# ? Mar 20, 2020 11:04 |
|
|
# ? Jun 7, 2024 22:24 |
|
Hammerite posted:Maybe you've already addressed this, but doesn't this (as described) make some matching rows more likely to be selected than others? A row that is considered to fall immediately after a long sequence of rows disqualified by the filter is much more likely to be selected than a row that falls immediately after another matching row. I would have thought that this isn't what you want - that you want each matching row to have an equal probability of being selected. You're correct, it would be an uneven distribution, but that ought to be acceptable for my use case. The table is a sort of work queue (SELECT FOR UPDATE SKIP LOCKED style), so any gaps should 'smooth out' over time.
|
# ? Mar 20, 2020 11:23 |
Use a TABLESAMPLE query but repeat it until you don't get zero results.
|
|
# ? Mar 20, 2020 11:30 |
|
NihilCredo posted:You're correct, it would be an uneven distribution, but that ought to be acceptable for my use case. The table is a sort of work queue (SELECT FOR UPDATE SKIP LOCKED style), so any gaps should 'smooth out' over time. In that case, you could use a TABLESAMPLE query to pick a random row (any row, don't put your condition in yet), then find the first row that matches your condition and has id >= your randomly selected row's id. If you get no results, then you looped, and can grab the first one in the table (with id < your randomly selected row's id, if you want to keep it to just one table scan in the case where no rows meet your condition).
|
# ? Mar 20, 2020 15:52 |
|
NihilCredo posted:
Not unless you use something other than the built in sequence nexval since there can be gaps in that quote:Because nextval and setval calls are never rolled back, sequence objects cannot be used if "gapless" assignment of sequence numbers is needed. It is possible to build gapless assignment by using exclusive locking of a table containing a counter; but this solution is much more expensive than sequence objects, especially if many transactions need sequence numbers concurrently. Just pointing this out because I get the sense it's not widely known that sequences can have gaps in general.
|
# ? Mar 20, 2020 17:41 |
|
NihilCredo posted:I've got a large table in Postgres with a UUID primary key. So what's a "uuid"? You're going to get clustering over the full uuid in any case, and even the low order of the timestamp may have seasonality clustering. You probably don't want to trust the uuid generators for uniformity, so you don't want to base your selection on the uuid. Thinking about the problem as just "random row from large table", the benefit of random ordering is that you're guaranteed a result, limited appropriately. Obviously calculated fields for each row are expensive, particularly when 0% of the results will be used. So you want something less expensive overall but still random. There's nothing in SQL that will do all of this for you, so you're looking for a stored procedure, application stuff, (or maybe a recursive cte can manage it). You have one tool that allows you to select some rows without the full scan: A cursor. So then, set up a cursor to do the scan and then you can keep requesting rows until random<1/numrows. If you don't know the exact row count, the table stats will be close and you can shave off 10%, or as previously suggested just loop. Otherwise you can balance speed and uniformity using certain types of clustering (not the DB keyword). This would almost certainly require a second table for cluster counting, but there may be a tricky approach with OFFSET. You also want to read https://www.postgresql.org/docs/9.6/indexes-index-only-scans.html very carefully. If you have access to the database you can probably just pull a random location from the table pages (on disk). Edit: You mentioned a jobs/workflow type system, so don't do what you're trying to do. In the stable state, if you have N rows removed/added in time T, you'd want to process the first N submitted between 0T and 1T, the next N from 1T to 2T, etc. If you randomly select, p=1/N is a geometric distribution. The expected average time to hit row N is still N*T, but the standard deviation (for large N) is also N*T. Your customers will wonder why their job "hasn't started in three days". After you build this they'll be asking you to implement priority queueing and such anyway, so you'll be right back to "FIFO plus a few priority levels plus maybe an 'expected runtime' delay". PhantomOfTheCopier fucked around with this message at 23:56 on Mar 20, 2020 |
# ? Mar 20, 2020 23:24 |
|
PhantomOfTheCopier posted:Edit: You mentioned a jobs/workflow type system, so don't do what you're trying to do. In the stable state, if you have N rows removed/added in time T, you'd want to process the first N submitted between 0T and 1T, the next N from 1T to 2T, etc. If you randomly select, p=1/N is a geometric distribution. The expected average time to hit row N is still N*T, but the standard deviation (for large N) is also N*T. Your customers will wonder why their job "hasn't started in three days". After you build this they'll be asking you to implement priority queueing and such anyway, so you'll be right back to "FIFO plus a few priority levels plus maybe an 'expected runtime' delay". I think you're on to something. My line of thought was something like 'ok, I don't need to introduce a priority system right now, as long as the retries of old messages don't outright stop the sending of new ones it's gonna be ok', but it seems defining a sufficiently random order of execution isn't much less work than defining a proper priority system. Just brainstorming, but most messages fall into one of three cases: 1) Works on the first try (the majority) 2) Fails due to a network issue or some other problem that will go away on its own fairly quickly. Should be aggressively retried until it goes through. 3) Fails due to application-level errors that won't get fixed until a human looks at it. Should not be retried too often so as not to waste workers' time in case there is a long list of bottled up messages. So perhaps I could just improve the filter, which right now considers the messages that have either never been sent, or have been unsuccessfully sent more than $RETRY_INTERVAL ago. Instead of: code:
code:
|
# ? Mar 21, 2020 19:06 |
|
What's the MySQL setting that controls how strict MySQL is about case sensitivity? On local MySQL installs the case sensitivity is not a problem, but on the DigitalOcean managed DB cluster we have powering dev/prod, it seems to be strict about case sensitivity there. Not sure if I have the power to control the DigitalOcean side of things so maybe we need to enforce case sensitivity on local dev environments.
|
# ? Mar 24, 2020 16:06 |
Protocol7 posted:What's the MySQL setting that controls how strict MySQL is about case sensitivity? Case sensitivity on database and table names is OS dependent. If the os doesn't care e.g. Windows neither does MySQL. If on Linux it will. Obligatory: lol
|
|
# ? Mar 24, 2020 16:22 |
Check the collation set for database and tables. Perhaps also default collation for connection? I don't remember if that's a thing. Collations with _ci suffix are case insensitive, suffixes _cs and _bin should be case sensitive. Edit: beaten and lol holy poo poo mysql how insane can you be
|
|
# ? Mar 24, 2020 16:23 |
|
Bundy posted:Case sensitivity on database and table names is OS dependent. If the os doesn't care e.g. Windows neither does MySQL. If on Linux it will. That makes sense. Unfortunately I inherited this DB setup from an offshore team, so there are, how do I put this lightly... some, uh, procedural changes to make.
|
# ? Mar 24, 2020 16:26 |
|
lower_case_table_names=1 Or role the dice, live dangerously, and if you're on ext4 - Linux 5.2 that'll allow lowercase naming per directory
|
# ? Mar 24, 2020 19:11 |
|
I dunno if it's possible to change that on a DigitalOcean managed DB cluster, so I guess the solution is to teach the offshore guys a better way to generate the DB scripts, which should be interesting. Mostly because I'm not convinced I know enough about SQL stuff to do so effectively...
|
# ? Mar 24, 2020 20:53 |
|
Trying to do a fairly straightforward trivia database where I can pull the results into a leaderboard. Results table is date team round score as to futureproof a little I don't want to assume x amount of rounds in each quiz. Ideally I'd then like to display the leaderboard as Position / Team / Round 1 / Round 2 / Round 3 / Round etc / Total I'm trying to think of a way to bring that out in a single pull. If not I'm sure I can do in multiple queries, so far I have code:
Otherwise I can get the entry script to enter the total as a 'round' and just pull from there, but I might be setting myself up for further issues like that and that doesn't scream good design to me.
|
# ? Mar 26, 2020 22:46 |
|
Got a couple rows of sample data to play with? e: if each round is on a separate row, you'll probably have to pivot to get it displayed the way you want kloa fucked around with this message at 23:14 on Mar 26, 2020 |
# ? Mar 26, 2020 23:06 |
|
Yeah, this is a textbook example of a pivot query, look that up. In standard SQL you have to explicitly declare your pivot columns so you can only support a fixed number of rounds (without doing nasty hacks with dynamic SQL). You can do something like declare 100 rounds and then, at the application level, drop any column that doesn't have non-null values. Your specific SQL dialect may offer some extra help, for example Postgres has a "crosstab" function that will perform a pivot on every value in the selected rows.
|
# ? Mar 26, 2020 23:38 |
|
Cool, I'm used to making pivot tables in Excel so that alone should put me in the right direction, thanks both! e: just realized I didn't mention it's mySQL, which at first glance doesn't seem to support pivoting by default but looks like you can write queries to do the same thing at least.
|
# ? Mar 26, 2020 23:46 |
|
NihilCredo posted:Your specific SQL dialect may offer some extra help, for example Postgres has a "crosstab" function that will perform a pivot on every value in the selected rows. oh man, it's so unfair that something like this hasn't been standardised
|
# ? Mar 27, 2020 02:01 |
|
Don’t pivot. God no. Fixed pivot or dynamic sql is the opposite of future proofing and both will be a pain to work with in JS (I’m assuming your front end is a website) One query. Serialize to json. Date, total, array of rounds. If you must, two queries. One summary and one with a rowset of rounds for the summary record. That’s how I’d approach it.
|
# ? Mar 27, 2020 03:13 |
|
NihilCredo posted:Yeah, this is a textbook example of a pivot query, look that up. Crosstab is ok actually
|
# ? Mar 27, 2020 03:31 |
|
MS SQL normalized (mostly) straight to JSONcode:
code:
code:
|
# ? Mar 27, 2020 03:54 |
|
Ruggan I learn something new every time you post
|
# ? Mar 27, 2020 14:33 |
|
quote:drop table if exists #games good lord i have been writing IF OBJECT_ID('tempdb..#games',N'U') IS NOT NULL BEGIN DROP TABLE #games END for loving ever, so thanks for that
|
# ? Mar 27, 2020 15:42 |
|
Ruggan posted:MS SQL normalized (mostly) straight to JSON Oh lord, if there's one thing I've never been able to get my head around for some reason it's JavaScript and it's various offshots. My brain and OO just doesn't mix for some reason, but I really appreciate the effort here and maybe I'll use that as a jumping point to finally getting past the mental block I'm at and not just recycle the same essential elements I get away with for every dumb project. I'm also glad the poster above was able to get something out of it too! Luckily this thing doesn't have to be good, hell it probably doesn't even need to work as it's most likely just going to be something for me to work on and keep my brain occupied during crummy times. My scripting can generally be described as 'it works, but at what cost?'.
|
# ? Mar 27, 2020 16:01 |
|
kumba posted:good lord i have been writing Same, good lord.
|
# ? Mar 27, 2020 16:38 |
|
Bundy posted:Case sensitivity on database and table names is OS dependent. If the os doesn't care e.g. Windows neither does MySQL. If on Linux it will.
|
# ? Mar 27, 2020 21:48 |
|
kumba posted:good lord i have been writing Nth Doctor posted:Same, good lord. the optional IF EXISTS was introduced in MS SQL Server 2016 if you're on some legacy MS SQL database you still need to use the awful syntax kumba wrote also huge: CREATE OR ALTER [PROCEDURE / VIEW / FUNCTION]
|
# ? Mar 27, 2020 22:16 |
|
Not really a question, but goddamn, this is cool. https://twitter.com/thedouglane/status/1243644982889775106?s=19
|
# ? Mar 27, 2020 22:18 |
|
I'm a doofus who just spent 30 minutes trying to figure out why a query was returning 65,535 rows before figuring out it was a coincidence. Don't be like me.
|
# ? Mar 31, 2020 17:16 |
I have a large-ish table (8 million records) in SQL Server and I need to do a simple update via a translation table. I currently have: SQL code:
If I was implementing this in a flat file store in a C program I'd be scanning LargeTable and look up the matching mapping row then update. This only performs a single scan of the large table and doesn't really need an index at all. Is there any way to force SQL Server to do the update that way instead? I really only need to run this update once ever, except that the process I'm working on has a bunch of development cycles, so an extra 40 minutes of execution time isn't that much of an issue in the big picture. Mostly asking for curiosity's sake.
|
|
# ? Apr 1, 2020 11:32 |
|
nielsm posted:I have a large-ish table (8 million records) in SQL Server and I need to do a simple update via a translation table. For this, you probably want to give your table variable a clustered index on A and look for a MERGE JOIN leveraging the fact that both tables can be sorted on A/FieldA. Idk what your rdbms is, but I know with SQL Server you can force the algorithm and index usage on the query: SQL code:
Usually I don't do either of the HINTS that I used because I want the query optimizer to have discretion, but since you said this was a one time thing, I was more controlling. Lastly you could do OPTION(FORCE ORDER) at the end to force joining from @@mapping to LargeTable like you mentioned you would do. Note how I flipped the JOIN order to set that up.
|
# ? Apr 1, 2020 12:25 |
Thanks, I'll try out those methods. Right now I get a Nested Loops (Inner Join) using Hash Match (Aggregate), with a table scan of the mapping table at top and an index seek on the nonclustered index of the large table at the bottom. This has an actual cost of 48% on the index seek, 19% on the nested loops, 31% on the hash match, and 1% on the final clustered index update.
|
|
# ? Apr 1, 2020 13:25 |
Okay I experimented a bit, two things made a difference: First, using a tempdb table ##mapping instead of a table var @@mapping. Second, creating the clustered index on ##mapping.A allowed the execution plan to prefer a nested loops join with a table scan of LargeTable at the top, doing a stream aggregate match on the mapping table. I didn't need any index hints or changing the join type on the update statement after those two changes. The actual cost moved to 85% table update, 9% table scan on LargeTable, 6% index seek on ##mapping, zero on the join. Execution time improved by a factor 16, from about 12 minutes down to 45 seconds with a LargeTable dataset of 800k rows. Edit: And yes those numbers are much closer to what I had expected the costs for the opration to be. nielsm fucked around with this message at 14:22 on Apr 1, 2020 |
|
# ? Apr 1, 2020 13:49 |
|
I have a TSQL LAG question: Give a table: code:
code:
|
# ? Apr 8, 2020 17:00 |
|
Agrikk posted:I have a TSQL LAG question: This sounds like a case for CASE and another LAG instead of a WHERE clause, but I haven't really used LAG very much. Let me bodge something together real quick and check my intuition. Fake edit: code:
|
# ? Apr 8, 2020 17:23 |
|
Agrikk posted:I have a TSQL LAG question: i think you can do this with a calendar table and a left join with ISNULL() around the whole score & lag calculation to return 0 for days not in table? code:
e: f,b
|
# ? Apr 8, 2020 17:25 |
|
Worst case you could try a CTE, depending on what you are trying to accomplish. Above seems good though.
|
# ? Apr 8, 2020 18:49 |
|
Nth Doctor posted:This sounds like a case for CASE and another LAG instead of a WHERE clause, but I haven't really used LAG very much. Let me bodge something together real quick and check my intuition. You should be able to factor out the window frame (partition by etc) in the last column. You can do the case over the window frame, but I’d just do IIF(datediff(...) <24, score-lag(Score,1),0)
|
# ? Apr 9, 2020 06:54 |
|
Another question: Given a table like so: code:
code:
code:
|
# ? Apr 17, 2020 20:31 |
|
|
# ? Jun 7, 2024 22:24 |
|
As far as I know there is no dialect of SQL which can do that in a single operation. Although the better way to solve that would be to have a table of:code:
|
# ? Apr 17, 2020 20:47 |