Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

NihilCredo posted:

c) <query that I want>

Scans the whole table starting from a random initial point (and rolling over), upon meeting a matching record stops early and returns it.

Maybe you've already addressed this, but doesn't this (as described) make some matching rows more likely to be selected than others? A row that is considered to fall immediately after a long sequence of rows disqualified by the filter is much more likely to be selected than a row that falls immediately after another matching row. I would have thought that this isn't what you want - that you want each matching row to have an equal probability of being selected.

Adbot
ADBOT LOVES YOU

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

Hammerite posted:

Maybe you've already addressed this, but doesn't this (as described) make some matching rows more likely to be selected than others? A row that is considered to fall immediately after a long sequence of rows disqualified by the filter is much more likely to be selected than a row that falls immediately after another matching row. I would have thought that this isn't what you want - that you want each matching row to have an equal probability of being selected.

You're correct, it would be an uneven distribution, but that ought to be acceptable for my use case. The table is a sort of work queue (SELECT FOR UPDATE SKIP LOCKED style), so any gaps should 'smooth out' over time.

nielsm
Jun 1, 2009



Use a TABLESAMPLE query but repeat it until you don't get zero results.

Jabor
Jul 16, 2010

#1 Loser at SpaceChem

NihilCredo posted:

You're correct, it would be an uneven distribution, but that ought to be acceptable for my use case. The table is a sort of work queue (SELECT FOR UPDATE SKIP LOCKED style), so any gaps should 'smooth out' over time.

In that case, you could use a TABLESAMPLE query to pick a random row (any row, don't put your condition in yet), then find the first row that matches your condition and has id >= your randomly selected row's id. If you get no results, then you looped, and can grab the first one in the table (with id < your randomly selected row's id, if you want to keep it to just one table scan in the case where no rows meet your condition).

Pardot
Jul 25, 2001




NihilCredo posted:


If I had a sequential indexed column in there, I could just pick a random number between 1 and MAX(index_column).

Not unless you use something other than the built in sequence nexval since there can be gaps in that

quote:

Because nextval and setval calls are never rolled back, sequence objects cannot be used if "gapless" assignment of sequence numbers is needed. It is possible to build gapless assignment by using exclusive locking of a table containing a counter; but this solution is much more expensive than sequence objects, especially if many transactions need sequence numbers concurrently.

Just pointing this out because I get the sense it's not widely known that sequences can have gaps in general.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

NihilCredo posted:

I've got a large table in Postgres with a UUID primary key.
Problem found.

So what's a "uuid"? You're going to get clustering over the full uuid in any case, and even the low order of the timestamp may have seasonality clustering. You probably don't want to trust the uuid generators for uniformity, so you don't want to base your selection on the uuid.

Thinking about the problem as just "random row from large table", the benefit of random ordering is that you're guaranteed a result, limited appropriately. Obviously calculated fields for each row are expensive, particularly when 0% of the results will be used. So you want something less expensive overall but still random.

There's nothing in SQL that will do all of this for you, so you're looking for a stored procedure, application stuff, (or maybe a recursive cte can manage it). You have one tool that allows you to select some rows without the full scan: A cursor. So then, set up a cursor to do the scan and then you can keep requesting rows until random<1/numrows. If you don't know the exact row count, the table stats will be close and you can shave off 10%, or as previously suggested just loop.

Otherwise you can balance speed and uniformity using certain types of clustering (not the DB keyword). This would almost certainly require a second table for cluster counting, but there may be a tricky approach with OFFSET.

You also want to read https://www.postgresql.org/docs/9.6/indexes-index-only-scans.html very carefully.

If you have access to the database you can probably just pull a random location from the table pages (on disk). :buddy:



Edit: You mentioned a jobs/workflow type system, so don't do what you're trying to do. In the stable state, if you have N rows removed/added in time T, you'd want to process the first N submitted between 0T and 1T, the next N from 1T to 2T, etc. If you randomly select, p=1/N is a geometric distribution. The expected average time to hit row N is still N*T, but the standard deviation (for large N) is also N*T. Your customers will wonder why their job "hasn't started in three days". After you build this they'll be asking you to implement priority queueing and such anyway, so you'll be right back to "FIFO plus a few priority levels plus maybe an 'expected runtime' delay".

PhantomOfTheCopier fucked around with this message at 23:56 on Mar 20, 2020

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

PhantomOfTheCopier posted:

Edit: You mentioned a jobs/workflow type system, so don't do what you're trying to do. In the stable state, if you have N rows removed/added in time T, you'd want to process the first N submitted between 0T and 1T, the next N from 1T to 2T, etc. If you randomly select, p=1/N is a geometric distribution. The expected average time to hit row N is still N*T, but the standard deviation (for large N) is also N*T. Your customers will wonder why their job "hasn't started in three days". After you build this they'll be asking you to implement priority queueing and such anyway, so you'll be right back to "FIFO plus a few priority levels plus maybe an 'expected runtime' delay".

I think you're on to something.

My line of thought was something like 'ok, I don't need to introduce a priority system right now, as long as the retries of old messages don't outright stop the sending of new ones it's gonna be ok', but it seems defining a sufficiently random order of execution isn't much less work than defining a proper priority system.

Just brainstorming, but most messages fall into one of three cases:

1) Works on the first try (the majority)
2) Fails due to a network issue or some other problem that will go away on its own fairly quickly. Should be aggressively retried until it goes through.
3) Fails due to application-level errors that won't get fixed until a human looks at it. Should not be retried too often so as not to waste workers' time in case there is a long list of bottled up messages.

So perhaps I could just improve the filter, which right now considers the messages that have either never been sent, or have been unsuccessfully sent more than $RETRY_INTERVAL ago. Instead of:

code:
  where last_attempt.attempt_id is null 
     or current_timestamp > last_attempt.start_date + @retry_interval
I could do something like:

code:
  where last_attempt.attempt_id is null 
     or current_timestamp > last_attempt.start_date + @retry_interval + coalesce(@delay_factor * (current_timestamp - message.registration_date), 'empty')		
so the retry times grow exponentially longer until the message expires. If a human fixes a persistent error, he can always manually force the messages to be attempted immediately... though perhaps it would be better if I had a maximum interval, so that an engineer can say 'ok, I fixed the bug, I know that all the processes will be resumed by tomorrow at the latest' or something like that (if it took days to fix the bug and it would be annoying to go force the messages through manually, it can probably wait one business day.).

Macichne Leainig
Jul 26, 2012

by VG
What's the MySQL setting that controls how strict MySQL is about case sensitivity?

On local MySQL installs the case sensitivity is not a problem, but on the DigitalOcean managed DB cluster we have powering dev/prod, it seems to be strict about case sensitivity there.

Not sure if I have the power to control the DigitalOcean side of things so maybe we need to enforce case sensitivity on local dev environments.

NinpoEspiritoSanto
Oct 22, 2013




Protocol7 posted:

What's the MySQL setting that controls how strict MySQL is about case sensitivity?

On local MySQL installs the case sensitivity is not a problem, but on the DigitalOcean managed DB cluster we have powering dev/prod, it seems to be strict about case sensitivity there.

Not sure if I have the power to control the DigitalOcean side of things so maybe we need to enforce case sensitivity on local dev environments.

Case sensitivity on database and table names is OS dependent. If the os doesn't care e.g. Windows neither does MySQL. If on Linux it will.

Obligatory: lol

nielsm
Jun 1, 2009



Check the collation set for database and tables. Perhaps also default collation for connection? I don't remember if that's a thing.
Collations with _ci suffix are case insensitive, suffixes _cs and _bin should be case sensitive.


Edit: beaten and lol holy poo poo mysql how insane can you be

Macichne Leainig
Jul 26, 2012

by VG

Bundy posted:

Case sensitivity on database and table names is OS dependent. If the os doesn't care e.g. Windows neither does MySQL. If on Linux it will.

Obligatory: lol

That makes sense. Unfortunately I inherited this DB setup from an offshore team, so there are, how do I put this lightly... some, uh, procedural changes to make.

nem
Jan 4, 2003

panel.dev
apnscp: cPanel evolved
lower_case_table_names=1

Or role the dice, live dangerously, and if you're on ext4 - Linux 5.2 that'll allow lowercase naming per directory :coolfish:

Macichne Leainig
Jul 26, 2012

by VG
I dunno if it's possible to change that on a DigitalOcean managed DB cluster, so I guess the solution is to teach the offshore guys a better way to generate the DB scripts, which should be interesting. Mostly because I'm not convinced I know enough about SQL stuff to do so effectively...

EL BROMANCE
Jun 10, 2006

COWABUNGA DUDES!
🥷🐢😬



Trying to do a fairly straightforward trivia database where I can pull the results into a leaderboard.

Results table is

date
team
round
score

as to futureproof a little I don't want to assume x amount of rounds in each quiz. Ideally I'd then like to display the leaderboard as

Position / Team / Round 1 / Round 2 / Round 3 / Round etc / Total

I'm trying to think of a way to bring that out in a single pull. If not I'm sure I can do in multiple queries, so far I have

code:
SELECT *, sum(results.score) AS total FROM `results`,`teams` WHERE results.team = teams.id GROUP BY teams.id
which gets me the first round and the total for each team, but I can't think of a way to pull the other rounds into that line. Possible and easy? (I haven't put the sort by total in yet either, but I'm assuming the usual sort will work fine)

Otherwise I can get the entry script to enter the total as a 'round' and just pull from there, but I might be setting myself up for further issues like that and that doesn't scream good design to me.

kloa
Feb 14, 2007


Got a couple rows of sample data to play with?

e: if each round is on a separate row, you'll probably have to pivot to get it displayed the way you want

kloa fucked around with this message at 23:14 on Mar 26, 2020

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

Yeah, this is a textbook example of a pivot query, look that up.

In standard SQL you have to explicitly declare your pivot columns so you can only support a fixed number of rounds (without doing nasty hacks with dynamic SQL). You can do something like declare 100 rounds and then, at the application level, drop any column that doesn't have non-null values.

Your specific SQL dialect may offer some extra help, for example Postgres has a "crosstab" function that will perform a pivot on every value in the selected rows.

EL BROMANCE
Jun 10, 2006

COWABUNGA DUDES!
🥷🐢😬



Cool, I'm used to making pivot tables in Excel so that alone should put me in the right direction, thanks both!

e: just realized I didn't mention it's mySQL, which at first glance doesn't seem to support pivoting by default but looks like you can write queries to do the same thing at least.

redleader
Aug 18, 2005

Engage according to operational parameters

NihilCredo posted:

Your specific SQL dialect may offer some extra help, for example Postgres has a "crosstab" function that will perform a pivot on every value in the selected rows.

oh man, it's so unfair that something like this hasn't been standardised

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Don’t pivot. God no. Fixed pivot or dynamic sql is the opposite of future proofing and both will be a pain to work with in JS (I’m assuming your front end is a website)

One query. Serialize to json. Date, total, array of rounds.

If you must, two queries. One summary and one with a rowset of rounds for the summary record.

That’s how I’d approach it.

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


NihilCredo posted:

Yeah, this is a textbook example of a pivot query, look that up.

In standard SQL you have to explicitly declare your pivot columns so you can only support a fixed number of rounds (without doing nasty hacks with dynamic SQL). You can do something like declare 100 rounds and then, at the application level, drop any column that doesn't have non-null values.

Your specific SQL dialect may offer some extra help, for example Postgres has a "crosstab" function that will perform a pivot on every value in the selected rows.

Crosstab is ok actually

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


MS SQL normalized (mostly) straight to JSON

code:
drop table if exists #games
drop table if exists #teams
drop table if exists #gameTeams
drop table if exists #results

-- Create a table for games
create table #games (gameId int, gameDate date)
insert into #games values
    (1, '1/1/2020'),
    (2, '2/1/2020')

-- Create a table for teams
create table #teams (teamId int, teamName varchar(50))
insert into #teams values
    (1, 'Aged Applesauce'),
    (2, 'Big Bad Boyz'),
    (3, 'Creepy Crabs')

-- Create a table for what teams participated in which games    
create table #gameTeams (gameId int, teamId int)
insert into #gameTeams values
    (1, 1),
    (1, 2),
    (1, 3),
    (2, 1),
    (2, 2),
    (2, 3)

-- Create a table for resuts
create table #results (gameId int, teamId int, gameRound int, score int)
insert into #results values
    --Game 1 round 1
    (1, 1, 1, 10),
    (1, 2, 1, 12),
    (1, 3, 1, 8 ),
    --Game 1 round 2
    (1, 1, 2, 11),
    (1, 2, 2, 14),
    (1, 3, 2, 13),
    
    --Game 2 round 1
    (2, 1, 1, 10),
    (2, 2, 1, 12),
    (2, 3, 1, 8 ),
    --Game 2 round 2
    (2, 1, 2, 11),
    (2, 2, 2, 14),
    (2, 3, 2, 13),
    --Game 2 round 3
    (2, 1, 3, 11),
    (2, 2, 3, 14),
    (2, 3, 3, 13)

-- Get our data
select      gameId      =   G.gameId,
            gameDate    =   G.gameDate,
            teams       =   JSON_QUERY((
                                select      teamId      =   T.teamId,
                                            teamName    =   T.teamName,
                                            rounds      =   JSON_QUERY((
                                                                SELECT      R.gameRound,
                                                                            R.score
                                                                FROM        #results R
                                                                WHERE       R.gameId = GT.gameId
                                                                AND         R.teamId = GT.teamId
                                                                FOR JSON PATH
                                                            ))
                                FROM        #gameTeams GT
                                INNER JOIN  #teams T
                                ON          T.teamId = GT.teamId
                                WHERE       G.gameId = GT.gameId
                                FOR JSON PATH
                            ))
from        #games G
FOR JSON PATH
Output
code:
[
  {
    "gameId": 1,
    "gameDate": "2020-01-01",
    "teams": [
      {
        "teamId": 1,
        "teamName": "Aged Applesauce",
        "rounds": [
          {
            "gameRound": 1,
            "score": 10
          },
          {
            "gameRound": 2,
            "score": 11
          }
        ]
      },
      {
        "teamId": 2,
        "teamName": "Big Bad Boyz",
        "rounds": [
          {
            "gameRound": 1,
            "score": 12
          },
          {
            "gameRound": 2,
            "score": 14
          }
        ]
      },
      {
        "teamId": 3,
        "teamName": "Creepy Crabs",
        "rounds": [
          {
            "gameRound": 1,
            "score": 8
          },
          {
            "gameRound": 2,
            "score": 13
          }
        ]
      }
    ]
  },
  {
    "gameId": 2,
    "gameDate": "2020-02-01",
    "teams": [
      {
        "teamId": 1,
        "teamName": "Aged Applesauce",
        "rounds": [
          {
            "gameRound": 1,
            "score": 10
          },
          {
            "gameRound": 2,
            "score": 11
          },
          {
            "gameRound": 3,
            "score": 11
          }
        ]
      },
      {
        "teamId": 2,
        "teamName": "Big Bad Boyz",
        "rounds": [
          {
            "gameRound": 1,
            "score": 12
          },
          {
            "gameRound": 2,
            "score": 14
          },
          {
            "gameRound": 3,
            "score": 14
          }
        ]
      },
      {
        "teamId": 3,
        "teamName": "Creepy Crabs",
        "rounds": [
          {
            "gameRound": 1,
            "score": 8
          },
          {
            "gameRound": 2,
            "score": 13
          },
          {
            "gameRound": 3,
            "score": 13
          }
        ]
      }
    ]
  }
]
React JS to make it into a thing (untested but likely works in create-react-app)

code:
const Scoreboard = ({
    games, // let's assume this is the json you just got
}) => (
    <div>
        {games.map((g) => <Game key={g.gameId} game={g} />)}
    </div>
);

const Game = ({
    game,
}) => (
    <div>
        <div>{game.date}</div>
        <div>{game.teams.sort((a, b) => a.rounds.reduce((prev, curr) => prev + curr.score, 0) - b.rounds.reduce((prev, curr) => prev + curr.score, 0)).map((t) => <Team key={t.teamId} team={t} />)}</div>
    </div>
);

const Team = ({
    team,
}) => (
    <div>
        <div>{team.teamName}</div>
        {team.rounds.sort((a, b) => a.gameRound - b.gameRound).map((r) => <div>{`Round ${r.gameRound} - ${r.score}`}</div>)}
        <div>{`Total: ${team.rounds.reduce((prev, curr) => prev + curr.score, 0)}`}</div>
    </div>
);
idk you could do a lot to refactor, simplify, or better normalize the data but that would take more time that I don't feel like spending this very moment - hope this helps you think about alternate approaches to the forbidden dynamic-column layout at least

Sir Bobert Fishbone
Jan 16, 2006

Beebort
Ruggan I learn something new every time you post

kumba
Nov 8, 2003

I posted my food for USPOL Thanksgiving!

enjoy the ride

Lipstick Apathy

quote:

drop table if exists #games
drop table if exists #teams
drop table if exists #gameTeams
drop table if exists #results

good lord i have been writing

IF OBJECT_ID('tempdb..#games',N'U') IS NOT NULL
BEGIN DROP TABLE #games END

for loving ever, so thanks for that

EL BROMANCE
Jun 10, 2006

COWABUNGA DUDES!
🥷🐢😬



Ruggan posted:

MS SQL normalized (mostly) straight to JSON

idk you could do a lot to refactor, simplify, or better normalize the data but that would take more time that I don't feel like spending this very moment - hope this helps you think about alternate approaches to the forbidden dynamic-column layout at least

Oh lord, if there's one thing I've never been able to get my head around for some reason it's JavaScript and it's various offshots. My brain and OO just doesn't mix for some reason, but I really appreciate the effort here and maybe I'll use that as a jumping point to finally getting past the mental block I'm at and not just recycle the same essential elements I get away with for every dumb project. I'm also glad the poster above was able to get something out of it too!

Luckily this thing doesn't have to be good, hell it probably doesn't even need to work as it's most likely just going to be something for me to work on and keep my brain occupied during crummy times. My scripting can generally be described as 'it works, but at what cost?'.

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


kumba posted:

good lord i have been writing

IF OBJECT_ID('tempdb..#games',N'U') IS NOT NULL
BEGIN DROP TABLE #games END

for loving ever, so thanks for that

Same, good lord.

Chilled Milk
Jun 22, 2003

No one here is alone,
satellites in every home

Bundy posted:

Case sensitivity on database and table names is OS dependent. If the os doesn't care e.g. Windows neither does MySQL. If on Linux it will.

Obligatory: lol

:psyduck:

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


kumba posted:

good lord i have been writing

IF OBJECT_ID('tempdb..#games',N'U') IS NOT NULL
BEGIN DROP TABLE #games END

for loving ever, so thanks for that

Nth Doctor posted:

Same, good lord.

the optional IF EXISTS was introduced in MS SQL Server 2016

if you're on some legacy MS SQL database you still need to use the awful syntax kumba wrote

also huge: CREATE OR ALTER [PROCEDURE / VIEW / FUNCTION]

nullfunction
Jan 24, 2005

Nap Ghost
Not really a question, but goddamn, this is cool.

https://twitter.com/thedouglane/status/1243644982889775106?s=19

Just-In-Timeberlake
Aug 18, 2003
I'm a doofus who just spent 30 minutes trying to figure out why a query was returning 65,535 rows before figuring out it was a coincidence.

Don't be like me.

nielsm
Jun 1, 2009



I have a large-ish table (8 million records) in SQL Server and I need to do a simple update via a translation table.

I currently have:
SQL code:
declare @@mapping as table (A nvarchar(50), B nvarchar(50), C nvarchar(50));
insert into @@mapping values
('abc', '123', 'xxx'),
('def', '234', 'xxx'),
('ghi', '987', 'yyy');
-- about 12 rows in @@mapping

create index LargeTable_FieldA on LargeTable (FieldA);

update t set FieldA = m.B, FieldB = m.C, FieldX = m.A
from LargeTable t join @@mapping m on t.FieldA = m.A;
The execution plan I get is scanning @@mapping and for each row find all rows in LargeTable that match, then update those rows. The index helps this, but it still means seeking around LargeTable a whole bunch, and creating the index just for this update takes a good while.

If I was implementing this in a flat file store in a C program I'd be scanning LargeTable and look up the matching mapping row then update. This only performs a single scan of the large table and doesn't really need an index at all. Is there any way to force SQL Server to do the update that way instead?

I really only need to run this update once ever, except that the process I'm working on has a bunch of development cycles, so an extra 40 minutes of execution time isn't that much of an issue in the big picture. Mostly asking for curiosity's sake.

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


nielsm posted:

I have a large-ish table (8 million records) in SQL Server and I need to do a simple update via a translation table.

I currently have:
SQL code:
declare @@mapping as table (A nvarchar(50), B nvarchar(50), C nvarchar(50));
insert into @@mapping values
('abc', '123', 'xxx'),
('def', '234', 'xxx'),
('ghi', '987', 'yyy');
-- about 12 rows in @@mapping

create index LargeTable_FieldA on LargeTable (FieldA);

update t set FieldA = m.B, FieldB = m.C, FieldX = m.A
from LargeTable t join @@mapping m on t.FieldA = m.A;
The execution plan I get is scanning @@mapping and for each row find all rows in LargeTable that match, then update those rows. The index helps this, but it still means seeking around LargeTable a whole bunch, and creating the index just for this update takes a good while.

If I was implementing this in a flat file store in a C program I'd be scanning LargeTable and look up the matching mapping row then update. This only performs a single scan of the large table and doesn't really need an index at all. Is there any way to force SQL Server to do the update that way instead?

I really only need to run this update once ever, except that the process I'm working on has a bunch of development cycles, so an extra 40 minutes of execution time isn't that much of an issue in the big picture. Mostly asking for curiosity's sake.
What JOIN method does your execution plan tell you is being used?

For this, you probably want to give your table variable a clustered index on A and look for a MERGE JOIN leveraging the fact that both tables can be sorted on A/FieldA. Idk what your rdbms is, but I know with SQL Server you can force the algorithm and index usage on the query:


SQL code:
declare
  @@mapping as table (
    A nvarchar(50) INDEX mapping_A CLUSTERED,
    B nvarchar(50),
    C nvarchar(50)
  );

insert into @@mapping values
('abc', '123', 'xxx'),
('def', '234', 'xxx'),
('ghi', '987', 'yyy');
-- about 12 rows in @@mapping

create index LargeTable_FieldA on LargeTable (FieldA);

update t
set
  FieldA = m.B,
  FieldB = m.C,
  FieldX = m.A
from @@mapping m WITH INDEX(mapping_A)
MERGE join LargeTable t
WITH INDEX(LargeTable_FieldA)
  on t.FieldA = m.A;

Usually I don't do either of the HINTS that I used because I want the query optimizer to have discretion, but since you said this was a one time thing, I was more controlling.

Lastly you could do OPTION(FORCE ORDER) at the end to force joining from @@mapping to LargeTable like you mentioned you would do. Note how I flipped the JOIN order to set that up.

nielsm
Jun 1, 2009



Thanks, I'll try out those methods.

Right now I get a Nested Loops (Inner Join) using Hash Match (Aggregate), with a table scan of the mapping table at top and an index seek on the nonclustered index of the large table at the bottom. This has an actual cost of 48% on the index seek, 19% on the nested loops, 31% on the hash match, and 1% on the final clustered index update.

nielsm
Jun 1, 2009



Okay I experimented a bit, two things made a difference:

First, using a tempdb table ##mapping instead of a table var @@mapping.
Second, creating the clustered index on ##mapping.A allowed the execution plan to prefer a nested loops join with a table scan of LargeTable at the top, doing a stream aggregate match on the mapping table.

I didn't need any index hints or changing the join type on the update statement after those two changes.

The actual cost moved to 85% table update, 9% table scan on LargeTable, 6% index seek on ##mapping, zero on the join. Execution time improved by a factor 16, from about 12 minutes down to 45 seconds with a LargeTable dataset of 800k rows.
Edit: And yes those numbers are much closer to what I had expected the costs for the opration to be.

nielsm fucked around with this message at 14:22 on Apr 1, 2020

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
I have a TSQL LAG question:

Give a table:
code:
UserID   timestamp    score
UserA   2020-04-02    10000
UserA   2020-04-01     9000
UserA   2020-03-31    8000
UserA   2020-02-29    5000
I am using a LAG function to determine the difference in scores between days like so:

code:
SELECT	[UserID],
	[timestamp],
	[score],
	score - LAG(score, 1,0) over (ORDER BY (STAMP)) as PointsPerDay
  FROM table
but I cannot figure out the WHERE clause so that if the time difference between two entries is greater than 25 hours return a zero for PointsPerDay for that entry.

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Agrikk posted:

I have a TSQL LAG question:

Give a table:
code:
UserID   timestamp    score
UserA   2020-04-02    10000
UserA   2020-04-01     9000
UserA   2020-03-31    8000
UserA   2020-02-29    5000
I am using a LAG function to determine the difference in scores between days like so:

code:
SELECT	[UserID],
	[timestamp],
	[score],
	score - LAG(score, 1,0) over (ORDER BY (STAMP)) as PointsPerDay
  FROM table
but I cannot figure out the WHERE clause so that if the time difference between two entries is greater than 25 hours return a zero for PointsPerDay for that entry.

This sounds like a case for CASE and another LAG instead of a WHERE clause, but I haven't really used LAG very much. Let me bodge something together real quick and check my intuition.

Fake edit:
code:
CREATE TABLE dbo.#Temp (
  UserID VARCHAR(25) NOT NULL,
  Stamp DATETIME NOT NULL,
  Score INT
);

INSERT dbo.#Temp (
  UserID,
  Stamp,
  Score
)
VALUES
('UserA',   '2020-04-02',    10000),
('UserA',   '2020-04-01',     9000),
('UserA',   '2020-03-31',    8000),
('UserA',   '2020-02-29',    5000),
('UserB',   '2020-04-02',    5),
('UserB',   '2020-04-01',     1)

SELECT
  T.UserID,
  T.Stamp,
	T.Score,
	T.Score - LAG(T.Score, 1,0) OVER (ORDER BY (T.Stamp)) as PointsPerDay,
  
  -- Begin messing around
  LAG(T.Score, 1,0)  OVER (PARTITION BY T.UserID ORDER BY (T.Stamp)) AS LaggingScore,
  LAG(T.Stamp, 1,0) OVER (PARTITION BY T.UserID ORDER BY (T.Stamp)) AS LaggingDate,
  DATEDIFF(hh, LAG(T.Stamp, 1,0) OVER (PARTITION BY T.UserID ORDER BY (T.Stamp)), T.Stamp) AS Hours_Diff,

  CASE WHEN DATEDIFF(hh, LAG(T.Stamp, 1,0) OVER (PARTITION BY T.UserID ORDER BY (T.Stamp)), T.Stamp) > 24 THEN 0
  ELSE T.Score - LAG(T.Score, 1,0) OVER (PARTITION BY T.UserID ORDER BY (T.Stamp))
  END AS NewPointsPerDay
FROM dbo.#Temp AS T;
GO
DROP TABLE dbo.#Temp;
If you don't care about resetting points/day by user, remove the PARTITION BY but I thought maybe you'd eventually be caring about that.

kumba
Nov 8, 2003

I posted my food for USPOL Thanksgiving!

enjoy the ride

Lipstick Apathy

Agrikk posted:

I have a TSQL LAG question:

Give a table:
code:
UserID   timestamp    score
UserA   2020-04-02    10000
UserA   2020-04-01     9000
UserA   2020-03-31    8000
UserA   2020-02-29    5000
I am using a LAG function to determine the difference in scores between days like so:

code:
SELECT	[UserID],
	[timestamp],
	[score],
	score - LAG(score, 1,0) over (ORDER BY (STAMP)) as PointsPerDay
  FROM table
but I cannot figure out the WHERE clause so that if the time difference between two entries is greater than 25 hours return a zero for PointsPerDay for that entry.

i think you can do this with a calendar table and a left join with ISNULL() around the whole score & lag calculation to return 0 for days not in table?

code:
SELECT	[UserID],
	[timestamp],
	[score],
	ISNULL(score - LAG(score, 1,0) over (ORDER BY (STAMP)),0) as PointsPerDay
  FROM calendar c
  LEFT JOIN  table t on c.theDate = t.timestamp  
i haven't used LAG much, does this kind of thing work?

e: f,b

Furia
Jul 26, 2015

Grimey Drawer
Worst case you could try a CTE, depending on what you are trying to accomplish. Above seems good though.

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

Nth Doctor posted:

This sounds like a case for CASE and another LAG instead of a WHERE clause, but I haven't really used LAG very much. Let me bodge something together real quick and check my intuition.

Fake edit:
code:
CREATE TABLE dbo.#Temp (
  UserID VARCHAR(25) NOT NULL,
  Stamp DATETIME NOT NULL,
  Score INT
);

INSERT dbo.#Temp (
  UserID,
  Stamp,
  Score
)
VALUES
('UserA',   '2020-04-02',    10000),
('UserA',   '2020-04-01',     9000),
('UserA',   '2020-03-31',    8000),
('UserA',   '2020-02-29',    5000),
('UserB',   '2020-04-02',    5),
('UserB',   '2020-04-01',     1)

SELECT
  T.UserID,
  T.Stamp,
	T.Score,
	T.Score - LAG(T.Score, 1,0) OVER (ORDER BY (T.Stamp)) as PointsPerDay,
  
  -- Begin messing around
  LAG(T.Score, 1,0)  OVER (PARTITION BY T.UserID ORDER BY (T.Stamp)) AS LaggingScore,
  LAG(T.Stamp, 1,0) OVER (PARTITION BY T.UserID ORDER BY (T.Stamp)) AS LaggingDate,
  DATEDIFF(hh, LAG(T.Stamp, 1,0) OVER (PARTITION BY T.UserID ORDER BY (T.Stamp)), T.Stamp) AS Hours_Diff,

  CASE WHEN DATEDIFF(hh, LAG(T.Stamp, 1,0) OVER (PARTITION BY T.UserID ORDER BY (T.Stamp)), T.Stamp) > 24 THEN 0
  ELSE T.Score - LAG(T.Score, 1,0) OVER (PARTITION BY T.UserID ORDER BY (T.Stamp))
  END AS NewPointsPerDay
FROM dbo.#Temp AS T;
GO
DROP TABLE dbo.#Temp;
If you don't care about resetting points/day by user, remove the PARTITION BY but I thought maybe you'd eventually be caring about that.

You should be able to factor out the window frame (partition by etc) in the last column. You can do the case over the window frame, but I’d just do IIF(datediff(...) <24, score-lag(Score,1),0)

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
Another question:

Given a table like so:
code:
TransactionID   Item1   Item1Count Item2 Item2Count  Item3 Item3count  Item4 Item4count
abc123                 4              10                5               20            6             100            7          50
with an item table like so
code:
ItemID   ItemName
4               foo
5               bar
6               bizz
7               buzz
Is it possible to create output like so:
code:
abc123 foo 10 bar 20 bizz 100 buzz 50
Without doing four joins? Is there a lookup table function or technique that'll look up the all the values in the whole row from one table at once?

Adbot
ADBOT LOVES YOU

Moonwolf
Jun 29, 2004

Flee from th' terrifyin' evil of "NHS"!


As far as I know there is no dialect of SQL which can do that in a single operation. Although the better way to solve that would be to have a table of:
code:
transactionId  itemId   count
4              5        30
4              3        112
5              5        2
where you do one join to decorate with the item name from id and then pivot into your transaction + items format.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply