SELECT * FROM Questions WHERE Type = 'Stupid'

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > SELECT * FROM Questions WHERE Type = 'Stupid'

«‹›303 »

Hammerite: Mar 9, 2007; And you don't remember what I said here, either, but it was pompous and stupid.; Jade Ear Joe

NihilCredo posted:

c) <query that I want>

Scans the whole table starting from a random initial point (and rolling over), upon meeting a matching record stops early and returns it.

Maybe you've already addressed this, but doesn't this (as described) make some matching rows more likely to be selected than others? A row that is considered to fall immediately after a long sequence of rows disqualified by the filter is much more likely to be selected than a row that falls immediately after another matching row. I would have thought that this isn't what you want - that you want each matching row to have an equal probability of being selected.

# ? Mar 20, 2020 11:04

Adbot: ADBOT LOVES YOU

# ? Jun 7, 2024 22:24

NihilCredo: Jun 6, 2011; iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

Hammerite posted:

Maybe you've already addressed this, but doesn't this (as described) make some matching rows more likely to be selected than others? A row that is considered to fall immediately after a long sequence of rows disqualified by the filter is much more likely to be selected than a row that falls immediately after another matching row. I would have thought that this isn't what you want - that you want each matching row to have an equal probability of being selected.

You're correct, it would be an uneven distribution, but that ought to be acceptable for my use case. The table is a sort of work queue (SELECT FOR UPDATE SKIP LOCKED style), so any gaps should 'smooth out' over time.

# ? Mar 20, 2020 11:23

nielsm: Jun 1, 2009

Use a TABLESAMPLE query but repeat it until you don't get zero results.

# ? Mar 20, 2020 11:30

Jabor: Jul 16, 2010; #1 Loser at SpaceChem

NihilCredo posted:

You're correct, it would be an uneven distribution, but that ought to be acceptable for my use case. The table is a sort of work queue (SELECT FOR UPDATE SKIP LOCKED style), so any gaps should 'smooth out' over time.

In that case, you could use a TABLESAMPLE query to pick a random row (any row, don't put your condition in yet), then find the first row that matches your condition and has id >= your randomly selected row's id. If you get no results, then you looped, and can grab the first one in the table (with id < your randomly selected row's id, if you want to keep it to just one table scan in the case where no rows meet your condition).

# ? Mar 20, 2020 15:52

Pardot: Jul 25, 2001

NihilCredo posted:

If I had a sequential indexed column in there, I could just pick a random number between 1 and MAX(index_column).

Not unless you use something other than the built in sequence nexval since there can be gaps in that

quote:

Because nextval and setval calls are never rolled back, sequence objects cannot be used if "gapless" assignment of sequence numbers is needed. It is possible to build gapless assignment by using exclusive locking of a table containing a counter; but this solution is much more expensive than sequence objects, especially if many transactions need sequence numbers concurrently.

Just pointing this out because I get the sense it's not widely known that sequences can have gaps in general.

# ? Mar 20, 2020 17:41

PhantomOfTheCopier: Aug 13, 2008; Pikabooze!

NihilCredo posted:

I've got a large table in Postgres with a UUID primary key.

Problem found.

So what's a "uuid"? You're going to get clustering over the full uuid in any case, and even the low order of the timestamp may have seasonality clustering. You probably don't want to trust the uuid generators for uniformity, so you don't want to base your selection on the uuid.

Thinking about the problem as just "random row from large table", the benefit of random ordering is that you're guaranteed a result, limited appropriately. Obviously calculated fields for each row are expensive, particularly when 0% of the results will be used. So you want something less expensive overall but still random.

There's nothing in SQL that will do all of this for you, so you're looking for a stored procedure, application stuff, (or maybe a recursive cte can manage it). You have one tool that allows you to select some rows without the full scan: A cursor. So then, set up a cursor to do the scan and then you can keep requesting rows until random<1/numrows. If you don't know the exact row count, the table stats will be close and you can shave off 10%, or as previously suggested just loop.

Otherwise you can balance speed and uniformity using certain types of clustering (not the DB keyword). This would almost certainly require a second table for cluster counting, but there may be a tricky approach with OFFSET.

You also want to read https://www.postgresql.org/docs/9.6/indexes-index-only-scans.html very carefully.

If you have access to the database you can probably just pull a random location from the table pages (on disk). :buddy:

Edit: You mentioned a jobs/workflow type system, so don't do what you're trying to do. In the stable state, if you have N rows removed/added in time T, you'd want to process the first N submitted between 0T and 1T, the next N from 1T to 2T, etc. If you randomly select, p=1/N is a geometric distribution. The expected average time to hit row N is still N*T, but the standard deviation (for large N) is also N*T. Your customers will wonder why their job "hasn't started in three days". After you build this they'll be asking you to implement priority queueing and such anyway, so you'll be right back to "FIFO plus a few priority levels plus maybe an 'expected runtime' delay".

PhantomOfTheCopier fucked around with this message at 23:56 on Mar 20, 2020

# ? Mar 20, 2020 23:24

NihilCredo: Jun 6, 2011; iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

PhantomOfTheCopier posted:

Edit: You mentioned a jobs/workflow type system, so don't do what you're trying to do. In the stable state, if you have N rows removed/added in time T, you'd want to process the first N submitted between 0T and 1T, the next N from 1T to 2T, etc. If you randomly select, p=1/N is a geometric distribution. The expected average time to hit row N is still N*T, but the standard deviation (for large N) is also N*T. Your customers will wonder why their job "hasn't started in three days". After you build this they'll be asking you to implement priority queueing and such anyway, so you'll be right back to "FIFO plus a few priority levels plus maybe an 'expected runtime' delay".

I think you're on to something.

My line of thought was something like 'ok, I don't need to introduce a priority system right now, as long as the retries of old messages don't outright stop the sending of new ones it's gonna be ok', but it seems defining a sufficiently random order of execution isn't much less work than defining a proper priority system.

Just brainstorming, but most messages fall into one of three cases:

1) Works on the first try (the majority)
2) Fails due to a network issue or some other problem that will go away on its own fairly quickly. Should be aggressively retried until it goes through.
3) Fails due to application-level errors that won't get fixed until a human looks at it. Should not be retried too often so as not to waste workers' time in case there is a long list of bottled up messages.

So perhaps I could just improve the filter, which right now considers the messages that have either never been sent, or have been unsuccessfully sent more than $RETRY_INTERVAL ago. Instead of:

code:

  where last_attempt.attempt_id is null 
     or current_timestamp > last_attempt.start_date + @retry_interval

I could do something like:

code:

  where last_attempt.attempt_id is null 
     or current_timestamp > last_attempt.start_date + @retry_interval + coalesce(@delay_factor * (current_timestamp - message.registration_date), 'empty')

so the retry times grow exponentially longer until the message expires. If a human fixes a persistent error, he can always manually force the messages to be attempted immediately... though perhaps it would be better if I had a maximum interval, so that an engineer can say 'ok, I fixed the bug, I know that all the processes will be resumed by tomorrow at the latest' or something like that (if it took days to fix the bug and it would be annoying to go force the messages through manually, it can probably wait one business day.).

# ? Mar 21, 2020 19:06

Macichne Leainig: Jul 26, 2012; by VG

What's the MySQL setting that controls how strict MySQL is about case sensitivity?

On local MySQL installs the case sensitivity is not a problem, but on the DigitalOcean managed DB cluster we have powering dev/prod, it seems to be strict about case sensitivity there.

Not sure if I have the power to control the DigitalOcean side of things so maybe we need to enforce case sensitivity on local dev environments.

# ? Mar 24, 2020 16:06

NinpoEspiritoSanto: Oct 22, 2013

Protocol7 posted:

What's the MySQL setting that controls how strict MySQL is about case sensitivity?

On local MySQL installs the case sensitivity is not a problem, but on the DigitalOcean managed DB cluster we have powering dev/prod, it seems to be strict about case sensitivity there.

Not sure if I have the power to control the DigitalOcean side of things so maybe we need to enforce case sensitivity on local dev environments.

Case sensitivity on database and table names is OS dependent. If the os doesn't care e.g. Windows neither does MySQL. If on Linux it will.

Obligatory: lol

# ? Mar 24, 2020 16:22

nielsm: Jun 1, 2009

Check the collation set for database and tables. Perhaps also default collation for connection? I don't remember if that's a thing.
Collations with _ci suffix are case insensitive, suffixes _cs and _bin should be case sensitive.

Edit: beaten and lol holy poo poo mysql how insane can you be

# ? Mar 24, 2020 16:23

Macichne Leainig: Jul 26, 2012; by VG

Bundy posted:

Case sensitivity on database and table names is OS dependent. If the os doesn't care e.g. Windows neither does MySQL. If on Linux it will.

Obligatory: lol

That makes sense. Unfortunately I inherited this DB setup from an offshore team, so there are, how do I put this lightly... some, uh, procedural changes to make.

# ? Mar 24, 2020 16:26

nem: Jan 4, 2003; panel.dev
apnscp: cPanel evolved

lower_case_table_names=1

Or role the dice, live dangerously, and if you're on ext4 - Linux 5.2 that'll allow lowercase naming per directory :coolfish:

# ? Mar 24, 2020 19:11

Macichne Leainig: Jul 26, 2012; by VG

I dunno if it's possible to change that on a DigitalOcean managed DB cluster, so I guess the solution is to teach the offshore guys a better way to generate the DB scripts, which should be interesting. Mostly because I'm not convinced I know enough about SQL stuff to do so effectively...

# ? Mar 24, 2020 20:53

EL BROMANCE: Jun 10, 2006; COWABUNGA DUDES!
🥷🐢😬

Trying to do a fairly straightforward trivia database where I can pull the results into a leaderboard.

Results table is

date
team
round
score

as to futureproof a little I don't want to assume x amount of rounds in each quiz. Ideally I'd then like to display the leaderboard as

Position / Team / Round 1 / Round 2 / Round 3 / Round etc / Total

I'm trying to think of a way to bring that out in a single pull. If not I'm sure I can do in multiple queries, so far I have

code:

SELECT *, sum(results.score) AS total FROM `results`,`teams` WHERE results.team = teams.id GROUP BY teams.id

which gets me the first round and the total for each team, but I can't think of a way to pull the other rounds into that line. Possible and easy? (I haven't put the sort by total in yet either, but I'm assuming the usual sort will work fine)

Otherwise I can get the entry script to enter the total as a 'round' and just pull from there, but I might be setting myself up for further issues like that and that doesn't scream good design to me.

# ? Mar 26, 2020 22:46

kloa: Feb 14, 2007

Got a couple rows of sample data to play with?

e: if each round is on a separate row, you'll probably have to pivot to get it displayed the way you want

kloa fucked around with this message at 23:14 on Mar 26, 2020

# ? Mar 26, 2020 23:06

NihilCredo: Jun 6, 2011; iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

Yeah, this is a textbook example of a pivot query, look that up.

In standard SQL you have to explicitly declare your pivot columns so you can only support a fixed number of rounds (without doing nasty hacks with dynamic SQL). You can do something like declare 100 rounds and then, at the application level, drop any column that doesn't have non-null values.

Your specific SQL dialect may offer some extra help, for example Postgres has a "crosstab" function that will perform a pivot on every value in the selected rows.

# ? Mar 26, 2020 23:38

EL BROMANCE: Jun 10, 2006; COWABUNGA DUDES!
🥷🐢😬

Cool, I'm used to making pivot tables in Excel so that alone should put me in the right direction, thanks both!

e: just realized I didn't mention it's mySQL, which at first glance doesn't seem to support pivoting by default but looks like you can write queries to do the same thing at least.

# ? Mar 26, 2020 23:46

redleader: Aug 18, 2005; Engage according to operational parameters

NihilCredo posted:

Your specific SQL dialect may offer some extra help, for example Postgres has a "crosstab" function that will perform a pivot on every value in the selected rows.

oh man, it's so unfair that something like this hasn't been standardised

# ? Mar 27, 2020 02:01

Ruggan: Feb 20, 2007; WHAT THAT SMELL LIKE?!

Don�t pivot. God no. Fixed pivot or dynamic sql is the opposite of future proofing and both will be a pain to work with in JS (I�m assuming your front end is a website)

One query. Serialize to json. Date, total, array of rounds.

If you must, two queries. One summary and one with a rowset of rounds for the summary record.

That�s how I�d approach it.

# ? Mar 27, 2020 03:13

Ruggan: Feb 20, 2007; WHAT THAT SMELL LIKE?!

NihilCredo posted:

Yeah, this is a textbook example of a pivot query, look that up.

In standard SQL you have to explicitly declare your pivot columns so you can only support a fixed number of rounds (without doing nasty hacks with dynamic SQL). You can do something like declare 100 rounds and then, at the application level, drop any column that doesn't have non-null values.

Your specific SQL dialect may offer some extra help, for example Postgres has a "crosstab" function that will perform a pivot on every value in the selected rows.

Crosstab is ok actually

# ? Mar 27, 2020 03:31

Ruggan: Feb 20, 2007; WHAT THAT SMELL LIKE?!

MS SQL normalized (mostly) straight to JSON

code:

drop table if exists #games
drop table if exists #teams
drop table if exists #gameTeams
drop table if exists #results

-- Create a table for games
create table #games (gameId int, gameDate date)
insert into #games values
    (1, '1/1/2020'),
    (2, '2/1/2020')

-- Create a table for teams
create table #teams (teamId int, teamName varchar(50))
insert into #teams values
    (1, 'Aged Applesauce'),
    (2, 'Big Bad Boyz'),
    (3, 'Creepy Crabs')

-- Create a table for what teams participated in which games    
create table #gameTeams (gameId int, teamId int)
insert into #gameTeams values
    (1, 1),
    (1, 2),
    (1, 3),
    (2, 1),
    (2, 2),
    (2, 3)

-- Create a table for resuts
create table #results (gameId int, teamId int, gameRound int, score int)
insert into #results values
    --Game 1 round 1
    (1, 1, 1, 10),
    (1, 2, 1, 12),
    (1, 3, 1, 8 ),
    --Game 1 round 2
    (1, 1, 2, 11),
    (1, 2, 2, 14),
    (1, 3, 2, 13),
    
    --Game 2 round 1
    (2, 1, 1, 10),
    (2, 2, 1, 12),
    (2, 3, 1, 8 ),
    --Game 2 round 2
    (2, 1, 2, 11),
    (2, 2, 2, 14),
    (2, 3, 2, 13),
    --Game 2 round 3
    (2, 1, 3, 11),
    (2, 2, 3, 14),
    (2, 3, 3, 13)

-- Get our data
select      gameId      =   G.gameId,
            gameDate    =   G.gameDate,
            teams       =   JSON_QUERY((
                                select      teamId      =   T.teamId,
                                            teamName    =   T.teamName,
                                            rounds      =   JSON_QUERY((
                                                                SELECT      R.gameRound,
                                                                            R.score
                                                                FROM        #results R
                                                                WHERE       R.gameId = GT.gameId
                                                                AND         R.teamId = GT.teamId
                                                                FOR JSON PATH
                                                            ))
                                FROM        #gameTeams GT
                                INNER JOIN  #teams T
                                ON          T.teamId = GT.teamId
                                WHERE       G.gameId = GT.gameId
                                FOR JSON PATH
                            ))
from        #games G
FOR JSON PATH

Output

code:

[
  {
    "gameId": 1,
    "gameDate": "2020-01-01",
    "teams": [
      {
        "teamId": 1,
        "teamName": "Aged Applesauce",
        "rounds": [
          {
            "gameRound": 1,
            "score": 10
          },
          {
            "gameRound": 2,
            "score": 11
          }
        ]
      },
      {
        "teamId": 2,
        "teamName": "Big Bad Boyz",
        "rounds": [
          {
            "gameRound": 1,
            "score": 12
          },
          {
            "gameRound": 2,
            "score": 14
          }
        ]
      },
      {
        "teamId": 3,
        "teamName": "Creepy Crabs",
        "rounds": [
          {
            "gameRound": 1,
            "score": 8
          },
          {
            "gameRound": 2,
            "score": 13
          }
        ]
      }
    ]
  },
  {
    "gameId": 2,
    "gameDate": "2020-02-01",
    "teams": [
      {
        "teamId": 1,
        "teamName": "Aged Applesauce",
        "rounds": [
          {
            "gameRound": 1,
            "score": 10
          },
          {
            "gameRound": 2,
            "score": 11
          },
          {
            "gameRound": 3,
            "score": 11
          }
        ]
      },
      {
        "teamId": 2,
        "teamName": "Big Bad Boyz",
        "rounds": [
          {
            "gameRound": 1,
            "score": 12
          },
          {
            "gameRound": 2,
            "score": 14
          },
          {
            "gameRound": 3,
            "score": 14
          }
        ]
      },
      {
        "teamId": 3,
        "teamName": "Creepy Crabs",
        "rounds": [
          {
            "gameRound": 1,
            "score": 8
          },
          {
            "gameRound": 2,
            "score": 13
          },
          {
            "gameRound": 3,
            "score": 13
          }
        ]
      }
    ]
  }
]

React JS to make it into a thing (untested but likely works in create-react-app)

code:

const Scoreboard = ({
    games, // let's assume this is the json you just got
}) => (
    <div>
        {games.map((g) => <Game key={g.gameId} game={g} />)}
    </div>
);

const Game = ({
    game,
}) => (
    <div>
        <div>{game.date}</div>
        <div>{game.teams.sort((a, b) => a.rounds.reduce((prev, curr) => prev + curr.score, 0) - b.rounds.reduce((prev, curr) => prev + curr.score, 0)).map((t) => <Team key={t.teamId} team={t} />)}</div>
    </div>
);

const Team = ({
    team,
}) => (
    <div>
        <div>{team.teamName}</div>
        {team.rounds.sort((a, b) => a.gameRound - b.gameRound).map((r) => <div>{`Round ${r.gameRound} - ${r.score}`}</div>)}
        <div>{`Total: ${team.rounds.reduce((prev, curr) => prev + curr.score, 0)}`}</div>
    </div>
);

idk you could do a lot to refactor, simplify, or better normalize the data but that would take more time that I don't feel like spending this very moment - hope this helps you think about alternate approaches to the forbidden dynamic-column layout at least

# ? Mar 27, 2020 03:54

Sir Bobert Fishbone: Jan 16, 2006; Beebort

Ruggan I learn something new every time you post

# ? Mar 27, 2020 14:33

kumba: Nov 8, 2003; I posted my food for USPOL Thanksgiving!

enjoy the ride; Lipstick Apathy

quote:

drop table if exists #games
drop table if exists #teams
drop table if exists #gameTeams
drop table if exists #results

good lord i have been writing

IF OBJECT_ID('tempdb..#games',N'U') IS NOT NULL
BEGIN DROP TABLE #games END

for loving ever, so thanks for that

# ? Mar 27, 2020 15:42

EL BROMANCE: Jun 10, 2006; COWABUNGA DUDES!
🥷🐢😬

Ruggan posted:

MS SQL normalized (mostly) straight to JSON

idk you could do a lot to refactor, simplify, or better normalize the data but that would take more time that I don't feel like spending this very moment - hope this helps you think about alternate approaches to the forbidden dynamic-column layout at least

Oh lord, if there's one thing I've never been able to get my head around for some reason it's JavaScript and it's various offshots. My brain and OO just doesn't mix for some reason, but I really appreciate the effort here and maybe I'll use that as a jumping point to finally getting past the mental block I'm at and not just recycle the same essential elements I get away with for every dumb project. I'm also glad the poster above was able to get something out of it too!

Luckily this thing doesn't have to be good, hell it probably doesn't even need to work as it's most likely just going to be something for me to work on and keep my brain occupied during crummy times. My scripting can generally be described as 'it works, but at what cost?'.

# ? Mar 27, 2020 16:01

Nth Doctor: Sep 7, 2010; Darkrai used Dream Eater!
It's super effective!

kumba posted:

good lord i have been writing

IF OBJECT_ID('tempdb..#games',N'U') IS NOT NULL
BEGIN DROP TABLE #games END

for loving ever, so thanks for that

Same, good lord.

# ? Mar 27, 2020 16:38

Chilled Milk: Jun 22, 2003; No one here is alone,
satellites in every home

Bundy posted:

Case sensitivity on database and table names is OS dependent. If the os doesn't care e.g. Windows neither does MySQL. If on Linux it will.

Obligatory: lol

# ? Mar 27, 2020 21:48

Ruggan: Feb 20, 2007; WHAT THAT SMELL LIKE?!

kumba posted:

good lord i have been writing

IF OBJECT_ID('tempdb..#games',N'U') IS NOT NULL
BEGIN DROP TABLE #games END

for loving ever, so thanks for that

Nth Doctor posted:

Same, good lord.

the optional IF EXISTS was introduced in MS SQL Server 2016

if you're on some legacy MS SQL database you still need to use the awful syntax kumba wrote

also huge: CREATE OR ALTER [PROCEDURE / VIEW / FUNCTION]

# ? Mar 27, 2020 22:16

nullfunction: Jan 24, 2005; Nap Ghost

Not really a question, but goddamn, this is cool.

https://twitter.com/thedouglane/status/1243644982889775106?s=19

# ? Mar 27, 2020 22:18

Just-In-Timeberlake: Aug 18, 2003

I'm a doofus who just spent 30 minutes trying to figure out why a query was returning 65,535 rows before figuring out it was a coincidence.

Don't be like me.

# ? Mar 31, 2020 17:16

nielsm: Jun 1, 2009

I have a large-ish table (8 million records) in SQL Server and I need to do a simple update via a translation table.

I currently have:

SQL code:

declare @@mapping as table (A nvarchar(50), B nvarchar(50), C nvarchar(50));
insert into @@mapping values
('abc', '123', 'xxx'),
('def', '234', 'xxx'),
('ghi', '987', 'yyy');
-- about 12 rows in @@mapping

create index LargeTable_FieldA on LargeTable (FieldA);

update t set FieldA = m.B, FieldB = m.C, FieldX = m.A
from LargeTable t join @@mapping m on t.FieldA = m.A;

The execution plan I get is scanning @@mapping and for each row find all rows in LargeTable that match, then update those rows. The index helps this, but it still means seeking around LargeTable a whole bunch, and creating the index just for this update takes a good while.

If I was implementing this in a flat file store in a C program I'd be scanning LargeTable and look up the matching mapping row then update. This only performs a single scan of the large table and doesn't really need an index at all. Is there any way to force SQL Server to do the update that way instead?

I really only need to run this update once ever, except that the process I'm working on has a bunch of development cycles, so an extra 40 minutes of execution time isn't that much of an issue in the big picture. Mostly asking for curiosity's sake.

# ? Apr 1, 2020 11:32

Nth Doctor: Sep 7, 2010; Darkrai used Dream Eater!
It's super effective!

nielsm posted:

I have a large-ish table (8 million records) in SQL Server and I need to do a simple update via a translation table.

I currently have:
SQL code:
declare @@mapping as table (A nvarchar(50), B nvarchar(50), C nvarchar(50));
insert into @@mapping values
('abc', '123', 'xxx'),
('def', '234', 'xxx'),
('ghi', '987', 'yyy');
-- about 12 rows in @@mapping

create index LargeTable_FieldA on LargeTable (FieldA);

update t set FieldA = m.B, FieldB = m.C, FieldX = m.A
from LargeTable t join @@mapping m on t.FieldA = m.A;
The execution plan I get is scanning @@mapping and for each row find all rows in LargeTable that match, then update those rows. The index helps this, but it still means seeking around LargeTable a whole bunch, and creating the index just for this update takes a good while.

If I was implementing this in a flat file store in a C program I'd be scanning LargeTable and look up the matching mapping row then update. This only performs a single scan of the large table and doesn't really need an index at all. Is there any way to force SQL Server to do the update that way instead?

I really only need to run this update once ever, except that the process I'm working on has a bunch of development cycles, so an extra 40 minutes of execution time isn't that much of an issue in the big picture. Mostly asking for curiosity's sake.

What JOIN method does your execution plan tell you is being used?

For this, you probably want to give your table variable a clustered index on A and look for a MERGE JOIN leveraging the fact that both tables can be sorted on A/FieldA. Idk what your rdbms is, but I know with SQL Server you can force the algorithm and index usage on the query:

SQL code:

declare
  @@mapping as table (
    A nvarchar(50) INDEX mapping_A CLUSTERED,
    B nvarchar(50),
    C nvarchar(50)
  );

insert into @@mapping values
('abc', '123', 'xxx'),
('def', '234', 'xxx'),
('ghi', '987', 'yyy');
-- about 12 rows in @@mapping

create index LargeTable_FieldA on LargeTable (FieldA);

update t
set
  FieldA = m.B,
  FieldB = m.C,
  FieldX = m.A
from @@mapping m WITH INDEX(mapping_A)
MERGE join LargeTable t
WITH INDEX(LargeTable_FieldA)
  on t.FieldA = m.A;

Usually I don't do either of the HINTS that I used because I want the query optimizer to have discretion, but since you said this was a one time thing, I was more controlling.

Lastly you could do OPTION(FORCE ORDER) at the end to force joining from @@mapping to LargeTable like you mentioned you would do. Note how I flipped the JOIN order to set that up.

# ? Apr 1, 2020 12:25

nielsm: Jun 1, 2009

Thanks, I'll try out those methods.

Right now I get a Nested Loops (Inner Join) using Hash Match (Aggregate), with a table scan of the mapping table at top and an index seek on the nonclustered index of the large table at the bottom. This has an actual cost of 48% on the index seek, 19% on the nested loops, 31% on the hash match, and 1% on the final clustered index update.

# ? Apr 1, 2020 13:25

nielsm: Jun 1, 2009

Okay I experimented a bit, two things made a difference:

First, using a tempdb table ##mapping instead of a table var @@mapping.
Second, creating the clustered index on ##mapping.A allowed the execution plan to prefer a nested loops join with a table scan of LargeTable at the top, doing a stream aggregate match on the mapping table.

I didn't need any index hints or changing the join type on the update statement after those two changes.

The actual cost moved to 85% table update, 9% table scan on LargeTable, 6% index seek on ##mapping, zero on the join. Execution time improved by a factor 16, from about 12 minutes down to 45 seconds with a LargeTable dataset of 800k rows.
Edit: And yes those numbers are much closer to what I had expected the costs for the opration to be.

nielsm fucked around with this message at 14:22 on Apr 1, 2020

# ? Apr 1, 2020 13:49

Agrikk: Oct 17, 2003; Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

I have a TSQL LAG question:

Give a table:

code:

UserID   timestamp    score
UserA   2020-04-02    10000
UserA   2020-04-01     9000
UserA   2020-03-31    8000
UserA   2020-02-29    5000

I am using a LAG function to determine the difference in scores between days like so:

code:

SELECT	[UserID],
	[timestamp],
	[score],
	score - LAG(score, 1,0) over (ORDER BY (STAMP)) as PointsPerDay
  FROM table

but I cannot figure out the WHERE clause so that if the time difference between two entries is greater than 25 hours return a zero for PointsPerDay for that entry.

# ? Apr 8, 2020 17:00

Nth Doctor: Sep 7, 2010; Darkrai used Dream Eater!
It's super effective!

Agrikk posted:

I have a TSQL LAG question:

Give a table:
code:
UserID   timestamp    score
UserA   2020-04-02    10000
UserA   2020-04-01     9000
UserA   2020-03-31    8000
UserA   2020-02-29    5000
I am using a LAG function to determine the difference in scores between days like so:
code:
SELECT	[UserID],
	[timestamp],
	[score],
	score - LAG(score, 1,0) over (ORDER BY (STAMP)) as PointsPerDay
  FROM table
but I cannot figure out the WHERE clause so that if the time difference between two entries is greater than 25 hours return a zero for PointsPerDay for that entry.

This sounds like a case for CASE and another LAG instead of a WHERE clause, but I haven't really used LAG very much. Let me bodge something together real quick and check my intuition.

Fake edit:

code:

CREATE TABLE dbo.#Temp (
  UserID VARCHAR(25) NOT NULL,
  Stamp DATETIME NOT NULL,
  Score INT
);

INSERT dbo.#Temp (
  UserID,
  Stamp,
  Score
)
VALUES
('UserA',   '2020-04-02',    10000),
('UserA',   '2020-04-01',     9000),
('UserA',   '2020-03-31',    8000),
('UserA',   '2020-02-29',    5000),
('UserB',   '2020-04-02',    5),
('UserB',   '2020-04-01',     1)

SELECT
  T.UserID,
  T.Stamp,
	T.Score,
	T.Score - LAG(T.Score, 1,0) OVER (ORDER BY (T.Stamp)) as PointsPerDay,
  
  -- Begin messing around
  LAG(T.Score, 1,0)  OVER (PARTITION BY T.UserID ORDER BY (T.Stamp)) AS LaggingScore,
  LAG(T.Stamp, 1,0) OVER (PARTITION BY T.UserID ORDER BY (T.Stamp)) AS LaggingDate,
  DATEDIFF(hh, LAG(T.Stamp, 1,0) OVER (PARTITION BY T.UserID ORDER BY (T.Stamp)), T.Stamp) AS Hours_Diff,

  CASE WHEN DATEDIFF(hh, LAG(T.Stamp, 1,0) OVER (PARTITION BY T.UserID ORDER BY (T.Stamp)), T.Stamp) > 24 THEN 0
  ELSE T.Score - LAG(T.Score, 1,0) OVER (PARTITION BY T.UserID ORDER BY (T.Stamp))
  END AS NewPointsPerDay
FROM dbo.#Temp AS T;
GO
DROP TABLE dbo.#Temp;

If you don't care about resetting points/day by user, remove the PARTITION BY but I thought maybe you'd eventually be caring about that.

# ? Apr 8, 2020 17:23

kumba: Nov 8, 2003; I posted my food for USPOL Thanksgiving!

enjoy the ride; Lipstick Apathy

Agrikk posted:

I have a TSQL LAG question:

Give a table:
code:
UserID   timestamp    score
UserA   2020-04-02    10000
UserA   2020-04-01     9000
UserA   2020-03-31    8000
UserA   2020-02-29    5000
I am using a LAG function to determine the difference in scores between days like so:
code:
SELECT	[UserID],
	[timestamp],
	[score],
	score - LAG(score, 1,0) over (ORDER BY (STAMP)) as PointsPerDay
  FROM table
but I cannot figure out the WHERE clause so that if the time difference between two entries is greater than 25 hours return a zero for PointsPerDay for that entry.

i think you can do this with a calendar table and a left join with ISNULL() around the whole score & lag calculation to return 0 for days not in table?

code:

SELECT	[UserID],
	[timestamp],
	[score],
	ISNULL(score - LAG(score, 1,0) over (ORDER BY (STAMP)),0) as PointsPerDay
  FROM calendar c
  LEFT JOIN  table t on c.theDate = t.timestamp

i haven't used LAG much, does this kind of thing work?

e: f,b

# ? Apr 8, 2020 17:25

Furia: Jul 26, 2015; Grimey Drawer

Worst case you could try a CTE, depending on what you are trying to accomplish. Above seems good though.

# ? Apr 8, 2020 18:49

Malcolm XML: Aug 8, 2009; I always knew it would end like ｔｈｉｓ．

Nth Doctor posted:

This sounds like a case for CASE and another LAG instead of a WHERE clause, but I haven't really used LAG very much. Let me bodge something together real quick and check my intuition.

Fake edit:

code:

CREATE TABLE dbo.#Temp (
  UserID VARCHAR(25) NOT NULL,
  Stamp DATETIME NOT NULL,
  Score INT
);

INSERT dbo.#Temp (
  UserID,
  Stamp,
  Score
)
VALUES
('UserA',   '2020-04-02',    10000),
('UserA',   '2020-04-01',     9000),
('UserA',   '2020-03-31',    8000),
('UserA',   '2020-02-29',    5000),
('UserB',   '2020-04-02',    5),
('UserB',   '2020-04-01',     1)

SELECT
  T.UserID,
  T.Stamp,
	T.Score,
	T.Score - LAG(T.Score, 1,0) OVER (ORDER BY (T.Stamp)) as PointsPerDay,
  
  -- Begin messing around
  LAG(T.Score, 1,0)  OVER (PARTITION BY T.UserID ORDER BY (T.Stamp)) AS LaggingScore,
  LAG(T.Stamp, 1,0) OVER (PARTITION BY T.UserID ORDER BY (T.Stamp)) AS LaggingDate,
  DATEDIFF(hh, LAG(T.Stamp, 1,0) OVER (PARTITION BY T.UserID ORDER BY (T.Stamp)), T.Stamp) AS Hours_Diff,

  CASE WHEN DATEDIFF(hh, LAG(T.Stamp, 1,0) OVER (PARTITION BY T.UserID ORDER BY (T.Stamp)), T.Stamp) > 24 THEN 0
  ELSE T.Score - LAG(T.Score, 1,0) OVER (PARTITION BY T.UserID ORDER BY (T.Stamp))
  END AS NewPointsPerDay
FROM dbo.#Temp AS T;
GO
DROP TABLE dbo.#Temp;

If you don't care about resetting points/day by user, remove the PARTITION BY but I thought maybe you'd eventually be caring about that.

You should be able to factor out the window frame (partition by etc) in the last column. You can do the case over the window frame, but I�d just do IIF(datediff(...) <24, score-lag(Score,1),0)

# ? Apr 9, 2020 06:54

Agrikk: Oct 17, 2003; Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

Another question:

Given a table like so:

code:

TransactionID   Item1   Item1Count Item2 Item2Count  Item3 Item3count  Item4 Item4count
abc123                 4              10                5               20            6             100            7          50

with an item table like so

code:

ItemID   ItemName
4               foo
5               bar
6               bizz
7               buzz

Is it possible to create output like so:

code:

abc123 foo 10 bar 20 bizz 100 buzz 50

Without doing four joins? Is there a lookup table function or technique that'll look up the all the values in the whole row from one table at once?

# ? Apr 17, 2020 20:31

Adbot: ADBOT LOVES YOU

# ? Jun 7, 2024 22:24

Moonwolf: Jun 29, 2004; Flee from th' terrifyin' evil of "NHS"!

As far as I know there is no dialect of SQL which can do that in a single operation. Although the better way to solve that would be to have a table of:

code:

transactionId  itemId   count
4              5        30
4              3        112
5              5        2

where you do one join to decorate with the item name from id and then pivot into your transaction + items format.

# ? Apr 17, 2020 20:47

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > SELECT * FROM Questions WHERE Type = 'Stupid'

«‹›303 »