Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Defenestrategy
Oct 24, 2010

Thanks for the ideas guys!

PhantomOfTheCopier posted:

Shrug, start small.

Luckily the asks they want out of it are very small and I don't have to abide by any sort of legal requirements here, otherwise I would have told them to bite the bullet and go buy an actual solution. I'm gonna have a front end website where dudes can check in by clicking a drop down box and selecting their name, and hitting submit, of course with some amount of sanity check to make sure they aren't double submitting on accident. Then I'm gonna create an "admin" console for owner to be able to do basic viewing and editing of the tables.

A question: Why do you suggest having a setup like member_id and member_info tables instead of just one members table?

Defenestrategy fucked around with this message at 01:35 on Feb 3, 2023

Adbot
ADBOT LOVES YOU

Ranzear
Jul 25, 2013

Defenestrategy posted:

A question: Why do you suggest having a setup like member_id and member_info tables instead of just one members table?

Frontloaded and generally good assumptions about first normal forms.

If a member could be in an arbitrary number of classes and needs to track data independently between them it can't be done without breaking nicely flattened, aka normalized, structures. Hence a second table that can have the same member_id multiple times for a row per class they're taking, filled with their data relevant to that class. Classes would be defined in their table, but class occurrences are in a table that references a class_id multiple times too. Then attendance of a member to a specific timeslot of a class is in yet another table, referencing both other tables.

The other relevant part of this 'one here, many there' is that the many entries table is foreign keyed to the single entry table. This is also a case where your primary key should just be an integer you forget about, it's the foreign key constraint that really drives the table. That's one of my big db tips: constraints, constraints, constraints.

So your schema might look like:

code:
classes
	pid, (unique) classid
class_times
	pid, (incrementing) timeslotid, (foreign key) classes.classid, date_reference, timeslot_reference, etc
members
	pid, (unique) memberid, fullname
member_registrations
	pid, (incrementing) registrationid, (foreign key) members.memberid, (foreign key) classes.classid
member_attendance
	pid, (foreign key) member.memberid, (foreign key) class_times.timeslotid, attendance_enum    (plus add a unique constraint on memberid and timeslotid)
Yes, some tables might end up having a single column aside from a primary key. That's fine and expected. Don't try to maximize column usage like you're gonna run out of tables. Also, IMO, primary keys are just a 'default' index and storage ordering, so I just leave them autoinc ints and focus on making my own uniqueness and foreign key constraints, and it lets me have a little more flexibility down the line especially with (re)naming fields. Altering a primary key just gives me heebie-jeebies, plus I've done some Galera stuff which needs to over-increment primary keys and I need to handle it at the application layer instead.

Notice this allows attendance to be marked for classes not registered. Maybe someone needs to make up a class and you want that, else add a column with a foreign key requiring registrationid.

Building constraints into the schema like this lets you avoid bricking the db with small coding errors, it'll just tell your app to hecc off. Handling those errors instead of trying to cover all cases is also easier on the app coding side.

Ranzear fucked around with this message at 05:21 on Feb 3, 2023

Just-In-Timeberlake
Aug 18, 2003

Defenestrategy posted:

Thanks for the ideas guys!

Luckily the asks they want out of it are very small and I don't have to abide by any sort of legal requirements here, otherwise I would have told them to bite the bullet and go buy an actual solution. I'm gonna have a front end website where dudes can check in by clicking a drop down box and selecting their name, and hitting submit, of course with some amount of sanity check to make sure they aren't double submitting on accident. Then I'm gonna create an "admin" console for owner to be able to do basic viewing and editing of the tables.

A question: Why do you suggest having a setup like member_id and member_info tables instead of just one members table?

get some cheap UPC key fobs and a bar code scanner (also hella cheap, $30 with stand)

Someone enrolls, assign them a fob, they scan in when they get there, no mistakes.

Alternately don't do any of this because you're going to be expected to be tech support for something you didn't get paid for, but your buddy is going to act like you did.

Just-In-Timeberlake fucked around with this message at 14:30 on Feb 3, 2023

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
Stupid analytics-101 type question, and a basic SQL Fiddle with examples: http://sqlfiddle.com/#!17/41a4b/14

I've got people, notes, and codes, a minimal model looks like this:

SQL code:
CREATE TABLE notes(note_id TEXT, person_id TEXT, note_text TEXT);
CREATE TABLE codes(code_id TEXT, person_id TEXT, code TEXT);
CREATE TABLE person(person_id TEXT);
What's the sanest way to get a count of notes and codes per person, knowing that there are some people with zero for one or more of those and I do want those to appear?

So I want something like:

pre:
person_id notes codes
1 4 5
2 8 0
3 0 16
Would you rather see a COUNT DISTINCT like

SQL code:
SELECT
    person_id,
    COUNT(DISTINCT note_id) as note_count,
    COUNT(DISTINCT codes) as code_count
FROM
    person
    LEFT JOIN notes USING(person_id)
    LEFT JOIN codes USING(person_id)
GROUP BY
    person_id;
or one with two CTEs like
SQL code:
    SELECT
        person_id,
        COUNT(*) as note_count
    FROM
        person
        JOIN notes USING(person_id)
    GROUP BY
        person_id
or is there something I'm completely missing that's simpler than any of this? Also, to boot in reality I'm dealing with spark and codes doesn't even have a primary key so I'd have to do DISTINCT on something like a datetime field. Blech. I think the one with multiple CTEs might make for a better execution plan on most platforms, but I'm not sure.

Ranzear
Jul 25, 2013

You want COALESCE(count(foo.bar), 0) to get zeroes instead of null, then it's just joins and grouping slightly less of a clusterfuck:

SQL code:
SELECT
  p.person_id,
  (SELECT COALESCE(COUNT(person_id), 0) FROM notes n WHERE p.person_id = n.person_id) AS note_count,
  (SELECT COALESCE(COUNT(person_id), 0) FROM codes c WHERE p.person_id = c.person_id) AS code_count
FROM person p
GROUP BY p.person_id;
code:
person_id	notes	codes
1		2	1
2		0	1
3		0	0
Complete even with zero-zero cases.

COUNT(*) is an antipattern. Put what you want to count in it.

Double edit: gently caress gently caress gently caress. Everything is wrong. Hang on...

Triple edit: Subqueries it is. This works fine until you try to do more than one join is all:
SQL code:
SELECT
  p.person_id,
  COALESCE(COUNT(person_id), 0) AS note_count
FROM person p
LEFT JOIN notes n ON p.person_id = n.person_id
GROUP BY p.person_id;

Ranzear fucked around with this message at 04:17 on Feb 4, 2023

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
I feel like there's something really fundamental I'm missing. Playing around in SQL Fiddle the execution plan does look better with the two CTEs, one to count each type of record, and then a JOIN on person_id at the end.

I hacked through this by making two separate queries and then smashing the results together in python, because no matter which way I tried to get this all out in one query I was going OOM and dying with 360GB of RAM across the spark cluster, and I'd have to bother IT to get a bigger cluster. Notes is >1billion records, there's a couple million people, I didn't bother to count codes but it's somewhere between millions and a billion.

Ranzear
Jul 25, 2013

Fixed mine with subqueries with where clauses to keep the counts separate. When just counting one other table the join works but the count spills over with two or more.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
Why does each cte also need the person table?

... It won't let me post...
Select id count note from notes group by notes
Should work just fine in a CTE.


Ten million people is 10M x 2 integers for the id+count tables and another 10M for the results. It should only be 60M x 64bit or 480MB. Even at 4x for stuff, storage is less than 2GB? Likely more if everything is actually a text field.

Start with WHERE personid less than 500 in your ctes to see if the results are even sensible or if it's doing something dumb.

Ranzear
Jul 25, 2013

Ranzear posted:

COUNT(*) is an antipattern. Put what you want to count in it.

What the gently caress Postgre? COUNT(*) actually no-ops the compare and is faster? So much for my "it costs nothing to have more clarity here" ideal. Speed difference even shows up in the Fiddle even if the plan is the same.

Edit: I thought it might be the table aliases in the counts making it 10x slower for some reason, it's just being weird all around. I'd still prefer to filter out null person_ids unless I had a constraint.

Ranzear fucked around with this message at 04:24 on Feb 4, 2023

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

PhantomOfTheCopier posted:

Why does each cte also need the person table?

... It won't let me post...
Select id count note from notes group by notes
Should work just fine in a CTE.


Ten million people is 10M x 2 integers for the id+count tables and another 10M for the results. It should only be 60M x 64bit or 480MB. Even at 4x for stuff, storage is less than 2GB? Likely more if everything is actually a text field.

Start with WHERE personid less than 500 in your ctes to see if the results are even sensible or if it's doing something dumb.

I don't need the person table in each CTE, and that solves the memory usage in spark! Good call. It's super fast and simple now with the two CTEs.

As for resource usage: Spark does really unpredictable things compared to any relational database that I've used, because it's not a relational database, it is just pretending to create one over semi-structured data. I hate Spark, but it seems to be the dominant tool and what gets provided as output from whatever data processing pipeline companies have now.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Twerk from Home posted:

I don't need the person table in each CTE, and that solves the memory usage in spark! Good call. It's super fast and simple now with the two CTEs.

As for resource usage: Spark does really unpredictable things compared to any relational database that I've used, because it's not a relational database, it is just pretending to create one over semi-structured data. I hate Spark, but it seems to be the dominant tool and what gets provided as output from whatever data processing pipeline companies have now.

:cheers:

I certainly don't know but it's probably better to think of it as map reduce. Joins are going to be trouble but materialized data (cte and some subselects) can be handled in separate subprocesses. It's easy, therefore, for it to build tons of small reference "tables" of intermediate results and do one big join at the end, but doing joins for every row is just a big Cartesian mess. In fact with Spark you should probably just imagine that each table is stored somewhere else in the world.

But I don't know Spark so :downsgun:

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

Ranzear posted:

COUNT(*) is an antipattern.

:thunk:

what the heck is this take?! count(*) literally just counts the number of rows in the resultset, it's a perfectly legitimate thing to want to do and is perfectly clear in intent. putting an expression in there means something different, it counts the rows where that expression isn't null, but if you don't actually care about asserting that then it's not just confusing, as you discovered it also forces the db to do an entirely avoidable null check!

TheFluff fucked around with this message at 15:41 on Feb 4, 2023

Condimentalist
Jun 13, 2007

TheFluff posted:

:thunk:

what the heck is this take?! count(*) literally just counts the number of rows in the resultset, it's a perfectly legitimate thing to want to do and is perfectly clear in intent. putting an expression in there means something different, it counts the rows where that expression isn't null, but if you don't actually care about asserting that then it's not just confusing, as you discovered it also forces the db to do an entirely avoidable null check!

Maybe Count(1) will make them happy lol

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
regardless of anything else, COUNT(*) is also idiomatic SQL and RDBMS are likely to take the time to make sure it's performant

Condimentalist
Jun 13, 2007
I mean, I would never deploy production code with a count(*) or select (*), partly due to reliability concerns (I.e. if you return all columns and the column count changes, you can easily break middle tier code, etc) and partly due to strict database development standards, but for one off queries or non deployed code… I do it all the time and it’s super handy.

Kind of like order by 1,3,3,4 instead of spelling out the columns. Yes it should work in production, but just be explicit!

Kuule hain nussivan
Nov 27, 2008

Condimentalist posted:

I mean, I would never deploy production code with a count(*) or select (*), partly due to reliability concerns (I.e. if you return all columns and the column count changes, you can easily break middle tier code, etc) and partly due to strict database development standards, but for one off queries or non deployed code… I do it all the time and it’s super handy.

Kind of like order by 1,3,3,4 instead of spelling out the columns. Yes it should work in production, but just be explicit!

But... count(*) doesn't care about theactual columns?

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
COUNT(*) and SELECT * are entirely different

Just-In-Timeberlake
Aug 18, 2003

Condimentalist posted:

I mean, I would never deploy production code with a count(*) or select (*), partly due to reliability concerns (I.e. if you return all columns and the column count changes, you can easily break middle tier code, etc) and partly due to strict database development standards, but for one off queries or non deployed code… I do it all the time and it’s super handy.

Kind of like order by 1,3,3,4 instead of spelling out the columns. Yes it should work in production, but just be explicit!

COUNT(*) has nothing to do with columns

Ranzear
Jul 25, 2013

TheFluff posted:

:thunk:

what the heck is this take?!

Maybe 80% poo poo take, 20% misremembered MariaDB quirks: Something about always using a not null column in count() which probably results in the exact same no-op as Postgre with *, and then there's a level of explicitness and clarity that comes of using the primary key in there without the performance penalty.

What probably matters more is picking something and sticking to it. Also a safe bet I read that five years ago specifically about Aria Storage Engine which still isn't done.

Ranzear fucked around with this message at 01:48 on Feb 5, 2023

Condimentalist
Jun 13, 2007

Just-In-Timeberlake posted:

COUNT(*) has nothing to do with columns

I don’t think I said it did?
I am making two separate statements in my post related to the asterisk in sql statements; I should have been clearer in my post.

Neither of those asterisk related sql statements should be deployed to production, was my point.

RandomBlue
Dec 30, 2012

hay guys!


Biscuit Hider

Condimentalist posted:

I don’t think I said it did?
I am making two separate statements in my post related to the asterisk in sql statements; I should have been clearer in my post.

Neither of those asterisk related sql statements should be deployed to production, was my point.

You are misunderstanding how COUNT(*) works vs COUNT(col).

https://learnsql.com/blog/difference-between-count-distinct/

RandomBlue fucked around with this message at 05:06 on Feb 5, 2023

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Condimentalist posted:

I don’t think I said it did?
I am making two separate statements in my post related to the asterisk in sql statements; I should have been clearer in my post.

Neither of those asterisk related sql statements should be deployed to production, was my point.

Ok I’m sorry, what? What do you think the potential issues of SELECT COUNT(*) are exactly?

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Ruggan posted:

Ok I’m sorry, what? What do you think the potential issues of SELECT COUNT(*) are exactly?
Ten years ago, PostgreSQL had troubles with the behavior of count(*), that were notably improved in 9.2. General support for index only scans meant that count* could be optimized.

That provides a hint, however: count(key) may force more appropriate index usage particularly for conditional counts... But in practice I suspect this is a difficult case to hit these days.

It's weird because I "know to avoid" count* on PG but, beyond some memory of it being slow, there may be little remaining reason beyond a desire to have something explicit and "ugh that's a mysql thing".

Pardot
Jul 25, 2001




One other thing with count is that it gets you the exact correct count, and has to wait for locks, scan things, just do work in general. If you really need an exact count that’s fine. But a lot of times people take a count to just display like the end number of pages or something and it doesn’t really matter if it’s exact. In that case grab the estimate from one of the stats table.

I would blow Dane Cook
Dec 26, 2008
Weird question, but in your work as a database administrator have you ever been confronted with an ethical quandary or asked to do something with data that you weren't comfortable with/felt appropriate and how did you handle it?

Kuule hain nussivan
Nov 27, 2008

I would blow Dane Cook posted:

Weird question, but in your work as a database administrator have you ever been confronted with an ethical quandary or asked to do something with data that you weren't comfortable with/felt appropriate and how did you handle it?

While not a DB-admin per se, I've made an anonymous complaint to a regulating authority regarding use of sensitive information.

mortarr
Apr 28, 2005

frozen meat at high speed

I would blow Dane Cook posted:

Weird question, but in your work as a database administrator have you ever been confronted with an ethical quandary or asked to do something with data that you weren't comfortable with/felt appropriate and how did you handle it?

Not a dba either, but 20 or so years ago I was responsible for a medical report that was included in a "discharge summary" report - a report sent to a patients' GP post inpatient stay. The report included a summarised set of medical tests, including test name, date, value, and normal/expected range. I was responsible for re-writing/modernising the report, basically turning it from raw data into a pivoted set to save pages when faxed (lol), during the re-write I found a logic error transposing data for a small subset of tests, for all patients since the report was first written - going back multiple years. The resulting error would show both false positives and false negatives, depending which data was transposed.

I thought it was a huge deal, my boss thought we should just fix it and do nothing, I was just starting out, so not all that confident but I was thinking due to the scale of the issue I might need to go to the media or something as I wasn't getting any traction on it within my team. I don't recall how it happened now but I ended up speaking to the medical director of the hospital and he explained that the specific tests that were likely to be problematic had a shelf life - they were relevant only for around two or three months once administered, and also if they were abnormal there would have been a text note included in the report, abnormal tests would have triggered a further round of tests that would have been reported differently - so the risk of having non-obvious wrong data on a patents' permanent record was acceptable.

I still think about the whole thing, I don't know if the outcome was right but ultimately I didn't know enough about the medico-legal side then and had to trust that any fallback would land on the director at that stage - as his instructions to me were clear. I think I would do things differently if the same scenario happened to me now, in particular I would get a lot more in writing, but also my technical and business/people skills are way better so I would be able to investigate and advocate my position more effectively.

I really did come close to going to the media, and I think the only thing that stopped me was that I recognised doing that would have most likely tanked my career just as I was just starting out.

I've come across a few ethically dodgy situations at work since but nothing anywhere near the scope of this one, and none to do with data I've been responsible for. I think going through this was a good learning experience - being very careful with sensitive data is certainly something I factor into my work practice now.

Wipfmetz
Oct 12, 2007

Sitzen ein oder mehrere Wipfe in einer Lore, so kann man sie ueber den Rand der Lore hinausschauen sehen.
Does anybody have experience with postgres' indices on expressions, like substr(somevarchar, 3 ,4)?

Basically, our customer migrates a large database from DB2/zOS to postgres. Our system runs a few queries on that database, including joins and filterings and aggregations.
Since it's an automotive enterprise with partially decades old tables and records, it also includes the odd char(n)-fields where you have to filter on the nth character and stuff.

We are now in the situation where we may request a list of indices.

I know that posgres supports indices on expressions, but I've got no idea if it's actually used by the query planner aside from a few corner cases.

Moonwolf
Jun 29, 2004

Flee from th' terrifyin' evil of "NHS"!


Wipfmetz posted:

Does anybody have experience with postgres' indices on expressions, like substr(somevarchar, 3 ,4)?

Basically, our customer migrates a large database from DB2/zOS to postgres. Our system runs a few queries on that database, including joins and filterings and aggregations.
Since it's an automotive enterprise with partially decades old tables and records, it also includes the odd char(n)-fields where you have to filter on the nth character and stuff.

We are now in the situation where we may request a list of indices.

I know that posgres supports indices on expressions, but I've got no idea if it's actually used by the query planner aside from a few corner cases.

You'll need the same expression in the query exactly, usually, for functional indexing to work, in the same manner as you need an accurate reference for partitioning to parse usefully for partition pruning. If you have a matching query to index function call though, it should use it unless the stats say that another entry onto the table is going to be more selective anyway.

Wipfmetz
Oct 12, 2007

Sitzen ein oder mehrere Wipfe in einer Lore, so kann man sie ueber den Rand der Lore hinausschauen sehen.
Cool. "same expression" sounds like a very doable requirement.
Seems to be worth a try.

double nine
Aug 8, 2013

I'm struggling with a beginner SQL question.

I have a two tables, basically it's got the following columns

- customer ID code
- purchase date
- store ID code
- Product ID
- invoice number

These are split into two tables: one with all purchases in the last 12 months, and one for the older purchases.

I need to get a list for all rows for a product A (say ID 123456) purchased since 20220101, if they haven't purchased product B (eg.987654)

what makes it extra fun is that these could be found in either table and I'm not getting there.
can someone help build a basic structure for this?

(e: specifically the filter "show if NOT PURCHASED 987654" is giving me trouble, all the rest is obvious

double nine fucked around with this message at 20:12 on Feb 16, 2023

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

double nine posted:

I'm struggling with a beginner SQL question.

I have a two tables, basically it's got the following columns

- customer ID code
- purchase date
- store ID code
- Product ID
- invoice number

These are split into two tables: one with all purchases in the last 12 months, and one for the older purchases.

I need to get a list for all rows for a product A (say ID 123456) purchased since 20220101, if they haven't purchased product B (eg.987654)

what makes it extra fun is that these could be found in either table and I'm not getting there.
can someone help build a basic structure for this?

(e: specifically the filter "show if NOT PURCHASED 987654" is giving me trouble, all the rest is obvious

code:
SELECT
    WantedTable.*
FROM
    (SELECT * FROM Last12MonthsTable UNION ALL SELECT * FROM OlderTable) AS WantedTable
    LEFT JOIN Last12MonthsTable AS Check1 ON
        Check1.CustomerID = WantedTable.CustomerID AND
        Check1.ProductID = 987654
    LEFT JOIN OlderTable AS Check2 ON
        Check2.CustomerID = WantedTable.CustomerID AND
        Check2.ProductID = 987654
WHERE
    Check1.CustomerID IS NULL AND
    Check2.CustomerID IS NULL AND
    WantedTable.ProductID = 123456 AND
    WantedTable.PurchaseDate >= '20220101'
this is not tested and I can't make any guarantee about correctness or performance

Hammerite fucked around with this message at 20:56 on Feb 16, 2023

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
you could potentially do the union on the outside instead, and maybe the query engine would be more likely to do it efficiently? Idk

code:
SELECT
    WantedTable1.*
FROM
    Last12MonthsTable AS WantedTable1
    LEFT JOIN Last12MonthsTable AS Check11 ON
        Check11.CustomerID = WantedTable1.CustomerID AND
        Check11.ProductID = 987654
    LEFT JOIN OlderTable AS Check12 ON
        Check12.CustomerID = WantedTable1.CustomerID AND
        Check12.ProductID = 987654
WHERE
    Check11.CustomerID IS NULL AND
    Check12.CustomerID IS NULL AND
    WantedTable1.ProductID = 123456
UNION ALL
SELECT
    WantedTable2.*
FROM
    OlderTable AS WantedTable2
    LEFT JOIN Last12MonthsTable AS Check21 ON
        Check21.CustomerID = WantedTable2.CustomerID AND
        Check21.ProductID = 987654
    LEFT JOIN OlderTable AS Check22 ON
        Check22.CustomerID = WantedTable2.CustomerID AND
        Check22.ProductID = 987654
WHERE
    Check21.CustomerID IS NULL AND
    Check22.CustomerID IS NULL AND
    WantedTable2.ProductID = 123456 AND
    WantedTable2.PurchaseDate >= '20220101'
no date condition on the first select because that table contains only rows that are guaranteed to satisfy the date condition

Just-In-Timeberlake
Aug 18, 2003

double nine posted:

I'm struggling with a beginner SQL question.

I have a two tables, basically it's got the following columns

- customer ID code
- purchase date
- store ID code
- Product ID
- invoice number

These are split into two tables: one with all purchases in the last 12 months, and one for the older purchases.

I need to get a list for all rows for a product A (say ID 123456) purchased since 20220101, if they haven't purchased product B (eg.987654)

what makes it extra fun is that these could be found in either table and I'm not getting there.
can someone help build a basic structure for this?

(e: specifically the filter "show if NOT PURCHASED 987654" is giving me trouble, all the rest is obvious

This should work but it's ugly as hell

code:
DECLARE @pId1				INT = 123456
	, @pId2				INT = 987654
	, @customerId			INT = 1
	, @dateCutoff			DATETIME = '1/1/2022';

;WITH cte AS (
	SELECT	[productId],
            	[customerId],
            	[purchaseDate],
            	[storeId],
            	[invoiceNumber]
	FROM	@salesNew
	WHERE	[customerId] = @customerId
		AND [productId] = @pId1
		AND [purchaseDate] >= @dateCutoff
	UNION
	SELECT	[productId],
            	[customerId],
           	[purchaseDate],
           	[storeId],
            	[invoiceNumber]
	FROM	@salesOld
	WHERE	[customerId] = @customerId
		AND [productId] = @pId1
		AND [purchaseDate] >= @dateCutoff
)
SELECT	[productId],
        [customerId],
        [purchaseDate],
        [storeId],
        [invoiceNumber]
FROM	[cte]
WHERE	(
			SELECT	COUNT(*)
			FROM	(
						SELECT	[productId],
							[customerId],
							[purchaseDate],
							[storeId],
							[invoiceNumber]
						FROM	@salesNew
						WHERE	[customerId] = @customerId
							AND [productId] = @pId2
						UNION
						SELECT	[productId],
							[customerId],
							[purchaseDate],
							[storeId],
							[invoiceNumber]
						FROM	@salesOld
						WHERE	[customerId] = @customerId
							AND [productId] = @pId2
					) AS [t]
		) = 0

Just-In-Timeberlake fucked around with this message at 21:07 on Feb 16, 2023

double nine
Aug 8, 2013

thanks for the suggestions, I'll try them tomorrow.
I think it'll work

mortarr
Apr 28, 2005

frozen meat at high speed

double nine posted:

I'm struggling with a beginner SQL question...



I threw your whole post into chat gpt as-is and it came up with a pretty similar answer to hammerites' 2nd one (sql comments are mine):

code:
SELECT *
FROM purchases_last_12_months
WHERE ProductID = '123456'
AND PurchaseDate >= '20220101'
AND CustomerID NOT IN (
    -- ignore customers that have purchased product B in the last 12 months
    SELECT CustomerID
    FROM purchases_last_12_months
    WHERE ProductID = '987654'
)
UNION
SELECT *
FROM older_purchases
WHERE ProductID = '123456'
AND PurchaseDate >= '20220101'
AND CustomerID NOT IN (
    -- ignore customers that have purchased product B *prior* to the last 12 months
    SELECT CustomerID
    FROM older_purchases
    WHERE ProductID = '987654'
)
Dunno if I'd recommend it for advanced analytic queries, but it seems pretty good for getting a starting point if you're already fairly proficient at sql. I tried to add my own preferred style, which is building stuff up by ctes, but it's getting blocked by sa's ddos filter or something sorry... here's my pseudo code for it:

code:
/*
probably not performant on large datasets:

with cte purchases for all cols from (purchases_last_12_months unioned with older_purchases) where ProductID = '123456' AND PurchaseDate >= '20220101'
then with cte customers_to_ignore as CustomerID from (purchases_last_12_months unioned with older_purchases) where ProductID = '987654'
get all data from purchases where CustomerID not in customers_to_ignore 
*/

mortarr fucked around with this message at 22:29 on Feb 16, 2023

Jabor
Jul 16, 2010

#1 Loser at SpaceChem
That chatgpt answer is really cool because it looks plausible and probably works on a couple of test cases but is also completely wrong and doesn't meet the specification at all.

Please do not use chatgpt to answer questions, it does not actually know the answers.

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

double nine posted:

I'm struggling with a beginner SQL question.

I have a two tables, basically it's got the following columns

- customer ID code
- purchase date
- store ID code
- Product ID
- invoice number

These are split into two tables: one with all purchases in the last 12 months, and one for the older purchases.

I need to get a list for all rows for a product A (say ID 123456) purchased since 20220101, if they haven't purchased product B (eg.987654)

what makes it extra fun is that these could be found in either table and I'm not getting there.
can someone help build a basic structure for this?

(e: specifically the filter "show if NOT PURCHASED 987654" is giving me trouble, all the rest is obvious

If you're thinking "I want to filter a query using a condition based on rows not existing", what you want is an operation called an anti-join - it's basically the opposite of an inner join. There are a couple of different ways to do it; probably the most common one is what Hammerite suggested initially (left join with WHERE some_not_nullable_col IS NULL as a filter). My preferred way though is to use WHERE NOT EXISTS(<subquery>), for a couple of reasons. First off I think it's the most obvious way to express the intent of the operation - a left join can be used for a ton of things and in a complex query it can be easy to miss the WHERE some_col IS NULL bit since it can be pretty far away from the left join itself, but NOT EXISTS is always an anti-join. Second, at least in Postgres, although the query planner is pretty good at identifying the most common ways to write an anti-join, the NOT EXISTS form is the one least likely to lead to it choosing a bizarrely bad query plan (the NOT IN(<subquery>) form is particularly prone to bad query plans, so avoid that one).

With all that being said, here's my take (postgres typecasting syntax for the datetimes):
SQL code:
with all_purchases as (
  select * from purchases
  union all
  select * from purchases_old
)
select *
from all_purchases
where
    all_purchases.product_id = 123456
    and all_purchases.purchased_at >= '2022-01-01'::timestamptz
    -- exclude rows where a purchase of product 987654 exists for this customer
    and not exists(
      select 1
      from all_purchases as all_purchases_inner
      where
      	all_purchases_inner.customer_id = all_purchases.customer_id
      	and all_purchases_inner.product_id = 987654
    )
If you're on a database that always materializes CTE's, that WITH clause at the top might be problematic, but in postgres versions from the last couple of years it should be no problem, and you can always just convert it to a subquery - I just prefer it as a CTE for readability. In fact that entire query is written mostly for readability and I haven't even looked at the query plan, so no guarantees.

Fiddle: https://www.db-fiddle.com/f/rTHPAVU6M2aD78BHpuf7UB/0

TheFluff fucked around with this message at 03:15 on Feb 17, 2023

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
Why materialize all the rows, join all the things multiple times, and delay all the filtering logic? This seems like "all of these except for these customers".

(And the chances of SA letting me paste this, which sqlfiddled the formatting...)

code:

with naup as (
  (select customer_id from purchases  where product_id=987654)
  union
  (select customer_id from purchases_old where product_id=987654)
  ),
  yeap as ( 
     (select * from purchases where product_id=123456 
     and purchased_at>'2022-01-01')
  union
  (select * from purchases_old where product_id=123456 
 and purchased_at>'2022-01-01')
    )
 select * from yeap 
  where 
  customer_id not in (  select customer_id   from naup where naup.customer_id=customer_id) 

Won't be able to check hypothetical query plans for a few days.

Adbot
ADBOT LOVES YOU

Macichne Leainig
Jul 26, 2012

by VG

Jabor posted:

That chatgpt answer is really cool because it looks plausible and probably works on a couple of test cases but is also completely wrong and doesn't meet the specification at all.

Please do not use chatgpt to answer questions, it does not actually know the answers.

I asked it to give me a null cipher once, and I poo poo you not, it basically told me "the first letter of each word from the sentence 'the dog ran away' spells a message. for example, 'the,' dog', and 'ran' spells "HELP" "

:wtf:

I really wish I had the confidence chatgpt has

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply