Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
What's special about lot-1 and lot-2? Why not lot-0 and lot-2? Is this really the problem you're trying to solve, or is this how you've tried to simplify the actual situation? (My magic8 ball is twitching near "solving the wrong problem".)

Adbot
ADBOT LOVES YOU

Spikes32
Jul 25, 2013

Happy trees

PhantomOfTheCopier posted:

What's special about lot-1 and lot-2? Why not lot-0 and lot-2? Is this really the problem you're trying to solve, or is this how you've tried to simplify the actual situation? (My magic8 ball is twitching near "solving the wrong problem".)

This is actually the problem I'm trying to solve. - 0 is the start of the manufacturing, - 1 designates one middle part of the manufacturing process while - 2 is the final product. The date associated with - 1 is determined to be the production date which needs to match the date on the final product. Some of the final product dates have been set to the - 0 and I need to get a list of them. X y and z will be a 6 character lot number starting with the year 21-1234 an example.

And you might ask why was it done this way instead of just another column. Which the program is set up to do, easily, natively. And as always the response is $@#$# previous admin hacked it together.

Spikes32 fucked around with this message at 00:33 on Jan 29, 2021

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Spikes32 posted:

This is actually the problem I'm trying to solve. - 0 is the start of the manufacturing, - 1 designates one middle part of the manufacturing process while - 2 is the final product. The date associated with - 1 is determined to be the production date which needs to match the date on the final product. Some of the final product dates have been set to the - 0 and I need to get a list of them. X y and z will be a 6 character lot number starting with the year 21-1234 an example.

And you might ask why was it done this way instead of just another column. Which the program is set up to do, easily, natively. And as always the response is $@#$# previous admin hacked it together.
:( Previous admins argh.

There's no way to "do" this in SQL in the sense that you can't just filter rows based on basic conditions. That means writing a query that first reshapes the data. Indeed, then, it will be a "self join" kinda, and you're basically using the database as a big clunky calculator, so it will be mostly equivalent to dumping all the data and doing it with an external programming language.

I'd start by finding all the tag-2 dates specifically: SELECT left(lot,-2) l2,d d2 FROM table WHERE right(lot,2)='-2';

Next find all the tag-1 dates that differ.

(Replace BSELECT, it won't let me post the select statements.)
code:

BSELECT left(lot,-2) lot,d
FROM table, (BSELECT left(lot,-2) l2,d d2 FROM table WHERE right(lot,2)='-2') AS table2
WHERE right(lot,2)='-1' AND left(lot,-2)=l2 AND d!=d2;

Spikes32
Jul 25, 2013

Happy trees
Appreciate it, I'll funagle this into working for me tomorrow or something similar via excel.

Moonwolf
Jun 29, 2004

Flee from th' terrifyin' evil of "NHS"!


Spikes32 posted:

I have to figure out a query for something and I'm hoping you can help. I'm pretty sure I need to do a join but am having trouble visualizing it. I have the following table and I need a way to find the lots where the - 1 and - 2 dates do not match. So in this example I would want the query to return y-1 and or y-2, and z-1 and or z-2. X y and z are variable but each lot will be replicated two or three times with the - 0,- 1 or - 2 after them.

Lot. Date
X-0. 1
X-1. 2
X-2. 2
Y-0. 3
Y-1. 4
Y-2. 5
Z-0. 6
Z-1. 7
Z-2. 6

You could also use subquery factoring if you have it:
code:
WITH split_lots AS
(SELECT
 SUBSTR(lot, 1, length(lot)-2) lot_group,
 SUBSTR(lot, -1, 1) lot_number,
 date
FROM
 table)
SELECT
 block1.lot_group
FROM
 split_lots block1 JOIN
 split_lots block2 ON 
 block1.lot_group = block2.lot_group AND 
 block1.date != block2.date 
WHERE
 block1.lot_number = 1 AND
 block2.lot_number = 2

Scarf
Jun 24, 2005

On sight
I know this should be pretty simple, but I don't have a way of testing my query at the moment... Assume I'm working in Oracle SQL Developer and I need to count the number of cases where the number of days between an application date and a start date are within 7 days.

Would I be able to do something as simple as:

count (CASE WHEN start_date - app_date <= 7 THEN 1 END)

Moonwolf
Jun 29, 2004

Flee from th' terrifyin' evil of "NHS"!


Scarf posted:

I know this should be pretty simple, but I don't have a way of testing my query at the moment... Assume I'm working in Oracle SQL Developer and I need to count the number of cases where the number of days between an application date and a start date are within 7 days.

Would I be able to do something as simple as:

count (CASE WHEN start_date - app_date <= 7 THEN 1 END)

Yes, as long as they're actually DATE type then doing straight subtraction comes out with a decimal of the days between them.

Scarf
Jun 24, 2005

On sight

Moonwolf posted:

Yes, as long as they're actually DATE type then doing straight subtraction comes out with a decimal of the days between them.

Cool, much appreciated!

ModeSix
Mar 14, 2009

So here I am with my stupid question of the day.

I have 3 tables, a user table, a contest entry table, and a votes table: user, entries, votes.



The contest runs on a monthly basis and requires a new contest entry each month to be counted. Each user only gets to vote once per gender in any given month, so they can vote for one male and one female.

I have the following query:
code:
SELECT be.year, 
       be.month, 
       be.user_id, 
       be.gender, 
       be.image, 
       u.username, 
       COUNT(bv.votefor_id) as votes 
FROM   datingsite_beauty_entries be 
       INNER JOIN datingsite_beauty_votes bv 
               ON be.user_id = bv.votefor_id 
       INNER JOIN datingsite_users u 
               ON be.user_id = u.id 
WHERE  be.year = bv.year 
       AND be.month = bv.month 
GROUP  BY bv.votefor_id, 
          be.year, 
          be.month 
Which gives me a nice result output of exactly what I want, (mostly).


How can I have it only display the top X results (based on votes) for each unique year/month/gender combination?

ModeSix fucked around with this message at 21:56 on Feb 8, 2021

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


ModeSix posted:

So here I am with my stupid question of the day.

I have 3 tables, a user table, a contest entry table, and a votes table: user, entries, votes.



The contest runs on a monthly basis and requires a new contest entry each month to be counted. Each user only gets to vote once per gender in any given month, so they can vote for one male and one female.

I have the following query:
code:
SELECT be.year, 
       be.month, 
       be.user_id, 
       be.gender, 
       be.image, 
       u.username, 
       COUNT(bv.votefor_id) as votes 
FROM   datingsite_beauty_entries be 
       INNER JOIN datingsite_beauty_votes bv 
               ON be.user_id = bv.votefor_id 
       INNER JOIN datingsite_users u 
               ON be.user_id = u.id 
WHERE  be.year = bv.year 
       AND be.month = bv.month 
GROUP  BY bv.votefor_id, 
          be.year, 
          be.month 
Which gives me a nice result output of exactly what I want, (mostly).


How can I have it only display the top X results (based on votes) for each unique year/month/gender combination?
What's your rdbms?

This sounds like a job for the ROW_NUMBER() window function, ordered by COUNT() DESC and partitioned by year, month, and gender with HAVING ROW_NUMBER()... <= the number of top entries you want to get

ModeSix
Mar 14, 2009

Nth Doctor posted:

What's your rdbms?

This sounds like a job for the ROW_NUMBER() window function, ordered by COUNT() DESC and partitioned by year, month, and gender with HAVING ROW_NUMBER()... <= the number of top entries you want to get

This is on a MySQL database.

So, ORDER BY count DESC GROUP BY year, month, gender? I am not sure how to partition it. If we can do that then just grab the first row of each grouping, that's exactly what I want to do.

edit: Wow there's a PARTITION BY sql function. Thanks for pointing me in the direction I need to go on this!

Empress Brosephine
Mar 31, 2012

by Jeffrey of YOSPOS
Im a lazy person and have my javascript program doing an insert statement and then running a separate DELETE statement that deletes all null entries right after to clean up the database…is there any reason NOT to do this?

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Empress Brosephine posted:

Im a lazy person and have my javascript program doing an insert statement and then running a separate DELETE statement that deletes all null entries right after to clean up the database…is there any reason NOT to do this?

I mean... it seems really goofy to me to go through the hassle of having the DB do the insert only to purge a portion of it right afterward.

Empress Brosephine
Mar 31, 2012

by Jeffrey of YOSPOS
Yeah its pretty goofy. I don’t want to write a javascript case to check if a field is null, wish sql had a way to be like “insert into table(value) where value is not null” but that last part doesn’t work on insert statements.

Impotence
Nov 8, 2010
Lipstick Apathy
you could always just make that column explicitly be not null and any inserts get rejected?

Empress Brosephine
Mar 31, 2012

by Jeffrey of YOSPOS
Wow you guys are so much more helpful than stack overflow, they told me to get hosed. That’s a great idea Bioware, is there a way to do that to force it to skip in SQL the not null and do the rest?


Here’s my original question from stackoverflow:

quote:

I feel like I'm missing something obvious, but I'm trying to run this command in pg-promise for node/express:

code:
const condition = pgp.as.format(" WHERE value IS NOT null", values);
const query = pgp.helpers.insert(values, cs) + condition;
Which translates to this:

code:
INSERT INTO "formquestionresponse"("attrib_id","value","response_id") VALUES(63,'Philadelphia Flyers','135'),(64,'Test','135'),(68,null,'135'),(77,null,'135') WHERE value IS NOT null
But I am getting a SQL error 16:03:46 error: syntax error at or near "WHERE"

But I can't for the life of me figure out how to fix that...am I missing something obvious?

Thanks for the help!

Impotence
Nov 8, 2010
Lipstick Apathy
Your application should not be inserting null values if you don't want them in the first place, it's quite literally one line to check that the string isn't empty, so I'm not sure why we're heading this route, but it's a column type modifier, not per-query: ALTER TABLE formquestionresponse MODIFY value TEXT NOT NULL;

This will also fail if you have null values, so you'll have to delete all null values first. Any attempts to insert a null there afterward should hardfail. You'll need to try/catch it.

Empress Brosephine
Mar 31, 2012

by Jeffrey of YOSPOS
Ill do that, thank you for the help BioWare

Gatac
Apr 22, 2008

Fifty Cent's next biopic.
I'm mostly with Biowarfare on this. The exception I'd give is if you're inserting multiple rows with an INSERT SELECT and the values are coming from another database object, it might make sense to filter it in the SELECT clause and not bother your application, but for single-row inserts, yeah, do your logic in the application.

Or write a stored procedure in the database that takes the parameters your application provides and does the check for NULL and building the INSERT internally. Stored procedures + bind variables help with keeping the fiddly SQL details out of your application and ensure that you have a standard interface to make grants for, rather than letting your application access every table and having to touch application code whenever you modify your data model.

Empress Brosephine
Mar 31, 2012

by Jeffrey of YOSPOS
Yeah I'm inserting about 100 rows at once so that's why I asked. I just need to not be lazy and write the code in the app lol.

redleader
Aug 18, 2005

Engage according to operational parameters
in mssql you could do something like the following. not sure if postgres or whatever supports this syntax though, but iirc this is standard sql syntax

code:
INSERT INTO "formquestionresponse" ("attrib_id", "value", "response_id")
SELECT attrib_id, value, response_id
FROM (
    VALUES
    (63, 'Philadelphia Flyers', '135'),
    (64, 'Test', '135'),
    (68, null, '135'),
    (77, null, '135')
) vals (attrib_id, value, response_id)
WHERE value IS NOT null

basically just putting the VALUES clause into a subquery and filtering that

i agree with the previous posters that value should be set as non-nullable (in addition to any code/query changes), if having nulls in value doesn't make logical sense in your schema

redleader fucked around with this message at 22:44 on Feb 10, 2021

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

Nullable values in columns should really be an opt-in rather than an opt-out, but it's about 50 years late for that.

galenanorth
May 19, 2016

What is the advantage of using a dynamic query in a stored procedure or function, compared to writing it some other way? I'm looking at PostgreSQL's PL/pgSQL's EXECUTE statement, while the Open Web Application Security Project recommends against using dynamic SQL in a stored procedure. For a dynamic query where the only variables are for data (literals), I'm thinking you don't need a dynamic query because you can pass the procedure's arguments straight into a query written normally. If so, that'd leave dynamic queries involving variables for tables and columns.

Is there any equivalent way to write a procedure without a dynamic query using a variable for a table's identifier or a column's identifier? PL/pgSQL doesn't have table variables, so I'm guessing there'd be a workaround involving transforming a table into an array of rows, passing the array as an argument, and then turning it back into a table.

galenanorth fucked around with this message at 23:06 on Feb 12, 2021

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
there is no advantage to using a dynamic query, it's just sometimes you have to do it that way, specifically in your case where you want to pass in a variable for a table/column name. the disadvantages are, the query planner can't cache the plan for your dynamic query, and sql injection like owasp is worried about. postgres has functions to quote your variables safely to avoid sql injection, toward the end of the execute section in the documentation. the easiest way is to use the format function as they describe. as long as you're careful, it's safe.

i do not suggest doing anything with row arrays to simulate polymorphism. i have attempted this in the past, just to explore how generic i could write things, and that way lies madness. postgres has only a few specialized polymorphic types (anyelement, anyarray, anynonarray, anyenum), and while occasionally you can get away with "ehh here's an anonymous row type, figure it out" in queries, with functions you will eventually have to write down the table and column names. like, if you try to return something generic, it looks like you got away with it, but in fact the caller has to specify the type it expects.

Gatac
Apr 22, 2008

Fifty Cent's next biopic.
Pretty much anything can be done with dynamic SQL, the question is always "What do you need it for?"

Heed well my tale of woe: I do some ETL stuff in Oracle where, depending on various factors, I want to use one of several objects as a source for the data entry and decide which I'm using at runtime. Dynamic SQL seemed like it would save me a lot of "If this object then this bigass SELECT, if this object then this bigass SELECT with all the same columns and a slightly different FROM-clause"-ness. That might have been ugly but it would've been far faster to actually code than getting the dynamic SQL working correctly, chiefly made difficult by not being able to get useful errors at compile time and only erroring out at runtime, where I then needed good logging to figure things out. In the end, I hacked it together with some ref cursor jank. It works and performs reasonably well, I just wouldn't recommend anyone else go this way. Unless your data model is bad (realtalk, it probably is in one way or another but let's assume there's no wanton abuse of the common table), a stored function that does the same sorts of transformation on an object tbd with a parameter is probably overly broad and prone to all sorts of subtle errors.

I recommend a look at this Steven Feuerstein post. It's brief, the comments have some further discussion on the topic and if you still come away from it thinking your use case requires it, then by all means have at it, just ask yourself first if it really is the best way to solve your concrete issue.

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


D-SQL has its place, and while OWASP has legitimate concerns about injection, in SQL Server's implementation, at least, it can even be parameterized, and even have plans be stored and reused.

My company has used D-SQL to allow extremely complex queries chop off entire sets of expensive joins that weren't needed based on the tenant's configuration.

I also rolled my own replication using Change Tracking and triply nested D-SQL which SUCKED to write but was flexible enough to support keeping > 100 tables in over a dozen databases synced across 8 shards. I handled the new insert case and my brain basically leaked out of my ears.

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Yeah I wouldn’t completely denounce dynamic SQL. It has its place, but typically if you see someone jumping to it and they aren’t experienced it’s probably not a good use case.

In my personal experience it has been extremely valuable for reducing the complexity of kitchen sink queries, which I believe I’ve posted about in this thread a few times.

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

Ooh, I've hit for once a classic SQL quiz-style problem.

We've got a Players table, a Seasons table, and a Players-Seasons table with each player's score. We need to show a join of all three tables, but the catch is that the currently active Season(s) must always be included, even if the player did not play in them (and so the Players-Seasons record is missing and the Score column is null).

Here's the SQL Fiddle if you want to try.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

NihilCredo posted:

Ooh, I've hit for once a classic SQL quiz-style problem.

We've got a Players table, a Seasons table, and a Players-Seasons table with each player's score. We need to show a join of all three tables, but the catch is that the currently active Season(s) must always be included, even if the player did not play in them (and so the Players-Seasons record is missing and the Score column is null).

Here's the SQL Fiddle if you want to try.

http://sqlfiddle.com/#!18/b0e95/10

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
i made all your columns not null btw because i assumed that was an oversight

Jabor
Jul 16, 2010

#1 Loser at SpaceChem
code:
select p.playerid, s.seasonid, coalesce(ps.score, 0)
from players p cross join seasons s
left join players_seasons ps on ps.playerid = p.playerid and ps.seasonid = s.seasonid
where
  ps.score is not null
  or s.active = 1
Kind of a brute force approach, cross join players and seasons and filter down to the ones that are relevant. The trick is that you can't be joining into the seasons table based on players_seasons, since you're not going to have that link on some of the things that you need it for. You can drop the coalescing if you want an explicit null there ofc.

Alternatively you can union two subqueries (one for the active season, one for older seasons) if you need ~blazing fast performance~ and your players_seasons are super sparse or something.

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

Thanks y'all :) Cross joins are something I only reach for when running diagnostics and need to some some gnarly one-time aggregation, I forgot that they have "legitimate" uses as well.

I'll check if it still uses indexes this way though, if it doesn't i'll probably refactor to the UNION approach.

Hammerite posted:

i made all your columns not null btw because i assumed that was an oversight

Don't worry, it was just a minimal repro, actual tables are totally different.

NihilCredo fucked around with this message at 15:39 on Feb 15, 2021

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

NihilCredo posted:

Don't worry, it was just a minimal repro, actual tables are totally different.

I meant more that I wasn't willing to go to the trouble of considering whether and how things might be different if all those columns really were nullable. I was offering a solution to the problem I thought you actually wanted a solution to, rather than a solution to the original problem per se.

galenanorth
May 19, 2016

edit: nevermind

galenanorth fucked around with this message at 04:15 on Feb 16, 2021

Tea Bone
Feb 18, 2011

I'm going for gasps.
I have MySQL server 8.0.20 installed from the oracle installer on my Mac (Catalina).

Intermittently after a reboot MySQL won't start and gets caught up in a start-up loop. I have to edit /etc/my.cnf and add:
code:
innodb_force_recovery=3
to it to get MySQL to start. If I then run
code:
mysqlcheck --all-databases
it returns telling me all tables are OK.
If I switch back to innodb_force_recovery=0 here then the problem persists. To fix it I have to backup everything up, then drop the offending table (since mysqlcheck doesn't show up anything I have to work out which table is the bad one through trial and error), then I can go back to innodb_force_recovery=0, start up MySQL and import my back up.

It's not after every restart, but about 50% of the time. Does anyone have any idea whats causing this?

galenanorth
May 19, 2016

I have two question about PostgreSQL. The first is about jsonpath.

https://blog.sql-workbench.eu/post/json-path/
https://www.postgresql.org/docs/current/functions-json.html#FUNCTIONS-JSON-PROCESSING-TABLE

This blog implies that jsonb_path_exists() and jsonb_path_match() will have slower performance than @? and @@ because they cannot be indexed, so the index on the whole jsonb column will be used. The documentation on jsonb_path_exists_tz() etc., on the other hand, says "Their counterparts are immutable, and so can be used in indexes". Is the first source correct? There appears to be a contradiction, but maybe there is some resolution to it such as functions being used in indexes being different from functions being indexed.

The second is about tsvector and tsquery. What is the difference between an anchored text search and a non-anchored text search? My PostgreSQL textbook mentions "PostgreSQL provides a full-text search capability, which is used to overcome SQL pattern matching operators, including LIKE and ILIKE, boosting the performance of the text search. For example, even though an index on text using the text_pattern_op class is supported, this index can't be used to match a non-anchored text search." I don't really get what they mean, and this information is hard to find in a search.

galenanorth fucked around with this message at 20:34 on Feb 23, 2021

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

galenanorth posted:

I have a question about PostgreSQL's jsonpath.

https://blog.sql-workbench.eu/post/json-path/
https://www.postgresql.org/docs/current/functions-json.html#FUNCTIONS-JSON-PROCESSING-TABLE

This blog implies that jsonb_path_exists() and jsonb_path_match() will have slower performance than @? and @@ because they cannot be indexed, so the index on the whole jsonb column will be used. The documentation on jsonb_path_exists_tz() etc., on the other hand, says "Their counterparts are immutable, and so can be used in indexes". Is the first source correct? There appears to be a contradiction, but maybe there is some resolution to it such as functions being used in indexes being different from functions being indexed.

I haven't dug deep into JSONB indexing in postgres but just looking at section 8.14.4 of the official docs I think the first link is just using confusing terminology and talking about something entirely different. When he says "the biggest drawback of the new JSON path functions is, that they can’t be indexed" I think what he actually means is that the GIN index does not support queries using those operators (or rather, not-operators). Postgres doesn't try to parse the path expression to determine if it's going to result in a condition that could be converted into one of the operators the GIN index does support. It would be less confusing to say that these functions "cannot use the index we just created", IMO.

On the other hand, when the postgres docs you linked say "their counterparts are immutable, and so can be used in indexes", what they're referring to is that you can create an index on an expression involving one of those functions. I don't know off the top of my head why or when you'd actually want to do that, but what the note is trying to tell you is that the return value of the _tz variants depend on a server setting that could change at any time and as such you can't create an index on an expression involving them (because if you did and then changed the server timezone the index would contain different results).

TheFluff fucked around with this message at 20:23 on Feb 23, 2021

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE
Double posting because you edited to add a second question:

galenanorth posted:

The second is about tsvector and tsquery. What is the difference between an anchored text search and a non-anchored text search? My PostgreSQL textbook mentions "PostgreSQL provides a full-text search capability, which is used to overcome SQL pattern matching operators, including LIKE and ILIKE, boosting the performance of the text search. For example, even though an index on text using the text_pattern_op class is supported, this index can't be used to match a non-anchored text search." I don't really get what they mean, and this information is hard to find in a search.

Anchoring here is used in the regular expression sense, as in anchoring a pattern to the start or end of a string. If you write WHERE someCol LIKE '%something', the pattern is anchored at the end of the string (there's no wildcard there; the pattern is fixed - anchored - to the end of the string and can only match there), but not at the start of the string (anything can go there and it can be arbitrarily long). Since b-tree indexes rely on sorting to work and strings are sorted based on what's at the start of the string, postgres can't use b-tree indexes for LIKE conditions or regular expressions that start with a wildcard - that is, patterns that are not anchored at the start of the string - because it wouldn't know where in the index to start looking. There's some additional caveats as well though, and in general LIKE and regular expressions are very crude tools, so fulltext indexes are usually what you actually want. For the full details see the manual, section 11.2, about index types and what operators they support.

e: just to be clear, b-tree indexes are the default in postgres, what you get if you just say CREATE INDEX without specifying an index type.

TheFluff fucked around with this message at 00:18 on Feb 24, 2021

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

Anybody wants to hear about a schema design I'm mulling over?

Context: Postgres 11 (soon 12, hopefully). The data is, for the purpose of discussion, a simple key-value store. In 99% the same mapping applies to all users, however users _may_:

1) have no value for a particular key
2) have a different value for a particular key
3) have an additional key-value that only exists for them

There are hundreds of thousands of keys, but only a few hundred users. In the first and second case, the different value will usually apply to a significant subset of the users. In the third case, it will usually apply to individual users only.

We need to efficiently fetch the key-values for the current user.

For the sake of readability, assume all columns are NOT NULL unless I explicitly wrote NULL.

Classic design:

SQL code:
create table codes (code_id int primary key, key text unique, value text);
create table codes_users (code_user_id int primary key, key text, user_id int, value text NULL, unique(key, user_id));

insert into codes values (1,'apples','MMMM'),(2, 'pears','PPPP'),(3,'bananas', 'BBBB');
insert into codes_users values (1,'apples',40,null),(2, 'melons',40,'MMMM'),(3,'pears', 40,'PPP222');

select  
*,
coalesce(codes_users.key, codes.key) as selected_key,
case when code_user_id is not null then codes_users.value else codes.value end as selected_value
from codes
full join codes_users on codes_users.user_id = 40 and codes.key = codes_users.key;

Single-table design:

code:
create table codes (code_id int primary key, key text, user_id int NULL, value text NULL, unique (key, user_id));

insert into codes values (1,'apples',null,'MMMM'),(2, 'pears',null,'PPPP'),(3,'bananas', null,'BBBB');
insert into codes values (4,'apples',40,null),(5, 'melons',40,'MMMM'),(6,'pears', 40,'PPP222');
insert into codes values (7,'apples',100,null),(8, 'pears', 100,'PPP222');


select * from codes where code_id in (
	select max(code_id order by (codes.user_id is not null))
	from codes
	where codes.user_id is null or 40 = codes.user_id
	group by key
);

Both these designs result in a large number of duplicate values, eg. if ~200 of the users share the same 'overrides'.

The former doesn't require a subquery, but the latter has a much simpler SELECT part ("value" is actually multiple columns that would all need coalescing).


Alternative design using Postgres array columns:

SQL code:
create table codes (code_id int primary key, key text, user_id int[] NULL, value text NULL);

insert into codes values (1,'apples',null,'MMMM'),(2, 'pears',null,'PPPP'),(3,'bananas', null,'BBBB');
insert into codes values (4,'apples','{40,100}',null),(5, 'melons','{40}','MMMM'),(6,'pears', '{40,100,130}','PPP222');

select * from codes where code_id in (
	select max(code_id order by (codes.user_id is not null))
	from codes
	where codes.user_id is null or 40 = any(codes.user_id)
	group by key
);
This version avoids duplicating rows. Efficient indexing is also possible via the intarray extension and the GIN indexes and (gin__int_ops) class.

However, enforcing uniqueness becomes pretty tricky. It's also not very extensible - I will not be able to add any user-specific metadata beyond the user_id.


To avoid duplicating rows without resorting to array columns, I'd need to keep the values in one column and specify overrides by referring to the record's primary key

code:
-- i've specified nullability here since it's important
create table codes (code_id int primary key, key text, value text NULL);
create table codes_users (code_user_id int primary key, key text, user_id int NULL, code_id int NULL, unique(key, user_id));

insert into codes values (1,'apples','MMMM'),(2, 'pears','PPPP'),(3,'bananas', 'BBBB');
insert into codes values (4,'apples',null),(5, 'melons','MMMM'),(6,'pears', 'PPP222');
insert into codes_users values (1,'apples',40,4),(2, 'melons',40,5),(3,'pears',40,6);


-- doesn't quite work, it's missing bananas:
select * from codes where code_id in (
	select max(code_id order by (codes_users.user_id is not null))
	from codes_users
	where codes_users.user_id is null or 40 = codes_users.user_id
	group by key
);

But not sure how to represent the 'default' case here (same key-value for every users) short of adding millions of entries to codes_users for all combinations.

Adbot
ADBOT LOVES YOU

Gatac
Apr 22, 2008

Fifty Cent's next biopic.
My immediate thought would be to use a separate lookup table for the keys with a NULLable column for a default value. I'd probably have a lookup table anyway foreign keyed to CODES.KEY and use a numeric value for that, rather than straight text. That way you have a definitive list of all possible keys and can rename them down the line without loving with existing rows in CODES.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply