|
Thanks for the ideas guys!PhantomOfTheCopier posted:Shrug, start small. Luckily the asks they want out of it are very small and I don't have to abide by any sort of legal requirements here, otherwise I would have told them to bite the bullet and go buy an actual solution. I'm gonna have a front end website where dudes can check in by clicking a drop down box and selecting their name, and hitting submit, of course with some amount of sanity check to make sure they aren't double submitting on accident. Then I'm gonna create an "admin" console for owner to be able to do basic viewing and editing of the tables. A question: Why do you suggest having a setup like member_id and member_info tables instead of just one members table? Defenestrategy fucked around with this message at 01:35 on Feb 3, 2023 |
# ? Feb 3, 2023 01:33 |
|
|
# ? Jun 10, 2024 12:05 |
|
Defenestrategy posted:A question: Why do you suggest having a setup like member_id and member_info tables instead of just one members table? Frontloaded and generally good assumptions about first normal forms. If a member could be in an arbitrary number of classes and needs to track data independently between them it can't be done without breaking nicely flattened, aka normalized, structures. Hence a second table that can have the same member_id multiple times for a row per class they're taking, filled with their data relevant to that class. Classes would be defined in their table, but class occurrences are in a table that references a class_id multiple times too. Then attendance of a member to a specific timeslot of a class is in yet another table, referencing both other tables. The other relevant part of this 'one here, many there' is that the many entries table is foreign keyed to the single entry table. This is also a case where your primary key should just be an integer you forget about, it's the foreign key constraint that really drives the table. That's one of my big db tips: constraints, constraints, constraints. So your schema might look like: code:
Notice this allows attendance to be marked for classes not registered. Maybe someone needs to make up a class and you want that, else add a column with a foreign key requiring registrationid. Building constraints into the schema like this lets you avoid bricking the db with small coding errors, it'll just tell your app to hecc off. Handling those errors instead of trying to cover all cases is also easier on the app coding side. Ranzear fucked around with this message at 05:21 on Feb 3, 2023 |
# ? Feb 3, 2023 05:17 |
|
Defenestrategy posted:Thanks for the ideas guys! get some cheap UPC key fobs and a bar code scanner (also hella cheap, $30 with stand) Someone enrolls, assign them a fob, they scan in when they get there, no mistakes. Alternately don't do any of this because you're going to be expected to be tech support for something you didn't get paid for, but your buddy is going to act like you did. Just-In-Timeberlake fucked around with this message at 14:30 on Feb 3, 2023 |
# ? Feb 3, 2023 14:25 |
|
Stupid analytics-101 type question, and a basic SQL Fiddle with examples: http://sqlfiddle.com/#!17/41a4b/14 I've got people, notes, and codes, a minimal model looks like this: SQL code:
So I want something like: pre:person_id notes codes 1 4 5 2 8 0 3 0 16 SQL code:
SQL code:
|
# ? Feb 3, 2023 23:37 |
|
You want COALESCE(count(foo.bar), 0) to get zeroes instead of null, then it's just SQL code:
code:
COUNT(*) is an antipattern. Put what you want to count in it. Double edit: gently caress gently caress gently caress. Everything is wrong. Hang on... Triple edit: Subqueries it is. This works fine until you try to do more than one join is all: SQL code:
Ranzear fucked around with this message at 04:17 on Feb 4, 2023 |
# ? Feb 4, 2023 02:20 |
|
I feel like there's something really fundamental I'm missing. Playing around in SQL Fiddle the execution plan does look better with the two CTEs, one to count each type of record, and then a JOIN on person_id at the end. I hacked through this by making two separate queries and then smashing the results together in python, because no matter which way I tried to get this all out in one query I was going OOM and dying with 360GB of RAM across the spark cluster, and I'd have to bother IT to get a bigger cluster. Notes is >1billion records, there's a couple million people, I didn't bother to count codes but it's somewhere between millions and a billion.
|
# ? Feb 4, 2023 03:05 |
|
Fixed mine with subqueries with where clauses to keep the counts separate. When just counting one other table the join works but the count spills over with two or more.
|
# ? Feb 4, 2023 03:30 |
|
Why does each cte also need the person table? ... It won't let me post... Select id count note from notes group by notes Should work just fine in a CTE. Ten million people is 10M x 2 integers for the id+count tables and another 10M for the results. It should only be 60M x 64bit or 480MB. Even at 4x for stuff, storage is less than 2GB? Likely more if everything is actually a text field. Start with WHERE personid less than 500 in your ctes to see if the results are even sensible or if it's doing something dumb.
|
# ? Feb 4, 2023 03:42 |
|
Ranzear posted:COUNT(*) is an antipattern. Put what you want to count in it. What the gently caress Postgre? COUNT(*) actually no-ops the compare and is faster? So much for my "it costs nothing to have more clarity here" ideal. Speed difference even shows up in the Fiddle even if the plan is the same. Edit: I thought it might be the table aliases in the counts making it 10x slower for some reason, it's just being weird all around. I'd still prefer to filter out null person_ids unless I had a constraint. Ranzear fucked around with this message at 04:24 on Feb 4, 2023 |
# ? Feb 4, 2023 04:02 |
|
PhantomOfTheCopier posted:Why does each cte also need the person table? I don't need the person table in each CTE, and that solves the memory usage in spark! Good call. It's super fast and simple now with the two CTEs. As for resource usage: Spark does really unpredictable things compared to any relational database that I've used, because it's not a relational database, it is just pretending to create one over semi-structured data. I hate Spark, but it seems to be the dominant tool and what gets provided as output from whatever data processing pipeline companies have now.
|
# ? Feb 4, 2023 06:34 |
|
Twerk from Home posted:I don't need the person table in each CTE, and that solves the memory usage in spark! Good call. It's super fast and simple now with the two CTEs. I certainly don't know but it's probably better to think of it as map reduce. Joins are going to be trouble but materialized data (cte and some subselects) can be handled in separate subprocesses. It's easy, therefore, for it to build tons of small reference "tables" of intermediate results and do one big join at the end, but doing joins for every row is just a big Cartesian mess. In fact with Spark you should probably just imagine that each table is stored somewhere else in the world. But I don't know Spark so
|
# ? Feb 4, 2023 06:49 |
|
Ranzear posted:COUNT(*) is an antipattern. what the heck is this take?! count(*) literally just counts the number of rows in the resultset, it's a perfectly legitimate thing to want to do and is perfectly clear in intent. putting an expression in there means something different, it counts the rows where that expression isn't null, but if you don't actually care about asserting that then it's not just confusing, as you discovered it also forces the db to do an entirely avoidable null check! TheFluff fucked around with this message at 15:41 on Feb 4, 2023 |
# ? Feb 4, 2023 15:30 |
|
TheFluff posted:
Maybe Count(1) will make them happy lol
|
# ? Feb 4, 2023 15:55 |
|
regardless of anything else, COUNT(*) is also idiomatic SQL and RDBMS are likely to take the time to make sure it's performant
|
# ? Feb 4, 2023 17:02 |
|
I mean, I would never deploy production code with a count(*) or select (*), partly due to reliability concerns (I.e. if you return all columns and the column count changes, you can easily break middle tier code, etc) and partly due to strict database development standards, but for one off queries or non deployed code… I do it all the time and it’s super handy. Kind of like order by 1,3,3,4 instead of spelling out the columns. Yes it should work in production, but just be explicit!
|
# ? Feb 4, 2023 18:30 |
|
Condimentalist posted:I mean, I would never deploy production code with a count(*) or select (*), partly due to reliability concerns (I.e. if you return all columns and the column count changes, you can easily break middle tier code, etc) and partly due to strict database development standards, but for one off queries or non deployed code… I do it all the time and it’s super handy. But... count(*) doesn't care about theactual columns?
|
# ? Feb 4, 2023 18:34 |
|
COUNT(*) and SELECT * are entirely different
|
# ? Feb 4, 2023 18:50 |
|
Condimentalist posted:I mean, I would never deploy production code with a count(*) or select (*), partly due to reliability concerns (I.e. if you return all columns and the column count changes, you can easily break middle tier code, etc) and partly due to strict database development standards, but for one off queries or non deployed code… I do it all the time and it’s super handy. COUNT(*) has nothing to do with columns
|
# ? Feb 4, 2023 20:05 |
|
TheFluff posted:
Maybe 80% poo poo take, 20% misremembered MariaDB quirks: Something about always using a not null column in count() which probably results in the exact same no-op as Postgre with *, and then there's a level of explicitness and clarity that comes of using the primary key in there without the performance penalty. What probably matters more is picking something and sticking to it. Also a safe bet I read that five years ago specifically about Aria Storage Engine which still isn't done. Ranzear fucked around with this message at 01:48 on Feb 5, 2023 |
# ? Feb 5, 2023 01:37 |
|
Just-In-Timeberlake posted:COUNT(*) has nothing to do with columns I don’t think I said it did? I am making two separate statements in my post related to the asterisk in sql statements; I should have been clearer in my post. Neither of those asterisk related sql statements should be deployed to production, was my point.
|
# ? Feb 5, 2023 02:17 |
|
Condimentalist posted:I don’t think I said it did? You are misunderstanding how COUNT(*) works vs COUNT(col). https://learnsql.com/blog/difference-between-count-distinct/ RandomBlue fucked around with this message at 05:06 on Feb 5, 2023 |
# ? Feb 5, 2023 05:02 |
|
Condimentalist posted:I don’t think I said it did? Ok I’m sorry, what? What do you think the potential issues of SELECT COUNT(*) are exactly?
|
# ? Feb 5, 2023 07:05 |
|
Ruggan posted:Ok I’m sorry, what? What do you think the potential issues of SELECT COUNT(*) are exactly? That provides a hint, however: count(key) may force more appropriate index usage particularly for conditional counts... But in practice I suspect this is a difficult case to hit these days. It's weird because I "know to avoid" count* on PG but, beyond some memory of it being slow, there may be little remaining reason beyond a desire to have something explicit and "ugh that's a mysql thing".
|
# ? Feb 6, 2023 08:08 |
|
One other thing with count is that it gets you the exact correct count, and has to wait for locks, scan things, just do work in general. If you really need an exact count that’s fine. But a lot of times people take a count to just display like the end number of pages or something and it doesn’t really matter if it’s exact. In that case grab the estimate from one of the stats table.
|
# ? Feb 6, 2023 09:37 |
|
Weird question, but in your work as a database administrator have you ever been confronted with an ethical quandary or asked to do something with data that you weren't comfortable with/felt appropriate and how did you handle it?
|
# ? Feb 11, 2023 10:40 |
|
I would blow Dane Cook posted:Weird question, but in your work as a database administrator have you ever been confronted with an ethical quandary or asked to do something with data that you weren't comfortable with/felt appropriate and how did you handle it? While not a DB-admin per se, I've made an anonymous complaint to a regulating authority regarding use of sensitive information.
|
# ? Feb 11, 2023 11:34 |
|
I would blow Dane Cook posted:Weird question, but in your work as a database administrator have you ever been confronted with an ethical quandary or asked to do something with data that you weren't comfortable with/felt appropriate and how did you handle it? Not a dba either, but 20 or so years ago I was responsible for a medical report that was included in a "discharge summary" report - a report sent to a patients' GP post inpatient stay. The report included a summarised set of medical tests, including test name, date, value, and normal/expected range. I was responsible for re-writing/modernising the report, basically turning it from raw data into a pivoted set to save pages when faxed (lol), during the re-write I found a logic error transposing data for a small subset of tests, for all patients since the report was first written - going back multiple years. The resulting error would show both false positives and false negatives, depending which data was transposed. I thought it was a huge deal, my boss thought we should just fix it and do nothing, I was just starting out, so not all that confident but I was thinking due to the scale of the issue I might need to go to the media or something as I wasn't getting any traction on it within my team. I don't recall how it happened now but I ended up speaking to the medical director of the hospital and he explained that the specific tests that were likely to be problematic had a shelf life - they were relevant only for around two or three months once administered, and also if they were abnormal there would have been a text note included in the report, abnormal tests would have triggered a further round of tests that would have been reported differently - so the risk of having non-obvious wrong data on a patents' permanent record was acceptable. I still think about the whole thing, I don't know if the outcome was right but ultimately I didn't know enough about the medico-legal side then and had to trust that any fallback would land on the director at that stage - as his instructions to me were clear. I think I would do things differently if the same scenario happened to me now, in particular I would get a lot more in writing, but also my technical and business/people skills are way better so I would be able to investigate and advocate my position more effectively. I really did come close to going to the media, and I think the only thing that stopped me was that I recognised doing that would have most likely tanked my career just as I was just starting out. I've come across a few ethically dodgy situations at work since but nothing anywhere near the scope of this one, and none to do with data I've been responsible for. I think going through this was a good learning experience - being very careful with sensitive data is certainly something I factor into my work practice now.
|
# ? Feb 12, 2023 23:29 |
|
Does anybody have experience with postgres' indices on expressions, like substr(somevarchar, 3 ,4)? Basically, our customer migrates a large database from DB2/zOS to postgres. Our system runs a few queries on that database, including joins and filterings and aggregations. Since it's an automotive enterprise with partially decades old tables and records, it also includes the odd char(n)-fields where you have to filter on the nth character and stuff. We are now in the situation where we may request a list of indices. I know that posgres supports indices on expressions, but I've got no idea if it's actually used by the query planner aside from a few corner cases.
|
# ? Feb 14, 2023 17:14 |
|
Wipfmetz posted:Does anybody have experience with postgres' indices on expressions, like substr(somevarchar, 3 ,4)? You'll need the same expression in the query exactly, usually, for functional indexing to work, in the same manner as you need an accurate reference for partitioning to parse usefully for partition pruning. If you have a matching query to index function call though, it should use it unless the stats say that another entry onto the table is going to be more selective anyway.
|
# ? Feb 14, 2023 17:38 |
|
Cool. "same expression" sounds like a very doable requirement. Seems to be worth a try.
|
# ? Feb 14, 2023 18:10 |
|
I'm struggling with a beginner SQL question. I have a two tables, basically it's got the following columns - customer ID code - purchase date - store ID code - Product ID - invoice number These are split into two tables: one with all purchases in the last 12 months, and one for the older purchases. I need to get a list for all rows for a product A (say ID 123456) purchased since 20220101, if they haven't purchased product B (eg.987654) what makes it extra fun is that these could be found in either table and I'm not getting there. can someone help build a basic structure for this? (e: specifically the filter "show if NOT PURCHASED 987654" is giving me trouble, all the rest is obvious double nine fucked around with this message at 20:12 on Feb 16, 2023 |
# ? Feb 16, 2023 19:58 |
|
double nine posted:I'm struggling with a beginner SQL question. code:
Hammerite fucked around with this message at 20:56 on Feb 16, 2023 |
# ? Feb 16, 2023 20:52 |
|
you could potentially do the union on the outside instead, and maybe the query engine would be more likely to do it efficiently? Idkcode:
|
# ? Feb 16, 2023 21:01 |
|
double nine posted:I'm struggling with a beginner SQL question. This should work but it's ugly as hell code:
Just-In-Timeberlake fucked around with this message at 21:07 on Feb 16, 2023 |
# ? Feb 16, 2023 21:04 |
|
thanks for the suggestions, I'll try them tomorrow. I think it'll work
|
# ? Feb 16, 2023 21:30 |
|
double nine posted:I'm struggling with a beginner SQL question... I threw your whole post into chat gpt as-is and it came up with a pretty similar answer to hammerites' 2nd one (sql comments are mine): code:
code:
mortarr fucked around with this message at 22:29 on Feb 16, 2023 |
# ? Feb 16, 2023 22:21 |
|
That chatgpt answer is really cool because it looks plausible and probably works on a couple of test cases but is also completely wrong and doesn't meet the specification at all. Please do not use chatgpt to answer questions, it does not actually know the answers.
|
# ? Feb 16, 2023 23:02 |
|
double nine posted:I'm struggling with a beginner SQL question. If you're thinking "I want to filter a query using a condition based on rows not existing", what you want is an operation called an anti-join - it's basically the opposite of an inner join. There are a couple of different ways to do it; probably the most common one is what Hammerite suggested initially (left join with WHERE some_not_nullable_col IS NULL as a filter). My preferred way though is to use WHERE NOT EXISTS(<subquery>), for a couple of reasons. First off I think it's the most obvious way to express the intent of the operation - a left join can be used for a ton of things and in a complex query it can be easy to miss the WHERE some_col IS NULL bit since it can be pretty far away from the left join itself, but NOT EXISTS is always an anti-join. Second, at least in Postgres, although the query planner is pretty good at identifying the most common ways to write an anti-join, the NOT EXISTS form is the one least likely to lead to it choosing a bizarrely bad query plan (the NOT IN(<subquery>) form is particularly prone to bad query plans, so avoid that one). With all that being said, here's my take (postgres typecasting syntax for the datetimes): SQL code:
Fiddle: https://www.db-fiddle.com/f/rTHPAVU6M2aD78BHpuf7UB/0 TheFluff fucked around with this message at 03:15 on Feb 17, 2023 |
# ? Feb 17, 2023 03:08 |
|
Why materialize all the rows, join all the things multiple times, and delay all the filtering logic? This seems like "all of these except for these customers". (And the chances of SA letting me paste this, which sqlfiddled the formatting...) code:
Won't be able to check hypothetical query plans for a few days.
|
# ? Feb 18, 2023 00:54 |
|
|
# ? Jun 10, 2024 12:05 |
|
Jabor posted:That chatgpt answer is really cool because it looks plausible and probably works on a couple of test cases but is also completely wrong and doesn't meet the specification at all. I asked it to give me a null cipher once, and I poo poo you not, it basically told me "the first letter of each word from the sentence 'the dog ran away' spells a message. for example, 'the,' dog', and 'ran' spells "HELP" " I really wish I had the confidence chatgpt has
|
# ? Feb 18, 2023 01:35 |