|
Macichne Leainig posted:I asked it to give me a null cipher once, and I poo poo you not, it basically told me "the first letter of each word from the sentence 'the dog ran away' spells a message. for example, 'the,' dog', and 'ran' spells "HELP" " "If you can't dazzle them with brilliance, baffle them with bullshit," is surprisingly human
|
# ? Feb 18, 2023 01:37 |
|
|
# ? Jun 1, 2024 20:24 |
|
PhantomOfTheCopier posted:Why materialize all the rows, join all the things multiple times, and delay all the filtering logic? This seems like "all of these except for these customers". At least in postgres you won't get much interesting out of a query plan unless you stuff a lot of data into the tables, because with toy examples that have like tens of rows per table it'll just seq scan everything and completely ignore indexes because there's no point in using them. CTE's can be inlined now too btw, they aren't always materialized anymore. If I was code reviewing I'd insist on union all to avoid the distinct operation and also not exists() instead of not in(). Just don't write anti-joins with not in(), it's a loaded footgun just waiting to go off. It'll seem fine and have the same query plan as not exists() until one day it suddenly does something completly braindead instead, because the expression returned from the subquery happens to be nullable (see here or here for details; they're old articles but nothing has changed).
|
# ? Feb 18, 2023 04:02 |
|
PostgreSQL definitely had significant issues optimizing NOT IN way back in 7.4. I had a recipe database where ingredient exclusion was analyzing in the quintillions and failed to complete in 25min over something like 5k recipes. I probably have the full analysis and comparison documentation but don't remember which route I took. When 8 appeared it ran in 35ms.
|
# ? Feb 18, 2023 13:45 |
|
PhantomOfTheCopier posted:Why materialize all the rows, join all the things multiple times, and delay all the filtering logic? This seems like "all of these except for these customers". this is the definition of bikeshedding. You might as well ask, why union multiple times? Surely whether a particular approach "materializes all the rows" is RDBMS-dependent. The OP didn't say which RBDMS they are using unless I missed it.
|
# ? Feb 18, 2023 15:40 |
|
hi i'm diving into the Redis docs for the first time ever and outside of understanding caching as a concept, i'm going in with not much experience with a goal to cache some of my database queries, and while i'm sure i'm going to need more help in the future, i have one initial question about using Redis in general: should i be caching data that is subject to change a lot? for example, if i am making a query to get "page 1" of a bunch of data but some stuff might get moved to "page 2" in a short time, is that even worth caching, even if i bust the cache when new data is created?
|
# ? Feb 19, 2023 10:26 |
Caching search results like that can make sense from a user experience point of view. Query the entire result set, save in the cache with a key for the specific user and query, so the results don't move when the user browses the pages.
|
|
# ? Feb 19, 2023 11:12 |
|
Does the the NOT IN vs NOT EXISTS anti-join query-plan forcing apply to MS SQL as well as postrgres, and is it always true that NOT IN is sub-optimal vs NOT EXISTS, or is it just true for larger sets? I've always done NOT IN for a values list and also for tables or ctes with small numbers of values (like 100 or 1000 max maybe?) but swap over to NOT EXISTS for larger sets than that. Jabor posted:That chatgpt answer is really cool because it looks plausible and probably works on a couple of test cases but is also completely wrong and doesn't meet the specification at all. Sorry about that chatgpt answer - not sure why when I look at it now, but it really did look legit at the time. 20-some years writing sql doesn't count for much when you're in a rush to post, I guess mortarr fucked around with this message at 21:49 on Feb 19, 2023 |
# ? Feb 19, 2023 21:40 |
|
nielsm posted:Caching search results like that can make sense from a user experience point of view. Query the entire result set, save in the cache with a key for the specific user and query, so the results don't move when the user browses the pages. but what if there's hundreds of pages? what's the best way to cache a large amount of paginated data?
|
# ? Feb 19, 2023 23:25 |
|
mortarr posted:Does the the NOT IN vs NOT EXISTS anti-join query-plan forcing apply to MS SQL as well as postrgres, and is it always true that NOT IN is sub-optimal vs NOT EXISTS, or is it just true for larger sets? It is true for MSSQL too. If you can guarantee that whatever you're putting into the NOT IN() expression isn't nullable, NOT IN() and NOT EXISTS() should be equivalent, but since you need to convince the query planner of this guarantee too it can get a bit sketchy and difficult to tell what it'll do if the expression is a subquery. The reason for this is that IN() does simple equality comparison and null of course isn't equal to anything, not even itself, so if even a single null appears in a NOT IN(), the entire operation will evaluate to false (or technically null/unknown I guess). Because of this it's not semantically equivalent to NOT EXISTS() - it can give you wrong/unexpected results too - and cannot use the same query plan if the expression is nullable. For small sets it probably doesn't matter too much but the NOT IN plan is usually significantly worse. TheFluff fucked around with this message at 03:08 on Feb 20, 2023 |
# ? Feb 20, 2023 02:58 |
|
never use NOT IN except as a convenience when you are manually querying the database and you have a short list of literal values to provide. short as in, you typed it yourself without copy-paste
|
# ? Feb 22, 2023 05:33 |
|
Dear friends I am extracting information from a database. The information is in a varchar, but I am interested inthe entries where a specific bit (not byte!) is one. I am using the following command SELECT ReportID, SUBSTRING(data, 21, 12) FROM RecData2 WHERE ( ( ( ( CAST(SUBSTRING(data, 32, 1) AS INTEGER ) ) % 2 ) <> 0 ) This command works exactly as I intend to (AFAICT!) , however it produces A LOT of warnings. Orders of magnitude above the number of results. Is that normal and expected when I am using complex instructions like this? Should I reformulate? I imagine a common criticism would be "you should not try to select too finely, extract all the packages and examine them with python or whatever you are using", but that would mean extracting a million packages and discarding all but 6k. I think it would save me a lot of time and processing power to discriminate them directly at the SQL level. Any thoughts or advice? Thanks!!
|
# ? Feb 22, 2023 12:27 |
|
Have you read the warning messages, or did you only look at how many there were?
|
# ? Feb 22, 2023 12:33 |
|
I admit I am a massive newbie and I don't know how to display them. I'll start there. Sorry!
|
# ? Feb 22, 2023 12:41 |
Also be sure to look at the data piecewise. It can help to limit the number of result rows from the query too, when testing. How to do it depends on the database, but for MySQL, PostgreSQL and SQLite you'd put "LIMIT 10" at the end of the query, for Microsoft SQL you'd write "SELECT TOP 10" at the start, and for Oracle there is a "FETCH FIRST 10 ROWS ONLY" clause you can add at the end. For example, you can try to get the parts of your WHERE clause individually: SQL code:
|
|
# ? Feb 22, 2023 13:09 |
|
hi again i'm hoping someone can help me out with a really trick mysql statement. i'm trying to get the latest Offense from an Offense table that is tied to a user, but the trick is offenses are tied to posts and private messages, not users, so i first need to join together Posts and PrivateMessages before I can filter down on the user level i have 4 tables: Posts, PrivateMessagePosts, Users, and Offenses. Offenses are tied to either a `Post.id` or `PrivateMessagePost.id` but not a user. they look like this: so far I have a SQLAlchemy query which will get a list of posts given a thread ID (that part is irrelevant), and return them with all of the offenses tied to them. it looks like this: pay no mind to the current offense join! that is to get a CSV of all offenses tied to each post we're querying. what i want additionally is the latest offense that is active on the user Python code:
SQL code:
|
# ? Feb 22, 2023 23:23 |
|
also i think i've posted about using ORMs before here and was told they suck, and now i'm finally understanding it. they generate some rear end queries, which is why i'm trying to manually write this query without letting the ORM get the relational data.
|
# ? Feb 22, 2023 23:25 |
|
If you have this pattern where you have a one-to-many relationship (one user has 0 or more offenses) and you want to get only the first related row by some criteria in a query that already has a bunch of other stuff going on, one way to do it in a fairly efficient way is with something called a lateral join, and gently caress it, it's Wednesday and I've had a couple of drinks, let's write some overly complicated SQL. peep this:SQL code:
performance is... probably not great? this is more "fun to write" than actually good. on postgres I think the planner would be capable of working with this but I have no idea about mysql. also if you're on mysql 5 you're hosed, you can do absolutely nothing interesting. TheFluff fucked around with this message at 03:03 on Feb 23, 2023 |
# ? Feb 23, 2023 02:54 |
|
man i literally got there right before you posted that. thanks!!!! i have been ready to smash my keyboard looking for this solution
|
# ? Feb 23, 2023 03:56 |
|
nielsm posted:Also be sure to look at the data piecewise. Hey Nielsm and Jabor! I am a still a stupid noob but at least now I know how to the warnings, and that that is a thing. However, now I have a working query and I thought I'd share it in case it helps. code:
Second, I got warnings coming from "LEFT" for a similar reason. Data is a string(a varchar) and it had no trouble parsing decimal integers, but gave out warnings when some hex numbers appeared. The commas fixed that. Thank you again!
|
# ? Feb 23, 2023 10:19 |
|
Does anyone have particular favorite solutions for data anonymization/scrubbing in postgres? I'm looking at methods to port our production data volume over to some of our test environments to help with load testing/dev type stuff, but we're currently on RDS and the official https://postgresql-anonymizer.readthedocs.io/en/stable/ isn't available as a supported extension there. (And no, there's no way we're setting up standalone EC2 instances just to run this, for better or worse).
|
# ? Mar 10, 2023 20:14 |
|
Gwaihir posted:Does anyone have particular favorite solutions for data anonymization/scrubbing in postgres? I'm looking at methods to port our production data volume over to some of our test environments to help with load testing/dev type stuff, but we're currently on RDS and the official https://postgresql-anonymizer.readthedocs.io/en/stable/ isn't available as a supported extension there. (And no, there's no way we're setting up standalone EC2 instances just to run this, for better or worse). e: If Anonymizer looks like a fit but you can't/don't want to use it in standalone mode on your RDS instance, one option for running it might be to set up a dummy PostgreSQL server in ECS+Fargate and access your data on RDS as foreign data Vulture Culture fucked around with this message at 16:26 on Mar 14, 2023 |
# ? Mar 14, 2023 16:20 |
|
TheFluff posted:*a solution* I've only been able to play with sql today, but I just want you to know that it works perfectly, thank you for solving the logic puzzle.
|
# ? Mar 14, 2023 18:42 |
|
Vulture Culture posted:Is the sensitive data stored in specific tables/columns, or is it interspersed with unstructured data in text/JSON columns? Yeah, luckily the data is all segregated out in to it's own columns (classical stuff you'd want to anonymize, names, address, ssn, phone, etc) and not mixed in to JSON columns or blobs or anything else. We currently have a process to generate scrubbed copies of our production database for dev use, but it's done using home grown scripts for the moment that devs can kick off themselves in jenkins. It... functions, but it's slow and I think there's some room for improvement. A standalone FDW connected EC2 server isn't out of the realm of possibility, but I think the advantages would have to be pretty significant for us to want to take on manually janitoring a postgres install there while the rest of our stuff is on RDS.
|
# ? Mar 14, 2023 19:24 |
|
i previously posted ITT about wanting to implement Redis for some paginated SQL queries and one suggestion was to request all the data and just store it as one object and just bust the cache when an item is updated or added (items cannot be removed at this time) But i was thinking of another possible solution and i'd like to be sanity checked here. What I was thinking is: 1. create a factory function that will take page number and return a SELECT statement for a page of items 2. when calling my SELECT statements in whatever GET request, run the factory function with the requested page and store the returned query as the Redis key (if the Redis key/value doesn't already exists. if it does, just read from memory) 3. when adding/updating an item, run another SELECT query to get what page this item is going to be on and then execute the factory function again with the calculated page number to get what the SELECT statement is going to be and then nullify that Redis key/value does this sound like it could work? am i overengineering this? i'm just really not sure how to properly use Redis with paginated data that can always be added to or updated teen phone cutie fucked around with this message at 23:11 on Mar 14, 2023 |
# ? Mar 14, 2023 20:29 |
I think you're thinking of a suggestion I made, but what I was thinking of was actually to use the cache to keep the result set constant while the user was browsing it. That is, nothing would invalidate or update the cached result set, other than a timeout or other indication that the user was done using that result set. In that model, you don't attempt to update the cached result set at all, but the user also won't see new items unless they explicitly invoke a new search or request a reload of the result set. In return, the user gets fully predictable paging with no results jumping around or getting skipped when switching page.
|
|
# ? Mar 14, 2023 21:38 |
|
nielsm posted:I think you're thinking of a suggestion I made, but what I was thinking of was actually to use the cache to keep the result set constant while the user was browsing it. That is, nothing would invalidate or update the cached result set, other than a timeout or other indication that the user was done using that result set. In that model, you don't attempt to update the cached result set at all, but the user also won't see new items unless they explicitly invoke a new search or request a reload of the result set. In return, the user gets fully predictable paging with no results jumping around or getting skipped when switching page. gotcha yeah i figured i was bastardizing the explanation a bit. though this solution still doesn't seem optimal for my use-case. the paginated items are comments in a thread, so the page results are going to be consistent other than a comment being edited or a new comment being added, which i would like to be instant with adds/updates busting the cache.
|
# ? Mar 14, 2023 21:53 |
|
I'm getting murdered in data engineering interviews because I've let whatever SQL skills I had atrophy over time. I've noticed that when I look at a problem my brain tends to freeze at first and then I tend to overthink everything. I start thinking things like "maybe I'll need to filter, so a subquery might be required..." or "maybe I'll need a CTE instead" and then it kind of snowballs from there. I got whammied by a "top x of y" problem (i.e. "top earners in the department") because I wasn't sure what kind of tool to approach it with. Should I just be grinding StrataScratch, or is there a more strategic way to approach this?
|
# ? Mar 29, 2023 20:50 |
|
how much transactional data can postgres hold before it blows up? could you use the WAL as a "copy on write database overlay" where you have some core dataset where you want to try out speculative changes on a per-cursor basis and have them reflected for that thread in its reads/updates, but only update seldomly, as a component of an "eventually-consistent" thing with some concurrent properties? (I'm obviously ignoring the write-contention problem here, just curious how many rows/blocksize per cursor (or WAL rows/size in total?) you could have open before it would explode horribly just on the basis of log replay/etc.) Paul MaudDib fucked around with this message at 01:34 on Apr 5, 2023 |
# ? Apr 5, 2023 01:27 |
|
Seventh Arrow posted:I'm getting murdered in data engineering interviews because I've let whatever SQL skills I had atrophy over time. I've noticed that when I look at a problem my brain tends to freeze at first and then I tend to overthink everything. I start thinking things like "maybe I'll need to filter, so a subquery might be required..." or "maybe I'll need a CTE instead" and then it kind of snowballs from there. I got whammied by a "top x of y" problem (i.e. "top earners in the department") because I wasn't sure what kind of tool to approach it with. Should I just be grinding StrataScratch, or is there a more strategic way to approach this? select * from (... rank() over (partition by department order by earnings desc) ...) where rank >= x common interview question because some databases have special syntax for this and the interviewer wants to show off that he knows it. definitely a realistic problem that does show up in real work, tho Paul MaudDib posted:how much transactional data can postgres hold before it blows up? i don't consider myself qualified to answer this question, but most of the wal settings are runtime configurable, i bet you could tune postgres to perform quite well in this scenario. but the default settings are not optimized for long-running transactions that sit around without committing or rolling back, i know that much
|
# ? Apr 8, 2023 04:49 |
|
I have a best practices question. I rarely work with SQL directly for setting up table schema. I mainly use a framework like Django to define my models and handle schema migrations. How is this typically handled without the use of a framework?
|
# ? Apr 8, 2023 13:57 |
|
lazerwolf posted:I have a best practices question. I rarely work with SQL directly for setting up table schema. I mainly use a framework like Django to define my models and handle schema migrations. How is this typically handled without the use of a framework? I can't speak for the Python but with .net when we don't want to use Entity Framework migrations we often use SQL Server Data Tools (SSDT) data projects. You define schema using scripts and/or GUIs (and pre/post deploy scripts manually) and the tool generates migrations from that. This is greatly simplifying all the things it can do for but it's largely hands-on. It's also SQL Server only so probably not what you want if you're working with Python. RedGate has a similar tool that is not SQL Server specific but I have never used it so I can't comment on it but I know it supports a LOT of databases.
|
# ? Apr 8, 2023 14:07 |
|
DELETE CASCADE posted:select * from (... rank() over (partition by department order by earnings desc) ...) where rank >= x Thanks - really, I think I need to just keep practicing and doing challenges that will let me start to recognize patterns and their solutions. This article helped somewhat.
|
# ? Apr 8, 2023 14:28 |
|
lazerwolf posted:I have a best practices question. I rarely work with SQL directly for setting up table schema. I mainly use a framework like Django to define my models and handle schema migrations. How is this typically handled without the use of a framework? You can use a migration tool (like Flyway; I guess it's now called Redgate) and raw SQL. You write your ALTER TABLE statements and whatnot and put them in a migration file and the migration tool keeps track of which migrations have been applied. On a side note it's also relatively common to do things in the opposite way compared to Django. That is, in Django you start from your model classes and let the tooling generate SQL schema changes for you, but some other frameworks and tools start from the schema and generate model structs or classes from that.
|
# ? Apr 8, 2023 15:27 |
|
I'm struggling at work with a (third-party) application built around a database; I only ever interact with databases through an ORM, so I'd love a more informed opinion as to whether this thing is as dumb as it appears. This application ingests and stores data from experiments run on various samples. Each sample has a unique ID. Each experiment has its own table in the database, with fields for the sample ID and various things measured in the experiment. The end user wants all the data in one big table, so a batch job joins all the experiment tables into one big display table. Currently the batch job takes an hour to run. There are about 20 experiment tables, with a total of about 500 total measurement fields. There have been about 60,000 samples. There are a total of about 2 million measured values, so about 14 out of every 15 values in the final table are empty. I have a hard time believing this is state of the art, but I don't know much about databases and am curious whether there is a deep reason for this design that I don't know about. I feel like storing the data in a "skinny" format with fields for just sample ID, field ID, and value (and then have other tables with details for each field, like name and originating experiment) could be the way to go, but I wonder if that would fail to scale for some reason I don't know about. I don't know much about how this database is implemented (other than "in Java") or how the AWS instance that runs all this is provisioned, unfortunately.
|
# ? Apr 10, 2023 16:00 |
|
CKinbote posted:I'm struggling at work with a (third-party) application built around a database; I only ever interact with databases through an ORM, so I'd love a more informed opinion as to whether this thing is as dumb as it appears. It doesn't sound particularly normalized, and without knowing further details and going just by your description I can't think of any reason it should take that long to just essentially collate the results of X tables into a final view. The most probable deep reason for this design is that the one person who knew what was going on no longer works there.
|
# ? Apr 10, 2023 16:56 |
|
CKinbote posted:I'm struggling at work with a (third-party) application built around a database; I only ever interact with databases through an ORM, so I'd love a more informed opinion as to whether this thing is as dumb as it appears. The "one big table" / "end user wants this format" constraint probably means its the kind of dumb that you don't want to gently caress with. That sounds like tooling is based around the output of this report.
|
# ? Apr 10, 2023 19:54 |
|
Bruegels Fuckbooks posted:The "one big table" / "end user wants this format" constraint probably means its the kind of dumb that you don't want to gently caress with. That sounds like tooling is based around the output of this report. (Un)fortunately, I maintain the downstream tooling, so changes to this report are my problem. In the end, I started messing around with it yesterday and cost myself an afternoon fixing what I broke. Oh well, can't say I wasn't warned.
|
# ? Apr 11, 2023 16:39 |
|
Recently we (or rather the guy who does most of the DB/sysadmin type stuff) changed our DB to using InnoDB, and a bunch of stuff became really slow. While this is fortunately (for me) outside of my responsibilities (at least beyond "defining the problem") I'm pretty amazed at how hosed up some of these queries are. This one that used to take just a few seconds (definitely no more than 10 seconds) now takes a whopping 2 hours. I was expecting it to be slow, but not that slow! I'm guessing there must be missing indices or something, but I definitely wasn't expecting such a dramatic difference in speed. Hopefully the guy in question addresses this soon, because I've been getting a lot of e-mails about how the searches are suddenly slow or timing out. Fortunately our website is just a scientific website that services a relatively small community of scientists, so it's not as huge of a deal as it might be if it were used for an actual business or something.
|
# ? Apr 18, 2023 23:03 |
Changed to InnoDB, from what? I don't think anyone have used MyISAM for at least 15 years. But yes check your indexes and see if any of the queries need to be jiggled around to make the optimizer generate a better plan. If you have the permissions on the server, learn how to issue and read Explain statements.
|
|
# ? Apr 19, 2023 06:48 |
|
|
# ? Jun 1, 2024 20:24 |
|
nielsm posted:Changed to InnoDB, from what? I don't think anyone have used MyISAM for at least 15 years. It was MyISAM, yes (though I know very little about this stuff and had to double-check my e-mail history to verify this). The database has been around since...I think the late 90s. Possibly as early as 1994. It's a big source of a lot of problems we've had, though we're gradually moving towards getting some stuff out of it that never should have been in it to begin with (a lot of our metadata is being moved to an RDF DB for example, I think Virtuoso?). I generally don't directly mess with the DB myself, with my interaction being limited to writing stuff like search queries that don't actually change/modify anything. But checking the more problematic queries with EXPLAIN sounds like a good idea (I wasn't even aware of this before), and I do have permissions. I found this link that seems to cover a lot of what might be causing the issue - https://dba.stackexchange.com/questions/75091/why-are-simple-selects-on-innodb-100x-slower-than-on-myisam It sounds like the indexes might need to be recreated differently.
|
# ? Apr 19, 2023 23:51 |