Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Lib and let die
Aug 26, 2004

Macichne Leainig posted:

I asked it to give me a null cipher once, and I poo poo you not, it basically told me "the first letter of each word from the sentence 'the dog ran away' spells a message. for example, 'the,' dog', and 'ran' spells "HELP" "

:wtf:

I really wish I had the confidence chatgpt has

"If you can't dazzle them with brilliance, baffle them with bullshit," is surprisingly human

Adbot
ADBOT LOVES YOU

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

PhantomOfTheCopier posted:

Why materialize all the rows, join all the things multiple times, and delay all the filtering logic? This seems like "all of these except for these customers".

(And the chances of SA letting me paste this, which sqlfiddled the formatting...)

code:
with naup as (
  (select customer_id from purchases  where product_id=987654)
  union
  (select customer_id from purchases_old where product_id=987654)
  ),
  yeap as ( 
     (select * from purchases where product_id=123456 
     and purchased_at>'2022-01-01')
  union
  (select * from purchases_old where product_id=123456 
 and purchased_at>'2022-01-01')
    )
 select * from yeap 
  where 
  customer_id not in (  select customer_id   from naup where naup.customer_id=customer_id) 
Won't be able to check hypothetical query plans for a few days.

At least in postgres you won't get much interesting out of a query plan unless you stuff a lot of data into the tables, because with toy examples that have like tens of rows per table it'll just seq scan everything and completely ignore indexes because there's no point in using them. CTE's can be inlined now too btw, they aren't always materialized anymore.

If I was code reviewing I'd insist on union all to avoid the distinct operation and also not exists() instead of not in(). Just don't write anti-joins with not in(), it's a loaded footgun just waiting to go off. It'll seem fine and have the same query plan as not exists() until one day it suddenly does something completly braindead instead, because the expression returned from the subquery happens to be nullable (see here or here for details; they're old articles but nothing has changed).

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
PostgreSQL definitely had significant issues optimizing NOT IN way back in 7.4. I had a recipe database where ingredient exclusion was analyzing in the quintillions and failed to complete in 25min over something like 5k recipes. :allears: I probably have the full analysis and comparison documentation but don't remember which route I took.

When 8 appeared it ran in 35ms. :downs:

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

PhantomOfTheCopier posted:

Why materialize all the rows, join all the things multiple times, and delay all the filtering logic? This seems like "all of these except for these customers".

this is the definition of bikeshedding. You might as well ask, why union multiple times? Surely whether a particular approach "materializes all the rows" is RDBMS-dependent. The OP didn't say which RBDMS they are using unless I missed it.

teen phone cutie
Jun 18, 2012

last year i rewrote something awful from scratch because i hate myself
hi i'm diving into the Redis docs for the first time ever and outside of understanding caching as a concept, i'm going in with not much experience with a goal to cache some of my database queries, and while i'm sure i'm going to need more help in the future, i have one initial question about using Redis in general:

should i be caching data that is subject to change a lot? for example, if i am making a query to get "page 1" of a bunch of data but some stuff might get moved to "page 2" in a short time, is that even worth caching, even if i bust the cache when new data is created?

nielsm
Jun 1, 2009



Caching search results like that can make sense from a user experience point of view. Query the entire result set, save in the cache with a key for the specific user and query, so the results don't move when the user browses the pages.

mortarr
Apr 28, 2005

frozen meat at high speed
Does the the NOT IN vs NOT EXISTS anti-join query-plan forcing apply to MS SQL as well as postrgres, and is it always true that NOT IN is sub-optimal vs NOT EXISTS, or is it just true for larger sets?

I've always done NOT IN for a values list and also for tables or ctes with small numbers of values (like 100 or 1000 max maybe?) but swap over to NOT EXISTS for larger sets than that.

Jabor posted:

That chatgpt answer is really cool because it looks plausible and probably works on a couple of test cases but is also completely wrong and doesn't meet the specification at all.

Please do not use chatgpt to answer questions, it does not actually know the answers.

Sorry about that chatgpt answer - not sure why when I look at it now, but it really did look legit at the time. 20-some years writing sql doesn't count for much when you're in a rush to post, I guess

mortarr fucked around with this message at 21:49 on Feb 19, 2023

teen phone cutie
Jun 18, 2012

last year i rewrote something awful from scratch because i hate myself

nielsm posted:

Caching search results like that can make sense from a user experience point of view. Query the entire result set, save in the cache with a key for the specific user and query, so the results don't move when the user browses the pages.

but what if there's hundreds of pages? what's the best way to cache a large amount of paginated data?

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

mortarr posted:

Does the the NOT IN vs NOT EXISTS anti-join query-plan forcing apply to MS SQL as well as postrgres, and is it always true that NOT IN is sub-optimal vs NOT EXISTS, or is it just true for larger sets?

I've always done NOT IN for a values list and also for tables or ctes with small numbers of values (like 100 or 1000 max maybe?) but swap over to NOT EXISTS for larger sets than that.

It is true for MSSQL too. If you can guarantee that whatever you're putting into the NOT IN() expression isn't nullable, NOT IN() and NOT EXISTS() should be equivalent, but since you need to convince the query planner of this guarantee too it can get a bit sketchy and difficult to tell what it'll do if the expression is a subquery.

The reason for this is that IN() does simple equality comparison and null of course isn't equal to anything, not even itself, so if even a single null appears in a NOT IN(), the entire operation will evaluate to false (or technically null/unknown I guess). Because of this it's not semantically equivalent to NOT EXISTS() - it can give you wrong/unexpected results too - and cannot use the same query plan if the expression is nullable. For small sets it probably doesn't matter too much but the NOT IN plan is usually significantly worse.

TheFluff fucked around with this message at 03:08 on Feb 20, 2023

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
never use NOT IN

except as a convenience when you are manually querying the database and you have a short list of literal values to provide. short as in, you typed it yourself without copy-paste

Dawncloack
Nov 26, 2007
ECKS DEE!
Nap Ghost
Dear friends

I am extracting information from a database. The information is in a varchar, but I am interested inthe entries where a specific bit (not byte!) is one.

I am using the following command

SELECT
ReportID,
SUBSTRING(data, 21, 12)
FROM
RecData2
WHERE
(
(
(
(
CAST(SUBSTRING(data, 32, 1) AS INTEGER
)
) % 2
) <> 0
)


This command works exactly as I intend to (AFAICT!) , however it produces A LOT of warnings. Orders of magnitude above the number of results.

Is that normal and expected when I am using complex instructions like this? Should I reformulate?

I imagine a common criticism would be "you should not try to select too finely, extract all the packages and examine them with python or whatever you are using", but that would mean extracting a million packages and discarding all but 6k. I think it would save me a lot of time and processing power to discriminate them directly at the SQL level.

Any thoughts or advice?

Thanks!!

Jabor
Jul 16, 2010

#1 Loser at SpaceChem
Have you read the warning messages, or did you only look at how many there were?

Dawncloack
Nov 26, 2007
ECKS DEE!
Nap Ghost
I admit I am a massive newbie and I don't know how to display them. I'll start there. Sorry!

nielsm
Jun 1, 2009



Also be sure to look at the data piecewise.

It can help to limit the number of result rows from the query too, when testing. How to do it depends on the database, but for MySQL, PostgreSQL and SQLite you'd put "LIMIT 10" at the end of the query, for Microsoft SQL you'd write "SELECT TOP 10" at the start, and for Oracle there is a "FETCH FIRST 10 ROWS ONLY" clause you can add at the end.

For example, you can try to get the parts of your WHERE clause individually:
SQL code:
SELECT
  ReportID,
  SUBSTRING(data, 32, 1)  AS byte32,
  CAST(SUBSTRING(data, 32, 1) AS INTEGER) AS byte32int
  CAST(SUBSTRING(data, 32, 1) AS INTEGER) % 2 AS byte32intmod2
  SUBSTRING(data, 21, 12) AS WantedData
FROM
  RecData2
LIMIT 100
That way you can verify that each part of it has the desired result.

teen phone cutie
Jun 18, 2012

last year i rewrote something awful from scratch because i hate myself
hi again i'm hoping someone can help me out with a really trick mysql statement. i'm trying to get the latest Offense from an Offense table that is tied to a user, but the trick is offenses are tied
to posts and private messages, not users, so i first need to join together Posts and PrivateMessages before I can filter down on the user level

i have 4 tables: Posts, PrivateMessagePosts, Users, and Offenses. Offenses are tied to either a `Post.id` or `PrivateMessagePost.id` but not a user. they look like this:



so far I have a SQLAlchemy query which will get a list of posts given a thread ID (that part is irrelevant), and return them with all of the offenses tied to them. it looks like this:

pay no mind to the current offense join! that is to get a CSV of all offenses tied to each post we're querying. what i want additionally is the latest offense that is active on the user

Python code:
 db.session.query(
      Post,
      Thread.title,
      User.username,
      User.byline,
      User.avatar_url,
      User.registered_on,
      User.id
      Forum.id,
      Forum.title,
      func.group_concat(offense.id),
      func.group_concat(offense.is_ban),
      func.group_concat(func.coalesce(offense.end_time, "None")),
      func.group_concat(offense.post_id),
      active_offense_one.id,
  )
  .join(Thread, base_post.thread_id == Thread.id)
  .join(Forum, Forum.id == Thread.forum_id)
  .join(some_user, some_user.id == base_post.author_id)
  .join(
      offense,
      and_(
          offense.post_id == base_post.id,
          offense.approved_user_id > 0,
          or_(offense.end_time > current_time, offense.is_ban == True),
      ),
      isouter=True,
  ) // this join ties all the offenses on the post to 
  .group_by(
      base_post.id,
      active_offense_one.id
  )
which translates to this SQL:

SQL code:
SELECT 
  posts_1.id AS posts_1_id, 
  posts_1.body AS posts_1_body, 
  posts_1.posted_at AS posts_1_posted_at, 
  posts_1.updated_at AS posts_1_updated_at, 
  posts_1.ip_address AS posts_1_ip_address, 
  posts_1.thread_id AS posts_1_thread_id, 
  posts_1.author_id AS posts_1_author_id, 
  threads.title AS threads_title, 
  users_1.username AS users_1_username, 
  users_1.byline AS users_1_byline, 
  users_1.avatar_url AS users_1_avatar_url, 
  users_1.registered_on AS users_1_registered_on, 
  forums.id AS forums_id, 
  forums.title AS forums_title, 
  """
  using group_concat to get a CSV of all the offenses tied to this specific post
  """
  group_concat(offenses_1.id) AS group_concat_1, 
  group_concat(offenses_1.is_ban) AS group_concat_2, 
  group_concat(coalesce(offenses_1.end_time, %(coalesce_1)s)) AS group_concat_3, 
  group_concat(offenses_1.post_id) AS group_concat_4, 
  offenses_2.id AS offenses_2_id, users_1.id AS users_1_id
FROM offenses AS offenses_2, posts AS posts_1 
INNER JOIN threads ON posts_1.thread_id = threads.id 
INNER JOIN forums ON forums.id = threads.forum_id 
INNER JOIN users AS users_1 ON users_1.id = posts_1.author_id 
LEFT OUTER JOIN offenses AS offenses_1 
  ON offenses_1.post_id = posts_1.id 
  AND offenses_1.approved_user_id > %(approved_user_id_1)s 
  AND (offenses_1.end_time > %(end_time_1)s OR offenses_1.is_ban = true) 
GROUP BY posts_1.id, offenses_2.id
That's all great and working fine, BUT I can't figure out a good way to get the latest active offense the user has. i've been bashing my head against the wall for 2 days on this

teen phone cutie
Jun 18, 2012

last year i rewrote something awful from scratch because i hate myself
also i think i've posted about using ORMs before here and was told they suck, and now i'm finally understanding it. they generate some rear end queries, which is why i'm trying to manually write this query without letting the ORM get the relational data.

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE
If you have this pattern where you have a one-to-many relationship (one user has 0 or more offenses) and you want to get only the first related row by some criteria in a query that already has a bunch of other stuff going on, one way to do it in a fairly efficient way is with something called a lateral join, and gently caress it, it's Wednesday and I've had a couple of drinks, let's write some overly complicated SQL. peep this:

SQL code:
with offenses_by_user as (
  select offenses.*, posts.author_id as offender_id
  from
    offenses
    inner join posts on posts.id = offenses.post_id
  union all
  select offenses.*, private_message_posts.author_id as offender_id
  from
    offenses
    inner join private_message_posts
      on private_message_posts.id = offenses.private_message_post_id
)
select
    posts.*,
    users.*,
    last_offenses.*
from
    posts
    inner join users on posts.author_id = users.id
    left join lateral (
      select offenses_by_user.*
      from offenses_by_user
      where offenses_by_user.offender_id = users.id
      order by offenses_by_user.end_time desc
      limit 1
    ) as last_offenses on true
order by posts.id;
irrelevant stuff omitted for brevity. fiddle: https://www.db-fiddle.com/f/8MBoro6e5ydXqErVEJQk4F/0

performance is... probably not great? this is more "fun to write" than actually good. on postgres I think the planner would be capable of working with this but I have no idea about mysql. also if you're on mysql 5 you're hosed, you can do absolutely nothing interesting.

TheFluff fucked around with this message at 03:03 on Feb 23, 2023

teen phone cutie
Jun 18, 2012

last year i rewrote something awful from scratch because i hate myself
man i literally got there right before you posted that. thanks!!!! i have been ready to smash my keyboard looking for this solution

Dawncloack
Nov 26, 2007
ECKS DEE!
Nap Ghost

nielsm posted:

Also be sure to look at the data piecewise.

It can help to limit the number of result rows from the query too, when testing. How to do it depends on the database, but for MySQL, PostgreSQL and SQLite you'd put "LIMIT 10" at the end of the query, for Microsoft SQL you'd write "SELECT TOP 10" at the start, and for Oracle there is a "FETCH FIRST 10 ROWS ONLY" clause you can add at the end.

For example, you can try to get the parts of your WHERE clause individually:
SQL code:
SELECT
  ReportID,
  SUBSTRING(data, 32, 1)  AS byte32,
  CAST(SUBSTRING(data, 32, 1) AS INTEGER) AS byte32int
  CAST(SUBSTRING(data, 32, 1) AS INTEGER) % 2 AS byte32intmod2
  SUBSTRING(data, 21, 12) AS WantedData
FROM
  RecData2
LIMIT 100
That way you can verify that each part of it has the desired result.

Hey Nielsm and Jabor!

I am a still a stupid noob but at least now I know how to the warnings, and that that is a thing.

However, now I have a working query and I thought I'd share it in case it helps.

code:
SELECT ReportID, SUBSTRING(data, 21,12) FROM RecData2 WHERE ( ( ( CONV( SUBSTRING(data, 32,1),16,10) ) % 2 ) <> 0  ) AND ( LEFT(data, 2) = '56'
Two things: Changing "cast" to "conv" did the trick, apparently it's all fine to feed it a string, as long as it can be understood as a hex, which is the case. Cast was complaining about it, despite giving me the right result.

Second, I got warnings coming from "LEFT" for a similar reason. Data is a string(a varchar) and it had no trouble parsing decimal integers, but gave out warnings when some hex numbers appeared. The commas fixed that.

Thank you again!

Gwaihir
Dec 8, 2009
Hair Elf
Does anyone have particular favorite solutions for data anonymization/scrubbing in postgres? I'm looking at methods to port our production data volume over to some of our test environments to help with load testing/dev type stuff, but we're currently on RDS and the official https://postgresql-anonymizer.readthedocs.io/en/stable/ isn't available as a supported extension there. (And no, there's no way we're setting up standalone EC2 instances just to run this, for better or worse).

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Gwaihir posted:

Does anyone have particular favorite solutions for data anonymization/scrubbing in postgres? I'm looking at methods to port our production data volume over to some of our test environments to help with load testing/dev type stuff, but we're currently on RDS and the official https://postgresql-anonymizer.readthedocs.io/en/stable/ isn't available as a supported extension there. (And no, there's no way we're setting up standalone EC2 instances just to run this, for better or worse).
Is the sensitive data stored in specific tables/columns, or is it interspersed with unstructured data in text/JSON columns?

e: If Anonymizer looks like a fit but you can't/don't want to use it in standalone mode on your RDS instance, one option for running it might be to set up a dummy PostgreSQL server in ECS+Fargate and access your data on RDS as foreign data

Vulture Culture fucked around with this message at 16:26 on Mar 14, 2023

double nine
Aug 8, 2013

TheFluff posted:

*a solution*

I've only been able to play with sql today, but I just want you to know that it works perfectly, thank you for solving the logic puzzle.

Gwaihir
Dec 8, 2009
Hair Elf

Vulture Culture posted:

Is the sensitive data stored in specific tables/columns, or is it interspersed with unstructured data in text/JSON columns?

e: If Anonymizer looks like a fit but you can't/don't want to use it in standalone mode on your RDS instance, one option for running it might be to set up a dummy PostgreSQL server in ECS+Fargate and access your data on RDS as foreign data

Yeah, luckily the data is all segregated out in to it's own columns (classical stuff you'd want to anonymize, names, address, ssn, phone, etc) and not mixed in to JSON columns or blobs or anything else. We currently have a process to generate scrubbed copies of our production database for dev use, but it's done using home grown scripts for the moment that devs can kick off themselves in jenkins. It... functions, but it's slow and I think there's some room for improvement. A standalone FDW connected EC2 server isn't out of the realm of possibility, but I think the advantages would have to be pretty significant for us to want to take on manually janitoring a postgres install there while the rest of our stuff is on RDS.

teen phone cutie
Jun 18, 2012

last year i rewrote something awful from scratch because i hate myself
i previously posted ITT about wanting to implement Redis for some paginated SQL queries and one suggestion was to request all the data and just store it as one object and just bust the cache when an item is updated or added (items cannot be removed at this time)

But i was thinking of another possible solution and i'd like to be sanity checked here. What I was thinking is:

1. create a factory function that will take page number and return a SELECT statement for a page of items
2. when calling my SELECT statements in whatever GET request, run the factory function with the requested page and store the returned query as the Redis key (if the Redis key/value doesn't already exists. if it does, just read from memory)
3. when adding/updating an item, run another SELECT query to get what page this item is going to be on and then execute the factory function again with the calculated page number to get what the SELECT statement is going to be and then nullify that Redis key/value

does this sound like it could work? am i overengineering this? i'm just really not sure how to properly use Redis with paginated data that can always be added to or updated

teen phone cutie fucked around with this message at 23:11 on Mar 14, 2023

nielsm
Jun 1, 2009



I think you're thinking of a suggestion I made, but what I was thinking of was actually to use the cache to keep the result set constant while the user was browsing it. That is, nothing would invalidate or update the cached result set, other than a timeout or other indication that the user was done using that result set. In that model, you don't attempt to update the cached result set at all, but the user also won't see new items unless they explicitly invoke a new search or request a reload of the result set. In return, the user gets fully predictable paging with no results jumping around or getting skipped when switching page.

teen phone cutie
Jun 18, 2012

last year i rewrote something awful from scratch because i hate myself

nielsm posted:

I think you're thinking of a suggestion I made, but what I was thinking of was actually to use the cache to keep the result set constant while the user was browsing it. That is, nothing would invalidate or update the cached result set, other than a timeout or other indication that the user was done using that result set. In that model, you don't attempt to update the cached result set at all, but the user also won't see new items unless they explicitly invoke a new search or request a reload of the result set. In return, the user gets fully predictable paging with no results jumping around or getting skipped when switching page.

gotcha yeah i figured i was bastardizing the explanation a bit. though this solution still doesn't seem optimal for my use-case. the paginated items are comments in a thread, so the page results are going to be consistent other than a comment being edited or a new comment being added, which i would like to be instant with adds/updates busting the cache.

Seventh Arrow
Jan 26, 2005

I'm getting murdered in data engineering interviews because I've let whatever SQL skills I had atrophy over time. I've noticed that when I look at a problem my brain tends to freeze at first and then I tend to overthink everything. I start thinking things like "maybe I'll need to filter, so a subquery might be required..." or "maybe I'll need a CTE instead" and then it kind of snowballs from there. I got whammied by a "top x of y" problem (i.e. "top earners in the department") because I wasn't sure what kind of tool to approach it with. Should I just be grinding StrataScratch, or is there a more strategic way to approach this?

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE
how much transactional data can postgres hold before it blows up?

could you use the WAL as a "copy on write database overlay" where you have some core dataset where you want to try out speculative changes on a per-cursor basis and have them reflected for that thread in its reads/updates, but only update seldomly, as a component of an "eventually-consistent" thing with some concurrent properties?

(I'm obviously ignoring the write-contention problem here, just curious how many rows/blocksize per cursor (or WAL rows/size in total?) you could have open before it would explode horribly just on the basis of log replay/etc.)

Paul MaudDib fucked around with this message at 01:34 on Apr 5, 2023

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003

Seventh Arrow posted:

I'm getting murdered in data engineering interviews because I've let whatever SQL skills I had atrophy over time. I've noticed that when I look at a problem my brain tends to freeze at first and then I tend to overthink everything. I start thinking things like "maybe I'll need to filter, so a subquery might be required..." or "maybe I'll need a CTE instead" and then it kind of snowballs from there. I got whammied by a "top x of y" problem (i.e. "top earners in the department") because I wasn't sure what kind of tool to approach it with. Should I just be grinding StrataScratch, or is there a more strategic way to approach this?

select * from (... rank() over (partition by department order by earnings desc) ...) where rank >= x

common interview question because some databases have special syntax for this and the interviewer wants to show off that he knows it. definitely a realistic problem that does show up in real work, tho

Paul MaudDib posted:

how much transactional data can postgres hold before it blows up?

could you use the WAL as a "copy on write database overlay" where you have some core dataset where you want to try out speculative changes on a per-cursor basis and have them reflected for that thread in its reads/updates, but only update seldomly, as a component of an "eventually-consistent" thing with some concurrent properties?

(I'm obviously ignoring the write-contention problem here, just curious how many rows/blocksize per cursor (or WAL rows/size in total?) you could have open before it would explode horribly just on the basis of log replay/etc.)

i don't consider myself qualified to answer this question, but most of the wal settings are runtime configurable, i bet you could tune postgres to perform quite well in this scenario. but the default settings are not optimized for long-running transactions that sit around without committing or rolling back, i know that much

lazerwolf
Dec 22, 2009

Orange and Black
I have a best practices question. I rarely work with SQL directly for setting up table schema. I mainly use a framework like Django to define my models and handle schema migrations. How is this typically handled without the use of a framework?

Cold on a Cob
Feb 6, 2006

i've seen so much, i'm going blind
and i'm brain dead virtually

College Slice

lazerwolf posted:

I have a best practices question. I rarely work with SQL directly for setting up table schema. I mainly use a framework like Django to define my models and handle schema migrations. How is this typically handled without the use of a framework?

I can't speak for the Python but with .net when we don't want to use Entity Framework migrations we often use SQL Server Data Tools (SSDT) data projects. You define schema using scripts and/or GUIs (and pre/post deploy scripts manually) and the tool generates migrations from that. This is greatly simplifying all the things it can do for but it's largely hands-on. It's also SQL Server only so probably not what you want if you're working with Python.

RedGate has a similar tool that is not SQL Server specific but I have never used it so I can't comment on it but I know it supports a LOT of databases.

Seventh Arrow
Jan 26, 2005

DELETE CASCADE posted:

select * from (... rank() over (partition by department order by earnings desc) ...) where rank >= x

common interview question because some databases have special syntax for this and the interviewer wants to show off that he knows it. definitely a realistic problem that does show up in real work, tho


Thanks - really, I think I need to just keep practicing and doing challenges that will let me start to recognize patterns and their solutions. This article helped somewhat.

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

lazerwolf posted:

I have a best practices question. I rarely work with SQL directly for setting up table schema. I mainly use a framework like Django to define my models and handle schema migrations. How is this typically handled without the use of a framework?

You can use a migration tool (like Flyway; I guess it's now called Redgate) and raw SQL. You write your ALTER TABLE statements and whatnot and put them in a migration file and the migration tool keeps track of which migrations have been applied.

On a side note it's also relatively common to do things in the opposite way compared to Django. That is, in Django you start from your model classes and let the tooling generate SQL schema changes for you, but some other frameworks and tools start from the schema and generate model structs or classes from that.

CKinbote
Dec 16, 2022
I'm struggling at work with a (third-party) application built around a database; I only ever interact with databases through an ORM, so I'd love a more informed opinion as to whether this thing is as dumb as it appears.

This application ingests and stores data from experiments run on various samples. Each sample has a unique ID. Each experiment has its own table in the database, with fields for the sample ID and various things measured in the experiment. The end user wants all the data in one big table, so a batch job joins all the experiment tables into one big display table.

Currently the batch job takes an hour to run. There are about 20 experiment tables, with a total of about 500 total measurement fields. There have been about 60,000 samples. There are a total of about 2 million measured values, so about 14 out of every 15 values in the final table are empty.

I have a hard time believing this is state of the art, but I don't know much about databases and am curious whether there is a deep reason for this design that I don't know about. I feel like storing the data in a "skinny" format with fields for just sample ID, field ID, and value (and then have other tables with details for each field, like name and originating experiment) could be the way to go, but I wonder if that would fail to scale for some reason I don't know about.

I don't know much about how this database is implemented (other than "in Java") or how the AWS instance that runs all this is provisioned, unfortunately.

Just-In-Timeberlake
Aug 18, 2003

CKinbote posted:

I'm struggling at work with a (third-party) application built around a database; I only ever interact with databases through an ORM, so I'd love a more informed opinion as to whether this thing is as dumb as it appears.

This application ingests and stores data from experiments run on various samples. Each sample has a unique ID. Each experiment has its own table in the database, with fields for the sample ID and various things measured in the experiment. The end user wants all the data in one big table, so a batch job joins all the experiment tables into one big display table.

Currently the batch job takes an hour to run. There are about 20 experiment tables, with a total of about 500 total measurement fields. There have been about 60,000 samples. There are a total of about 2 million measured values, so about 14 out of every 15 values in the final table are empty.

I have a hard time believing this is state of the art, but I don't know much about databases and am curious whether there is a deep reason for this design that I don't know about. I feel like storing the data in a "skinny" format with fields for just sample ID, field ID, and value (and then have other tables with details for each field, like name and originating experiment) could be the way to go, but I wonder if that would fail to scale for some reason I don't know about.

I don't know much about how this database is implemented (other than "in Java") or how the AWS instance that runs all this is provisioned, unfortunately.

It doesn't sound particularly normalized, and without knowing further details and going just by your description I can't think of any reason it should take that long to just essentially collate the results of X tables into a final view.

The most probable deep reason for this design is that the one person who knew what was going on no longer works there.

Bruegels Fuckbooks
Sep 14, 2004

Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

CKinbote posted:

I'm struggling at work with a (third-party) application built around a database; I only ever interact with databases through an ORM, so I'd love a more informed opinion as to whether this thing is as dumb as it appears.

This application ingests and stores data from experiments run on various samples. Each sample has a unique ID. Each experiment has its own table in the database, with fields for the sample ID and various things measured in the experiment. The end user wants all the data in one big table, so a batch job joins all the experiment tables into one big display table.

Currently the batch job takes an hour to run. There are about 20 experiment tables, with a total of about 500 total measurement fields. There have been about 60,000 samples. There are a total of about 2 million measured values, so about 14 out of every 15 values in the final table are empty.

I have a hard time believing this is state of the art, but I don't know much about databases and am curious whether there is a deep reason for this design that I don't know about. I feel like storing the data in a "skinny" format with fields for just sample ID, field ID, and value (and then have other tables with details for each field, like name and originating experiment) could be the way to go, but I wonder if that would fail to scale for some reason I don't know about.

I don't know much about how this database is implemented (other than "in Java") or how the AWS instance that runs all this is provisioned, unfortunately.

The "one big table" / "end user wants this format" constraint probably means its the kind of dumb that you don't want to gently caress with. That sounds like tooling is based around the output of this report.

CKinbote
Dec 16, 2022

Bruegels Fuckbooks posted:

The "one big table" / "end user wants this format" constraint probably means its the kind of dumb that you don't want to gently caress with. That sounds like tooling is based around the output of this report.

(Un)fortunately, I maintain the downstream tooling, so changes to this report are my problem. In the end, I started messing around with it yesterday and cost myself an afternoon fixing what I broke. Oh well, can't say I wasn't warned.

Ytlaya
Nov 13, 2005

Recently we (or rather the guy who does most of the DB/sysadmin type stuff) changed our DB to using InnoDB, and a bunch of stuff became really slow. While this is fortunately (for me) outside of my responsibilities (at least beyond "defining the problem") I'm pretty amazed at how hosed up some of these queries are. This one that used to take just a few seconds (definitely no more than 10 seconds) now takes a whopping 2 hours. I was expecting it to be slow, but not that slow!

I'm guessing there must be missing indices or something, but I definitely wasn't expecting such a dramatic difference in speed. Hopefully the guy in question addresses this soon, because I've been getting a lot of e-mails about how the searches are suddenly slow or timing out. Fortunately our website is just a scientific website that services a relatively small community of scientists, so it's not as huge of a deal as it might be if it were used for an actual business or something.

nielsm
Jun 1, 2009



Changed to InnoDB, from what? I don't think anyone have used MyISAM for at least 15 years.
But yes check your indexes and see if any of the queries need to be jiggled around to make the optimizer generate a better plan. If you have the permissions on the server, learn how to issue and read Explain statements.

Adbot
ADBOT LOVES YOU

Ytlaya
Nov 13, 2005

nielsm posted:

Changed to InnoDB, from what? I don't think anyone have used MyISAM for at least 15 years.
But yes check your indexes and see if any of the queries need to be jiggled around to make the optimizer generate a better plan. If you have the permissions on the server, learn how to issue and read Explain statements.

It was MyISAM, yes (though I know very little about this stuff and had to double-check my e-mail history to verify this). The database has been around since...I think the late 90s. Possibly as early as 1994. It's a big source of a lot of problems we've had, though we're gradually moving towards getting some stuff out of it that never should have been in it to begin with (a lot of our metadata is being moved to an RDF DB for example, I think Virtuoso?).

I generally don't directly mess with the DB myself, with my interaction being limited to writing stuff like search queries that don't actually change/modify anything. But checking the more problematic queries with EXPLAIN sounds like a good idea (I wasn't even aware of this before), and I do have permissions.

I found this link that seems to cover a lot of what might be causing the issue - https://dba.stackexchange.com/questions/75091/why-are-simple-selects-on-innodb-100x-slower-than-on-myisam It sounds like the indexes might need to be recreated differently.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply