Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Xae
Jan 19, 2005

Xae posted:

That is the plan.

The pg folks from MS are supposed to get back to us this week. Hopefully we get a technical person and not another drat salesman.

We found our problem: min_parallel_table_scan_size .

The Azure default is messed up. It thinks the size is in bytes, but it is in 8kb blocks.

Meaning it was set to not parallelize a query unless the plan estimated 8,388,608‬ 8kb blocks , roughly 70gb, to read.

Adbot
ADBOT LOVES YOU

Xae
Jan 19, 2005

ChickenWing posted:

Yeah I've been trying to reveal as much as I can without actually showing much in the way of schema details because proprietary code, etc.

Based on experimentation so far it looks like the answer is that when the query hits a certain size it just rolls over and dies. It has nothing to do with the columns I called out specifically, we've found other bits and bobs that, when removed, result in a sane query plan.

Check any sub queries or CTEs you have. Postgres is pretty iffy on pushing down filter criteria into them.

Also sorting 4 billion records for under 500kb of memory sounds real fishy.

Xae fucked around with this message at 03:33 on Jun 8, 2020

Xae
Jan 19, 2005

I've got two tables that are throwing problems at me.

They're the same table, but we have to maintain US and EU data centers with different data in them.

They're partitioned tables. In the US queries on the table will use the indexes correctly except for one partition. Which sucks, because a seq scan can add 5-10 seconds.

In the EU it will never use the index on any partition. Which means our update process goes from minutes to ten or so hours as it full table scans a billion rows a few times.

I verified that the indexes show as valid.
I've VACUUM ANALYZE'ed the "bad" partitions.
I've VACUUM ANALYZE'ed the whole tables.
I've REINDEX and then VACUUM ANALYZE'ed the tables AND partitions


Whats next? Drop the indexes on the partition and try again?

Xae
Jan 19, 2005

I've got a three day weekend to solve this problem in Postgres

code:
SELECT *
  FROM TableA
 WHERE TimeStampA > (SELECT max(TimestampB) FROM TableB)
Runs in 8 hours because TableA is partitioned and Postgres will not use the index on TimestampA on most of the partitions

But if I do this:
code:
SELECT *
  FROM TableA
 WHERE TimeStampA > ${CONSTANT}
Uses the index and runs in Seconds

As a bonus if I do this
code:
SELECT *
  FROM TableA_PartitionThatNormallyWontUseAnIndex
 WHERE TimeStampA> (SELECT max(TimestampB) FROM TableB)
It uses the index and finishes instantly.

We've got an implementation team at MS who is supposed to be helping with this poo poo, but because of the holiday they're kinda MIA. We've only gotten an acknowledgement that "That... shouldn't be happening".

I've messed around and done more REINDEX, VACUUMs and ANALYZE commands than I can count. Hell, I even detached the partitions, clustered on the index on TimeStampA and reattached. Still does a sequence scan.

Xae
Jan 19, 2005

Pardot posted:

Maybe if you pull the select max up into a cte?

No change.

But that did get me to think about encapsulating the subquery in a function, which seems to work? There is one sticky partition that won't use the index, but 5 minutes beats a few hours.

Thanks!

Xae
Jan 19, 2005

abelwingnut posted:

is there a good free db or table for all postal/zip/mail codes in the world?

i'm also not sure if this stuff changes? i'm american and really only know this system. it feels like it never changes? maybe the british do?

Addresses are the second biggest clusterfuck in computer science, right after time.


Even if a specific country doesn't change often there are nearly 200 countries. So as a whole they change all the time.

Postal codes in many countries are unique or near-unique identifiers so they'll change constantly.

Ruggan posted:

Not a Postgres expert (I do mostly MS SQL), but maybe do something like this?

code:
DECLARE @maxDt as datetime = (SELECT max(TimestampB) FROM TableB)

SELECT *
  FROM TableA
 WHERE TimeStampA > @maxDt


The query runs through an ETL tool that does not support that. It has to be a single self-contained query.

Xae
Jan 19, 2005

CarForumPoster posted:

I have a novice DB choice question. I'd like to choose something that will work for the next 2-3 years.

Let me know if these a better spot for this.

I'm working on a project that scrapes or FOIA requests data on civil lawsuits and stores the data in a Postgres DB (RDS) with one lawsuit per row. I access this data in several ways searching/filtering and returning results via a paginated REST API by attorney, type of case, location, court, case number, UUID, date, etc. which are each their own column. There are about 60 columns that contain fairly small amount of data and one "everything else" column which is json string (Mostly ~5kb, sometimes >100kb, rarely >1MB) for data that doesn't go in any other column that I often return but don't ever search by.

A major data set I'm adding will make this ~7M rows long and add 5,000 rows/day (1.8M rows/year) on average. I also update the rows as things change in the cases, updating ~ 1000 rows/day. I serve the content to a web app that gets very little traffic as its behind a login.

Should I be doing something different at this point? I'm worried that it'll take a several seconds to run even well filtered queries and that something NoSQL like DynamoDB won't allow me to do the flexible searching that is key to this project.

Look at moving the big ol' json string to a different table and having a key to join them. You might see performance gains, you might not.

Can't there be multiple attorneys per case? How are you handling that in one row?

I would also look at creating partitions for the table based on a filing date, or some other date that never changes. Then default to a time limited search. This will split up your indexes by partition and avoid searching the entire history when most of the time people should know roughly when a case was filed.

Xae
Jan 19, 2005

CarForumPoster posted:

Yes there can. Right now I have a bad system where I separate into a "lead attorney" and a list of other attorneys. This works for the cases I started this with but with the subsequent data, its not a great system. I'm tempted to avoid having a bunch of tables that reference each other though. If only because it makes the performance worse and the data analysis harder.


Its indeterminate how many attorneys and locations there are, so I'm not sure I could offer predefined values in the form of a reference table. If they aren't NULL, they're always encoded as strings which are human readable and in a pretty consistent format.

I'll def check the fuzzy string matching link out.

EDIT: That fuzzy string matching link is great and def a problem I will have. I do my best to standardize names when possible but obviously thats...impossible for human names acquired through scraping.

It depends on how you access things, but look ups might improve your performance.

Names as strings are big. Ints are small. The associative table between Case and Attorney will scan much faster than the Case table.

TimestampTZ64(FilingDate)+Int32(CaseID)+Int32(AttorneyID)+Int32(AttorneyType) = 160 bits = 20 Bytes

Postgres is decent at generating plans, so should do a scan on Attorney, then index look up on Case/Attorney Association table, then Partition/PK look up on Case.

In that case the only table scan is on the Attorney table, which can't possibly be that big. It'll probably be in memory most of the time. From there it uses indexes to avoid scanning the big table.

It seems dumb and extra effort, but Warehouses are the one place where bit size and that jazz still matters. Because a hundred bytes here or there times a few million or a billion rows adds up fast.

When you're running that fuzzy logic search its better to run it on an Attorney table with 10,000 rows than the Case table with 10,000,000.

Xae
Jan 19, 2005

CarForumPoster posted:

I'm worried that was the argument that led to this nearly unusable API layout: https://www.courtlistener.com/api/rest-info/ Take a look at that diagram and try to figure out how to find all of "Bob Loblaw's" cases in

I love courtlistener and talk with the owner every now and then so I should pick his brain on this, but man their API is an absolute BITCH to search flexibly.

An attorney table will likely end up on the order of 1M rows for a 10M row case table. This is because of challenges with attorney attribution between courts, states, etc.

For reference, the count of CL's "attorneys" endpoint is 6.4M for a 17M count "dockets" end point.

Yeah, that is a bit over designed for your purposes.

Build the data model out a few ways. Use a data generator to fill out the tables and then start querying it.

A lot of these questions can only be answered by experimenting.

Adbot
ADBOT LOVES YOU

Xae
Jan 19, 2005

Mox Hombre posted:

yo what's up YOSPOS long time no see.

I have to get some data out of a few weirdly designed tables in postgres. We process tips for hairdressers and whatnot. We have a Ledger table where we track positive and negative additions to someone's balance. There's also a Payout table, and a payout is us sending money out, and a payout covers several tips, for several people. Once someone has been paid out, we add negative balance to their Ledger and mark it with a payoutId to know which payout resulted in that person's balance going down.

So far so good. I'm writing some analytics for that cause the db guy is not currently available. I have to get the average number of tips per payout for one person. Now you would imagine that there would be a M2M table or something that tells you which tips were paid out in which payout. But there isn't. Apparently, the way to get all the Ledger entries that were paid out in one payout (say payoutId = 99) is to find the Ledger entry with payoutId = 98 and then just go up by ids until we get to a Ledger entry with payoutId = 99.

so I have
code:
Ledger or whatever
id    reason    amount    payoutId
9     tip       $4
8     payout    -$32      9
7     tip       $4
6     tip       $4
5     tip       $4
4     payout    -$52      8
3     tip       $4
2     payout    $3        7
And I need to get a result that tells me that payout 9 had 3 tips, payout 7 had 1 tip and so on for each payout. Or even better to annotate it with a column that says which payout the tip was paid out, then I can do grouping and aggreagates. Getting this would be pretty easy if the data were more normalized or If i were doing this in application code. but with sql this is just hosed
code:
SELECT reason, SUM(amount), AVG(amount), COUNT(amount)
  FROM Ledger 
 WHERE id BETWEEN (SELECT id FROM Ledger  WHERE payout_id = #input) AND (Select id FROM Ledger WHERE payout_id = (#input+1))-1
 GROUP BY reason 
Alternately gently caress around with Window Functions or write some stored procedures.
https://www.postgresql.org/docs/11/tutorial-window.html

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply