Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
How big is too big for Postgres indexes? I'm trying to drop into an application that's struggling that I know little about, and improve performance.

I've got a 15GB data, ~12.5 million row table with no relations. The hottest index is >2.5GB, and individual queries using that index are both slow (>1.5s) and driving high CPU usage. The problem query is definitely only doing an index scan with very low planning time, but execution time is high. The index is a 4 column multi-column index, we're querying on 3 of them in the right order so the index is being used. The columns that are being queried on are highly duplicated, and also text fields.

2 of the 3 columns we're querying with have a total of 6 distinct values each, so my immediate thinking is that we can reduce the index size by normalizing those out into their own table, so that instead of duplicated text making the index huge we just have numeric foreign keys. Any suggestions for different kinds of indexes that would perform better with such duplicated values in the index? It's just a normal B-Tree right now.

Edit: Actually, I would bet that putting the least-duplicated column first in the multi-column index helps... let me try that.

Twerk from Home fucked around with this message at 03:49 on Sep 15, 2020

Adbot
ADBOT LOVES YOU

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Pardot posted:

Large indexes are fine as long as (and I’m assuming this is oltp) you have enough ram to keep cache hit rates at 99%


Look into gin or gist indexes. They are inverted so will be much smaller and more efficient for duplicated values. See this talk for details on how it works under the hood

https://youtu.be/2l-nCkPQVuQ


If you can get away with it, try the “C” collation. The downside is non ascii characters won’t sort to the expected place, but it might not matter (in almost all my apps almost never). Even when I have non ascii text, I’m not relying how Postgres sorts it, since it gets sorted again in my app or it just doesn’t matter for whatever reason. If this is the case here, and the collation is C, you can take advantage of the “abbreviated keys” optimization that went in a few releases ago.

I really appreciate this, thank you. I'm going to start by hitting the docs about GIN indexes.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Moonwolf posted:

On top of what the others have said, MySQL can only do nested loop joins, which are simple, and the fastest way to do small results sets, but will always suck no matter what for 10k+ results sets, and be terrible for 100k+. Trimming unused joins will certainly help though.

Postgres and MSSQL can perform much better when you're asking them to do herculean things, right? My understanding is that the less database-friendly your workload is, the worse MySQL falls behind.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Moonwolf posted:

At my work we have 2 OLTP databases which are clocking in at ~70TB and 55TB, and it still drat works, although large table operations are getting hard so the business has finally really decided to invest in archiving/purging out the drat things.

Are those Oracle, SQL server, or did you do that on something else? That's pretty huge and I'm curious if you have a ton of tables and multiple applications sharing them, or your core data model is really that big.

Either way it's impressive, do your biggest indexes still fit in memory?

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

D34THROW posted:

Isn't NoSQL non relational? How the gently caress is a push to a non relational system supposed to work when an overwhelming majority of enterprise data is relational?

As linked above, all of the major cloud vendors offer a much-more-complicated solution where you can effectively represent relational data in NoSQL databases that are basically just Key / Value stores with CQRS, as seen in Azure's docs here.

I've seen teams move applications to this, engineers love it because it's a ton of new things to learn and study and concepts to reason about, and the cloud vendors love it because it moves the data storage into platform-specific solutions like DynamoDB, Lambdas, and all sorts of entirely proprietary things that mean that the application has vendor-specific code all the way to the core. Also, once you've done it, then you can scale up near infinitely at the cost of a higher cloud bill.

Done right, it really is more resilient, at the cost of doing something like 20x the total compute work and complexity, but all of that additional compute work is horizontally scalable, unlike a single master database node.

Edit: The biggest practical advantage I've seen from NoSQL is that you can let every team have isolated database resources, preventing one team from pooping in the pool and impacting other teams. I spent 5 years working in a huge, many-team, microservices world, and a given data model would sprawl across teams, but they didn't want any teams sharing a database, so it's not like SQL databases could be used effectively either.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

lazerwolf posted:

Can you expound on this? Why are ORMs bad

This post is more than 15 years old now, but still pretty relevant: https://blog.codinghorror.com/object-relational-mapping-is-the-vietnam-of-computer-science/.

ORMs are an extremely leaky abstraction. To efficiently use an ORM, you have to also understand the basics of how the ORM translates its own API to queries, how its generated tables will actually work on the database, and what effective queries will look like for your data.

That said, I start with an ORM most of the time on most applications, and will defend that decision. Every sane ORM has the ability to let you hand-write your own queries where you need to, and ORMs result in less boilerplate code than a query builder for all of the basic operations.

ORMs aren't bad, they're a trade off to move faster by accepting more complexity. Usually this is a trade-off that I personally am willing to make. They also let more people write data access code without confident SQL understanding, which is fine as long as other people who do understand the ORM / SQL are reviewing code.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
I feel like there must be a better way to handle filters in a HAVING statement. I can't use window functions there, and it stinks.

I've got a common pattern dealing with aggregates where I want to group together everything that is not a significant % of the total into an "other" category. To get concrete, I threw something together where I have colors and years, and I want to get the count of all colors that make up more than 10% of the total, and group together everything that makes up less than 20% of the total into an "other" category:

SQL code:
WITH colors AS (
SELECT 1 as id, 'blue' as color, 1993 as year
UNION SELECT 2, 'blue', 1994
UNION SELECT 3, 'yellow', 1994 
UNION SELECT 4, 'yellow', 1994
UNION SELECT 5, 'yellow', 1994 
UNION SELECT 6, 'yellow', 1994
UNION SELECT 7, 'blue', 1994
UNION SELECT 8, 'blue', 1994
UNION SELECT 9, 'blue', 1994
UNION SELECT 10, 'red', 1994
UNION SELECT 11, 'red', 1993
UNION SELECT 12, 'red', 1993
UNION SELECT 13, 'red', 1993
UNION SELECT 14, 'red', 1993
UNION SELECT 15, 'red', 1993
UNION SELECT 16, 'green', 1993
UNION SELECT 17, 'orange', 1993
),  topper AS (
  SELECT 
    color 
  FROM 
    colors 
  GROUP BY 
    color 
  HAVING 
    CAST(
      COUNT(*) AS FLOAT
    ) / (
      SELECT 
        COUNT(*) 
      FROM 
        colors
    ) > 0.2
)
SELECT 
  CASE WHEN color IN (
    SELECT 
      color 
    FROM 
      topper
  ) THEN color ELSE 'other' END as case_color, 
  count(*) 
FROM 
  colors 
group by 
  case_color
order by 
  count(*) DESC;
code:
red|6
blue|5
yellow|4
other|2
I feel like I'm bludgeoning my way through this and there's got to be several better ways to do it. The biggest problem I see is that once I want to filter it down to just a single year, I have to duplicate the WHERE both into the main query and the CTE for topper, because I can't use a window function inside of a HAVING. For example, if I add "WHERE YEAR = '1994'", then I need it in both places or otherwise I get "red" in the query result where I should have "other".

I realize there's a completely different way I could get to the "other" amount by subtracting the sum count of all the colors that passed the >=20% bar from the total of all matching rows, but I can't figure out how to use that.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Ninja.Bob posted:

Can you use a subquery?

Yes, and your solution is a much better approach to the whole space, thanks!

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Hughmoris posted:

Rookie question:

I have a StackOverflow demo DB in SQL Server that I would like to replicate in PostgreSQL to practice on. The DB has 9 tables. What is the easiest way to get the data out of SQL Server and into PostgreSQL? Figure out a way to export the tables as CSVs and copy into Postgres?

First I'd check if https://pgloader.io/ supports your version of MS SQL, and use that if it does.

Also, if anybody here is an experienced lisper, please update pgloader to work with MySQLs new auth, it's the most in-demand issue on GitHub.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
Stupid analytics-101 type question, and a basic SQL Fiddle with examples: http://sqlfiddle.com/#!17/41a4b/14

I've got people, notes, and codes, a minimal model looks like this:

SQL code:
CREATE TABLE notes(note_id TEXT, person_id TEXT, note_text TEXT);
CREATE TABLE codes(code_id TEXT, person_id TEXT, code TEXT);
CREATE TABLE person(person_id TEXT);
What's the sanest way to get a count of notes and codes per person, knowing that there are some people with zero for one or more of those and I do want those to appear?

So I want something like:

pre:
person_id notes codes
1 4 5
2 8 0
3 0 16
Would you rather see a COUNT DISTINCT like

SQL code:
SELECT
    person_id,
    COUNT(DISTINCT note_id) as note_count,
    COUNT(DISTINCT codes) as code_count
FROM
    person
    LEFT JOIN notes USING(person_id)
    LEFT JOIN codes USING(person_id)
GROUP BY
    person_id;
or one with two CTEs like
SQL code:
    SELECT
        person_id,
        COUNT(*) as note_count
    FROM
        person
        JOIN notes USING(person_id)
    GROUP BY
        person_id
or is there something I'm completely missing that's simpler than any of this? Also, to boot in reality I'm dealing with spark and codes doesn't even have a primary key so I'd have to do DISTINCT on something like a datetime field. Blech. I think the one with multiple CTEs might make for a better execution plan on most platforms, but I'm not sure.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
I feel like there's something really fundamental I'm missing. Playing around in SQL Fiddle the execution plan does look better with the two CTEs, one to count each type of record, and then a JOIN on person_id at the end.

I hacked through this by making two separate queries and then smashing the results together in python, because no matter which way I tried to get this all out in one query I was going OOM and dying with 360GB of RAM across the spark cluster, and I'd have to bother IT to get a bigger cluster. Notes is >1billion records, there's a couple million people, I didn't bother to count codes but it's somewhere between millions and a billion.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

PhantomOfTheCopier posted:

Why does each cte also need the person table?

... It won't let me post...
Select id count note from notes group by notes
Should work just fine in a CTE.


Ten million people is 10M x 2 integers for the id+count tables and another 10M for the results. It should only be 60M x 64bit or 480MB. Even at 4x for stuff, storage is less than 2GB? Likely more if everything is actually a text field.

Start with WHERE personid less than 500 in your ctes to see if the results are even sensible or if it's doing something dumb.

I don't need the person table in each CTE, and that solves the memory usage in spark! Good call. It's super fast and simple now with the two CTEs.

As for resource usage: Spark does really unpredictable things compared to any relational database that I've used, because it's not a relational database, it is just pretending to create one over semi-structured data. I hate Spark, but it seems to be the dominant tool and what gets provided as output from whatever data processing pipeline companies have now.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
Not really a SQL question, but I figure y'all have the lay of the land for analytics tools in here. Cross posting:

Twerk from Home posted:

I'm wanting a single-node simple text search for exploring datasets on the order of couple terabytes of text that's medium-compressible. I want to KISS because this fits on a single computer, but we still want chunking so that it's reasonably fast and efficient and uses several cores. Dataset sizes are around 1-5TB uncompressed and 200-300GB compressed.

We've got lots of these datasets and we don't need HA uptime and have no streaming ingest. What tools should I look at before gluing something together myself? Single node elasticsearch? DuckDB? Shell scripts calling split, zstd, ripgrep, cat and a dozen other unix tools?

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
Can somebody give me a crash course in effective MySQL bulk loading? I could also use a recommendation for a logical backup tool that'll work well with tables in the 50GB-300GB range, are people happier with mydumper or mysqlpump?

I've never touched MySQL before, have the time in the trenches with Postgres and Spark SQL, and have some big tables that are desperately in need of some optimization. Right now they're entirely denormalized and have no indexes of any kind, I'm hoping to both normalize to save some space and add indexes to make querying it sane. This is genomics research data, and the researchers have previously been suffering through the fact that any query takes 3min to come back and running an application that makes a lot of queries is going to take 5 days to run.

I've got a dozen questions and nobody to talk to, so if anybody sees obvious problems with any aspect of this I'd appreciate advice. This is an example of what I'm dealing with:

SQL code:
CREATE TABLE `BIG_TABLE_1` (
    `tissue` text,
    `gene` text,
    `phecode` text,
    `ensID` text,
    `beta` double DEFAULT NULL,
    `pval` double DEFAULT NULL,
    `zscore` double DEFAULT NULL,
    `description` text,
    `phe_group` text
) ENGINE = InnoDB DEFAULT CHARSET = latin1;
We've got lots of these, some have a few more columns, but they're all in the 50GB-300GB range with really low cardinality in the text fields, so there's tons of duplication.

My plan is to create new, normalized tables from this on a replica of the DB that I have exclusively to myself, so I can do full table locks, turn off indexing while inserting, and do any number of deferred processing tricks while I'm loading the data. I've been reading and threw a bunch of different stuff at it in hopes of faster inserts, but I'm not sure which of these are useful and which are just cargo cult bullshit:

SQL code:
USE pxcan_normalized;
BEGIN;

-- Configure for bulk inserts
SET autocommit = 0; -- I dont need this because i'm trying to do it all in one transaction, right? Should I batch it into smaller transactions?
SET unique_checks = 0;
SET foreign_key_checks = 0;
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; -- I had to set this because apparently InnoDB was creating row locks even though i have whole table locks?

-- Does this even do anything for InnoDB?
LOCK TABLES tbl1 WRITE,
tbl2 WRITE; -- and so on
Anyway, my general plan was to pull out columns with very low cardinality into their own tables and then create normalized copies of the big tables using those new child tables. For example, we've got less than 50 tissue, sample_group, phe_group and less than 50,000 genes and 10,000 phecodes and associated descriptions, so there's room to deduplicate a good amount of data.

In the past these huge tables have been made somewhat usable by adding indexes (to the big duplicated text fields) and also by making more copies of the table that have been somewhat filtered, for example:

SQL code:
CREATE TABLE bigtbl_pvalue_05 AS
SELECT
    *
FROM
    bigtbl
WHERE
    pval <.05;
This sucks. I'd much rather have indexes work for us. If I were in Postgres I'd partition by pval at the common cutoffs people use, but in MySQL I guess I'd have to make pval the first column in the primary key to affect how InnoDB lays it out on the disk? The natural key here is (tissue, gene, phecode) right now, which will become (tissue_id, gene_id, phecode_id) in my normalized schema. Would it be horrible to throw pval on the front of the primary key, and will that cause MySQL to cluster by pval such that queries with a specific pval range will be almost as fast as the smaller copy of the table? Once I've got this working well I've got dozens of these DBs to fix up.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Boris Galerkin posted:

I'm very familiar with Python, and I know the "basics" of SQL enough that I can look at the posts here ITT and generally understand, but I want to learn SQL more. How would you guys recommend learning SQL in 2024?

I think you might get better advice if you can specify if you're more interested in SQL as an application developer wanting fast transactions or for data analysis and transformation, because those are pretty different worlds with different skillsets once you get past the basics.

I'm clueless in the OLAP world and doing my best to learn quickly. There's also so many kinds of tools at this point, everything uses SQL.

Adbot
ADBOT LOVES YOU

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Jabor posted:

instead of spending brainpower worrying about clustering and how to optimally set up a primary key, i would just do the normalization and then spec out a server that fits the resulting dataset in ram.

it's likely to be both cheaper and give better outcomes than spending a bunch of your expensive time on optimization.

This is the advice that I'm here for, but I am still seeking a faster way to do the normalization / transformation in the first place.

MySQL is pretty slow chewing on 1.5 billion rows in a single transaction, any suggestions on good ways to batch this down (or parallelize it, or anything)? I've got lots of databases to do this to and have run into some edge case failures on the data that I didn't encounter when I tested this with a tiny subset of rows to make sure everything worked.

SQL code:
INSERT INTO
    big_tbl1_norm (
        beta,
        pval,
        zscore,
        tissue_id,
        gene_id,
        phecode_id,
        phe_group_id
    )
SELECT
    CAST(orig.beta AS FLOAT),
    CAST(orig.pval AS FLOAT),
    CAST(orig.zscore AS FLOAT),
    t.tissue_id,
    g.gene_id,
    p.phecode_id,
    pg.phe_group_id
FROM
    src_schema.big_tbl1 orig
    JOIN src_schema.tissue t ON orig.tissue = t.tissue
    JOIN src_schema.gene g ON orig.gene = g.gene
    JOIN src_schema.phecode p ON orig.phecode = p.phecode
    JOIN src_schema.phe_group pg ON orig.phe_group = pg.phe_group;
So far the MySQL-isms that I've tripped on are that text indexes are case insensitive by default, meaning that you can't store somestring and SomeString in a column with a unique index, and that InnoDB will run out of row locks and die with " The total number of locks exceeds the lock table size" if you're doing big operations like this, I thought that locking the whole table would prevent obtaining row locks but I had to set READ UNCOMMITTED.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply