|
How big is too big for Postgres indexes? I'm trying to drop into an application that's struggling that I know little about, and improve performance. I've got a 15GB data, ~12.5 million row table with no relations. The hottest index is >2.5GB, and individual queries using that index are both slow (>1.5s) and driving high CPU usage. The problem query is definitely only doing an index scan with very low planning time, but execution time is high. The index is a 4 column multi-column index, we're querying on 3 of them in the right order so the index is being used. The columns that are being queried on are highly duplicated, and also text fields. 2 of the 3 columns we're querying with have a total of 6 distinct values each, so my immediate thinking is that we can reduce the index size by normalizing those out into their own table, so that instead of duplicated text making the index huge we just have numeric foreign keys. Any suggestions for different kinds of indexes that would perform better with such duplicated values in the index? It's just a normal B-Tree right now. Edit: Actually, I would bet that putting the least-duplicated column first in the multi-column index helps... let me try that. Twerk from Home fucked around with this message at 03:49 on Sep 15, 2020 |
# ¿ Sep 15, 2020 03:40 |
|
|
# ¿ May 21, 2024 05:59 |
|
Pardot posted:Large indexes are fine as long as (and I’m assuming this is oltp) you have enough ram to keep cache hit rates at 99% I really appreciate this, thank you. I'm going to start by hitting the docs about GIN indexes.
|
# ¿ Sep 23, 2020 14:45 |
|
Moonwolf posted:On top of what the others have said, MySQL can only do nested loop joins, which are simple, and the fastest way to do small results sets, but will always suck no matter what for 10k+ results sets, and be terrible for 100k+. Trimming unused joins will certainly help though. Postgres and MSSQL can perform much better when you're asking them to do herculean things, right? My understanding is that the less database-friendly your workload is, the worse MySQL falls behind.
|
# ¿ Jan 11, 2021 21:05 |
|
Moonwolf posted:At my work we have 2 OLTP databases which are clocking in at ~70TB and 55TB, and it still drat works, although large table operations are getting hard so the business has finally really decided to invest in archiving/purging out the drat things. Are those Oracle, SQL server, or did you do that on something else? That's pretty huge and I'm curious if you have a ton of tables and multiple applications sharing them, or your core data model is really that big. Either way it's impressive, do your biggest indexes still fit in memory?
|
# ¿ Jun 14, 2021 12:48 |
|
D34THROW posted:Isn't NoSQL non relational? How the gently caress is a push to a non relational system supposed to work when an overwhelming majority of enterprise data is relational? As linked above, all of the major cloud vendors offer a much-more-complicated solution where you can effectively represent relational data in NoSQL databases that are basically just Key / Value stores with CQRS, as seen in Azure's docs here. I've seen teams move applications to this, engineers love it because it's a ton of new things to learn and study and concepts to reason about, and the cloud vendors love it because it moves the data storage into platform-specific solutions like DynamoDB, Lambdas, and all sorts of entirely proprietary things that mean that the application has vendor-specific code all the way to the core. Also, once you've done it, then you can scale up near infinitely at the cost of a higher cloud bill. Done right, it really is more resilient, at the cost of doing something like 20x the total compute work and complexity, but all of that additional compute work is horizontally scalable, unlike a single master database node. Edit: The biggest practical advantage I've seen from NoSQL is that you can let every team have isolated database resources, preventing one team from pooping in the pool and impacting other teams. I spent 5 years working in a huge, many-team, microservices world, and a given data model would sprawl across teams, but they didn't want any teams sharing a database, so it's not like SQL databases could be used effectively either.
|
# ¿ Mar 7, 2022 16:32 |
|
lazerwolf posted:Can you expound on this? Why are ORMs bad This post is more than 15 years old now, but still pretty relevant: https://blog.codinghorror.com/object-relational-mapping-is-the-vietnam-of-computer-science/. ORMs are an extremely leaky abstraction. To efficiently use an ORM, you have to also understand the basics of how the ORM translates its own API to queries, how its generated tables will actually work on the database, and what effective queries will look like for your data. That said, I start with an ORM most of the time on most applications, and will defend that decision. Every sane ORM has the ability to let you hand-write your own queries where you need to, and ORMs result in less boilerplate code than a query builder for all of the basic operations. ORMs aren't bad, they're a trade off to move faster by accepting more complexity. Usually this is a trade-off that I personally am willing to make. They also let more people write data access code without confident SQL understanding, which is fine as long as other people who do understand the ORM / SQL are reviewing code.
|
# ¿ Aug 17, 2022 19:21 |
|
I feel like there must be a better way to handle filters in a HAVING statement. I can't use window functions there, and it stinks. I've got a common pattern dealing with aggregates where I want to group together everything that is not a significant % of the total into an "other" category. To get concrete, I threw something together where I have colors and years, and I want to get the count of all colors that make up more than 10% of the total, and group together everything that makes up less than 20% of the total into an "other" category: SQL code:
code:
I realize there's a completely different way I could get to the "other" amount by subtracting the sum count of all the colors that passed the >=20% bar from the total of all matching rows, but I can't figure out how to use that.
|
# ¿ Oct 22, 2022 02:24 |
|
Ninja.Bob posted:Can you use a subquery? Yes, and your solution is a much better approach to the whole space, thanks!
|
# ¿ Oct 22, 2022 16:55 |
|
Hughmoris posted:Rookie question: First I'd check if https://pgloader.io/ supports your version of MS SQL, and use that if it does. Also, if anybody here is an experienced lisper, please update pgloader to work with MySQLs new auth, it's the most in-demand issue on GitHub.
|
# ¿ Nov 28, 2022 15:34 |
|
Stupid analytics-101 type question, and a basic SQL Fiddle with examples: http://sqlfiddle.com/#!17/41a4b/14 I've got people, notes, and codes, a minimal model looks like this: SQL code:
So I want something like: pre:person_id notes codes 1 4 5 2 8 0 3 0 16 SQL code:
SQL code:
|
# ¿ Feb 3, 2023 23:37 |
|
I feel like there's something really fundamental I'm missing. Playing around in SQL Fiddle the execution plan does look better with the two CTEs, one to count each type of record, and then a JOIN on person_id at the end. I hacked through this by making two separate queries and then smashing the results together in python, because no matter which way I tried to get this all out in one query I was going OOM and dying with 360GB of RAM across the spark cluster, and I'd have to bother IT to get a bigger cluster. Notes is >1billion records, there's a couple million people, I didn't bother to count codes but it's somewhere between millions and a billion.
|
# ¿ Feb 4, 2023 03:05 |
|
PhantomOfTheCopier posted:Why does each cte also need the person table? I don't need the person table in each CTE, and that solves the memory usage in spark! Good call. It's super fast and simple now with the two CTEs. As for resource usage: Spark does really unpredictable things compared to any relational database that I've used, because it's not a relational database, it is just pretending to create one over semi-structured data. I hate Spark, but it seems to be the dominant tool and what gets provided as output from whatever data processing pipeline companies have now.
|
# ¿ Feb 4, 2023 06:34 |
|
Not really a SQL question, but I figure y'all have the lay of the land for analytics tools in here. Cross posting:Twerk from Home posted:I'm wanting a single-node simple text search for exploring datasets on the order of couple terabytes of text that's medium-compressible. I want to KISS because this fits on a single computer, but we still want chunking so that it's reasonably fast and efficient and uses several cores. Dataset sizes are around 1-5TB uncompressed and 200-300GB compressed.
|
# ¿ May 25, 2023 23:11 |
|
Can somebody give me a crash course in effective MySQL bulk loading? I could also use a recommendation for a logical backup tool that'll work well with tables in the 50GB-300GB range, are people happier with mydumper or mysqlpump? I've never touched MySQL before, have the time in the trenches with Postgres and Spark SQL, and have some big tables that are desperately in need of some optimization. Right now they're entirely denormalized and have no indexes of any kind, I'm hoping to both normalize to save some space and add indexes to make querying it sane. This is genomics research data, and the researchers have previously been suffering through the fact that any query takes 3min to come back and running an application that makes a lot of queries is going to take 5 days to run. I've got a dozen questions and nobody to talk to, so if anybody sees obvious problems with any aspect of this I'd appreciate advice. This is an example of what I'm dealing with: SQL code:
My plan is to create new, normalized tables from this on a replica of the DB that I have exclusively to myself, so I can do full table locks, turn off indexing while inserting, and do any number of deferred processing tricks while I'm loading the data. I've been reading and threw a bunch of different stuff at it in hopes of faster inserts, but I'm not sure which of these are useful and which are just cargo cult bullshit: SQL code:
In the past these huge tables have been made somewhat usable by adding indexes (to the big duplicated text fields) and also by making more copies of the table that have been somewhat filtered, for example: SQL code:
|
# ¿ Mar 2, 2024 06:09 |
|
Boris Galerkin posted:I'm very familiar with Python, and I know the "basics" of SQL enough that I can look at the posts here ITT and generally understand, but I want to learn SQL more. How would you guys recommend learning SQL in 2024? I think you might get better advice if you can specify if you're more interested in SQL as an application developer wanting fast transactions or for data analysis and transformation, because those are pretty different worlds with different skillsets once you get past the basics. I'm clueless in the OLAP world and doing my best to learn quickly. There's also so many kinds of tools at this point, everything uses SQL.
|
# ¿ Mar 2, 2024 21:18 |
|
|
# ¿ May 21, 2024 05:59 |
|
Jabor posted:instead of spending brainpower worrying about clustering and how to optimally set up a primary key, i would just do the normalization and then spec out a server that fits the resulting dataset in ram. This is the advice that I'm here for, but I am still seeking a faster way to do the normalization / transformation in the first place. MySQL is pretty slow chewing on 1.5 billion rows in a single transaction, any suggestions on good ways to batch this down (or parallelize it, or anything)? I've got lots of databases to do this to and have run into some edge case failures on the data that I didn't encounter when I tested this with a tiny subset of rows to make sure everything worked. SQL code:
|
# ¿ Mar 4, 2024 19:04 |