|
Just-In-Timeberlake posted:boy I sure do love trying to unfuck a query based on a dozen views that are all like I did that last year! Collapsing it all into one view helped performance quite a bit, then we could drop all the sub-views because nothing used them. We didn't because but we could have
|
# ? Mar 4, 2022 18:50 |
|
|
# ? Jun 8, 2024 01:06 |
|
For those in the field, which way to you see the wind blowing for DBAs over the next few years? From the outside looking in, it looks like the cloud is being pushed more and more. I was watching a Microsoft video where a prominent leader of Azure SQL was talking about the future of DBAs, and how the role isn't going away but will likely morph in to a solution architect.
|
# ? Mar 6, 2022 05:24 |
|
I don't buy it. They've pushed for a solid decade to get everyone to go NoSQL and it hasn't happened. Things need to be really really large, or particularly susceptible to scalability outages to be worth the hassle and expense associated with making it work. When I was at AWS I saw lots of complaints about relational databases, even principal engineer mandates to avoid them, but there was zero guidance on how to make it happen. As a result, software developers were left to invent ways to store data, which usually meant ORM or WordPress-site equivalent designs; ie, the flattest most inefficient models you can imagine. Meanwhile, hardware is cheap and mtbf is relatively low. Relational databases are backed by a ton of theory and continue to evolve with increases in memory and cpu power, and are often less expensive than a bunch of custom software. Even with cloud solutions like Aurora, database performance requires attention, knowledge, experience, investigation, adaptation. So the cloud knows how much memory to give to buffers based on hardware settings; it's not telling you how to lay out your tables or optimize queries. It also seems like cloud data tooling is in its infancy. Companies don't think about ETL and reporting automation until someone thinks "they need ML". More simply, engineers don't know those tools. College graduates are more likely to get basic relational design. With more more more data these days, companies will increasingly encounter customer requests to extend data models and volume, so performance optimization and investigation will still be important. Disclaimer. The world generally does the exact opposite of what (I think) makes sense.
|
# ? Mar 6, 2022 09:34 |
|
Isn't NoSQL non relational? How the gently caress is a push to a non relational system supposed to work when an overwhelming majority of enterprise data is relational?
|
# ? Mar 6, 2022 14:44 |
|
D34THROW posted:Isn't NoSQL non relational? How the gently caress is a push to a non relational system supposed to work when an overwhelming majority of enterprise data is relational? it makes zero sense. the people who have pushed this obviously have never heard of a JOIN.
|
# ? Mar 6, 2022 19:11 |
|
My team spoke with an AWS technical person about some trouble we were having wiring their managed GraphQL service up to a relational database (disclaimer: this was a prototype and we were exploring whether it made sense knowing that it might not) and the answer we got was “Why aren’t you using DynamoDB?” He did not seem at all prepared for our answer, which was “Because we have relational requirements on our data and nowhere near enough of it to feel any performance constraints from a relational system.”
|
# ? Mar 6, 2022 19:13 |
|
Yeap it's basically all those things. While solutions architects and cloud consultants won't universally bring up NoSQL, you'll hear an increasing number telling you that it's more horizontally scalable, vertically scalable, available, shardable, redundant, etcetera. While some of that is correct, that may be the extent of their reasoning; ie, they aren't solving your product's problem statement. Take a look at this. It's not an awful design, but read between the lines: https://aws.amazon.com/blogs/database/options-for-legacy-application-modernization-with-amazon-aurora-and-amazon-dynamodb/. What is the author actually telling you about your relational database? Or try the article from Azure.
|
# ? Mar 6, 2022 20:42 |
|
Back in ye olde days of Access and SharePoint lists, the non relational structure of the latter is specifically what prevented me from using it. What kind of datasets are going to bog down a relational DB and what lack of indexing is occurring there
|
# ? Mar 6, 2022 23:17 |
|
D34THROW posted:Back in ye olde days of Access and SharePoint lists, the non relational structure of the latter is specifically what prevented me from using it. What kind of datasets are going to bog down a relational DB and what lack of indexing is occurring there Lots of idiots don't do indexes. That's really what nosql abstracts away for them: index maintenance. Make your information a bit more hierarchical and store it as whole serialized objects with no referential integrity to any other object. It feels hella-unnatural and the only use case I've really found for it was storing like an enterprise clouds app weblogs and poo poo. To have shittons of long text records that you want better free text searching within.
|
# ? Mar 7, 2022 07:02 |
|
Does "index maintenance" include the definition of primary keys? I'm always very suspicious of people who don't give their primary keys the love they deserve. They're not a technical detail; "What's the PK here" is one of the first questions which anybody should ask themself when they start working with any form of dataset.
|
# ? Mar 7, 2022 11:28 |
|
Perhaps it's a bit Hungarian notation in practice but my PKs and FKs are always blatantly obvious, named either "id" or "<foreign table>_id". Even when there is a field that should be unique between rows I create an integer PK to avoid potential issues. But i am babby at this so.
|
# ? Mar 7, 2022 12:56 |
|
Wipfmetz posted:Does "index maintenance" include the definition of primary keys? I'm always very suspicious of people who don't give their primary keys the love they deserve. Defining indices on stupid columns, not maintaining indices, not having supporting indices on FK columns/commonly queried columns, all that poo poo here. D34THROW posted:Perhaps it's a bit Hungarian notation in practice but my PKs and FKs are always blatantly obvious, named either "id" or "<foreign table>_id". Even when there is a field that should be unique between rows I create an integer PK to avoid potential issues. This is 100% good practice from start to finish. Comparing ints is a single step operation for a CPU and yet some folks try to PK on a VARCHAR instead.
|
# ? Mar 7, 2022 13:53 |
|
Nth Doctor posted:This is 100% good practice from start to finish. Comparing ints is a single step operation for a CPU and yet some folks try to PK on a VARCHAR instead. And if you have a usage scenario where such performance benefits are required. I'm usually working with engineering data so all that data has cute little tables where the usually handled dataset size rarely enters the millions. I'll happily put PKs on VARCHARs whenever it reduces _my_ time, e.g. because I'll get obvious uniqueness guarantees onto which i can build my queries on. I'm usually also adding an autoincrement field to direct my FKs to. No idea if that is stupid or not, performancewise. (I guess it would be better to have the PK on the autoincrement and ensuring those "obvious uniqueness guarantees" by additonal unique indices. That should make INSERTs easier. ) Wipfmetz fucked around with this message at 14:33 on Mar 7, 2022 |
# ? Mar 7, 2022 14:29 |
|
Nth Doctor posted:This is 100% good practice from start to finish. Comparing ints is a single step operation for a CPU and yet some folks try to PK on a VARCHAR instead. Case in point in the application I'm (re)working on now, every material report is tied to a job. That job has a unique identifier in the format JJ######. I could either A) store the job number directly as an integer with all the formatting headaches that brings (adding leading zeroes if necessary, adding the JJ prefix), or store the job number as a VARCHAR(8) and not have to gently caress with formatting. A year ago, I would have tried to brute-force the more complicated solution as chest-beating programming-man moment.
|
# ? Mar 7, 2022 14:32 |
|
D34THROW posted:Isn't NoSQL non relational? How the gently caress is a push to a non relational system supposed to work when an overwhelming majority of enterprise data is relational? As linked above, all of the major cloud vendors offer a much-more-complicated solution where you can effectively represent relational data in NoSQL databases that are basically just Key / Value stores with CQRS, as seen in Azure's docs here. I've seen teams move applications to this, engineers love it because it's a ton of new things to learn and study and concepts to reason about, and the cloud vendors love it because it moves the data storage into platform-specific solutions like DynamoDB, Lambdas, and all sorts of entirely proprietary things that mean that the application has vendor-specific code all the way to the core. Also, once you've done it, then you can scale up near infinitely at the cost of a higher cloud bill. Done right, it really is more resilient, at the cost of doing something like 20x the total compute work and complexity, but all of that additional compute work is horizontally scalable, unlike a single master database node. Edit: The biggest practical advantage I've seen from NoSQL is that you can let every team have isolated database resources, preventing one team from pooping in the pool and impacting other teams. I spent 5 years working in a huge, many-team, microservices world, and a given data model would sprawl across teams, but they didn't want any teams sharing a database, so it's not like SQL databases could be used effectively either.
|
# ? Mar 7, 2022 16:32 |
|
D34THROW posted:Case in point in the application I'm (re)working on now, every material report is tied to a job. That job has a unique identifier in the format JJ######. I could either A) store the job number directly as an integer with all the formatting headaches that brings (adding leading zeroes if necessary, adding the JJ prefix), or store the job number as a VARCHAR(8) and not have to gently caress with formatting. You could even store the job number as a CHAR(8) if you knew it would never ever change. Storing as a VARCHAR(8) locks you into that lurking "what if we need a seventh numerical digit?" landmine just as easily. I'm a big advocate for identity columns even when there's already a uniquely valued column in the table, if the column isn't comparable as a single CPU step. There's no reason you should spend multiple cycles comparing J, J, 4, 2, 1, 2, 6, 3 to J, J, 4, 2, 1, 2, 6, 0 one by one when you can just as easily see 34920 vs. 88382381 isn't a match with a single step.
|
# ? Mar 7, 2022 18:05 |
Nth Doctor posted:I'm a big advocate for identity columns even when there's already a uniquely valued column in the table, if the column isn't comparable as a single CPU step. There's no reason you should spend multiple cycles comparing J, J, 4, 2, 1, 2, 6, 3 to J, J, 4, 2, 1, 2, 6, 0 one by one when you can just as easily see 34920 vs. 88382381 isn't a match with a single step. On the other hand, 8 bytes is the size of an int64 and your CPU can already compare two int64 values in a single instruction. Compilers for C++ and other languages can translation a comparison of an array of 8 bytes into a comparison of a single 64 bit value, depending on exact context, so a database engine might also be able to.
|
|
# ? Mar 7, 2022 18:17 |
|
nielsm posted:On the other hand, 8 bytes is the size of an int64 and your CPU can already compare two int64 values in a single instruction. Compilers for C++ and other languages can translation a comparison of an array of 8 bytes into a comparison of a single 64 bit value, depending on exact context, so a database engine might also be able to. You raise a decent point in the case where the strings are <= 8 characters long, that's true. I would also bet modern dbms engines are smart enough to use that trivial escape hatch. You also have the collation layer to get through, though. Does the server/logical database/column have a collation set for case sensitivity vs. case insensitivity? Because that can mean extra work that the engine wouldn't necessarily know about a priori.
|
# ? Mar 7, 2022 18:50 |
|
pertinent to this conversation https://www.youtube.com/watch?v=b2F-DItXtZs
|
# ? Mar 7, 2022 22:06 |
|
Just-In-Timeberlake posted:pertinent to this conversation 12 years later and it only gets worse
|
# ? Mar 8, 2022 01:04 |
|
Rookie question: Can someone explain, or point me towards a reference, of how log speed/size affects transaction speeds? I saw a blog post from Brent Ozar where he points out that that a $22k/mo Azure SQL Database is woefully slow at loading data. https://www.brentozar.com/archive/2019/02/how-fast-can-a-21468-mo-azure-sql-db-load-data/ The way I understood it is that the log file speed limit was woefully low. I'm starting my SQL Server journey (beyond writing reports) and still not sure how the log controls the speed.
|
# ? Mar 8, 2022 01:14 |
|
Wipfmetz posted:Does "index maintenance" include the definition of primary keys? I'm always very suspicious of people who don't give their primary keys the love they deserve. if your pk ain't a surrogate it's a shameful pk
|
# ? Mar 8, 2022 03:14 |
|
Hughmoris posted:Rookie question: Can someone explain, or point me towards a reference, of how log speed/size affects transaction speeds? I think you should start reading here. The MSDN pages about MSSQL are usually very good. In brief though, the way transactional databases are typically implemented is that when you're changing data you first write a log entry for what you're going to do to a transaction log, and once that's been persisted then you can go off and update the actual tables and indexes, usually asynchronously (you don't flush to disk after every single transaction commit). That way if you crash or lose power while you're updating the big table and index files (or before the changes have been flushed to disk) you can recover from the transaction log, which is cheaper to write to. In postgres this is called the write-ahead log (WAL); in MSSQL it's just the transaction log. Apparently what's going on with Azure SQL is that while you can get a shitton of disk space and very fast I/O for the actual tables and indexes, that apparently doesn't change the I/O speed for the transaction log, so if you have a lot of small transactions you're bottlenecked by the log write speed. TheFluff fucked around with this message at 03:30 on Mar 8, 2022 |
# ? Mar 8, 2022 03:27 |
|
TheFluff posted:I think you should start reading here. The MSDN pages about MSSQL are usually very good. That makes sense, thanks for the explanation! *Very time video that YT recommended: Data Loading Best Practices on Azure SQL DB https://www.youtube.com/watch?v=WP-bqtVdJg4 Hughmoris fucked around with this message at 15:31 on Mar 8, 2022 |
# ? Mar 8, 2022 15:17 |
|
TheFluff posted:if your pk ain't a surrogate it's a shameful pk
|
# ? Mar 8, 2022 20:31 |
|
TheFluff posted:if your pk ain't a surrogate it's a shameful pk I don't think I've ever used anything but a surrogate PK. Just always made sense to me to let the DBMS do it.
|
# ? Mar 8, 2022 20:44 |
|
Nth Doctor posted:Defining indices on stupid columns, not maintaining indices, not having supporting indices on FK columns/commonly queried columns, all that poo poo here. I wouldn't use a varchar but a uniqueidentifier or equivalent is a good idea if you think the keys might make it to the front-end because then they shouldn't be guessable. I've seen it used as a way to make slamming a bunch of data into multiple tables a little easier because you can make the PK/FK in code but I'm not sure if that was necessarily a good idea rather than an expedient one.
|
# ? Mar 8, 2022 21:23 |
|
turns out it actually doesn't matter for performance whether your primary key is an int or a varchar or a uuid. what matters is poo poo like n+1 query patterns and horrible joins produced by hibernate or equivalent garbage. you'll need someone who can read and interpret the output of explain long before you'll need to care about the number of cpu cycles per equality comparison
|
# ? Mar 8, 2022 22:03 |
|
i'm probably the horror, but i've never seen the point in having special guids or whatever so that ids can't be guessed on the front end. you should have authorization checks in place to prevent users accessing resources they're not allowed to regardless, so even if they changed id=69 to id=420 they should just get a 401 or 404 or whatever you like maybe knowing the current id and growth rate could leak a bit of information? seems a bit of a niche problem i guess it could be defence in depth if your authn/z checks are buggy? plus copying/typing numeric ids is just so dang convenient
|
# ? Mar 9, 2022 05:55 |
|
There are use cases for "anyone with the url is presumed to be allowed access", but you still don't want everything to be publicly browsable.
|
# ? Mar 9, 2022 07:11 |
|
redleader posted:i'm probably the horror, but i've never seen the point in having special guids or whatever so that ids can't be guessed on the front end. you should have authorization checks in place to prevent users accessing resources they're not allowed to regardless, so even if they changed id=69 to id=420 they should just get a 401 or 404 or whatever you like If you have an API that can authorize an id+password, it will be much easier to break in with a scan of [0-65563]/password123 than trying accounts individually. If IDs are sequential, it's fairly fast to find a reasonable upper limit and then try to hide the scan as normal traffic. For a random ID space, most results will be visible as an invalid id metric, which is much more detectable as an attack (even though you only respond to the user with Naup and no details). Consider something like email. If an external mailer wants to attack a company they need to find names. If addresses were all abc123@ they would be able to increase their true positive/negative rates. For most cases it's there to protect against unforseen API issues. Sure your users are authorized on FaceGlimpse, but then a frontend team decided to make the messenger use user IDs. Since the site grants you access to your friends, via their email accounts, the app can map their userid so you can message them. Yay! Then some kid in college skipping their art class decides... I can just message user IDs 1 through 99999 and see if they know how to answer this homework question. Everything is authorized because those steps precede the message API, which can assume "they must be friends".
|
# ? Mar 10, 2022 01:29 |
|
I would simply not have security holes in the first place, then my ids don't matter.
|
# ? Mar 10, 2022 05:51 |
|
feel free to have an internal to external id mapping table that is used in all api views. relish in the joy of not giving a gently caress what underlying id type the dba/pm/idiot/yourself wanted to use at the time. know that performance is impacted approximately 0.00000%
|
# ? Mar 10, 2022 07:30 |
|
half my job is just mapping ids around
|
# ? Mar 10, 2022 10:09 |
|
What's the largest number of columns you've ever had to work with in one table? I have a table with 20 columns, 5 of which are FKs to lookup tables. This is as concise as I think I can make it, and I still feel it's excessive. Someone tell me it's not
|
# ? Mar 10, 2022 18:36 |
|
D34THROW posted:What's the largest number of columns you've ever had to work with in one table? I have a table with 20 columns, 5 of which are FKs to lookup tables. at my current job there is a characteristics table that has 160 columns
|
# ? Mar 10, 2022 18:55 |
|
Yeah, I regularly run across tables with 150-200 columns in them.
|
# ? Mar 10, 2022 18:57 |
|
raminasi posted:My team spoke with an AWS technical person about some trouble we were having wiring their managed GraphQL service up to a relational database (disclaimer: this was a prototype and we were exploring whether it made sense knowing that it might not) and the answer we got was “Why aren’t you using DynamoDB?” He did not seem at all prepared for our answer, which was “Because we have relational requirements on our data and nowhere near enough of it to feel any performance constraints from a relational system.” I’m the opposite of that guy, when I see someone using noSQL my question is usually “so why aren’t you using Postgres?”.
|
# ? Mar 10, 2022 19:14 |
|
Hate being forced to work with MySQL. It's such a garbage db. Has it improved? Sure. Is it good now? No.
|
# ? Mar 10, 2022 19:37 |
|
|
# ? Jun 8, 2024 01:06 |
|
redleader posted:maybe knowing the current id and growth rate could leak a bit of information? seems a bit of a niche problem Yep, both of these are decent reasons to use GUIDs IMO. GUID reasons, even
|
# ? Mar 10, 2022 19:38 |