|
Just-In-Timeberlake posted:man that product name is loving on the nose yes, it loving sucks. i did a project for this same group that should have taken like 1/4 of the time it should actually take. like, i had to create a separate loving server, install cron on there, make sure that the js involved with cron didn't interfere with the formatting of the sql, move all of the code into separate files, then, because you can't send more than one sql statement via their api, had to make sure that things triggered successively as they should. instead of just using a loving stored procedure. i won't even get into the lack of true indexing and the nebulous hell that is clustering on snowflake. abelwingnut fucked around with this message at 22:02 on Oct 23, 2019 |
# ? Oct 23, 2019 21:55 |
|
|
# ? Jun 8, 2024 05:55 |
|
Ruggan posted:Creating a whole range of rows for every combo sounds very messy though. You could instead take your example of 1.2.3.0/24 and represent it with three columns: the ranges are variable. they range from /7 to /32. yes, it's wild. that one /7 would take up 33MM rows alone. but yea, therefore, i'd have to create a column for each octet. also not sure how leading zeroes would work in the join in that scenario? also, what do you mean by persisted calculated columns? i can definitely separate the octets, either by a function i found in there or by mathing it down. i have the equations to convert these to decimal, and can also extract that from the object that is created from PARSE_IP. and yes, creating a whole table of those ranges does sound messy...in any other system. but this platform is built for massive tables. there's one in here, this is not a table that i'm working with by the way, that's literally 228 995 504 699 rows. that table alone is 24.6TB. and this platform actually is somewhat fast at dealing with it, admittedly. abelwingnut fucked around with this message at 22:10 on Oct 23, 2019 |
# ? Oct 23, 2019 22:00 |
If you can modify the schema and get everyone to agree on it, handle the addresses as 32 bit integers and just store the first and last address of each range, and do a simple between check.
|
|
# ? Oct 23, 2019 22:18 |
|
My first two thoughts: Use PostgreSQL foreign data wrappers and the built in network address type and be done with it. Roops custom service for data, yeah that won't work. Next, use regular expressions. Netmasks are straightforward to represent as patterns. The issue here is that pattern matching is not a standard column collation, so it'll be slow. Next thought, the issue is there speed will suck because the join condition is "inside out". IE, you're trying to find a value between two calculated values, and that'll be slow, maybe even O(mn) depending on the database. So, flip the problem around. What if you wrote a big florping case statement that checked all possible netmasks for a given ip? Then you could do a direct equality match with B. But yeah, if you're doing range checks in a database without geometry or custom types, you really want integers, coupled with some indexes optimized for greater/less-than queries. Have you tried contacting snowflake?
|
# ? Oct 23, 2019 22:21 |
|
nielsm posted:If you can modify the schema and get everyone to agree on it, handle the addresses as 32 bit integers and just store the first and last address of each range, and do a simple between check. yea, i'm fairly sure that is a no-go. PhantomOfTheCopier posted:My first two thoughts: and yea, i have to use snowflake. i have told them how awful it is, and that's just not happening. contacting snowflake is my next step, but wanted to make sure i wasn't missing something.
|
# ? Oct 23, 2019 22:24 |
|
Ruggan posted:Creating a whole range of rows for every combo sounds very messy though. You could instead take your example of 1.2.3.0/24 and represent it with three columns: That's not how IP address ranges work. The /24 at the end refers to the prefix, the 24 most significant bits of the address (so 1.2.3 in this example). The range consists of all the possible addresses formed by the remaining 8 bits, so 1.2.3.0 to 1.2.3.255. You can mask off any arbitrary number of bits you want, it doesn't have to be byte-aligned, so for example 1.2.3.0/25 would be the range 1.2.3.0 to 1.2.3.127. Given that abelwingnut mentions that all sorts of wacky masks that aren't byte-aligned are in use in his data, treating the addresses as strings with regular expressions doesn't seem feasible to me either. nielsm's idea is a reasonable alternative to just using bitmasks directly though - perhaps even better than using bitmasks. TheFluff fucked around with this message at 00:11 on Oct 24, 2019 |
# ? Oct 23, 2019 23:58 |
|
TheFluff posted:That's not how IP address ranges work. The /24 at the end refers to the prefix, the 24 most significant bits of the address (so 1.2.3 in this example). The range consists of all the possible addresses formed by the remaining 8 bits, so 1.2.3.0 to 1.2.3.255. You can mask off any arbitrary number of bits you want, it doesn't have to be byte-aligned, so for example 1.2.3.0/25 would be the range 1.2.3.0 to 1.2.3.127. Didn’t realize that - thanks for clarifying.
|
# ? Oct 24, 2019 00:59 |
|
Holy poo poo! I’ve been diving into MySQL docs to troubleshoot a data load performance issue and I found this gem: quote:
https://dev.mysql.com/doc/refman/5.7/en/insert-optimization.html Does this really indicate a performance limit on the size of a MySQL table and knock it right out of contention for any database of significant size?
|
# ? Oct 24, 2019 06:06 |
|
That’s normal that the insert performance would be limited by the Btrees Big O insertion time which is log n. Any DB that uses btrees will have that as the lower bound on insertion performance.
|
# ? Oct 24, 2019 06:12 |
PhantomOfTheCopier posted:Next thought, the issue is there speed will suck because the join condition is "inside out". IE, you're trying to find a value between two calculated values, and that'll be slow, maybe even O(mn) depending on the database. So, flip the problem around. What if you wrote a big florping case statement that checked all possible netmasks for a given ip? Then you could do a direct equality match with B. This is not too bad an idea, if you're stuck with strings: On the client performing the query, generate all possible netmasks the IP could match, this should be limited to around 25 possibilities. Then query for an exact string match for each of those. E.g. if your address to search for is 15.56.204.85, then the possible netmasks would be: 15.56.204.85/32, 15.56.204.84/31, 15.56.204.84/30, 15.56.204.80/29, 15.56.204.80/28, 15.56.204.64/27, ... 15.56.204.0/22, 15.56.200.0/21, 15.56.192.0/20, ..., 15.0.0.0/8, 14.0.0.0/7 If any of those exist in the database, you have a match.
|
|
# ? Oct 24, 2019 07:12 |
Agrikk posted:Yeah. I’m loading 2200 CSV files [each with 1.5 million rows of data consisting of five columns with a total character width of no more than 86] into a single table with an auto-incrementing rowID and an index on three of the five columns. Just to get the obvious questions out of the way: You are doing the inserts inside a transaction, right? I'm pretty sure if you open a transaction the indexes will only be updated when you commit, instead of after every insert.
|
|
# ? Oct 24, 2019 07:15 |
|
nielsm posted:This is not too bad an idea, if you're stuck with strings: Yepz. Assuming that the ranges in B are non-overlapping. Also remember to start with the larger ranges (/7) and work toward the smaller (/32), first match wins.
|
# ? Oct 24, 2019 14:12 |
|
yea, i like that. i'm not sure how i could possibly code that, though. would have to be a series of CASE statements, each computing the binary form, then converting it to the string. or something like that. i'll have to think it through a bit.
|
# ? Oct 24, 2019 15:07 |
|
What is everyone’s editor preference when working with Oracle (mainly writing queries for reports)? So far I’ve only been working with SQL and using SSMS or visual studio and am decidedly not impressed with the Oracle SQL Developer app.
|
# ? Oct 24, 2019 16:11 |
|
general sql question: let's say i have: code:
my guess is it has held the result of udfBlah(a.x) in memory so that it doesn't have to reevaluate, and can then move on to row 2 in tableB, evaluate it, then row 3 in tableB, and so forth until it reaches the end of the table B. at that point it then reevaluates udfBlah(a.x) for row 2, then cycles through table B again while holding the new value of udfBlah(a.x) in memory. do i have that correct, or does it reevaluate udfBlah(a.x) with each evaluation?
|
# ? Oct 24, 2019 17:22 |
|
abelwingnut posted:general sql question: So caveat that you'll want to check the query plan. Others can probably answer this better than I can but I'll try. I'm using Microsoft SQL Server and what it looks like it does it it computes the scalar value for all of tableA's rows and holds those in memory to join. It then picks the most performant join operation it can: if it's joining against a sorted key from tableB it'll sort the scalar values and merge join them, otherwise it does a hash join or something similar. I checked the query plan for self joining a huge table (100M+ rows) and it still looked like it was going to try that method, but I'm not sure if the memory would need to spill out to tempdb. Our server has a shitton of memory so maybe it could manage it, but the memory grant would likely be huge if this is what it did. It probably will also try to intelligently pipeline the data forward so that it does not need to wait for all of the scalar values to be computed in order to start the join - but that would only be the case if the optimizer was deciding to hash-match rather than sort/merge-join the data. Maybe someone else with more knowledge around this could chime in.
|
# ? Oct 24, 2019 18:07 |
|
abelwingnut posted:does it reevaluate udfBlah(a.x) with each evaluation? Star War Sex Parrot fucked around with this message at 21:18 on Oct 24, 2019 |
# ? Oct 24, 2019 21:16 |
|
From now on, I'm gonna refer to abelwingnut's database as the Snowflake Special. just thought you needed to know that also it's my Friday this week.
|
# ? Oct 24, 2019 22:57 |
|
abelwingnut posted:general sql question: So there was some mention of join types, but pretty much any db is always going to keep the results of your udfBlah(a.x) for evaluation, because anything else takes more cpu. Even at a nested loop it will evaluate that once, and then examine b.y for equality a row at a time. Merge join I'm not certain because of how it'd need to sort b.y to match the ordering of udfBlah(x.y), hash join would just make the hashes of all the udfBlah(a.x) for every a row and then join to the matching ones on b. The only place where this wouldn't hold is if udfBlah is non-determinative, so if you've got a random or time function in there so the output can change, then you'd need to really examine what it was doing, it could either evaluate it per row coming in or per join.
|
# ? Oct 24, 2019 23:34 |
|
Does anyone here know of a good resource for translating relational schema constraints into mysql ddl statements? My professor is making zero effort on this front.
|
# ? Oct 25, 2019 00:08 |
|
Indexes on expressionsabelwingnut posted:
The "best" case is a function that always gives the same result for equivalent inputs, such as a mathematical function. To understand this, look at EXPLAIN. Different engines use different query caching mechanisms, so lookups in temporary tables and caching will differ, along with actual performance relative to the current database configuration. It also depends on table statistics. Look at these: code:
code:
(amend) Contrast with this volatile: code:
Academic interest is appreciated, but I'm forced to ask, Why should you care? Rather, is this the right question? Assuming this is a continuation of the above, I see where you're coming from, but I suspect the important pieces are hiding behind the API. You might be able to give rough runtime comparisons between queries, but you need EXPLAIN results to really see what's happening, when the table is being scanned/hashjoined/cached/&c. Nevertheless, you're on the right path. Stage One is to go through a bunch of approaches, record their runtime performance, pick the best. That gives you enough info to go to the hopefully-responsive database owners for direct assistance. PhantomOfTheCopier fucked around with this message at 11:50 on Oct 25, 2019 |
# ? Oct 25, 2019 11:29 |
|
nielsm posted:Just to get the obvious questions out of the way: You are doing the inserts inside a transaction, right? I'm pretty sure if you open a transaction the indexes will only be updated when you commit, instead of after every insert. Yes, it is a single statement to load all the lines of a file in. I’m not looping over the file inserting one at a time.
|
# ? Oct 25, 2019 14:49 |
|
Can someone check my thinking about partitioning? Given a table with [row_id][Username][team][timestamp][data1][data2] With primary key on [row_id] and a composite index over [username][teamnumber][timestamp]: A good way to boost read performance for the most common type of query which is SELECT * FROM [table] WHERE [username] = $username AND [teamnumber] = $teamnumber would be to partition by a calculated field consisting of [username][teamnumber]. Right? The principal here is that the retrieval of data only has to go to the specific partition and search that, rather than the entire dataset. This partition scheme will do nothing to speed INSERTS if data comes in batches where every user/teamnumber combo gets an additional [timestamp] with associated data. How did I do?
|
# ? Oct 27, 2019 00:40 |
|
Quick stupid question for something I haven't done before. Let's say I have two servers running the same version of postgresql. Something happens to server A which means I access it properly anymore, but I need to take a dump of it's database. Instead of setting up a new server to run the database in server A, could I back up the data folder from server B, copy the data folder from server A to server B and then restart server B? Will this crash spectacularly, or just give me a working server with the former contents of server A visible?
|
# ? Nov 1, 2019 11:44 |
|
if server A is not running and messing with the data while you're copying it, that should work, since it's pretty much what pg_basebackup does. all the data necessary to make the db work is in the pgdata directory. however, if there's something wrong with the data or configuration in that directory that is rendering the db inaccessible, you'll have the same problem on server B it also has to be the same version
|
# ? Nov 1, 2019 19:13 |
|
DELETE CASCADE posted:if server A is not running and messing with the data while you're copying it, that should work, since it's pretty much what pg_basebackup does. all the data necessary to make the db work is in the pgdata directory. however, if there's something wrong with the data or configuration in that directory that is rendering the db inaccessible, you'll have the same problem on server B
|
# ? Nov 1, 2019 19:22 |
|
How do I add columns dynamically to a table based on incoming data? I'm envisioning a table that has columns [date][serviceAspend][serviceBspend]...[serviceXspend] and every week I get a report that contains a date column and some combination of serviceSpend columns that I won't know ahead of time. How can I build a script that loads the data from an incoming file and if a new service comes in on a report that hasn't been "seen" before, the new service gets added as a new column to the existing table? and data loaded?
|
# ? Nov 1, 2019 21:03 |
|
I sure hope you're running a system that can efficiently perform schema changes.
|
# ? Nov 1, 2019 21:04 |
|
does your database support json or some other type of arbitrary key-value format as a column type? if so, put your dynamic stuff in there, as one column. alternatively if the values all have the same type (like just one number or something) then you can have two columns (type, data) where type points to a lookup table, and if you need a new type just add a new row in the lookup if you do it the way you're suggesting, then you'll probably have lots of nulls in your sparse dynamic columns. and since a sql query can't refer to your new columns without knowing their names ahead of time, even if the data updates automatically, you'll still have to make manual changes all over the place to do anything with your new data
|
# ? Nov 1, 2019 21:12 |
Agrikk posted:How do I add columns dynamically to a table based on incoming data? Normalize that poo poo. code:
|
|
# ? Nov 1, 2019 22:36 |
|
nielsm posted:Normalize that poo poo. This is beautiful. Thanks for this.
|
# ? Nov 1, 2019 23:02 |
|
nielsm posted:Normalize that poo poo. I know you already accepted this answer, but +1 for normalization.
|
# ? Nov 2, 2019 03:13 |
|
Nth Doctor posted:I know you already accepted this answer, but +1 for normalization. It's funny. My head is all up in performance poo poo like disk layout, filegroups, indexes and partitions and I'd completely forgotten about database design 101.
|
# ? Nov 2, 2019 22:18 |
|
I have a table of data that I'm trying to extract occurrences of features out of. A data row can have one or more features in it. Features are represented by a features_table that contains a match_mask regex and a not_match_mask regex. As the names would suggest, a feature exists if the match_mask regex matches, and the not_match_mask either does not or is null. This makes it a straightforward (for me) inner join between the data table and the feature table. The join criteria is where the regex matching happens. The edge case where this falls down is when not_match_mask and match_mask are not completely disjoint - so let's say you have one feature "foo" with not_match_mask "foobar" and another feature that matches "foobar", and a data row can hypothetically contain "foo ... foobar". This should produce one feature foo and feature foobar for that data row. That's not desirable of course but that's the reality of the feature I need to extract... "foo" is contextually different from "foobar" and there can potentially be both in a data row. Handling this programmatically seems tricky so what I was thinking is to find rows where more than one match_mask fits and then handle them manually (hopefully there won't be more than a few dozen). So like, do the inner join, then GROUP BY back to the input rows, and then select rows having count > 1 ? Am I completely barking up the wrong tree here? Is there some analytics thing that will help me out? I'm using Postgres 12. Paul MaudDib fucked around with this message at 07:13 on Nov 6, 2019 |
# ? Nov 6, 2019 06:55 |
|
Kuule hain nussivan posted:Quick stupid question for something I haven't done before. if you have a baseline snapshot and a copy of the write-ahead-log (WAL) then you can load the snapshot and play back the WAL onto it, and restore the state up to where things came off the rails. I have never done this either but that is how PG is supposed to work. Bear in mind that you have to have the entire WAL, of course... if they were ever purged or overwritten you're toast. Good luck! https://www.postgresql.org/docs/9.4/wal-configuration.html edit: quote:There will always be at least one WAL segment file, and will normally not be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 or checkpoint_segments + wal_keep_segments + 1 files. Each segment file is normally 16 MB (though this size can be altered when building the server). You can use this to estimate space requirements for WAL. Ordinarily, when old log segment files are no longer needed, they are recycled (that is, renamed to become future segments in the numbered sequence). If, due to a short-term peak of log output rate, there are more than 3 * checkpoint_segments + 1 segment files, the unneeded segment files will be deleted instead of recycled until the system gets back under this limit. and https://www.postgresql.org/docs/10/continuous-archiving.html https://www.opsdash.com/blog/postgresql-wal-archiving-backup.html Paul MaudDib fucked around with this message at 07:17 on Nov 6, 2019 |
# ? Nov 6, 2019 07:08 |
|
Paul MaudDib posted:I have a table of data that I'm trying to extract occurrences of features out of. A data row can have one or more features in it. Features are represented by a features_table that contains a match_mask regex and a not_match_mask regex. As the names would suggest, a feature exists if the match_mask regex matches, and the not_match_mask either does not or is null. My first suggestion would be rewrite your match_mask regexes to remove the need for the not_match_mask. So 'foo' and not 'foobar' becomes 'foo(?!bar)'. Also, what is the structure of a data_row? Is it just one huge undifferentiated varchar column? Is the data at least delimited?
|
# ? Nov 6, 2019 15:27 |
|
I do mostly traditional sysadmin stuff for a fintech company and work is willing to pay for some SQL classes for me. We use MS SQL server in house and are looking at Aurora MySQL, although I'm at select * from foo level, so there shouldn't be anything system specific in whatever it is that I learn. Can anyone recommend some places to look for training?
|
# ? Nov 7, 2019 02:11 |
|
Jethro posted:My first suggestion would be rewrite your match_mask regexes to remove the need for the not_match_mask. So 'foo' and not 'foobar' becomes 'foo(?!bar)'. cool, that works great, thanks! I just learned a new thing you could do with regexes. I'm still gonna keep not_match_mask in the script because it's already there and might be useful down the line, but if it's null then it's ignored anyway. Yeah, more or less freeform user text, it's not like CSS or HL7 where you could key off the delimiter. Obviously it would be nicer not to have to clean the data up myself but is there some other reason it's bad?
|
# ? Nov 7, 2019 02:15 |
|
Happiness Commando posted:I do mostly traditional sysadmin stuff for a fintech company and work is willing to pay for some SQL classes for me. We use MS SQL server in house and are looking at Aurora MySQL, although I'm at select * from foo level, so there shouldn't be anything system specific in whatever it is that I learn. For a newcomer to SQL, I think this is good to get a solid foundation: https://www.executeprogram.com/ (scroll down to the bottom, there is a SQL course)
|
# ? Nov 7, 2019 03:08 |
|
|
# ? Jun 8, 2024 05:55 |
|
Paul MaudDib posted:if you have a baseline snapshot and a copy of the write-ahead-log (WAL) then you can load the snapshot and play back the WAL onto it, and restore the state up to where things came off the rails. I have never done this either but that is how PG is supposed to work. If the server dropped for unrelated issues, it should "just work". If there's corruption from a replay transaction, you may be able to restore via PITR (but it's not as straightforward as the intro docs suggest). If the data tables are corrupted, it could take a lot of work. People using modern defaults of streaming slave servers without reconfiguring WAL slow backups are suffering the "MySQL replication fallacy" (eg, that this is "the way replication is done"). Dedicated base backups to a separate server can often be expensive to performance compared to WAL archiving (for emergency restore purposes).
|
# ? Nov 7, 2019 08:08 |