Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
abelwingnut
Dec 23, 2002


Just-In-Timeberlake posted:

man that product name is loving on the nose

yes, it loving sucks. i did a project for this same group that should have taken like 1/4 of the time it should actually take. like, i had to create a separate loving server, install cron on there, make sure that the js involved with cron didn't interfere with the formatting of the sql, move all of the code into separate files, then, because you can't send more than one sql statement via their api, had to make sure that things triggered successively as they should.

instead of just using a loving stored procedure.

i won't even get into the lack of true indexing and the nebulous hell that is clustering on snowflake.

abelwingnut fucked around with this message at 22:02 on Oct 23, 2019

Adbot
ADBOT LOVES YOU

abelwingnut
Dec 23, 2002


Ruggan posted:

Creating a whole range of rows for every combo sounds very messy though. You could instead take your example of 1.2.3.0/24 and represent it with three columns:

FirstThreeOctets: 1.2.3
FourthOctetFrom: 0
FourthOctetTo: 24

Can you create persisted calculated columns in your tables that represent it in this way? Or even fully separate the octets? This way you can continue to store it the way you do today but facilitate better joins.

the ranges are variable. they range from /7 to /32. yes, it's wild. that one /7 would take up 33MM rows alone. but yea, therefore, i'd have to create a column for each octet. also not sure how leading zeroes would work in the join in that scenario?

also, what do you mean by persisted calculated columns? i can definitely separate the octets, either by a function i found in there or by mathing it down. i have the equations to convert these to decimal, and can also extract that from the object that is created from PARSE_IP.

and yes, creating a whole table of those ranges does sound messy...in any other system. but this platform is built for massive tables. there's one in here, this is not a table that i'm working with by the way, that's literally 228 995 504 699 rows. that table alone is 24.6TB. and this platform actually is somewhat fast at dealing with it, admittedly.

abelwingnut fucked around with this message at 22:10 on Oct 23, 2019

nielsm
Jun 1, 2009



If you can modify the schema and get everyone to agree on it, handle the addresses as 32 bit integers and just store the first and last address of each range, and do a simple between check.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
My first two thoughts:

Use PostgreSQL foreign data wrappers and the built in network address type and be done with it. Roops custom service for data, yeah that won't work.

Next, use regular expressions. Netmasks are straightforward to represent as patterns. The issue here is that pattern matching is not a standard column collation, so it'll be slow.

Next thought, the issue is there speed will suck because the join condition is "inside out". IE, you're trying to find a value between two calculated values, and that'll be slow, maybe even O(mn) depending on the database. So, flip the problem around. What if you wrote a big florping case statement that checked all possible netmasks for a given ip? Then you could do a direct equality match with B.

But yeah, if you're doing range checks in a database without geometry or custom types, you really want integers, coupled with some indexes optimized for greater/less-than queries.

Have you tried contacting snowflake? :cop:

abelwingnut
Dec 23, 2002


nielsm posted:

If you can modify the schema and get everyone to agree on it, handle the addresses as 32 bit integers and just store the first and last address of each range, and do a simple between check.

yea, i'm fairly sure that is a no-go.

PhantomOfTheCopier posted:

My first two thoughts:

Use PostgreSQL foreign data wrappers and the built in network address type and be done with it. Roops custom service for data, yeah that won't work.

Next, use regular expressions. Netmasks are straightforward to represent as patterns. The issue here is that pattern matching is not a standard column collation, so it'll be slow.

Next thought, the issue is there speed will suck because the join condition is "inside out". IE, you're trying to find a value between two calculated values, and that'll be slow, maybe even O(mn) depending on the database. So, flip the problem around. What if you wrote a big florping case statement that checked all possible netmasks for a given ip? Then you could do a direct equality match with B.

But yeah, if you're doing range checks in a database without geometry or custom types, you really want integers, coupled with some indexes optimized for greater/less-than queries.

Have you tried contacting snowflake? :cop:

and yea, i have to use snowflake. i have told them how awful it is, and that's just not happening.

contacting snowflake is my next step, but wanted to make sure i wasn't missing something.

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

Ruggan posted:

Creating a whole range of rows for every combo sounds very messy though. You could instead take your example of 1.2.3.0/24 and represent it with three columns:

FirstThreeOctets: 1.2.3
FourthOctetFrom: 0
FourthOctetTo: 24

Can you create persisted calculated columns in your tables that represent it in this way? Or even fully separate the octets? This way you can continue to store it the way you do today but facilitate better joins.

That's not how IP address ranges work. The /24 at the end refers to the prefix, the 24 most significant bits of the address (so 1.2.3 in this example). The range consists of all the possible addresses formed by the remaining 8 bits, so 1.2.3.0 to 1.2.3.255. You can mask off any arbitrary number of bits you want, it doesn't have to be byte-aligned, so for example 1.2.3.0/25 would be the range 1.2.3.0 to 1.2.3.127.

Given that abelwingnut mentions that all sorts of wacky masks that aren't byte-aligned are in use in his data, treating the addresses as strings with regular expressions doesn't seem feasible to me either. nielsm's idea is a reasonable alternative to just using bitmasks directly though - perhaps even better than using bitmasks.

TheFluff fucked around with this message at 00:11 on Oct 24, 2019

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


TheFluff posted:

That's not how IP address ranges work. The /24 at the end refers to the prefix, the 24 most significant bits of the address (so 1.2.3 in this example). The range consists of all the possible addresses formed by the remaining 8 bits, so 1.2.3.0 to 1.2.3.255. You can mask off any arbitrary number of bits you want, it doesn't have to be byte-aligned, so for example 1.2.3.0/25 would be the range 1.2.3.0 to 1.2.3.127.

Given that abelwingnut mentions that all sorts of wacky masks that aren't byte-aligned are in use in his data, treating the addresses as strings with regular expressions doesn't seem feasible to me either. nielsm's idea is a reasonable alternative to just using bitmasks directly though - perhaps even better than using bitmasks.

Didn’t realize that - thanks for clarifying.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
Holy poo poo! I’ve been diving into MySQL docs to troubleshoot a data load performance issue and I found this gem:

quote:


The time required for inserting a row is determined by the following factors, where the numbers indicate approximate proportions:

Connecting: (3)

Sending query to server: (2)

Parsing query: (2)

Inserting row: (1 × size of row)

Inserting indexes: (1 × number of indexes)

Closing: (1)

This does not take into consideration the initial overhead to open tables, which is done once for each concurrently running query.

The size of the table slows down the insertion of indexes by log N, assuming B-tree indexes.

https://dev.mysql.com/doc/refman/5.7/en/insert-optimization.html


Does this really indicate a performance limit on the size of a MySQL table and knock it right out of contention for any database of significant size?

Janitor Prime
Jan 22, 2004

PC LOAD LETTER

What da fuck does that mean

Fun Shoe
That’s normal that the insert performance would be limited by the Btrees Big O insertion time which is log n. Any DB that uses btrees will have that as the lower bound on insertion performance.

nielsm
Jun 1, 2009



PhantomOfTheCopier posted:

Next thought, the issue is there speed will suck because the join condition is "inside out". IE, you're trying to find a value between two calculated values, and that'll be slow, maybe even O(mn) depending on the database. So, flip the problem around. What if you wrote a big florping case statement that checked all possible netmasks for a given ip? Then you could do a direct equality match with B.

This is not too bad an idea, if you're stuck with strings:
On the client performing the query, generate all possible netmasks the IP could match, this should be limited to around 25 possibilities. Then query for an exact string match for each of those.

E.g. if your address to search for is 15.56.204.85, then the possible netmasks would be:
15.56.204.85/32, 15.56.204.84/31, 15.56.204.84/30, 15.56.204.80/29, 15.56.204.80/28, 15.56.204.64/27, ... 15.56.204.0/22, 15.56.200.0/21, 15.56.192.0/20, ..., 15.0.0.0/8, 14.0.0.0/7
If any of those exist in the database, you have a match.

nielsm
Jun 1, 2009



Agrikk posted:

Yeah. I’m loading 2200 CSV files [each with 1.5 million rows of data consisting of five columns with a total character width of no more than 86] into a single table with an auto-incrementing rowID and an index on three of the five columns.

I recognize that I could drop the index and recreate it after, but that won’t help me in the long run since I’ll be adding data to this table every night and the table has to be useable as fresh data is added.

Just to get the obvious questions out of the way: You are doing the inserts inside a transaction, right? I'm pretty sure if you open a transaction the indexes will only be updated when you commit, instead of after every insert.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

nielsm posted:

This is not too bad an idea, if you're stuck with strings:
On the client performing the query, generate all possible netmasks the IP could match, this should be limited to around 25 possibilities. Then query for an exact string match for each of those.

E.g. if your address to search for is 15.56.204.85, then the possible netmasks would be:
15.56.204.85/32, 15.56.204.84/31, 15.56.204.84/30, 15.56.204.80/29, 15.56.204.80/28, 15.56.204.64/27, ... 15.56.204.0/22, 15.56.200.0/21, 15.56.192.0/20, ..., 15.0.0.0/8, 14.0.0.0/7
If any of those exist in the database, you have a match.

Yepz. Assuming that the ranges in B are non-overlapping. Also remember to start with the larger ranges (/7) and work toward the smaller (/32), first match wins.

abelwingnut
Dec 23, 2002


yea, i like that. i'm not sure how i could possibly code that, though. would have to be a series of CASE statements, each computing the binary form, then converting it to the string. or something like that. i'll have to think it through a bit.

duffmensch
Feb 20, 2004

Duffman is thrusting in the direction of the problem!
What is everyone’s editor preference when working with Oracle (mainly writing queries for reports)? So far I’ve only been working with SQL and using SSMS or visual studio and am decidedly not impressed with the Oracle SQL Developer app.

abelwingnut
Dec 23, 2002


general sql question:

let's say i have:

code:
SELECT
	*
FROM
	tableA a
	JOIN tableB b ON udfBlah(a.x) = b.y
so in my mind, the way the engine works, it looks at the first row in tableA, evaluates a.x, then goes to tableB, then looks for b.y in the first row. whatever that result may be, true or false, what happens next?

my guess is it has held the result of udfBlah(a.x) in memory so that it doesn't have to reevaluate, and can then move on to row 2 in tableB, evaluate it, then row 3 in tableB, and so forth until it reaches the end of the table B. at that point it then reevaluates udfBlah(a.x) for row 2, then cycles through table B again while holding the new value of udfBlah(a.x) in memory.

do i have that correct, or does it reevaluate udfBlah(a.x) with each evaluation?

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


abelwingnut posted:

general sql question:

let's say i have:

code:
SELECT
	*
FROM
	tableA a
	JOIN tableB b ON udfBlah(a.x) = b.y
so in my mind, the way the engine works, it looks at the first row in tableA, evaluates a.x, then goes to tableB, then looks for b.y in the first row. whatever that result may be, true or false, what happens next?

my guess is it has held the result of udfBlah(a.x) in memory so that it doesn't have to reevaluate, and can then move on to row 2 in tableB, evaluate it, then row 3 in tableB, and so forth until it reaches the end of the table B. at that point it then reevaluates udfBlah(a.x) for row 2, then cycles through table B again while holding the new value of udfBlah(a.x) in memory.

do i have that correct, or does it reevaluate udfBlah(a.x) with each evaluation?

So caveat that you'll want to check the query plan. Others can probably answer this better than I can but I'll try.

I'm using Microsoft SQL Server and what it looks like it does it it computes the scalar value for all of tableA's rows and holds those in memory to join. It then picks the most performant join operation it can: if it's joining against a sorted key from tableB it'll sort the scalar values and merge join them, otherwise it does a hash join or something similar.

I checked the query plan for self joining a huge table (100M+ rows) and it still looked like it was going to try that method, but I'm not sure if the memory would need to spill out to tempdb. Our server has a shitton of memory so maybe it could manage it, but the memory grant would likely be huge if this is what it did.

It probably will also try to intelligently pipeline the data forward so that it does not need to wait for all of the scalar values to be computed in order to start the join - but that would only be the case if the optimizer was deciding to hash-match rather than sort/merge-join the data.

Maybe someone else with more knowledge around this could chime in.

Star War Sex Parrot
Oct 2, 2003

abelwingnut posted:

does it reevaluate udfBlah(a.x) with each evaluation?
It depends on the system and the UDF. SQL Server is smart enough to identify inlineable scalar UDFs under certain criteria (see the requirements section).

Star War Sex Parrot fucked around with this message at 21:18 on Oct 24, 2019

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
From now on, I'm gonna refer to abelwingnut's database as the Snowflake Special.



:yayclod: just thought you needed to know that also it's my Friday this week.

Moonwolf
Jun 29, 2004

Flee from th' terrifyin' evil of "NHS"!


abelwingnut posted:

general sql question:

let's say i have:

code:
SELECT
	*
FROM
	tableA a
	JOIN tableB b ON udfBlah(a.x) = b.y
so in my mind, the way the engine works, it looks at the first row in tableA, evaluates a.x, then goes to tableB, then looks for b.y in the first row. whatever that result may be, true or false, what happens next?

my guess is it has held the result of udfBlah(a.x) in memory so that it doesn't have to reevaluate, and can then move on to row 2 in tableB, evaluate it, then row 3 in tableB, and so forth until it reaches the end of the table B. at that point it then reevaluates udfBlah(a.x) for row 2, then cycles through table B again while holding the new value of udfBlah(a.x) in memory.

do i have that correct, or does it reevaluate udfBlah(a.x) with each evaluation?

So there was some mention of join types, but pretty much any db is always going to keep the results of your udfBlah(a.x) for evaluation, because anything else takes more cpu. Even at a nested loop it will evaluate that once, and then examine b.y for equality a row at a time. Merge join I'm not certain because of how it'd need to sort b.y to match the ordering of udfBlah(x.y), hash join would just make the hashes of all the udfBlah(a.x) for every a row and then join to the matching ones on b.

The only place where this wouldn't hold is if udfBlah is non-determinative, so if you've got a random or time function in there so the output can change, then you'd need to really examine what it was doing, it could either evaluate it per row coming in or per join.

Cyber Sandwich
Nov 16, 2011

Now, Digital!
Does anyone here know of a good resource for translating relational schema constraints into mysql ddl statements? My professor is making zero effort on this front.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
:viggo: Indexes on expressions

abelwingnut posted:

code:
SELECT
	*
FROM
	tableA a
	JOIN tableB b ON udfBlah(a.x) = b.y
I admit I'd have to look, but no I don't think you have quite the right picture. The user defined function may/not be volatile as Moonwolf mentions, but it may/not also be deterministic relative to the rows in the database, and may/not be able to modify the database. A query planner optimizes differently in each case.

The "best" case is a function that always gives the same result for equivalent inputs, such as a mathematical function. To understand this, look at EXPLAIN. Different engines use different query caching mechanisms, so lookups in temporary tables and caching will differ, along with actual performance relative to the current database configuration.

It also depends on table statistics. Look at these:
code:
explain select * from movies m join viewing v on ((m.id*m.id)%500=v.movieid);
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Hash Join  (cost=192.83..665.69 rows=7051 width=82)
   Hash Cond: ((((m.id * m.id))::integer % 500) = v.movieid)
   ->  Seq Scan on movies m  (cost=0.00..49.09 rows=2109 width=76)
   ->  Hash  (cost=103.48..103.48 rows=7148 width=6)
         ->  Seq Scan on viewing v  (cost=0.00..103.48 rows=7148 width=6)
(5 rows)
Versus
code:
explain select * from accounts ac join transactions tr on ((ac.id*ac.id)%10000=tr.id);
                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Hash Join  (cost=5.77..575.09 rows=123 width=69)
   Hash Cond: (tr.id = (((ac.id * ac.id))::integer % 10000))
   ->  Seq Scan on transactions tr  (cost=0.00..424.91 rows=19091 width=48)
   ->  Hash  (cost=4.23..4.23 rows=123 width=21)
         ->  Seq Scan on accounts ac  (cost=0.00..4.23 rows=123 width=21)
(5 rows)
Note that the planner is smart enough to decide which table to use for the hash construction. Sadly, we don't get to see all the branches in the planner; it's subtle because it's the "least estimated cost" that's winning here, not the "smallest number of rows", though working memory limits may impact that decision in other circumstances. The plan also reveals "how" the results are built, for example in this latter case: A full scan on accounts is performed to construct the hash. Next (caveats) there's a full (non-indexed) scan on transactions. Finally there's a result filter that performs the hash join.

(amend) Contrast with this volatile:
code:
explain select * from accounts ac join transactions tr on ((floor(random()*ac.id*ac.id)::int)%10000=tr.id);
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..82616.20 rows=123 width=69)
   Join Filter: (((floor(((random() * (ac.id)::double precision) * (ac.id)::double precision)))::integer % 10000) = tr.id)
   ->  Seq Scan on transactions tr  (cost=0.00..424.91 rows=19091 width=48)
   ->  Materialize  (cost=0.00..4.85 rows=123 width=21)
         ->  Seq Scan on accounts ac  (cost=0.00..4.23 rows=123 width=21)
(5 rows)
(end amend)


Academic interest is appreciated, but I'm forced to ask, Why should you care? Rather, is this the right question? Assuming this is a continuation of the above, I see where you're coming from, but I suspect the important pieces are hiding behind the API. You might be able to give rough runtime comparisons between queries, but you need EXPLAIN results to really see what's happening, when the table is being scanned/hashjoined/cached/&c.

Nevertheless, you're on the right path. Stage One is to go through a bunch of approaches, record their runtime performance, pick the best. That gives you enough info to go to the hopefully-responsive database owners for direct assistance.

PhantomOfTheCopier fucked around with this message at 11:50 on Oct 25, 2019

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

nielsm posted:

Just to get the obvious questions out of the way: You are doing the inserts inside a transaction, right? I'm pretty sure if you open a transaction the indexes will only be updated when you commit, instead of after every insert.

Yes, it is a single statement to load all the lines of a file in. I’m not looping over the file inserting one at a time.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
Can someone check my thinking about partitioning?

Given a table with

[row_id][Username][team][timestamp][data1][data2]

With primary key on [row_id] and a composite index over [username][teamnumber][timestamp]:

A good way to boost read performance for the most common type of query which is

SELECT * FROM [table]
WHERE [username] = $username
AND [teamnumber] = $teamnumber

would be to partition by a calculated field consisting of [username][teamnumber]. Right?

The principal here is that the retrieval of data only has to go to the specific partition and search that, rather than the entire dataset.

This partition scheme will do nothing to speed INSERTS if data comes in batches where every user/teamnumber combo gets an additional [timestamp] with associated data.


How did I do?

Kuule hain nussivan
Nov 27, 2008

Quick stupid question for something I haven't done before.

Let's say I have two servers running the same version of postgresql. Something happens to server A which means I access it properly anymore, but I need to take a dump of it's database. Instead of setting up a new server to run the database in server A, could I back up the data folder from server B, copy the data folder from server A to server B and then restart server B? Will this crash spectacularly, or just give me a working server with the former contents of server A visible?

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
if server A is not running and messing with the data while you're copying it, that should work, since it's pretty much what pg_basebackup does. all the data necessary to make the db work is in the pgdata directory. however, if there's something wrong with the data or configuration in that directory that is rendering the db inaccessible, you'll have the same problem on server B

it also has to be the same version

Kuule hain nussivan
Nov 27, 2008

DELETE CASCADE posted:

if server A is not running and messing with the data while you're copying it, that should work, since it's pretty much what pg_basebackup does. all the data necessary to make the db work is in the pgdata directory. however, if there's something wrong with the data or configuration in that directory that is rendering the db inaccessible, you'll have the same problem on server B

it also has to be the same version
My use case should be okay then. To be exact, Server A is fine, but I need to access a copy of the database from a few days ago. We have a backup of the drive, so this means I can just grab a copy of pgdata instead of setting up a whole new server just to run the backup.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
How do I add columns dynamically to a table based on incoming data?

I'm envisioning a table that has columns [date][serviceAspend][serviceBspend]...[serviceXspend]

and every week I get a report that contains a date column and some combination of serviceSpend columns that I won't know ahead of time.

How can I build a script that loads the data from an incoming file and if a new service comes in on a report that hasn't been "seen" before, the new service gets added as a new column to the existing table? and data loaded?

Star War Sex Parrot
Oct 2, 2003

I sure hope you're running a system that can efficiently perform schema changes.

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
does your database support json or some other type of arbitrary key-value format as a column type? if so, put your dynamic stuff in there, as one column. alternatively if the values all have the same type (like just one number or something) then you can have two columns (type, data) where type points to a lookup table, and if you need a new type just add a new row in the lookup

if you do it the way you're suggesting, then you'll probably have lots of nulls in your sparse dynamic columns. and since a sql query can't refer to your new columns without knowing their names ahead of time, even if the data updates automatically, you'll still have to make manual changes all over the place to do anything with your new data

nielsm
Jun 1, 2009



Agrikk posted:

How do I add columns dynamically to a table based on incoming data?

I'm envisioning a table that has columns [date][serviceAspend][serviceBspend]...[serviceXspend]

and every week I get a report that contains a date column and some combination of serviceSpend columns that I won't know ahead of time.

How can I build a script that loads the data from an incoming file and if a new service comes in on a report that hasn't been "seen" before, the new service gets added as a new column to the existing table? and data loaded?

Normalize that poo poo.

code:
Reports:
  ReportID (PK)
  ReportDate
  ReportReceivedTimestamp

ReportableServices:
  ServiceID (PK)
  ServiceName

ReportedSpends:
  ReportID (FK to Reports)
  ServiceID (FK to ReportableServices)
  SpendAmount

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

nielsm posted:

Normalize that poo poo.

code:
Reports:
  ReportID (PK)
  ReportDate
  ReportReceivedTimestamp

ReportableServices:
  ServiceID (PK)
  ServiceName

ReportedSpends:
  ReportID (FK to Reports)
  ServiceID (FK to ReportableServices)
  SpendAmount

This is beautiful. Thanks for this.

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


nielsm posted:

Normalize that poo poo.

code:
Reports:
  ReportID (PK)
  ReportDate
  ReportReceivedTimestamp

ReportableServices:
  ServiceID (PK)
  ServiceName

ReportedSpends:
  ReportID (FK to Reports)
  ServiceID (FK to ReportableServices)
  SpendAmount

I know you already accepted this answer, but +1 for normalization.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

Nth Doctor posted:

I know you already accepted this answer, but +1 for normalization.

It's funny. My head is all up in performance poo poo like disk layout, filegroups, indexes and partitions and I'd completely forgotten about database design 101.

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE
I have a table of data that I'm trying to extract occurrences of features out of. A data row can have one or more features in it. Features are represented by a features_table that contains a match_mask regex and a not_match_mask regex. As the names would suggest, a feature exists if the match_mask regex matches, and the not_match_mask either does not or is null.

This makes it a straightforward (for me) inner join between the data table and the feature table. The join criteria is where the regex matching happens.

The edge case where this falls down is when not_match_mask and match_mask are not completely disjoint - so let's say you have one feature "foo" with not_match_mask "foobar" and another feature that matches "foobar", and a data row can hypothetically contain "foo ... foobar". This should produce one feature foo and feature foobar for that data row.

That's not desirable of course but that's the reality of the feature I need to extract... "foo" is contextually different from "foobar" and there can potentially be both in a data row.

Handling this programmatically seems tricky so what I was thinking is to find rows where more than one match_mask fits and then handle them manually (hopefully there won't be more than a few dozen). So like, do the inner join, then GROUP BY back to the input rows, and then select rows having count > 1 ?

Am I completely barking up the wrong tree here? Is there some analytics thing that will help me out? I'm using Postgres 12.

Paul MaudDib fucked around with this message at 07:13 on Nov 6, 2019

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

Kuule hain nussivan posted:

Quick stupid question for something I haven't done before.

Let's say I have two servers running the same version of postgresql. Something happens to server A which means I access it properly anymore, but I need to take a dump of it's database. Instead of setting up a new server to run the database in server A, could I back up the data folder from server B, copy the data folder from server A to server B and then restart server B? Will this crash spectacularly, or just give me a working server with the former contents of server A visible?

if you have a baseline snapshot and a copy of the write-ahead-log (WAL) then you can load the snapshot and play back the WAL onto it, and restore the state up to where things came off the rails. I have never done this either but that is how PG is supposed to work.

Bear in mind that you have to have the entire WAL, of course... if they were ever purged or overwritten you're toast.

Good luck!

https://www.postgresql.org/docs/9.4/wal-configuration.html

edit:

quote:

There will always be at least one WAL segment file, and will normally not be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 or checkpoint_segments + wal_keep_segments + 1 files. Each segment file is normally 16 MB (though this size can be altered when building the server). You can use this to estimate space requirements for WAL. Ordinarily, when old log segment files are no longer needed, they are recycled (that is, renamed to become future segments in the numbered sequence). If, due to a short-term peak of log output rate, there are more than 3 * checkpoint_segments + 1 segment files, the unneeded segment files will be deleted instead of recycled until the system gets back under this limit.

and https://www.postgresql.org/docs/10/continuous-archiving.html https://www.opsdash.com/blog/postgresql-wal-archiving-backup.html

Paul MaudDib fucked around with this message at 07:17 on Nov 6, 2019

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Paul MaudDib posted:

I have a table of data that I'm trying to extract occurrences of features out of. A data row can have one or more features in it. Features are represented by a features_table that contains a match_mask regex and a not_match_mask regex. As the names would suggest, a feature exists if the match_mask regex matches, and the not_match_mask either does not or is null.

This makes it a straightforward (for me) inner join between the data table and the feature table. The join criteria is where the regex matching happens.

The edge case where this falls down is when not_match_mask and match_mask are not completely disjoint - so let's say you have one feature "foo" with not_match_mask "foobar" and another feature that matches "foobar", and a data row can hypothetically contain "foo ... foobar". This should produce one feature foo and feature foobar for that data row.

That's not desirable of course but that's the reality of the feature I need to extract... "foo" is contextually different from "foobar" and there can potentially be both in a data row.

Handling this programmatically seems tricky so what I was thinking is to find rows where more than one match_mask fits and then handle them manually (hopefully there won't be more than a few dozen). So like, do the inner join, then GROUP BY back to the input rows, and then select rows having count > 1 ?

Am I completely barking up the wrong tree here? Is there some analytics thing that will help me out? I'm using Postgres 12.

My first suggestion would be rewrite your match_mask regexes to remove the need for the not_match_mask. So 'foo' and not 'foobar' becomes 'foo(?!bar)'.

Also, what is the structure of a data_row? Is it just one huge undifferentiated varchar column? Is the data at least delimited?

Happiness Commando
Feb 1, 2002
$$ joy at gunpoint $$

I do mostly traditional sysadmin stuff for a fintech company and work is willing to pay for some SQL classes for me. We use MS SQL server in house and are looking at Aurora MySQL, although I'm at select * from foo level, so there shouldn't be anything system specific in whatever it is that I learn.

Can anyone recommend some places to look for training?

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

Jethro posted:

My first suggestion would be rewrite your match_mask regexes to remove the need for the not_match_mask. So 'foo' and not 'foobar' becomes 'foo(?!bar)'.

Also, what is the structure of a data_row? Is it just one huge undifferentiated varchar column? Is the data at least delimited?

cool, that works great, thanks! I just learned a new thing you could do with regexes.

I'm still gonna keep not_match_mask in the script because it's already there and might be useful down the line, but if it's null then it's ignored anyway.

Yeah, more or less freeform user text, it's not like CSS or HL7 where you could key off the delimiter. Obviously it would be nicer not to have to clean the data up myself but is there some other reason it's bad?

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

Happiness Commando posted:

I do mostly traditional sysadmin stuff for a fintech company and work is willing to pay for some SQL classes for me. We use MS SQL server in house and are looking at Aurora MySQL, although I'm at select * from foo level, so there shouldn't be anything system specific in whatever it is that I learn.

Can anyone recommend some places to look for training?

For a newcomer to SQL, I think this is good to get a solid foundation: https://www.executeprogram.com/
(scroll down to the bottom, there is a SQL course)

Adbot
ADBOT LOVES YOU

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Paul MaudDib posted:

if you have a baseline snapshot and a copy of the write-ahead-log (WAL) then you can load the snapshot and play back the WAL onto it, and restore the state up to where things came off the rails. I have never done this either but that is how PG is supposed to work.
Yes it does work. Back before pgbasebackup we'd rsync the entire data directory, drop all connections, (flush), do a final rsync, then reopen. Recovery works directly since the WAL contains any committed replay entries.

If the server dropped for unrelated issues, it should "just work". If there's corruption from a replay transaction, you may be able to restore via PITR (but it's not as straightforward as the intro docs suggest). If the data tables are corrupted, it could take a lot of work.

People using modern defaults of streaming slave servers without reconfiguring WAL slow backups are suffering the "MySQL replication fallacy" (eg, that this is "the way replication is done"). Dedicated base backups to a separate server can often be expensive to performance compared to WAL archiving (for emergency restore purposes).

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply