Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Sedro
Dec 31, 2008
If you were using postgres, a unique partial index would make your job really easy
SQL code:
create table books ( id serial, title text, allow_duplicates boolean );
create unique index on books ( title ) where not allow_duplicates;

insert into books ( title, allow_duplicates ) values ( 'a', true );
insert into books ( title, allow_duplicates ) values ( 'a', true );

insert into books ( title, allow_duplicates ) values ( 'b', false );
insert into books ( title, allow_duplicates ) values ( 'b', false ); -- error
Assuming you don't insert the same title with different values of allow_duplicates.

Adbot
ADBOT LOVES YOU

McGlockenshire
Dec 16, 2005

GOLLOCKS!

EAT THE EGGS RICOLA posted:

I have a postgres database that I need to replicate across a dozen servers.

Except half of the servers are airgapped from the other, no direct connections permitted. I can automate a file transfer but direct db connections will never be permitted.

Uhh... any suggestions?

If you're using the last few 9.x releases, look into pg_basebackup and pg_receivexlog and friends.

http://www.safaribooksonline.com/library/view/postgresql-9-high/9781849516969/ch06s07.html posted:

With this new utility, PostgreSQL has the ability to transmit transaction logs to a remote system without the need for a dedicated PostgreSQL server. This also means that we can avoid ad hoc tools such as rsync when maintaining an archive server to save old WAL files.

This allows us to set up any server to pull transaction logs directly from the primary PostgreSQL server. For highly available servers, PostgreSQL no longer needs to fork an external command to safeguard transaction logs into an archive location.

fletcher
Jun 27, 2003

ken park is my favorite movie

Cybernetic Crumb

EAT THE EGGS RICOLA posted:

I have a postgres database that I need to replicate across a dozen servers.

Except half of the servers are airgapped from the other, no direct connections permitted. I can automate a file transfer but direct db connections will never be permitted.

Uhh... any suggestions?

Use a VPN or tunnel over SSH. Definitely don't open up the database port to the world!

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

EAT THE EGGS RICOLA posted:

I have a postgres database that I need to replicate across a dozen servers.

Except half of the servers are airgapped from the other, no direct connections permitted. I can automate a file transfer but direct db connections will never be permitted.

Uhh... any suggestions?

As the other posts have indicated, it depends entirely on what you're permitted to do. I read this as "nothing but whole file transfers", which means you use standard, version eight WAL shipping. That would affect your sync rate, of course.

Safe and Secure!
Jun 14, 2008

OFFICIAL SA THREAD RUINER
SPRING 2013

Sedro posted:

If you were using postgres, a unique partial index would make your job really easy
SQL code:
create table books ( id serial, title text, allow_duplicates boolean );
create unique index on books ( title ) where not allow_duplicates;

insert into books ( title, allow_duplicates ) values ( 'a', true );
insert into books ( title, allow_duplicates ) values ( 'a', true );

insert into books ( title, allow_duplicates ) values ( 'b', false );
insert into books ( title, allow_duplicates ) values ( 'b', false ); -- error
Assuming you don't insert the same title with different values of allow_duplicates.

Yeah. Postgres would be nice.

If we were actually using MariaDB 10.0 (but still InnoDB), I'm guessing that wouldn't make this any easier, since I can't find anything on unique partial indices in MariaDB, either.

Other ideas I had:
1. SELECT ... FOR UPDATE with a where clause crafted to scan and lock each row in my whole table, of almost 2 million records* (actually, writes only make up like 2% of our observed usage, so this isn't necessarily the worst thing ever, but it's spiky and I'm not sure how many of those consist of writes to this particular table...)

2. GET_LOCK in a stored procedure .. but this is unsafe for replication.

3. Allow these transactions to make dirty reads via READ UNCOMMITTED, check that the record exists, then optimistically insert the record assuming that a similar transaction isn't occurring at the same time, then do another select on the title and rollback the transaction if multiple records exist with the title. But this opens us up to a different race condition, where our pre- and post-checking see the after-effects of a delete on a duplicate title and insert the new record, only for the delete to be rolled back, resulting in us still having undesired duplicate titles.

* It would be neat if I could just do a SELECT ... FOR UPDATE on a range of title strings that only includes the title I'm interested in, plus one title "greater" and one title "lesser", or something similar to that, to prevent reads on the gap between checking and inserting a new record. Any idea if I can get something close to this? Or do I misunderstand how gap locks work in InnoDB?

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Safe and Secure! posted:

So let's say I have Book records, where each record has

Id (int), Title (String), DuplicateTitleAllowed (boolean).

...

The problem I'm trying to prevent is a race condition where two transactions may concurrently SELECT * WHERE Title='someTitle' AND DuplicateTitleAllowed=false, see that such a record does not exist, then insert a new Book record, each with Title='someTitle' and DuplicateTitleAllowed=false.

Why not rename it to IsUnique and leave it NULL in the overlap case. A standard index permits multiple nulls, no?

Edit: Sorry, that doesn't quite do it... seems like you'll need a coverage map of sorts, then.

PhantomOfTheCopier fucked around with this message at 04:18 on Aug 28, 2014

Minus Pants
Jul 18, 2004
I'm working on a mostly key/value storage app in which ALL data access must be logged. I've setup a trigger-populated audit table that tracks data modifications (user, time, before/after), but I'm stumped on the best way to log selects. I'm more of a developer than a DBA. Database is Postgres.

I could write a function that captures and logs all select queries, but that won't scale with lots of users getting large result sets. Only logging result row keys won't work because I need to log the data returned as it was in the query, not its present, potentially modified-since-then state ("show me all the data Bob accessed from time X to Y"). My latest thought is to log the result row keys and, when queried for the data, "rollback" a copy of the table using the audit table (plpython probably). Am I crazy? Is there a better way to do this? Audit queries would be very rare, so performance isn't a major concern.

McGlockenshire
Dec 16, 2005

GOLLOCKS!
How many unique queries will you end up having? Would it be realistic to put all the auditable selects behind stored procedures?

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Minus Pants posted:

I'm working on a mostly key/value storage app in which ALL data access must be logged. I've setup a trigger-populated audit table that tracks data modifications (user, time, before/after), but I'm stumped on the best way to log selects. I'm more of a developer than a DBA. Database is Postgres.

I could write a function that captures and logs all select queries, but that won't scale with lots of users getting large result sets. Only logging result row keys won't work because I need to log the data returned as it was in the query, not its present, potentially modified-since-then state ("show me all the data Bob accessed from time X to Y"). My latest thought is to log the result row keys and, when queried for the data, "rollback" a copy of the table using the audit table (plpython probably). Am I crazy? Is there a better way to do this? Audit queries would be very rare, so performance isn't a major concern.
The answer depends distinctly on the actual requirements, which might be somewhat obscured by your interpretation of them based on a few of your comments here. You should start by reading the documentation, of course: http://www.postgresql.org/docs/9.1/static/runtime-config-logging.html The options you seek are the CSV log and log_statement=all for starters.

You're not crazy to think that you shouldn't log all the results and should instead leverage your audit entries to construct the result sets when needed, but with a perfectly capable database sitting in front of you, you really ought not be using something external like plpython; you're likely to get it wrong. You'd be better off developing the proper stored rollback procedures now so that code can track your table structure: If anyone deems it necessary to add columns at some future date, they will be required to ensure the procedures for the triggers are correct, along with the rollback procedures. Don't be confusing matters by adding some external code that can desync from your database. That will only lead to utterly useless (i.e., corrupt) audit tables.

What do you do when someone comes in and says they don't know Python? Make them learn Python? Okay, good, go read the 2000 pages of PostgreSQL documentation. It's a great read. :pseudo:

Minus Pants
Jul 18, 2004

McGlockenshire posted:

How many unique queries will you end up having? Would it be realistic to put all the auditable selects behind stored procedures?

Yes, that would be doable. There would only be about a dozen selects.

PhantomOfTheCopier posted:

The answer depends distinctly on the actual requirements, which might be somewhat obscured by your interpretation of them based on a few of your comments here. You should start by reading the documentation, of course: http://www.postgresql.org/docs/9.1/static/runtime-config-logging.html The options you seek are the CSV log and log_statement=all for starters.

log_statement is very useful, but logging the results of selects (the data itself or references to it) is the part I'm struggling with.

PhantomOfTheCopier posted:

You're not crazy to think that you shouldn't log all the results and should instead leverage your audit entries to construct the result sets when needed, but with a perfectly capable database sitting in front of you, you really ought not be using something external like plpython; you're likely to get it wrong. You'd be better off developing the proper stored rollback procedures now so that code can track your table structure: If anyone deems it necessary to add columns at some future date, they will be required to ensure the procedures for the triggers are correct, along with the rollback procedures. Don't be confusing matters by adding some external code that can desync from your database. That will only lead to utterly useless (i.e., corrupt) audit tables.

Thanks, this post gave me an epiphany (use the db, stupid!). I'll make some coffee and crack open the pg manual :)

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Minus Pants posted:

log_statement is very useful, but logging the results of selects (the data itself or references to it) is the part I'm struggling with.
You keep coming back to this, but I caution you to review your actual goals. Ask very cautiously if this is the requirement, or merely your interpretation of it.

If you must have an instantly-accessible log of all requests to the database and the results from it, then, indeed, you need to create an entire architecture to restrict the database access methods to prevent users from executing anything that isn't expressly permitted. You'll want a thorough understanding of triggers and procedures, as well as roles and grants. In such a system, you'll want to keep as much as possible within the database so it survives backup and replication and cannot be easily circumvented by external software bugs.

If, on the other hand, you need the ability to retrieve some result sets after the fact, or otherwise intend to perform batch audits at some future point, you should consider another solution. Your audit tables would permit you to roll backward and forward again, but as you have noted, that takes a little bit of work. Depending on how far into history you need to reach, a single audit could be fast or slow, and you have to perform every query again after the data has been reconstructed. Presumably the trade-off is to reduce the storage size at the expense of audit speed.

If you're running batch audits, you might be better off coupling the query log with WAL shipping to an audit server. Determine a nominal period for a basebackup and ensure that you have all WAL segments during the intervening period. Operating on a separate system, you can then configure the PG server to start from the basebackup and replay to any point in time. You could couple this with audit replay to roll forward, and so forth.

Yeah, it sounds like you need to stay in the research phase a bit longer, then the design phase a bit more after that. Leave the aglie python out of it or you're going to end up with users running their own queries with no one the wiser.

Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

This is another 'works in regex, trying to figure out how to work it in SQL' question (MSSQL 2008).

I've got these quasi variable SKU numbers that have information in them I need to extract. They're constantly changing and being added to, so I need to be able to do this in as 'real time' as possible (e.g. running a query once to store the values I need in another table isn't a good solution). They're structured like:
code:
XXX12345YYZZZ
^  1   ^2^3 ^
Sorry for the crappy diagram. What this breaks out as is:
1. A variable numbers of letters, followed by a variable number of digits, let's call it the 'base' sku.
2. A finish code for the product, two letters
3. A color code for the product, three letters.

The idea being, I want to be able to select the 'base' sku and do a GROUP BY so I can see all color/finish variations rolled up to a parent.

Easy peasy right? Just do SELECT LEFT(SKU,LEN(SKU)-5). However, there is a problem; Segment 2 (finish code) is not always present, which means the -5 method will sometimes truncate the base SKU too far, and I get unrelated products in the group by. In RegEx this is very straightforward:
code:
([A-Z]+[0-9])+([A-Z]+)
And $1 gets you the base sku and $2 gets you the finish/color codes.

Is there a way to construct a SELECT query that can grab the 'base' SKU accurately? Which hopefully doesn't involve CTEs or installing code on a production server. The only distinction I can rely on are:
- The first part ('base sku') will always be a range of 5-10 letters followed by 3-5 numbers.
- The finish code is always two letters, the color code is always three letters
- The color code is always present

I realize I can build a WHERE clause that will tell me which ones have finish codes or not (by cobbling together %A-Z0-9% etc.), but I can't actually SELECT that value.

Sedro
Dec 31, 2008
If I'm understanding your requirements, you can extract characters [0,n] where n is the index of the last numeric character.

You can use patindex to find the first occurrence of a number. Reverse the string to find the last occurrence. Then take the substring to extract that first segment. Something like this
SQL code:
select substring('XXX12345YYZZZ', 0, len('XXX12345YYZZZ') - patindex('%[1234567890]%', reverse('XXX12345YYZZZ')) + 2);

Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

Sedro posted:

If I'm understanding your requirements, you can extract characters [0,n] where n is the index of the last numeric character.

You can use patindex to find the first occurrence of a number. Reverse the string to find the last occurrence. Then take the substring to extract that first segment. Something like this
SQL code:
select substring('XXX12345YYZZZ', 0, len('XXX12345YYZZZ') - patindex('%[1234567890]%', reverse('XXX12345YYZZZ')) + 2);

Well gat dang that is clever. I was on the patindex to find the first position but couldn't think of anything to do with it; it's the reverse that's the sneaky part here. Works exactly as stated above, all I had to do was replace the string literals with my column name. Thank you kindly pard.

EDIT-ARRGH the fuckers don't even follow their own rules. Sometimes characters come after the digits, that aren't finish or color codes. FML.

Scaramouche fucked around with this message at 00:07 on Sep 3, 2014

Demonachizer
Aug 7, 2004
This might not be the right place but just wondering if there is a way to determine a hashing algorithm used for a user table field in a DB if you have access to the table. Like I have the username, password and salt and can change things to whatever I want. I know this isn't really a SQL specific question since the hashing isn't done by SQL but the client program but I figure people who work with SQL all the time are more likely to know this stuff. The software was coded in VB6 also.

An example:

username something
password Awful1!!
Salt icRpbUhwhgL9h5G
Hash 95D69CBBEC9B818642F2CB1BF232DAB7

If this is a dumb place to ask this, any guidance to a better spot would be appreciated.

Demonachizer fucked around with this message at 15:25 on Sep 3, 2014

Quebec Bagnet
Apr 28, 2009

mess with the honk
you get the bonk
Lipstick Apathy

Demonachizer posted:

This might not be the right place but just wondering if there is a way to determine a hashing algorithm used for a user table field in a DB if you have access to the table. Like I have the username, password and salt and can change things to whatever I want. I know this isn't really a SQL specific question since the hashing isn't done by SQL but the client program but I figure people who work with SQL all the time are more likely to know this stuff. The software was coded in VB6 also.

An example:

username something
password Awful1!!
Salt icRpbUhwhgL9h5G
Hash 95D69CBBEC9B818642F2CB1BF232DAB7

If this is a dumb place to ask this, any guidance to a better spot would be appreciated.

Given that there's no information about the hash in the DB, you have to brute force it. Try some common hash algorithms with some combinations of password and salt and user name until you can match the hash in the database. You know the length of the hashed value so you only need to try 128-bit algorithms like MD5. Given that the software was developed in VB6, it's either old (or designed by lazy programmers if new) so it probably isn't a complicated scheme.

For future reference, there's a general programming questions thread.

Chill Callahan
Nov 14, 2012

Demonachizer posted:

This might not be the right place but just wondering if there is a way to determine a hashing algorithm used for a user table field in a DB if you have access to the table. Like I have the username, password and salt and can change things to whatever I want. I know this isn't really a SQL specific question since the hashing isn't done by SQL but the client program but I figure people who work with SQL all the time are more likely to know this stuff. The software was coded in VB6 also.

An example:

username something
password Awful1!!
Salt icRpbUhwhgL9h5G
Hash 95D69CBBEC9B818642F2CB1BF232DAB7

If this is a dumb place to ask this, any guidance to a better spot would be appreciated.

Nothing in here: http://www.insidepro.com/hashes.php, but it isn't truly comprehensive. If I were you I would try every permutation of username/password/salt with 128-bit hashing algorithms combined with toLower/toUpper on everything.

Nam Taf
Jun 25, 2005

I am Fat Man, hear me roar!

Architectural question time!

I have a whole number of machines which generate event recorder data. I'm looking at using Infobright or another columnar DB to store this as it's highly (~300x) compressible. Most reads will be aggregate/sequential reads from segments defined by a datetime, which I use as the key. Multiple machines come together to describe 1 asset, so evaluations will have to sample a time period from multiple machines.

How am I best structuring the tables? I'd first leap to having 1 table per machine but the idea of having table name as a variable in queries strikes me as not best practice.

How does a real DBA get around this design problem?

Fuck them
Jan 21, 2011

and their bullshit
:yotj:
I've got a working around bad architecture problem.

Basically, I have to look up cases by "[year]-[two letter 'code']-[morenumbers]". There is no table to match the code, nothing in the schema at all to support these two letter codes, etc. The UI just has three text boxes the user types in "2004" "CG" "butts" into and it concatenates it into "2004-CG-butts" and searches from there, and when partial it uses a like.

I was wondering if there's any non lovely way to look at the table and get the unique two letter 'codes', but it looks like I'd have to do a drat table scan.

I'm also not sure how to do the actual string manipulation in tSQL before I just selected distinct.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
Does anyone know how to strip out special characters out of a text file for importing into a database?

I have a list (1.5 million rows) of user names and a bunch of them include non-printable ASCII characters (characters 0 - 31) that cause importing to a Postgres database to fail. This list is coming from a SQL Server table export (where they load in just fine, strangely enough) into a text file.

FWIW: These special characters can be found throughout the name, not just at the beginning.

Anyone have a technique they use?

This is a windows-based environment.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

gently caress them posted:

I've got a working around bad architecture problem.

Basically, I have to look up cases by "[year]-[two letter 'code']-[morenumbers]". There is no table to match the code, nothing in the schema at all to support these two letter codes, etc. The UI just has three text boxes the user types in "2004" "CG" "butts" into and it concatenates it into "2004-CG-butts" and searches from there, and when partial it uses a like.

I was wondering if there's any non lovely way to look at the table and get the unique two letter 'codes', but it looks like I'd have to do a drat table scan.

I'm also not sure how to do the actual string manipulation in tSQL before I just selected distinct.

Is this list of two letter codes static or is it getting updated?

I'm not sure of your load and read/write patterns, but maybe you can create an initial table that stores all of your two letter codes based on a SELECT DISTINCT, and then as new records come in on other tables you could perform an upsert into the two letter codes table?

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

Nam Taf posted:

Architectural question time!

I have a whole number of machines which generate event recorder data. I'm looking at using Infobright or another columnar DB to store this as it's highly (~300x) compressible. Most reads will be aggregate/sequential reads from segments defined by a datetime, which I use as the key. Multiple machines come together to describe 1 asset, so evaluations will have to sample a time period from multiple machines.

How am I best structuring the tables? I'd first leap to having 1 table per machine but the idea of having table name as a variable in queries strikes me as not best practice.

How does a real DBA get around this design problem?

How about a single table with a primary key on TIMESTAMP, MACHINE_NAME?

I do the same thing to aggregate web logs from multiple web servers into a single table.

edit: Oh wait, you said columnar database. No idea if PKs work the same way on columns instead of rows.

Agrikk fucked around with this message at 19:41 on Sep 4, 2014

Fuck them
Jan 21, 2011

and their bullshit
:yotj:

Agrikk posted:

Is this list of two letter codes static or is it getting updated?

I'm not sure of your load and read/write patterns, but maybe you can create an initial table that stores all of your two letter codes based on a SELECT DISTINCT, and then as new records come in on other tables you could perform an upsert into the two letter codes table?

I believe it's static, but it's also a lot of stuff. It's every case that still has a physical copy somewhere within the court system. So, not a small number. Thankfully, all new stuff is "e-filed". My state sucks but is actually on the ball with regards to this!

How do I select distinct from a sub string? If the server could optimize that that would be great. I know how to do this in C#/whatever but not in SQL.

If new stuff was going on, I'd make a code table, but the thing is I don't know what the codes mean, and neither does anyone else. A few are known, but new ones were just made up ad-hoc. :smith:

Sedro
Dec 31, 2008

gently caress them posted:

How do I select distinct from a sub string? If the server could optimize that that would be great. I know how to do this in C#/whatever but not in SQL.
select distinct substring(NameOfYourColumn, 6, 2) from ...

If you need the query to perform well (ie. you will re-run the query often because the set of distinct codes change) then you need to use a functional index.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

gently caress them posted:

I believe it's static, but it's also a lot of stuff. It's every case that still has a physical copy somewhere within the court system. So, not a small number. Thankfully, all new stuff is "e-filed". My state sucks but is actually on the ball with regards to this!

How do I select distinct from a sub string? If the server could optimize that that would be great. I know how to do this in C#/whatever but not in SQL.

If new stuff was going on, I'd make a code table, but the thing is I don't know what the codes mean, and neither does anyone else. A few are known, but new ones were just made up ad-hoc. :smith:

Use SELECT DISTINCT SUBSTRING(CaseNumber, x,2) from Tablename

Where X is the position inthe string of the first letter of the two letter code.


e;f;b

Fuck them
Jan 21, 2011

and their bullshit
:yotj:

Sedro posted:

select distinct substring(NameOfYourColumn, 6, 2) from ...

If you need the query to perform well (ie. you will re-run the query often because the set of distinct codes change) then you need to use a functional index.

An ongoing project is actually indexing tables at all, on that note. Bit of a horror but, meh. Many of the indexed tables indexed the wrong column.

Since I have the time, how would I set up a function based index?

Sedro
Dec 31, 2008

gently caress them posted:

An ongoing project is actually indexing tables at all, on that note. Bit of a horror but, meh. Many of the indexed tables indexed the wrong column.

Since I have the time, how would I set up a function based index?
In MSSQL it looks like you create an index over a "computed column"
SQL code:
alter table NameOfYourTable add column Code as substring(NameOfYourColumn, 6, 2);
create index on NameOfYourTable (Code);

select distinct Code from NameOfYourTable; -- should use the index
In theory you could perform a loose index scan if you already had an index over the original column.

Nam Taf
Jun 25, 2005

I am Fat Man, hear me roar!

Agrikk posted:

How about a single table with a primary key on TIMESTAMP, MACHINE_NAME?

I do the same thing to aggregate web logs from multiple web servers into a single table.

edit: Oh wait, you said columnar database. No idea if PKs work the same way on columns instead of rows.

I could make it one table but then I'd have hundreds of machines in one giant table and that also seems not the best.
The problem is that machines aren't permanently grouped to assets - a group of machines comes together to be an asset but then can dissolve and move to different assets later on. Machines A, B and C may be asset 1 and D, E and F asset 2 one day, but then A, D and F may be 1 and B, C and E 2 the next.

Fuck them
Jan 21, 2011

and their bullshit
:yotj:
1.8m rows were scanned in .4 seconds. I take it 1.8m rows isn't that much for a production SQL server? I Still have no basis of what "big" is, or isn't.

Also, there is an index on that row, apparently. If new data was coming in I'd build a (proper, useful) code table and add the column.

Databases are just a drat rabbit hole.

Xae
Jan 19, 2005

gently caress them posted:

1.8m rows were scanned in .4 seconds. I take it 1.8m rows isn't that much for a production SQL server? I Still have no basis of what "big" is, or isn't.

Also, there is an index on that row, apparently. If new data was coming in I'd build a (proper, useful) code table and add the column.

Databases are just a drat rabbit hole.

1.8 Million rows is a decent size, but not that big.

Billions is "big" these days.



Ask me about having to table copy 10TB of Db2 tables because of poo poo design. Actually, never mind. I'm going to drink until I loving forget how terrible our Architects are.

:suicide:

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

Nam Taf posted:

I could make it one table but then I'd have hundreds of machines in one giant table and that also seems not the best.
The problem is that machines aren't permanently grouped to assets - a group of machines comes together to be an asset but then can dissolve and move to different assets later on. Machines A, B and C may be asset 1 and D, E and F asset 2 one day, but then A, D and F may be 1 and B, C and E 2 the next.

What's wrong with hundreds of machines IDs in a table? If the logs have the same format dump them all into a single table. It certainly beats joining across a hundred tables...

I have a pool of web servers in an VPC that get spun up and destroyed all the time due to auto scaling events that match demand. Each of these instances pushes web logs to a central S3 bucket that is polled by a process to snort the logs and collects the machine name. The pool of servers ranges from 6-20 web servers at any given moment and my log table contains over five hundred machine IDs.

Nam Taf
Jun 25, 2005

I am Fat Man, hear me roar!

Fair enough. I guess it just doesn't feel 'right' to have a database with 13 million rows (150 machines) a day for 1 table.

I mean it would work, but so would calling SQL in Python with table name as a variable input.

I parsed 24 million rows (1 machine's last year of data) in today and Infobright compressed the 1.3GB csv input to 4.1MB. That was pretty sexy. It had 0.00 second query times for aggregate functions too. Columnstore databases are amazing.

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Nam Taf posted:

Fair enough. I guess it just doesn't feel 'right' to have a database with 13 million rows (150 machines) a day for 1 table.

I mean it would work, but so would calling SQL in Python with table name as a variable input.
You're going to have the 13 million rows per day no matter what. Putting them all in the same table is definitely the way to go. Especially if your queries are always (or even sometimes) going to want the data from all the machines at once. Hardcoding 150 unions in your queries or looping through all your tables and then compiling the results in your app is wrong and backwards. Especially since my general understanding is that this is the sort of thing that columnar DBs are made for. That being said, it might make sense to move stuff to archive or summary tables every so often.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

Nam Taf posted:

Fair enough. I guess it just doesn't feel 'right' to have a database with 13 million rows (150 machines) a day for 1 table.

Welcome to Big Data. Around these parts in my company it's quite normal to have tables of billions and billions of rows increasing by 50-100 million rows a day.

...and yeah. ^^^ Summarize, summarize, summarize.

Gatac
Apr 22, 2008

Fifty Cent's next biopic.
Hey guys, got a newbie question regarding PL/SQL. Suppose I've got a procedure that's got a fairly complex SELECT to query the data I want to work on, and I'd like to store the results for the runtime of the procedure so I can then run further queries on that. I've done some googling about temporary tables and such and supposedly on an MS system, building temporary tables at runtime is no big deal, but Oracle doesn't support it. The consensus pretty much seems to be "It's not supposed to work that way, don't do it", followed by (what looks to me) horribly kludge-y ways to get it done anyway by explicitly creating, filling and then dropping a table.

I could just have the explicit query in there multiple times, but that seems inelegant when I really just need a way to say "Okay, remember this poo poo we did at the beginning? We're working with that again." Or wrap the whole thing up in an FOR loop so I can refer back to that outer query inside the loop however I want. But it feels like there's probably a better solution than what a newbie with Google can figure out. But even if there isn't, at this stage I'm mostly just genuinely curious: why isn't it done like this? Is this supposed to be one of those cases where the language wants to force you to clean up your thoughts (and program flow) to the point where you no longer need temp tables to get it done? Or am I just looking in the wrong place?

var1ety
Jul 26, 2004

Gatac posted:

Hey guys, got a newbie question regarding PL/SQL. Suppose I've got a procedure that's got a fairly complex SELECT to query the data I want to work on, and I'd like to store the results for the runtime of the procedure so I can then run further queries on that. I've done some googling about temporary tables and such and supposedly on an MS system, building temporary tables at runtime is no big deal, but Oracle doesn't support it. The consensus pretty much seems to be "It's not supposed to work that way, don't do it", followed by (what looks to me) horribly kludge-y ways to get it done anyway by explicitly creating, filling and then dropping a table.

I could just have the explicit query in there multiple times, but that seems inelegant when I really just need a way to say "Okay, remember this poo poo we did at the beginning? We're working with that again." Or wrap the whole thing up in an FOR loop so I can refer back to that outer query inside the loop however I want. But it feels like there's probably a better solution than what a newbie with Google can figure out. But even if there isn't, at this stage I'm mostly just genuinely curious: why isn't it done like this? Is this supposed to be one of those cases where the language wants to force you to clean up your thoughts (and program flow) to the point where you no longer need temp tables to get it done? Or am I just looking in the wrong place?

You can do this with a GLOBAL TEMPORARY TABLE. Depending on your result set size and default sampling level, you might need to hint CARDINALITY or OPTIMIZER_DYNAMIC_SAMPLING to avoid a bad plan (block size is used in the optimizer if you don't have stats, iirc).

Cast_No_Shadow
Jun 8, 2010

The Republic of Luna Equestria is a huge, socially progressive nation, notable for its punitive income tax rates. Its compassionate, cynical population of 714m are ruled with an iron fist by the dictatorship government, which ensures that no-one outside the party gets too rich.

I posted in this thread a while ago about being a dev thats ended up as a bit of a dba / dev wierd cross. Anyway can anyone recommend any certs/training both from a "this is actually useful" and a "this is bullshit but hr departments love it" angle? I dont know how long I will stay doing this vs going back to pure dev work but im enjoying it a lot at the moment and if I can learn more or get some handy bits of paper on the companies dime I certainly wouldn't say no.

Gatac
Apr 22, 2008

Fifty Cent's next biopic.

var1ety posted:

You can do this with a GLOBAL TEMPORARY TABLE. Depending on your result set size and default sampling level, you might need to hint CARDINALITY or OPTIMIZER_DYNAMIC_SAMPLING to avoid a bad plan (block size is used in the optimizer if you don't have stats, iirc).

Hmm, okay. Different question, is it possible to construct an UPDATE that doesn't actually change a field? I know it's possible to just have it write back what's already in there, but is there a way to avoid any change at all? Basically, I'd like to update a record, but certain fields only need to be changed under certain circumstances. (Using CASE statements when assigning the values.) I could just externalize those into separate UPDATEs wrapped in IF blocks, of course, but is it possible to centralize this into one neat UPDATE?

Anaxandrides
Jun 7, 2012

Every summer day in the desert is like a vacation in hell.

Gatac posted:

Hmm, okay. Different question, is it possible to construct an UPDATE that doesn't actually change a field? I know it's possible to just have it write back what's already in there, but is there a way to avoid any change at all? Basically, I'd like to update a record, but certain fields only need to be changed under certain circumstances. (Using CASE statements when assigning the values.) I could just externalize those into separate UPDATEs wrapped in IF blocks, of course, but is it possible to centralize this into one neat UPDATE?

I can say that this is 100% possible to write in MSSQL. It's possible to write many, many things that you should not write, though, and this is one of them. You're much better off having multiple UPDATE statements with differing WHERE clauses than having one massive UPDATE that selectively updates based on the value in the row -- isn't that what WHERE clauses are for?

Adbot
ADBOT LOVES YOU

SLOSifl
Aug 10, 2002


Anaxandrides posted:

I can say that this is 100% possible to write in MSSQL. It's possible to write many, many things that you should not write, though, and this is one of them. You're much better off having multiple UPDATE statements with differing WHERE clauses than having one massive UPDATE that selectively updates based on the value in the row -- isn't that what WHERE clauses are for?
Agreed. The complexity of the logic and the wide-net WHERE clause, if any, will be an indexing nightmare. You'll get better performance out of tight WHERE clauses and a few column updates.

What is the concern about updating columns that aren't changing? The only case where that could theoretically matter is if you update rows that don't need any updates at all, but even that can be offset by an efficient WHERE clause. Just update all the columns if any of them have changed...or do you have triggers on the table that care if certain columns were touched?

code:
UPDATE A SET Column1 = Val1, Column2 = Val2            --update everything
FROM TableA A WHERE Column1 <> Val1 OR Column2 <> Val2 --if anything changed
is probably the best option under most normal circumstances. I use the UPDATE X FROM Table X syntax under the assumption that you are joining on another table somewhere. There is no penalty to this pattern but it is easier to extend later.

SLOSifl fucked around with this message at 13:12 on Sep 9, 2014

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply