Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Kuule hain nussivan
Nov 27, 2008

grate deceiver posted:

Basically for every one of those X's I need to keep track of a bunch of numbers, a separate spreadsheet file, and some .pdf attachments. There's 12 (probably 24) months and 200 people, doing it in Excel is starting to get annoying as hell.
Oh, no question about that. What I meant is, can you pivot the monthly data into rows and then import that into access. Mind you, that'd probably be pretty trivial to do in VBA as well.

Adbot
ADBOT LOVES YOU

grate deceiver
Jul 10, 2009

Just a funny av. Not a redtext or an own ok.

Kuule hain nussivan posted:

Oh, no question about that. What I meant is, can you pivot the monthly data into rows and then import that into access. Mind you, that'd probably be pretty trivial to do in VBA as well.

Wait, you mean like transpose and make employee Ids into columns? That sounds even worse to manage...

Kuule hain nussivan
Nov 27, 2008

grate deceiver posted:

Wait, you mean like transpose and make employee Ids into columns? That sounds even worse to manage...

Oh, no. You keep the employee ids as a key, and then turn the month columns into single rows. For example, if you had Bob workin in January and December, the pivoted table would have 2 rows with the id Bob.

Employee | Month
Bob | January
Bob | December

Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

Yeah I'd do something like table Schedule:
code:
ID  Name  Month  Work
===================
5   Bob      07               1 
11 Mary    03               0
The boolean Work column means you could update a person to be not working without having to delete a row if that status changes as Kuule hain nussivan suggests.

Normalization would suggest not even having the Name in the table, instead having another table called Employee with
code:
ID  Name  DOB  DateHired
=====================
etc etc.

Then to see if someone is working it's as easy as
code:
SELECT e.Name,s.Month
FROM Schedule s INNER JOIN Employee e ON s.ID=e.ID
WHERE Work=1
ORDER BY s.Month
A bit simplistic, but you mentioned extra numbers data, and probably at least some of that belongs in a separate employee table. The PDFs are problematic as BLOBs always are; you can keep them in a database or store URLs to them on a site/network share.

grate deceiver
Jul 10, 2009

Just a funny av. Not a redtext or an own ok.
Ok, seems I vaguely had a similar idea to you two, except the table I made had only ID / Employee_ID / Month fields, so I only had records where there was work. Doing it like Scaramouche said, with the additional Work field, is probably smarter. Also found out it's super easy to still display the whole thing as a grid by using the crosstab query. Cool, thanks, dudes!

kiwid
Sep 30, 2013

Is there a way to add a "dynamic" column to a SELECT query based on another column?

For example, I have a simple table with a field called TicketNumber and a bunch of other fields that don't matter for the example.

The TicketNumber field has data like this:

L32342
M3232
N2132
L2132
A3232
L33265

The first letter of the ticket represents the branch it belongs to. No other fields in the table relate the branch, there is no branch_id or anything else.

I want to build a query that selects the Branch name and some other fields. Can I dynamically select a Branch field that looks at the first letter of the ticket number and sort of does a lookup in a key-value array or something?

I can do this in the software but I'd rather do it in the query.

edit: It's MySQL.

edit2: There is no branches table to join with which is why I'm resorting to this.

kiwid fucked around with this message at 16:15 on Jul 17, 2017

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
code:
SELECT T.TicketNumber,
       B.*
  FROM Ticket T
        LEFT OUTER JOIN Branch B ON B.BranchNum = SUBSTRING(T.TicketNumber,1,1)

quote:

edit2: There is no branches table to join with which is why I'm resorting to this.
In that case you could probably fake it with a CASE statement, but you'd have to hard-code it all into the query, in which case you aren't gaining anything, so either create the Branch table or don't bother.

E: I suppose you might be able to create a function, but you're still hard-coding stuff.

Jethro fucked around with this message at 16:19 on Jul 17, 2017

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Jethro posted:

code:
SELECT T.TicketNumber,
       B.*
  FROM Ticket T
        LEFT OUTER JOIN Branch B ON B.BranchNum = SUBSTRING(T.TicketNumber,1,1)
In that case you could probably fake it with a CASE statement, but you'd have to hard-code it all into the query, in which case you aren't gaining anything, so either create the Branch table or don't bother.

E: I suppose you might be able to create a function, but you're still hard-coding stuff.

or you could load the letter->Branch name mapping into a temp table and join on the first character of the ticket like Jethro posted.

kiwid
Sep 30, 2013

Thanks, I ended up just saying gently caress it and created a branch table. Overkill for this application but oh well.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
It sure would be nice if that field was split to ticketbranch and ticketnumber.

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



I've got a bug about a stored procedure running on MSSQL that's erroring out because of data truncation but only in production (where nobody has permissions to run a debugger) and only when run as a stored procedure. Meaning that, if I just copy the interesting part out of the sp_helptext output and run it in a SMS window, it runs and gives me a result set and two warnings about nulls being eliminated, probably by a SUM. Anyone have any ideas about what I could try to replicate the data truncation error outside of the exec black box? I'm guessing the warnings are a clue, but didn't think they got upgraded to errors inside a sproc, but I'm kind of mid-level in my SQL lore, so I could easily be very wrong.

Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

Is the trunc happening on a parameter or function variable? I know in the old days (pre-2012) it would just invisibly trunc varchars that would be larger than their column definition, not sure if they ever got around to fixing that.

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



Half of the problem is that I don't know where it's actually happening. The only opportunity I found for string truncation is one field of a temp table that shouldn't have had any data added to it that was longer than the 50 it's declared with, but it was at least possible, so I updated that and made changes to default nulls to zero in the SUMs that were being run. The line number in the error I get from exec puts it in the parameter list between SELECT and FROM when the temp table is getting filled initially, but it has changed back and forth between two lines (somehow?!) between runs on me while I've been farting around trying to reproduce it in a SMS buffer which is... confusing.

Dunno if those updates fixed anything and won't until we push the changes to prod or I can restore a full backup and try it, but that might take long enough that I'd have the restore run overnight.

plasticbugs
Dec 13, 2006

Special Batman and Robin
I have a production DB that is completely out of hand. My nice little web app that has about 2000 active users has been hit by spammers to the tune of 13M rows in my heroku postgres DB.

They're using some kind of random text generator and then putting links to pharmaceuticals and poo poo. Somewhere in those 13M+ rows of data is actual user data. I don't want to kill my website and any query that does any kind of pattern matching (even when I use LIMIT) is taking far too long to come back with anything. Is there any way to select and delete thousands (millions) of rows of data at a time without going nuclear on my DB? I guess I could just look for any post that includes "<a href=" since my site strips links out by default anyway.

Any suggestions?

Here's two pulled at random. The URLs and the things they link also seem incredibly random, so I'm not sure what kind of query will be most effective.

rear end in a top hat Spammer posted:

I'd like to order some foreign currency <a href="http://pacatuba.ce.gov.br/kamagra-now-review.pdf#hearth">kamagra shop erfahrungen</a> As the government's short-term borrowing costs have jumped,borrowing costs for companies, banks and Wall Street haveclimbed in the past 72 hours, signaling traders' worries theWashington impasse over the debt ceiling might drag on muchlonger than they had expected just a week ago.
<a href="http://www.nathan-outlaw.com/donde-comprar-cytotec-en-farmacias.pdf">donde puedo comprar pastillas cytotec en argentina</a> "How these people could have turned him and poisoned his mind is dreadful," the newspaper quoted her as saying. "He was an innocent, naive and simple man. I suppose he must have been an ideal candidate.
<a href="http://maltwhiskytrail.com/gabapentin-300-mg-3-times-a-day-side-effects.pdf">generic gabapentin side effects</a> The recession in Portugal and Spain was still affectingdemand for oil products, which fell 6 percent in the quarter,and the company said it is also expecting lower sales towholesale clients this quarter from a year earlier.
<a href="http://maltwhiskytrail.com/synthetic-thyroid-hormone-levothyroxine-synthroid-levothroid-levolet.pdf">levothyroxine 25 mcg tab side effects</a> Not enough shut-eye seems to trigger the nervous system to release high levels of "fight-or-flight" stress hormones that raise blood pressure, heart rate and blood sugar. Short sleepers also tend to be heavier, which can take a toll on the heart.

rear end in a top hat Spammer posted:

I'm at Liverpool University <a href="http://www.templatelite.com/buying-promethazine-online.pdf#flying">phenergan cream la thuoc gi</a> Stack the States is a big favorite for our daily commute in the car. My daughter reads out the questions and we race each other to see who can answer them first. Helps teach state capitals, shapes, geographic locations, flags and more.
<a href="http://safeer-e-aza.com/index.php/tadacip-cipla-review.pdf#poke">erfahrungen mit tadacip forum</a> Fosina says Clara turned to him and asked him if he thought Mariano was surprised. €œI told her, €˜When he sees the bill, he€™ll be stunned.€™ She had charged all of the relatives€™ flights on Mo€™s credit card,€ Fosina says. €œI think there was like, eight people there who weren€™t part of the family.€
<a href="http://maltwhiskytrail.com/cialis-20mg-tab-lilly.pdf">online generic cialis canada
</a> What those customers realize is that baskets aren't interchangeable commodities. Zoetis Pharmaceuticals, a recent spin-off from Pfizer, is the world's largest maker of animal medicines and vaccines (including the popular canine arthritis drug Rimadyl). The company was facing a nagging, expensive problem. It has pumps that feed medicine into a machine that fills ampules, and the pumps kept locking up. The pumps are small--10 inches long--but expensive. When one breaks, repairs cost $4,000, and the only place to get the work done is from a company in Germany. The lead time is two months.
<a href="http://blog.avianca.com.br/fda-approval-misoprostol-abortion.pdf#terminus">cytotec 200 mcg pfizer</a> Markets have become very sensitive to any clues on thetiming of a potential U.S. stimulus withdrawal, which ultimatelywill depend on incoming data. There will be plenty of that nextweek, with U.S. jobs the highlight.

Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

Dump each one to a text file and then email it to a Bayesian analysis spam catcher, delete the ones it bounces ;)

plasticbugs
Dec 13, 2006

Special Batman and Robin

Scaramouche posted:

Dump each one to a text file and then email it to a Bayesian analysis spam catcher, delete the ones it bounces ;)

I know that's not feasible with 13M+ entries, but I'm guessing I can write a script that looks for a positive spam hit via bayesian analysis and deletes the row. It'll be slow, but it's better than the alternative of upgrading my Heroku account just so I can host 13M rows worth of spam. Does that sound like a reasonable solution? Guess I better start researching.

The saddest thing is that my site is not search-indexed by design, so I'm not sure what benefit the spammers get by loading my database with data no one will ever see.

Pardot
Jul 25, 2001




You could use `heroku pgbackups:transfer` to go to a huge db instance, do the filter, then use it again to to back down to whatever you're on now. The cost will be prorated to the second.

Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

plasticbugs posted:

I know that's not feasible with 13M+ entries, but I'm guessing I can write a script that looks for a positive spam hit via bayesian analysis and deletes the row. It'll be slow, but it's better than the alternative of upgrading my Heroku account just so I can host 13M rows worth of spam. Does that sound like a reasonable solution? Guess I better start researching.

The saddest thing is that my site is not search-indexed by design, so I'm not sure what benefit the spammers get by loading my database with data no one will ever see.

I have a more serious answer now though it's hacky as hell. Last time I had to do this I did an SQL dump of the table (ID and content columns only), zipped it up, and unzipped it onto a monster fast desktop machine. Then I used a variety of regex tools to find the "bad" posts and mark their ids/delete those rows from the SQL dump. Then, I made a new table and copied over the "good" IDs, and renamed it to be the old table. This was faster than deleting 98% of the old table, and avoided index locks and so on.

For the regex I wrote a custom simple app in C#, but various tools like Notepad+, Textpad, etc. are pretty robust as well, though you'll need a lot of RAM to have the entire file in memory all at once. I found command line tools more reliable.

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

Okay, I need some clarifications on the subject of PostgreSQL indexes and primary keys. I've been reading docs, SO answer, and assorted blog posts and I keep hearing contradictory ideas.

My scenario is as follows: I want an Events table (in the event-sourcing sense). It will have a unique UUID column, where the UUID column is generated client-side (this is a requirement as it allows offline generation and guarantees idempotency) and is truly random, not timestamp-based. It will also have a sequential column (BIGSERIAL) which keeps track of the order in which the events reached the database; this column is used by the event processor(s) to ensure that they've processed every event and that they do so in a correct and consistent order. Let's call them EventID and EventNumber respectively.

There will be several tables that inherit from Events for specific kinds of Events, adding a column with the reference to the object they belong to (eg. CREATE TABLE User_Events (UserID UUID, LIKE Events INCLUDING ALL) INHERITS Events). These will exist strictly for the sake of conveniently summoning up the history of an entire UserID with a simple join, instead of searching through the payloads or maintaining dedicated history tables; the additional columns won't matter for business-logic purposes.

So the flow of data is: events get INSERTed into either the base Events table or, when applicable, one of the child tables like User_Events. They never get UPDATEd or DELETEd. Then, the event processor (projection function in ES lingo) SELECTs into the base Events table (which implicitly translates to a UNION ALL over the child tables) asking for all events with an EventNumber greater than the last one they processed. Occasionally, an application will query up the history of a particular item, and will do a SELECT on a specific child table over a specific UserID - it's not going to care about the EventID.

Now, as far as I understand it, the data in Postgres is always laid out in the same order in which it was inserted, so in that respect it doesn't matter which of EventID and EventNumber is the primary key, there's not going to be any data fragmentation. The only exception is if you manually invoke the CLUSTER command to physically rearrange a table according to a specific index, but it's a one-time thing.

Indexes are another matter - an index over a BIGSERIAL column will take up essentially no space at all, while a btree index over a UUID column takes up quite a lot of space, and may slow down INSERTs a bit as they need to be updated in a random position. But since UUID uniqueness will absolutely need to be encoded in the DB, and uniqueness can't be guaranteed without an index, there's going to be a UUID index whether or not it's the primary key.

So, to sum it up, am I correct in understanding that (contrary to what several blogs claim, and contrary to what is true in e.g. MS SQL Server) the following two arrangements are practically identical:

CREATE TABLE Events (EventID UUID PRIMARY KEY, EventNumber BIGSERIAL)

and

CREATE TABLE Events (EventID UUID NOT NULL UNIQUE, EventNumber BIGSERIAL PRIMARY KEY)

?

McGlockenshire
Dec 16, 2005

GOLLOCKS!
You are correct that postgres doesn't do data ordering on disk based on the content of the primary key. It's safe and correct to use your client-side generated UUID as your primary key here. While UUIDs take up twice the space as bigints, you should not take any performance concerns seriously until you can prove it's an actual problem effecting production.

Be very careful with table inheritance and sequences. I personally only tried using inheritance once and it was a very unpleasant experience, but I'm about 90% sure that's because we had no idea what we were doing and our use case, copying class inheritance in a home-grown ORM, was stupid.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
"PRIMARY KEY enforces the same data constraints as a combination of UNIQUE and NOT NULL". :science:

('PRIMARY KEY' kinda just "schema metadata sugar" with a tad of built-in enforcement on CREATE TABLE).

NihilCredo posted:

It will also have a sequential column (BIGSERIAL) which keeps track of the order in which the events reached the database...

Now, as far as I understand it, the data in Postgres is always laid out in the same order in which it was inserted, so in that respect it doesn't matter which of EventID and EventNumber is the primary key, there's not going to be any data fragmentation....

Indexes are another matter - an index over a BIGSERIAL column will take up essentially no space at all, while a btree index over a UUID column takes up quite a lot of space, and may slow down INSERTs a bit as they need to be updated in a random position. But since UUID uniqueness will absolutely need to be encoded in the DB, and uniqueness can't be guaranteed without an index, there's going to be a UUID index whether or not it's the primary key.

First, sequences shouldn't be interpreted to establish a bijection between event timestamps and the desired monotonically increasing integer sequence. "[T]here may be "holes" or gaps in the sequence of values which appears in the column, even if no rows are ever deleted", so certainly don't expect to walk incrementally --- you indicated a 'valid' approach, just being thorough here --- however, "A value allocated from the sequence is still "used up" even if a row containing that value is never successfully inserted into the table column. This may happen, for example, if the inserting transaction rolls back".

You have an ordering issue and a potential race condition: TxA requests 10 values from the serial getting 555--564, TxB starts and requests 10 getting 565--574. TxA proceeds to "long insert" some transactions, but is slow because of... networking, processor, things. Meanwhile TxB inserts four and commits. TxA is open while your processor asks for everything past 554, gets 565--568 as committed in TxB, and processes ahead through to 568. TxA commits 555--560, leaving a gap of 561--564, 569--574. Processor expects to pick up from 569 on the next cycle and misses all of the TxA commits. The ordering issue comes from the difference between the start of the transaction, potential to request serials independently of inserts, and the actual "event time" for the insert versus the "event time" it encodes. If you truly need to track event time, you need a distributed timekeeping solution that adheres to your CAP requirements.

Next, the insertion order into pages is typically ordered, but it's not enforced unless you rerun CLUSTER (as you said). You are correct that the index order doesn't affect page order, but it's probably not wise to assume that everything will be ordered. For example, I would need to do a bit more looking through documentation before being convinced that table pages aren't reserved between transactions being prepared and committed, suggesting that a rollback may leave a gap in a page. While I'd like to envision that the WAL actually holds all the dynamic content, it seems possible that a page reservation could be made before the final COMMIT in some circumstances. In any case, that behavior should make no difference in your general performance.

Finally, you want an index on your sequence and the UUID, because logic dictates both shall be unique. The serial doesn't prevent code from inserting duplicates, it is merely a convenience to obtain a number that shouldn't have been used if everyone has always been doing things the same way. If you can guarantee no updates/deletes, you should read about 'fillfactor' during CREATE INDEX. Moreover, if the processor is only ever looking beyond a certain ID, you might get better performance from a routinely-rebuilt partial index, or possibly from a > index. (I recall a depesz post on > versus = performance, but can't find it right now).

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

Thank you for the very thorough critique.

PhantomOfTheCopier posted:

First, sequences shouldn't be interpreted to establish a bijection between event timestamps and the desired monotonically increasing integer sequence. "[T]here may be "holes" or gaps in the sequence of values which appears in the column, even if no rows are ever deleted", so certainly don't expect to walk incrementally --- you indicated a 'valid' approach, just being thorough here --- however, "A value allocated from the sequence is still "used up" even if a row containing that value is never successfully inserted into the table column. This may happen, for example, if the inserting transaction rolls back".

You have an ordering issue and a potential race condition: TxA requests 10 values from the serial getting 555--564, TxB starts and requests 10 getting 565--574. TxA proceeds to "long insert" some transactions, but is slow because of... networking, processor, things. Meanwhile TxB inserts four and commits. TxA is open while your processor asks for everything past 554, gets 565--568 as committed in TxB, and processes ahead through to 568. TxA commits 555--560, leaving a gap of 561--564, 569--574. Processor expects to pick up from 569 on the next cycle and misses all of the TxA commits. The ordering issue comes from the difference between the start of the transaction, potential to request serials independently of inserts, and the actual "event time" for the insert versus the "event time" it encodes. If you truly need to track event time, you need a distributed timekeeping solution that adheres to your CAP requirements.

Very good catch. Couldn't I brute-force this by having the event processor ask for an exclusive lock when requesting the list of new events? If this impacts performance too much, I could use LOCK TABLE only to select the relevant EventNumbers, and then use a regular non-locking query to actually fetch the event payloads.

Or even better, I could first query without locks and only if there's a gap in the returned EventNumbers (meaning either a transaction in progress or a rollbacked one, but I cannot know whichj), only then try again with an exclusive lock.

On the subject of ordering, I don't care what particular order two nearly-simultaneous events get stored in - it might as well have been randomized by Internet latency anyway, so first-to-begin, first-to-finish, or completely random all work fine. The only thing I care is that, once they do get stored and committed, the event processor must always process them in a deterministic order. Wiping the business tables and reprocessing everything from the beginning of history (or from a known good state) should be a valid option in case of severe application bugs or schema redesigns. And if you did it without changing the event processor code, it should result in the exact same business tables, modulo any internally-generated UUIDs.

quote:

Next, the insertion order into pages is typically ordered, but it's not enforced unless you rerun CLUSTER (as you said). You are correct that the index order doesn't affect page order, but it's probably not wise to assume that everything will be ordered. For example, I would need to do a bit more looking through documentation before being convinced that table pages aren't reserved between transactions being prepared and committed, suggesting that a rollback may leave a gap in a page. While I'd like to envision that the WAL actually holds all the dynamic content, it seems possible that a page reservation could be made before the final COMMIT in some circumstances. In any case, that behavior should make no difference in your general performance.

Yeah I don't care if the physical layout is 100% neatly ordered, I just don't want massive performance degradation as every UUID gets inserted in the middle of the table (which is what SQL Server does if you have a UUID pkey without a separate sequential clustering index).

quote:

Finally, you want an index on your sequence and the UUID, because logic dictates both shall be unique. The serial doesn't prevent code from inserting duplicates, it is merely a convenience to obtain a number that shouldn't have been used if everyone has always been doing things the same way. If you can guarantee no updates/deletes, you should read about 'fillfactor' during CREATE INDEX. Moreover, if the processor is only ever looking beyond a certain ID, you might get better performance from a routinely-rebuilt partial index, or possibly from a > index. (I recall a depesz post on > versus = performance, but can't find it right now).

Okay, I need a hint about FILLFACTOR here. I knew about the option with that name in CREATE TABLE, but in that case the default value ( = 100 = pack everything without any empty space) is already optimal for append-only tables.

For CREATE INDEX, the documentation still says that the more you update a table the lower the fill factor should be, but I thought that the layout of index pages would reflect the structure of the values being indexed. So if you index sequential or mostly-sequential values, you'd want a high fillfactor (and better yet, a BRIN index). But if you're indexing random values - v4 UUIDs being an extreme case - every new insert could add an index reference to any page, so you would want to keep pages as big as you could afford (and periodically rebuild the index when too many pages fill up and need to be split).

If I'm supposed to use fillfactor=100 on a btree index on a UUID column of an append-only table, then I completely misunderstood how index pages work.

NihilCredo fucked around with this message at 23:03 on Aug 1, 2017

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
Might not get through all of this before I have to run, and disclaimer that I'm a bit out of practice so check your references. :buddy:

NihilCredo posted:

Very good catch. Couldn't I brute-force this by having the event processor ask for an exclusive lock when requesting the list of new events? If this impacts performance too much, I could use LOCK TABLE only to select the relevant EventNumbers, and then use a regular non-locking query to actually fetch the event payloads.
You can't lock rows that don't exist. When you start the processor transaction, you're not going to see ID rows that haven't been committed. Locking only really helps for update/delete.

quote:

Or even better, I could first query without locks and only if there's a gap in the returned EventNumbers (meaning either a transaction in progress or a rollbacked one, but I cannot know whichj), only then try again with an exclusive lock.
Honestly, this is heading toward "audit entries" that permit the processor to guarantee nothing is missed, at which point every INSERT should trigger an entry into the RowsToProcess table, which _can_ be locked by the processor (to ensure that only a single processor is working that row at once, e.g., if you parallelize). Another approach would be to use a rule/trigger to ensure that the row is timestamped "after commit" (meaning "AFTER INSERT", that is when the commit is made) to permit ordering.

More options possible, no time to think about it right now. Will ponder subconsciously. Not sure I fully appreciate what you're trying to do just yet.

quote:

On the subject of ordering, I don't care what particular order two nearly-simultaneous events get stored in - it might as well have been randomized by Internet latency anyway, so first-to-begin, first-to-finish, or completely random all work fine. The only thing I care is that, once they do get stored and committed, the event processor must always process them in a deterministic order. Wiping the business tables and reprocessing everything from the beginning of history (or from a known good state) should be a valid option in case of severe application bugs or schema redesigns. And if you did it without changing the event processor code, it should result in the exact same business tables, modulo any internally-generated UUIDs.
I think the ID field will be insufficient, since the commit ordering issue previously mentioned permits smaller IDs to commit after larger values. Reprocessing will yield a different result.

quote:

Okay, I need a hint about FILLFACTOR here. I knew about the option with that name in CREATE TABLE, but in that case the default value ( = 100 = pack everything without any empty space) is already optimal for append-only tables.
Naup, just miscommunication confusion: "B-trees use a default fillfactor of 90", because they expect there to be updates/deletes. As far as I know, you indeed want fillfactor 100 for the ID index, which should always be extending "at the right" (larger values), hence will require splitting when the page is full. Dropping fillfactor does nothing in that situation. On the other hand, the UUID pages are likely to be relatively randomly allocated, so I would expect fillfactor near 50 to be optimal. On the other hand, https://stackoverflow.com/questions/37148254/fillfactor-recommendation-for-index-on-uuid-column-in-postgresql-9-4 though the responder posts no data so it might not be trustworthy.


Edit: McGlockenshire made a statement that should not be ignored. Don't over-engineer; you can't establish there is a performance issue without some test data, and it probably isn't a concern until you can demonstrate that it will be one... unless you're talking about 1M inserts per hour or something, and then the concern is warranted.

PhantomOfTheCopier fucked around with this message at 01:46 on Aug 2, 2017

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

PhantomOfTheCopier posted:

You can't lock rows that don't exist. When you start the processor transaction, you're not going to see ID rows that haven't been committed. Locking only really helps for update/delete.

If I understood the docs correctly, that's the behaviour of the row locks acquired by INSERT/UPDATE/DELETE (or SELECT FOR UPDATE/DELETE), but the lock from LOCK TABLE or from maintenance operation actually does cock-block everything. And, which is even more important for me, waits for everyone else to be finished before accessing the table:

https://www.postgresql.org/docs/current/static/explicit-locking.html posted:

ACCESS EXCLUSIVE

Conflicts with locks of all modes (ACCESS SHARE, ROW SHARE, ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE). This mode guarantees that the holder is the only transaction accessing the table in any way.

Acquired by the DROP TABLE, TRUNCATE, REINDEX, CLUSTER, VACUUM FULL, and REFRESH MATERIALIZED VIEW (without CONCURRENTLY) commands. Many forms of ALTER TABLE also acquire a lock at this level. This is also the default lock mode for LOCK TABLE statements that do not specify a mode explicitly.

quote:

Honestly, this is heading toward "audit entries" that permit the processor to guarantee nothing is missed, at which point every INSERT should trigger an entry into the RowsToProcess table, which _can_ be locked by the processor (to ensure that only a single processor is working that row at once, e.g., if you parallelize). Another approach would be to use a rule/trigger to ensure that the row is timestamped "after commit" (meaning "AFTER INSERT", that is when the commit is made) to permit ordering.

More options possible, no time to think about it right now. Will ponder subconsciously. Not sure I fully appreciate what you're trying to do just yet.

Ok, yeah, that's a little further than I wanted to go. Thankfully I don't plan on having multiple processors in the short or medium terms. The application is an ERP management system, and the database is single-tenant, so potentially any event could affect the resolution of any other event (two webstores both attempting to sell the last item in storage, for example), meaning concurrency would be very tricky. The only way I might see myself considering parallel processors would be if I can divide events into categories that are known not to affect each other, and that rule would be the application's concern anyway.

quote:

Naup, just miscommunication confusion: "B-trees use a default fillfactor of 90", because they expect there to be updates/deletes. As far as I know, you indeed want fillfactor 100 for the ID index, which should always be extending "at the right" (larger values), hence will require splitting when the page is full. Dropping fillfactor does nothing in that situation. On the other hand, the UUID pages are likely to be relatively randomly allocated, so I would expect fillfactor near 50 to be optimal. On the other hand, https://stackoverflow.com/questions/37148254/fillfactor-recommendation-for-index-on-uuid-column-in-postgresql-9-4 though the responder posts no data so it might not be trustworthy.

Edit: McGlockenshire made a statement that should not be ignored. Don't over-engineer; you can't establish there is a performance issue without some test data, and it probably isn't a concern until you can demonstrate that it will be one... unless you're talking about 1M inserts per hour or something, and then the concern is warranted.

My take is this: I don't want to overengineer in the sense of using less safe / less comfortable data structures (hence my "gently caress it, UUIDs everywhere" approach), or in the sense of making the schema more complicated than necessary (e.g. so far I've managed to avoid triggers altogether, and I'd love to keep it that way).

But setting table options, using the right type of indexes, etc. I don't consider overengineering, because they don't impose any technical cost later on, and the worst case scenario is that they have no perceivable effect.

BabyFur Denny
Mar 18, 2003
If you intend to use distributed systems, I can really, really recommend the book "Designing Data Intensive Applications" by Martin Kleppmann. He talks about many of the challenges like locks, consistencies, unique incremental ids...
Have a look, it helped me a lot to understand things.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
I'm getting a mixed list of requirements, so I can't really provide any suggestions beyond those mentioned above. If a vendor came to me with an ERP solution that performed an exclusive full table lock just to process orders, I'd move on to the next potential vendor. If they told me it was built on the assumption that there would only ever be a single processor, again I'd go with a different vendor.

Processing establishes order here, not those IDs, so the processor must record that order to meet your stated requirement that a complete replay should be possible. There are lots of ways to do that, so while you say you don't really care which event arrives first, you do care that the information in the database reflects an immediately strongly consistent "comes before" relationship for processed events. There are lots of ways to do that other than locking the whole database.

Edit: We goons have to over engineer because it's better to assume the worst and let you pick the best solution for your scenario. But you knew that, just trying to justify why I see "very serious timekeeping CAP excitement here".

PhantomOfTheCopier fucked around with this message at 14:44 on Aug 2, 2017

the talent deficit
Dec 20, 2003

self-deprecation is a very british trait, and problems can arise when the british attempt to do so with a foreign culture





BabyFur Denny posted:

If you intend to use distributed systems, I can really, really recommend the book "Designing Data Intensive Applications" by Martin Kleppmann. He talks about many of the challenges like locks, consistencies, unique incremental ids...
Have a look, it helped me a lot to understand things.

this is the best introductory level book on dist sys i've ever read and it gets deep enough that even people who are already working in dist sys can probably benefit from it

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



the talent deficit posted:

this is the best introductory level book on dist sys i've ever read and it gets deep enough that even people who are already working in dist sys can probably benefit from it

How diagram heavy is it? I'm trying to decide between eBook and paperback

the talent deficit
Dec 20, 2003

self-deprecation is a very british trait, and problems can arise when the british attempt to do so with a foreign culture





Munkeymon posted:

How diagram heavy is it? I'm trying to decide between eBook and paperback

not very. i bought and read the kindle version

CLAM DOWN
Feb 13, 2007




I'm taking over a SQL 2014 AlwaysOn AG that doesn't appear to have had a backup plan set, so the transactions logs have piled up to the point where there isn't enough free disk space to do a backup and truncate. What are my options? I don't think I can change the recovery model to simple and manually truncate with dbcc shrinkfile in an availability group but my memory is hazy.

Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

CLAM DOWN posted:

I'm taking over a SQL 2014 AlwaysOn AG that doesn't appear to have had a backup plan set, so the transactions logs have piled up to the point where there isn't enough free disk space to do a backup and truncate. What are my options? I don't think I can change the recovery model to simple and manually truncate with dbcc shrinkfile in an availability group but my memory is hazy.

You are correct, it's not that easy with an ag/mirror. I've been in this spot quite a few times, but have luckily always been able to clear enough space on the drive to eke out the logs and the clear them up. I'm guessing that's not viable in this case.

This guy has some ideas, but I'm not sure if it's the same use case you have since I'm not sure if your logs will have status 2 (note, some dangerous stuff):
http://www.thecloudtechnologist.com/shrinking-sql-log-files-in-an-availability-group-cluster-or-database-mirror/

xenilk
Apr 17, 2004

ERRYDAY I BE SPLIT-TONING! Honestly, its the only skill I got other than shooting the back of women and calling it "Editorial".
Anyone familiar with optimizing huge tables in MySQL ? We have this table with a setup as follow:

- id
- field1
- field2
- field3
- field4

indexes are as follow:
- PK id
- Unique field1,field2,field3,field4
- index field4

The table itself is 260gig, the index size if 600+gig (fun times!)

As you can imagine, the bigger the table gets the less fun it is to insert/update items in that list.

I'm looking to validate my strategy and if you have inputs I'm all ears:

1 - Use gh-ost to create and populate field5 which would become a hash value of the content of all 4 unique fields
2 - Create unique index on that new field
3 - Remove old unique

For now i'd rather not touch the id PK but afterwards it might be removed and PK would become the new field.

Am I overthinking this or does this sound about right?

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
Haha. Just hahahaha. Come back when you get to terabytes.

If it's like every other glorified grep instance running out there, you probably need to read up on data normalization. Select count( distinct field1 ) / Select count( id )? And for the other fields.

Is your index on (f1,2,3,4) or is it four indexes?

Roundboy
Oct 21, 2008
What is your DB type? Innodb allows for adding indexes in place, which should not lock up the table.

Without seeing the data or your queries, I can only give general advice to match the type to the data.. store numbers as int and not varchar, etc.

Mysql only uses one index at a time, so a query on 3 of those columns should have an index on those 3 IN THAT ORDER. Index a,b,c is different then a,c,b and you will not get the benefit of that index.

If your data supports it, it might be worth it to look into partitions, especially if much of the changes are only on modern data.

There is not much you can do once you throw hardware and memory at the problem, but the size of the tables could mean unless you have a MASSIVE memory footprint, you are most likely touching disk all the time.

Ranzear
Jul 25, 2013

Definitely more info needed as to why you need a four column unique constraint or even cram all that into one table to begin with, but yes your approach will work.

Anyone worked with MariaDB Dynamic Columns and have a oneliner for checking the BLOB is valid before doing a column_add?

Scrapez
Feb 27, 2004

Efficiency question...

I'm writing a shell script that will take a list of telephone numbers each day, it will then take that list of numbers and determine if they already exist in the master list. If they do not exist, it will add them and if they exist, it will discard them.

Currently, I'm accomplishing this with flat files using grep. What I want to do is move the master list of number to a mysql database.

My question is whether it would be better to continue to utilize the flat files to get the list of new unique numbers and then insert those numbers into the mysql database. Or, would it be better to run a mysql query of the master list to determine if the new numbers exist and insert them if not?

It seems like MySQL would have some type of function that would make it possible to deliver a list of numbers and have it insert them if they don't already exist. Does such a thing exist? I am obviously not well versed with MySQL.

Edit: I was finally able to google the right thing. It appears I can do an INSERT IGNORE which will skip any entries that already exist in the database.

Scrapez fucked around with this message at 16:48 on Aug 23, 2017

Lonely Wolf
Jan 20, 2003

Will hawk false idols for heaps and heaps of dough.
If you could use flat files but need a little sql-y goodness, sqlite was made for you. Works pretty well with shell scripting, too. For your case you'd want to look up it's "on conflict [method]" syntax.

Scrapez
Feb 27, 2004

Lonely Wolf posted:

If you could use flat files but need a little sql-y goodness, sqlite was made for you. Works pretty well with shell scripting, too. For your case you'd want to look up it's "on conflict [method]" syntax.

I'll look into that. Thanks for the info.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Scrapez posted:

I'm writing a shell script that will take a list of telephone numbers each day, it will then take that list of numbers and determine if they already exist in the master list. If they do not exist, it will add them and if they exist, it will discard them.

Currently, I'm accomplishing this with flat files using grep.
That is not a job for grep.

cat oldlist newlist | sort | uniq

Adbot
ADBOT LOVES YOU

anthonypants
May 6, 2007

by Nyc_Tattoo
Dinosaur Gum
I'm migrating a database from MySQL to PostgreSQL and I need to convert some of the data. The old MySQL database kept this one column in a longblob datatype, and stored text and images freely. The PostgreSQL schema says this should be a text column, and with a second 'contentencoding' column marking whether the data is base64-encoded or not. I've also got a 'contenttype' column that has the MIME type; 'text/plain', 'image/png', etc. Is it possible to do something like encode(column::bytea, 'base64') that only hits certain rows where contenttype!='text/plain', and also updates the contentencoding field to base64? I asked about this project a few months ago, but I had to put it off at the time. I've gotten back into it and I feel like I'm really close, but I'm still really unfamiliar with SQL. I think this would be an UPDATE, but I don't think I can run functions through that?

e: looks like this'll work:
code:
UPDATE objectcustomfieldvalues SET largecontent=encode(largecontent::bytea, 'base64'), contentencoding='base64' WHERE (contenttype='image/png' AND contentencoding='none');

anthonypants fucked around with this message at 19:33 on Aug 24, 2017

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply