|
grate deceiver posted:Basically for every one of those X's I need to keep track of a bunch of numbers, a separate spreadsheet file, and some .pdf attachments. There's 12 (probably 24) months and 200 people, doing it in Excel is starting to get annoying as hell.
|
# ? Jul 11, 2017 14:02 |
|
|
# ? Jun 5, 2024 03:53 |
|
Kuule hain nussivan posted:Oh, no question about that. What I meant is, can you pivot the monthly data into rows and then import that into access. Mind you, that'd probably be pretty trivial to do in VBA as well. Wait, you mean like transpose and make employee Ids into columns? That sounds even worse to manage...
|
# ? Jul 11, 2017 14:45 |
|
grate deceiver posted:Wait, you mean like transpose and make employee Ids into columns? That sounds even worse to manage... Oh, no. You keep the employee ids as a key, and then turn the month columns into single rows. For example, if you had Bob workin in January and December, the pivoted table would have 2 rows with the id Bob. Employee | Month Bob | January Bob | December
|
# ? Jul 11, 2017 14:50 |
|
Yeah I'd do something like table Schedule:code:
Normalization would suggest not even having the Name in the table, instead having another table called Employee with code:
Then to see if someone is working it's as easy as code:
|
# ? Jul 11, 2017 17:19 |
|
Ok, seems I vaguely had a similar idea to you two, except the table I made had only ID / Employee_ID / Month fields, so I only had records where there was work. Doing it like Scaramouche said, with the additional Work field, is probably smarter. Also found out it's super easy to still display the whole thing as a grid by using the crosstab query. Cool, thanks, dudes!
|
# ? Jul 12, 2017 09:17 |
|
Is there a way to add a "dynamic" column to a SELECT query based on another column? For example, I have a simple table with a field called TicketNumber and a bunch of other fields that don't matter for the example. The TicketNumber field has data like this: L32342 M3232 N2132 L2132 A3232 L33265 The first letter of the ticket represents the branch it belongs to. No other fields in the table relate the branch, there is no branch_id or anything else. I want to build a query that selects the Branch name and some other fields. Can I dynamically select a Branch field that looks at the first letter of the ticket number and sort of does a lookup in a key-value array or something? I can do this in the software but I'd rather do it in the query. edit: It's MySQL. edit2: There is no branches table to join with which is why I'm resorting to this. kiwid fucked around with this message at 16:15 on Jul 17, 2017 |
# ? Jul 17, 2017 16:05 |
|
code:
quote:edit2: There is no branches table to join with which is why I'm resorting to this. E: I suppose you might be able to create a function, but you're still hard-coding stuff. Jethro fucked around with this message at 16:19 on Jul 17, 2017 |
# ? Jul 17, 2017 16:16 |
|
Jethro posted:
or you could load the letter->Branch name mapping into a temp table and join on the first character of the ticket like Jethro posted.
|
# ? Jul 17, 2017 18:57 |
|
Thanks, I ended up just saying gently caress it and created a branch table. Overkill for this application but oh well.
|
# ? Jul 18, 2017 00:58 |
|
It sure would be nice if that field was split to ticketbranch and ticketnumber.
|
# ? Jul 18, 2017 18:51 |
|
I've got a bug about a stored procedure running on MSSQL that's erroring out because of data truncation but only in production (where nobody has permissions to run a debugger) and only when run as a stored procedure. Meaning that, if I just copy the interesting part out of the sp_helptext output and run it in a SMS window, it runs and gives me a result set and two warnings about nulls being eliminated, probably by a SUM. Anyone have any ideas about what I could try to replicate the data truncation error outside of the exec black box? I'm guessing the warnings are a clue, but didn't think they got upgraded to errors inside a sproc, but I'm kind of mid-level in my SQL lore, so I could easily be very wrong.
|
# ? Jul 19, 2017 18:42 |
|
Is the trunc happening on a parameter or function variable? I know in the old days (pre-2012) it would just invisibly trunc varchars that would be larger than their column definition, not sure if they ever got around to fixing that.
|
# ? Jul 19, 2017 19:50 |
|
Half of the problem is that I don't know where it's actually happening. The only opportunity I found for string truncation is one field of a temp table that shouldn't have had any data added to it that was longer than the 50 it's declared with, but it was at least possible, so I updated that and made changes to default nulls to zero in the SUMs that were being run. The line number in the error I get from exec puts it in the parameter list between SELECT and FROM when the temp table is getting filled initially, but it has changed back and forth between two lines (somehow?!) between runs on me while I've been farting around trying to reproduce it in a SMS buffer which is... confusing. Dunno if those updates fixed anything and won't until we push the changes to prod or I can restore a full backup and try it, but that might take long enough that I'd have the restore run overnight.
|
# ? Jul 19, 2017 20:08 |
|
I have a production DB that is completely out of hand. My nice little web app that has about 2000 active users has been hit by spammers to the tune of 13M rows in my heroku postgres DB. They're using some kind of random text generator and then putting links to pharmaceuticals and poo poo. Somewhere in those 13M+ rows of data is actual user data. I don't want to kill my website and any query that does any kind of pattern matching (even when I use LIMIT) is taking far too long to come back with anything. Is there any way to select and delete thousands (millions) of rows of data at a time without going nuclear on my DB? I guess I could just look for any post that includes "<a href=" since my site strips links out by default anyway. Any suggestions? Here's two pulled at random. The URLs and the things they link also seem incredibly random, so I'm not sure what kind of query will be most effective. rear end in a top hat Spammer posted:I'd like to order some foreign currency <a href="http://pacatuba.ce.gov.br/kamagra-now-review.pdf#hearth">kamagra shop erfahrungen</a> As the government's short-term borrowing costs have jumped,borrowing costs for companies, banks and Wall Street haveclimbed in the past 72 hours, signaling traders' worries theWashington impasse over the debt ceiling might drag on muchlonger than they had expected just a week ago. rear end in a top hat Spammer posted:I'm at Liverpool University <a href="http://www.templatelite.com/buying-promethazine-online.pdf#flying">phenergan cream la thuoc gi</a> Stack the States is a big favorite for our daily commute in the car. My daughter reads out the questions and we race each other to see who can answer them first. Helps teach state capitals, shapes, geographic locations, flags and more.
|
# ? Jul 28, 2017 22:28 |
|
Dump each one to a text file and then email it to a Bayesian analysis spam catcher, delete the ones it bounces
|
# ? Jul 28, 2017 23:05 |
|
Scaramouche posted:Dump each one to a text file and then email it to a Bayesian analysis spam catcher, delete the ones it bounces I know that's not feasible with 13M+ entries, but I'm guessing I can write a script that looks for a positive spam hit via bayesian analysis and deletes the row. It'll be slow, but it's better than the alternative of upgrading my Heroku account just so I can host 13M rows worth of spam. Does that sound like a reasonable solution? Guess I better start researching. The saddest thing is that my site is not search-indexed by design, so I'm not sure what benefit the spammers get by loading my database with data no one will ever see.
|
# ? Jul 29, 2017 00:28 |
|
You could use `heroku pgbackups:transfer` to go to a huge db instance, do the filter, then use it again to to back down to whatever you're on now. The cost will be prorated to the second.
|
# ? Jul 29, 2017 01:12 |
|
plasticbugs posted:I know that's not feasible with 13M+ entries, but I'm guessing I can write a script that looks for a positive spam hit via bayesian analysis and deletes the row. It'll be slow, but it's better than the alternative of upgrading my Heroku account just so I can host 13M rows worth of spam. Does that sound like a reasonable solution? Guess I better start researching. I have a more serious answer now though it's hacky as hell. Last time I had to do this I did an SQL dump of the table (ID and content columns only), zipped it up, and unzipped it onto a monster fast desktop machine. Then I used a variety of regex tools to find the "bad" posts and mark their ids/delete those rows from the SQL dump. Then, I made a new table and copied over the "good" IDs, and renamed it to be the old table. This was faster than deleting 98% of the old table, and avoided index locks and so on. For the regex I wrote a custom simple app in C#, but various tools like Notepad+, Textpad, etc. are pretty robust as well, though you'll need a lot of RAM to have the entire file in memory all at once. I found command line tools more reliable.
|
# ? Jul 31, 2017 20:24 |
|
Okay, I need some clarifications on the subject of PostgreSQL indexes and primary keys. I've been reading docs, SO answer, and assorted blog posts and I keep hearing contradictory ideas. My scenario is as follows: I want an Events table (in the event-sourcing sense). It will have a unique UUID column, where the UUID column is generated client-side (this is a requirement as it allows offline generation and guarantees idempotency) and is truly random, not timestamp-based. It will also have a sequential column (BIGSERIAL) which keeps track of the order in which the events reached the database; this column is used by the event processor(s) to ensure that they've processed every event and that they do so in a correct and consistent order. Let's call them EventID and EventNumber respectively. There will be several tables that inherit from Events for specific kinds of Events, adding a column with the reference to the object they belong to (eg. CREATE TABLE User_Events (UserID UUID, LIKE Events INCLUDING ALL) INHERITS Events). These will exist strictly for the sake of conveniently summoning up the history of an entire UserID with a simple join, instead of searching through the payloads or maintaining dedicated history tables; the additional columns won't matter for business-logic purposes. So the flow of data is: events get INSERTed into either the base Events table or, when applicable, one of the child tables like User_Events. They never get UPDATEd or DELETEd. Then, the event processor (projection function in ES lingo) SELECTs into the base Events table (which implicitly translates to a UNION ALL over the child tables) asking for all events with an EventNumber greater than the last one they processed. Occasionally, an application will query up the history of a particular item, and will do a SELECT on a specific child table over a specific UserID - it's not going to care about the EventID. Now, as far as I understand it, the data in Postgres is always laid out in the same order in which it was inserted, so in that respect it doesn't matter which of EventID and EventNumber is the primary key, there's not going to be any data fragmentation. The only exception is if you manually invoke the CLUSTER command to physically rearrange a table according to a specific index, but it's a one-time thing. Indexes are another matter - an index over a BIGSERIAL column will take up essentially no space at all, while a btree index over a UUID column takes up quite a lot of space, and may slow down INSERTs a bit as they need to be updated in a random position. But since UUID uniqueness will absolutely need to be encoded in the DB, and uniqueness can't be guaranteed without an index, there's going to be a UUID index whether or not it's the primary key. So, to sum it up, am I correct in understanding that (contrary to what several blogs claim, and contrary to what is true in e.g. MS SQL Server) the following two arrangements are practically identical: CREATE TABLE Events (EventID UUID PRIMARY KEY, EventNumber BIGSERIAL) and CREATE TABLE Events (EventID UUID NOT NULL UNIQUE, EventNumber BIGSERIAL PRIMARY KEY) ?
|
# ? Aug 1, 2017 12:17 |
|
You are correct that postgres doesn't do data ordering on disk based on the content of the primary key. It's safe and correct to use your client-side generated UUID as your primary key here. While UUIDs take up twice the space as bigints, you should not take any performance concerns seriously until you can prove it's an actual problem effecting production. Be very careful with table inheritance and sequences. I personally only tried using inheritance once and it was a very unpleasant experience, but I'm about 90% sure that's because we had no idea what we were doing and our use case, copying class inheritance in a home-grown ORM, was stupid.
|
# ? Aug 1, 2017 18:15 |
|
"PRIMARY KEY enforces the same data constraints as a combination of UNIQUE and NOT NULL". ('PRIMARY KEY' kinda just "schema metadata sugar" with a tad of built-in enforcement on CREATE TABLE). NihilCredo posted:It will also have a sequential column (BIGSERIAL) which keeps track of the order in which the events reached the database... First, sequences shouldn't be interpreted to establish a bijection between event timestamps and the desired monotonically increasing integer sequence. "[T]here may be "holes" or gaps in the sequence of values which appears in the column, even if no rows are ever deleted", so certainly don't expect to walk incrementally --- you indicated a 'valid' approach, just being thorough here --- however, "A value allocated from the sequence is still "used up" even if a row containing that value is never successfully inserted into the table column. This may happen, for example, if the inserting transaction rolls back". You have an ordering issue and a potential race condition: TxA requests 10 values from the serial getting 555--564, TxB starts and requests 10 getting 565--574. TxA proceeds to "long insert" some transactions, but is slow because of... networking, processor, things. Meanwhile TxB inserts four and commits. TxA is open while your processor asks for everything past 554, gets 565--568 as committed in TxB, and processes ahead through to 568. TxA commits 555--560, leaving a gap of 561--564, 569--574. Processor expects to pick up from 569 on the next cycle and misses all of the TxA commits. The ordering issue comes from the difference between the start of the transaction, potential to request serials independently of inserts, and the actual "event time" for the insert versus the "event time" it encodes. If you truly need to track event time, you need a distributed timekeeping solution that adheres to your CAP requirements. Next, the insertion order into pages is typically ordered, but it's not enforced unless you rerun CLUSTER (as you said). You are correct that the index order doesn't affect page order, but it's probably not wise to assume that everything will be ordered. For example, I would need to do a bit more looking through documentation before being convinced that table pages aren't reserved between transactions being prepared and committed, suggesting that a rollback may leave a gap in a page. While I'd like to envision that the WAL actually holds all the dynamic content, it seems possible that a page reservation could be made before the final COMMIT in some circumstances. In any case, that behavior should make no difference in your general performance. Finally, you want an index on your sequence and the UUID, because logic dictates both shall be unique. The serial doesn't prevent code from inserting duplicates, it is merely a convenience to obtain a number that shouldn't have been used if everyone has always been doing things the same way. If you can guarantee no updates/deletes, you should read about 'fillfactor' during CREATE INDEX. Moreover, if the processor is only ever looking beyond a certain ID, you might get better performance from a routinely-rebuilt partial index, or possibly from a > index. (I recall a depesz post on > versus = performance, but can't find it right now).
|
# ? Aug 1, 2017 21:54 |
|
Thank you for the very thorough critique.PhantomOfTheCopier posted:First, sequences shouldn't be interpreted to establish a bijection between event timestamps and the desired monotonically increasing integer sequence. "[T]here may be "holes" or gaps in the sequence of values which appears in the column, even if no rows are ever deleted", so certainly don't expect to walk incrementally --- you indicated a 'valid' approach, just being thorough here --- however, "A value allocated from the sequence is still "used up" even if a row containing that value is never successfully inserted into the table column. This may happen, for example, if the inserting transaction rolls back". Very good catch. Couldn't I brute-force this by having the event processor ask for an exclusive lock when requesting the list of new events? If this impacts performance too much, I could use LOCK TABLE only to select the relevant EventNumbers, and then use a regular non-locking query to actually fetch the event payloads. Or even better, I could first query without locks and only if there's a gap in the returned EventNumbers (meaning either a transaction in progress or a rollbacked one, but I cannot know whichj), only then try again with an exclusive lock. On the subject of ordering, I don't care what particular order two nearly-simultaneous events get stored in - it might as well have been randomized by Internet latency anyway, so first-to-begin, first-to-finish, or completely random all work fine. The only thing I care is that, once they do get stored and committed, the event processor must always process them in a deterministic order. Wiping the business tables and reprocessing everything from the beginning of history (or from a known good state) should be a valid option in case of severe application bugs or schema redesigns. And if you did it without changing the event processor code, it should result in the exact same business tables, modulo any internally-generated UUIDs. quote:Next, the insertion order into pages is typically ordered, but it's not enforced unless you rerun CLUSTER (as you said). You are correct that the index order doesn't affect page order, but it's probably not wise to assume that everything will be ordered. For example, I would need to do a bit more looking through documentation before being convinced that table pages aren't reserved between transactions being prepared and committed, suggesting that a rollback may leave a gap in a page. While I'd like to envision that the WAL actually holds all the dynamic content, it seems possible that a page reservation could be made before the final COMMIT in some circumstances. In any case, that behavior should make no difference in your general performance. Yeah I don't care if the physical layout is 100% neatly ordered, I just don't want massive performance degradation as every UUID gets inserted in the middle of the table (which is what SQL Server does if you have a UUID pkey without a separate sequential clustering index). quote:Finally, you want an index on your sequence and the UUID, because logic dictates both shall be unique. The serial doesn't prevent code from inserting duplicates, it is merely a convenience to obtain a number that shouldn't have been used if everyone has always been doing things the same way. If you can guarantee no updates/deletes, you should read about 'fillfactor' during CREATE INDEX. Moreover, if the processor is only ever looking beyond a certain ID, you might get better performance from a routinely-rebuilt partial index, or possibly from a > index. (I recall a depesz post on > versus = performance, but can't find it right now). Okay, I need a hint about FILLFACTOR here. I knew about the option with that name in CREATE TABLE, but in that case the default value ( = 100 = pack everything without any empty space) is already optimal for append-only tables. For CREATE INDEX, the documentation still says that the more you update a table the lower the fill factor should be, but I thought that the layout of index pages would reflect the structure of the values being indexed. So if you index sequential or mostly-sequential values, you'd want a high fillfactor (and better yet, a BRIN index). But if you're indexing random values - v4 UUIDs being an extreme case - every new insert could add an index reference to any page, so you would want to keep pages as big as you could afford (and periodically rebuild the index when too many pages fill up and need to be split). If I'm supposed to use fillfactor=100 on a btree index on a UUID column of an append-only table, then I completely misunderstood how index pages work. NihilCredo fucked around with this message at 23:03 on Aug 1, 2017 |
# ? Aug 1, 2017 22:59 |
|
Might not get through all of this before I have to run, and disclaimer that I'm a bit out of practice so check your references. NihilCredo posted:Very good catch. Couldn't I brute-force this by having the event processor ask for an exclusive lock when requesting the list of new events? If this impacts performance too much, I could use LOCK TABLE only to select the relevant EventNumbers, and then use a regular non-locking query to actually fetch the event payloads. quote:Or even better, I could first query without locks and only if there's a gap in the returned EventNumbers (meaning either a transaction in progress or a rollbacked one, but I cannot know whichj), only then try again with an exclusive lock. More options possible, no time to think about it right now. Will ponder subconsciously. Not sure I fully appreciate what you're trying to do just yet. quote:On the subject of ordering, I don't care what particular order two nearly-simultaneous events get stored in - it might as well have been randomized by Internet latency anyway, so first-to-begin, first-to-finish, or completely random all work fine. The only thing I care is that, once they do get stored and committed, the event processor must always process them in a deterministic order. Wiping the business tables and reprocessing everything from the beginning of history (or from a known good state) should be a valid option in case of severe application bugs or schema redesigns. And if you did it without changing the event processor code, it should result in the exact same business tables, modulo any internally-generated UUIDs. quote:Okay, I need a hint about FILLFACTOR here. I knew about the option with that name in CREATE TABLE, but in that case the default value ( = 100 = pack everything without any empty space) is already optimal for append-only tables. Edit: McGlockenshire made a statement that should not be ignored. Don't over-engineer; you can't establish there is a performance issue without some test data, and it probably isn't a concern until you can demonstrate that it will be one... unless you're talking about 1M inserts per hour or something, and then the concern is warranted. PhantomOfTheCopier fucked around with this message at 01:46 on Aug 2, 2017 |
# ? Aug 2, 2017 01:42 |
|
PhantomOfTheCopier posted:You can't lock rows that don't exist. When you start the processor transaction, you're not going to see ID rows that haven't been committed. Locking only really helps for update/delete. If I understood the docs correctly, that's the behaviour of the row locks acquired by INSERT/UPDATE/DELETE (or SELECT FOR UPDATE/DELETE), but the lock from LOCK TABLE or from maintenance operation actually does cock-block everything. And, which is even more important for me, waits for everyone else to be finished before accessing the table: https://www.postgresql.org/docs/current/static/explicit-locking.html posted:ACCESS EXCLUSIVE quote:Honestly, this is heading toward "audit entries" that permit the processor to guarantee nothing is missed, at which point every INSERT should trigger an entry into the RowsToProcess table, which _can_ be locked by the processor (to ensure that only a single processor is working that row at once, e.g., if you parallelize). Another approach would be to use a rule/trigger to ensure that the row is timestamped "after commit" (meaning "AFTER INSERT", that is when the commit is made) to permit ordering. Ok, yeah, that's a little further than I wanted to go. Thankfully I don't plan on having multiple processors in the short or medium terms. The application is an ERP management system, and the database is single-tenant, so potentially any event could affect the resolution of any other event (two webstores both attempting to sell the last item in storage, for example), meaning concurrency would be very tricky. The only way I might see myself considering parallel processors would be if I can divide events into categories that are known not to affect each other, and that rule would be the application's concern anyway. quote:Naup, just miscommunication confusion: "B-trees use a default fillfactor of 90", because they expect there to be updates/deletes. As far as I know, you indeed want fillfactor 100 for the ID index, which should always be extending "at the right" (larger values), hence will require splitting when the page is full. Dropping fillfactor does nothing in that situation. On the other hand, the UUID pages are likely to be relatively randomly allocated, so I would expect fillfactor near 50 to be optimal. On the other hand, https://stackoverflow.com/questions/37148254/fillfactor-recommendation-for-index-on-uuid-column-in-postgresql-9-4 though the responder posts no data so it might not be trustworthy. My take is this: I don't want to overengineer in the sense of using less safe / less comfortable data structures (hence my "gently caress it, UUIDs everywhere" approach), or in the sense of making the schema more complicated than necessary (e.g. so far I've managed to avoid triggers altogether, and I'd love to keep it that way). But setting table options, using the right type of indexes, etc. I don't consider overengineering, because they don't impose any technical cost later on, and the worst case scenario is that they have no perceivable effect.
|
# ? Aug 2, 2017 09:25 |
|
If you intend to use distributed systems, I can really, really recommend the book "Designing Data Intensive Applications" by Martin Kleppmann. He talks about many of the challenges like locks, consistencies, unique incremental ids... Have a look, it helped me a lot to understand things.
|
# ? Aug 2, 2017 10:38 |
|
I'm getting a mixed list of requirements, so I can't really provide any suggestions beyond those mentioned above. If a vendor came to me with an ERP solution that performed an exclusive full table lock just to process orders, I'd move on to the next potential vendor. If they told me it was built on the assumption that there would only ever be a single processor, again I'd go with a different vendor. Processing establishes order here, not those IDs, so the processor must record that order to meet your stated requirement that a complete replay should be possible. There are lots of ways to do that, so while you say you don't really care which event arrives first, you do care that the information in the database reflects an immediately strongly consistent "comes before" relationship for processed events. There are lots of ways to do that other than locking the whole database. Edit: We goons have to over engineer because it's better to assume the worst and let you pick the best solution for your scenario. But you knew that, just trying to justify why I see "very serious timekeeping CAP excitement here". PhantomOfTheCopier fucked around with this message at 14:44 on Aug 2, 2017 |
# ? Aug 2, 2017 13:44 |
|
BabyFur Denny posted:If you intend to use distributed systems, I can really, really recommend the book "Designing Data Intensive Applications" by Martin Kleppmann. He talks about many of the challenges like locks, consistencies, unique incremental ids... this is the best introductory level book on dist sys i've ever read and it gets deep enough that even people who are already working in dist sys can probably benefit from it
|
# ? Aug 3, 2017 04:58 |
|
the talent deficit posted:this is the best introductory level book on dist sys i've ever read and it gets deep enough that even people who are already working in dist sys can probably benefit from it How diagram heavy is it? I'm trying to decide between eBook and paperback
|
# ? Aug 3, 2017 16:08 |
|
Munkeymon posted:How diagram heavy is it? I'm trying to decide between eBook and paperback not very. i bought and read the kindle version
|
# ? Aug 3, 2017 18:37 |
|
I'm taking over a SQL 2014 AlwaysOn AG that doesn't appear to have had a backup plan set, so the transactions logs have piled up to the point where there isn't enough free disk space to do a backup and truncate. What are my options? I don't think I can change the recovery model to simple and manually truncate with dbcc shrinkfile in an availability group but my memory is hazy.
|
# ? Aug 4, 2017 18:34 |
|
CLAM DOWN posted:I'm taking over a SQL 2014 AlwaysOn AG that doesn't appear to have had a backup plan set, so the transactions logs have piled up to the point where there isn't enough free disk space to do a backup and truncate. What are my options? I don't think I can change the recovery model to simple and manually truncate with dbcc shrinkfile in an availability group but my memory is hazy. You are correct, it's not that easy with an ag/mirror. I've been in this spot quite a few times, but have luckily always been able to clear enough space on the drive to eke out the logs and the clear them up. I'm guessing that's not viable in this case. This guy has some ideas, but I'm not sure if it's the same use case you have since I'm not sure if your logs will have status 2 (note, some dangerous stuff): http://www.thecloudtechnologist.com/shrinking-sql-log-files-in-an-availability-group-cluster-or-database-mirror/
|
# ? Aug 5, 2017 20:22 |
|
Anyone familiar with optimizing huge tables in MySQL ? We have this table with a setup as follow: - id - field1 - field2 - field3 - field4 indexes are as follow: - PK id - Unique field1,field2,field3,field4 - index field4 The table itself is 260gig, the index size if 600+gig (fun times!) As you can imagine, the bigger the table gets the less fun it is to insert/update items in that list. I'm looking to validate my strategy and if you have inputs I'm all ears: 1 - Use gh-ost to create and populate field5 which would become a hash value of the content of all 4 unique fields 2 - Create unique index on that new field 3 - Remove old unique For now i'd rather not touch the id PK but afterwards it might be removed and PK would become the new field. Am I overthinking this or does this sound about right?
|
# ? Aug 18, 2017 02:54 |
|
Haha. Just hahahaha. Come back when you get to terabytes. If it's like every other glorified grep instance running out there, you probably need to read up on data normalization. Select count( distinct field1 ) / Select count( id )? And for the other fields. Is your index on (f1,2,3,4) or is it four indexes?
|
# ? Aug 18, 2017 03:41 |
|
What is your DB type? Innodb allows for adding indexes in place, which should not lock up the table. Without seeing the data or your queries, I can only give general advice to match the type to the data.. store numbers as int and not varchar, etc. Mysql only uses one index at a time, so a query on 3 of those columns should have an index on those 3 IN THAT ORDER. Index a,b,c is different then a,c,b and you will not get the benefit of that index. If your data supports it, it might be worth it to look into partitions, especially if much of the changes are only on modern data. There is not much you can do once you throw hardware and memory at the problem, but the size of the tables could mean unless you have a MASSIVE memory footprint, you are most likely touching disk all the time.
|
# ? Aug 18, 2017 11:35 |
|
Definitely more info needed as to why you need a four column unique constraint or even cram all that into one table to begin with, but yes your approach will work. Anyone worked with MariaDB Dynamic Columns and have a oneliner for checking the BLOB is valid before doing a column_add?
|
# ? Aug 18, 2017 18:33 |
|
Efficiency question... I'm writing a shell script that will take a list of telephone numbers each day, it will then take that list of numbers and determine if they already exist in the master list. If they do not exist, it will add them and if they exist, it will discard them. Currently, I'm accomplishing this with flat files using grep. What I want to do is move the master list of number to a mysql database. My question is whether it would be better to continue to utilize the flat files to get the list of new unique numbers and then insert those numbers into the mysql database. Or, would it be better to run a mysql query of the master list to determine if the new numbers exist and insert them if not? It seems like MySQL would have some type of function that would make it possible to deliver a list of numbers and have it insert them if they don't already exist. Does such a thing exist? I am obviously not well versed with MySQL. Edit: I was finally able to google the right thing. It appears I can do an INSERT IGNORE which will skip any entries that already exist in the database. Scrapez fucked around with this message at 16:48 on Aug 23, 2017 |
# ? Aug 23, 2017 16:44 |
|
If you could use flat files but need a little sql-y goodness, sqlite was made for you. Works pretty well with shell scripting, too. For your case you'd want to look up it's "on conflict [method]" syntax.
|
# ? Aug 23, 2017 17:00 |
|
Lonely Wolf posted:If you could use flat files but need a little sql-y goodness, sqlite was made for you. Works pretty well with shell scripting, too. For your case you'd want to look up it's "on conflict [method]" syntax. I'll look into that. Thanks for the info.
|
# ? Aug 23, 2017 17:27 |
|
Scrapez posted:I'm writing a shell script that will take a list of telephone numbers each day, it will then take that list of numbers and determine if they already exist in the master list. If they do not exist, it will add them and if they exist, it will discard them. cat oldlist newlist | sort | uniq
|
# ? Aug 24, 2017 00:02 |
|
|
# ? Jun 5, 2024 03:53 |
|
I'm migrating a database from MySQL to PostgreSQL and I need to convert some of the data. The old MySQL database kept this one column in a longblob datatype, and stored text and images freely. The PostgreSQL schema says this should be a text column, and with a second 'contentencoding' column marking whether the data is base64-encoded or not. I've also got a 'contenttype' column that has the MIME type; 'text/plain', 'image/png', etc. Is it possible to do something like encode(column::bytea, 'base64') that only hits certain rows where contenttype!='text/plain', and also updates the contentencoding field to base64? I asked about this project a few months ago, but I had to put it off at the time. I've gotten back into it and I feel like I'm really close, but I'm still really unfamiliar with SQL. I think this would be an UPDATE, but I don't think I can run functions through that? e: looks like this'll work: code:
anthonypants fucked around with this message at 19:33 on Aug 24, 2017 |
# ? Aug 24, 2017 01:06 |