Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Hughmoris posted:

Hmm. Yeah, I think I'm going to have to figure out how to use the second method you recommended. The outer apply query is taking about 7 minutes for 1 month.

What are the table names and relevant column names of the two tables?

Adbot
ADBOT LOVES YOU

anthonypants
May 6, 2007

by Nyc_Tattoo
Dinosaur Gum
Do any of you guys do version control with SQL code, and if so what do you use and how do you use it?

Kuule hain nussivan
Nov 27, 2008

anthonypants posted:

Do any of you guys do version control with SQL code, and if so what do you use and how do you use it?
I've just been pushing stuff to git. Individual .sql files for table creation, procedures, possible initialization data etc. I'm sure there's a better way, but I haven't had any major issues yet.

kumba
Nov 8, 2003

I posted my food for USPOL Thanksgiving!

enjoy the ride

Lipstick Apathy
i'm a fan of SVN but that's only because i mostly write ad hoc queries and have no reason to have the fancy deployment capabilities of git

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

anthonypants posted:

Do any of you guys do version control with SQL code, and if so what do you use and how do you use it?

Any version control system will handle sql files just fine. Even if you use a visual designer or something, just export the schema as a .sql file and VC that. There's no reason not to.

The more interesting question is whether to manage migrations manually or using some kind of declarative tool, which depends heavily on what your deployment story looks like.

Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

Yeah if you don't care about deployment just about anything can handle versioning, even like command line Visual Source Safe if you wanted (does that still exist? brrr)

anthonypants
May 6, 2007

by Nyc_Tattoo
Dinosaur Gum

NihilCredo posted:

Any version control system will handle sql files just fine. Even if you use a visual designer or something, just export the schema as a .sql file and VC that. There's no reason not to.

The more interesting question is whether to manage migrations manually or using some kind of declarative tool, which depends heavily on what your deployment story looks like.
Is "just export all your existing databases using 'script to create as' and then generate incremental .sql files for each change" really best practices

Kuule hain nussivan
Nov 27, 2008

anthonypants posted:

Is "just export all your existing databases using 'script to create as' and then generate incremental .sql files for each change" really best practices

Not trying to be snarky, but what are the major downsides of this simple method?

spiritual bypass
Feb 19, 2008

Grimey Drawer

anthonypants posted:

Is "just export all your existing databases using 'script to create as' and then generate incremental .sql files for each change" really best practices

It's better than nothing, which is the only alternative I've seen besides extremely expensive proprietary database versioning tools that don't work very well

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
Definitely standard repository best practices, but it's more a matter of creating appropriate roll forward/backward SQL. Dumping the DDL is a start, but after that you'll want the specific ALTER TABLE script and corresponding rollback in the repo. The changes should obviously be paired with the test cases. ;)

thebigcow
Jan 3, 2001

Bully!
https://cloudblogs.microsoft.com/sqlserver/2017/11/15/announcing-sql-operations-studio-for-preview/

SQL Operations studio has git support built right in if that interests you.

Cold on a Cob
Feb 6, 2006

i've seen so much, i'm going blind
and i'm brain dead virtually

College Slice

anthonypants posted:

Do any of you guys do version control with SQL code, and if so what do you use and how do you use it?

I use SQL Server Data Tools. It's a visual studio add-on, works for SQL server and azure dbs, and since it's just a collection of files it works fine with any version control you want to use (we use TFS, but i've seen it used with git and svn). It lets you handle structure (tables, procs, schemas, etc) and pre/post deploy sql scripts for seeding and manipulating data. It can also create your initial project by reading the existing schema of a database and reverse engineering it into a set of scripts for you (it obviously can't do seed data for that though). It will also do some rudimentary validations on your procedures, functions, and views during "compile time".

If SSDT doesn't suit your needs or you want something that works with non-SQL Server, another option I've explored (but not yet used) is DbUp. It seems like a good choice if you want to write all the change scripts manually, whereas with SSDT it generates a lot of the scripts for you. Sometimes SSDT is way overkill for my needs or doesn't give me the fine-grained control I want so I'm eager to try DbUp sometime.

Janitor Prime
Jan 22, 2004

PC LOAD LETTER

What da fuck does that mean

Fun Shoe

rt4 posted:

It's better than nothing, which is the only alternative I've seen besides extremely expensive proprietary database versioning tools that don't work very well

Liquibase has worked extremely well for us. It was not a big hassle to export our existing DB, in fact we used it as a first step to migrate from MySQL to Postgres while maintaining support for both in a single changelog file. I like that it keeps track of which changes have already been executed on each enviornment, which makes it safe to rerun all the changes on application start.

Rick
Feb 23, 2004
When I was 17, my father was so stupid, I didn't want to be seen with him in public. When I was 24, I was amazed at how much the old man had learned in just 7 years.
sl;dfjkl;jsdfakd;jf kdsf kljfskdfj kl

Databasetype: mariadb running on windows via Xampp

I was trying to remove privileges to a table that I don't want to be updated via php anymore and accidentally removed root's global permission to update. IT TURNS OUT YOU CAN'T DO A WHOLE LOT LIKE THIS. I've spent roughly three hours trying various fixes from google but can't seem to fix it. Like I try to load mysql with the skip-permissions thing on and it still denies me permission to update. I've tried loogging in as other versions of root and can't seem to do that right either.

Can I do anything at all? Should I just export the databases and start over? This is just a personal thing I run on my pc, if this was for development for anyone else I'd have a backup global admin account and probably wouldn't be using root to begin with.

E: Solved. In Xampp you have to actually create the my.cnf so that you can include the skip-grant-tables. Once you do that, GUIs may still throw a fit when you try to update user permissions but if you just update them directly in the mysql database, it works.

Rick fucked around with this message at 01:04 on Sep 9, 2018

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
How do I partition based on the first character of a column in a MySQL 5.6 database?

My column contains special characters and whatnot so I want to partition based on the ASCII value of the first character in the column.

I have the ASCII ranges I want to use for the eight partitions, but I don't actually know how to set up the partition:
1 37
38 42
43 47
48 53
54 67
68 73
74 81
82 255

How do I define these partitions at table creation?

edit: Or is there a better way of doing this?

Agrikk fucked around with this message at 01:56 on Sep 23, 2018

McGlockenshire
Dec 16, 2005

GOLLOCKS!
It sounds like you want partitioned tables? If so, 5.6 only supports checking the entire value of the field. Consider pre-processing your data to add another column that contains the numeric ASCII value of the first character in the magic column, and partition using PARTITION BY RANGE on that. It might be possible to do this automatically using a trigger and the ASCII builtin... might. I don't know how triggers and partitioning interact.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
Thanks for this link.

It actually led me to KEY PARTITIONING which ultimately does what I was trying to do with grouping by first letter, which was assign data evenly across partitions based on one or more columns.

Cheers!

Pollyanna
Mar 5, 2005

Milk's on them.


We have a Postgres table with roughly 10 million to 15 million rows, and some of the columns in that table contain PII. We want to blank out that PII in those columns and I tried to do so via an “update_all” in Rails equivalent to “update <table> set <the columns = fake data>”, but it seems to take a very very long time. Why would this happen? Is there a way to figure out why it would be so slow? Is the time it takes to update an entire column in a table just that long?

Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

Pollyanna posted:

We have a Postgres table with roughly 10 million to 15 million rows, and some of the columns in that table contain PII. We want to blank out that PII in those columns and I tried to do so via an “update_all” in Rails equivalent to “update <table> set <the columns = fake data>”, but it seems to take a very very long time. Why would this happen? Is there a way to figure out why it would be so slow? Is the time it takes to update an entire column in a table just that long?

There's a lot of questions about the structure to answer I think.

One is, what is the criteria you are updating by? If it's an indexed column (e.g. ID) it's faster than a string compare on an unindexed column (e.g. WHERE ColumnPII LIKE '%x%'). If there's no criteria at all (e.g. apply to entire table) then what columns are being modified and how? Are those columns indexed?

Pollyanna
Mar 5, 2005

Milk's on them.


Scaramouche posted:

There's a lot of questions about the structure to answer I think.

One is, what is the criteria you are updating by? If it's an indexed column (e.g. ID) it's faster than a string compare on an unindexed column (e.g. WHERE ColumnPII LIKE '%x%'). If there's no criteria at all (e.g. apply to entire table) then what columns are being modified and how? Are those columns indexed?

Apply to entire table, yeah. The table has about 16 btree indexes applied to it (things like created_at, updated_at DESC, uuid, email, etc). It’s supposed to be flat-out just this:

code:
UPDATE table
SET phone_number = ‘000-000-0000’;
where there is no index on phone number. I want to expand this to email too, but that definitely does have an index on it.

EDIT: I tried updating just one of them and it also takes forever. What the hell.

Pollyanna fucked around with this message at 20:53 on Sep 27, 2018

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!

Pollyanna posted:

We have a Postgres table with roughly 10 million to 15 million rows, and some of the columns in that table contain PII. We want to blank out that PII in those columns and I tried to do so via an “update_all” in Rails equivalent to “update <table> set <the columns = fake data>”, but it seems to take a very very long time. Why would this happen? Is there a way to figure out why it would be so slow? Is the time it takes to update an entire column in a table just that long?
A few questions:
Are you testing what happens when you do the update query directly in the DB, or are you only testing through Rails? Are you sure Rails is doing what you are expecting and not something stupid?
Is there a lot of activity happening on that table while you're running the update? Are you sure there's nothing else blocking you?
Maybe my opinions are skewed from being old now, but 10 million rows isn't nothing (though it's certainly not Big Data). Is there some way you could break things up so that you can 1) maybe do things in a more parallel fashion and 2) not have a lot of resources devoted to one big ongoing uncommitted transaction. I don't have any experience with Postgres so I don't know how much issue 2 applies, nor do I know if there is there some sort of hint or setting that could enable or force more parallelism.

Pollyanna
Mar 5, 2005

Milk's on them.


I’m going through Rails - I’m pretty sure we don’t get direct RW access to the staging database (or if we do, I don’t have it). I don’t know if it’s Rails or the DB, but the fact that it’s hanging on even updating one row where the primary key is supplied implies that it’s not Rails. Right?

I’m gonna have to find a way to get closer to the metal to figure out what’s happening.

Scaramouche
Mar 26, 2001

SPACE FACE! SPACE FACE!

Failing on one row implies it's in the Rails layer to me; the database itself would be very straightforward about if something passed/failed. However if Rails is causing the problem (e.g. permissions, connection string, etc) it could hang around indeterminately. If you can maybe check the Triggers to see if something weird is happening on_update

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
Slow to update a single row also reeks of Rails to me buuuuut if the application is really dumb nothing prevents that row from being locked and waiting that entire time. EG, if 10k processes are all updating that row, them of course it will be slow. Now if you can guarantee that nothing is happening on the database and it's still slow, then probably it's not the database.

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
When you have a series of SQL statements (outside of a transaction) like:

update table set col = newval where id = 1;
update table set col = newval where id = 2;
...
update table set col = newval where id = n;

Then sure the first one will complete, the change will be visible, then the second will execute. (It will also probably be super slow, don't do this.) But when you write SQL of the form:

update table set col = newval;

You are modifying every row in one statement, therefore implicitly in one transaction. You won't see any changes until the entire statement finishes, not even to the first row.

Also, the way Postgres backend store works, an update is basically a delete plus an insert. If you have a bunch of indexes on the table, they also have to be updated for every single row, even if the modification you are making doesn't touch the indexed columns, because the internal id of the row has changed and now the index is out of date. Constraints may need to be checked similarly per row depending on how they are set up. For large tables with many rows, if you can handle the downtime, it is generally fastest to drop all indexes/constraints on or referring to the table, make your modifications, then create the indexes/constraints again. Equivalently, you can create a new table that is identical except for those indexes/constraints, perform your modifications there, add the indexes/constraints back to the new table, and only at the very end do a transactional delete-and-rename to replace the previous table. That's normally what I do for bulk table loads or wholesale modifications like this.

Can't speak to the Rails part, but ultimately all Rails could possibly be doing is SQL statements like the above.

Edit: If you can't even update one row separately then maybe there is a lock preventing it, check your query's status in pg_stat_activity to see.

DELETE CASCADE fucked around with this message at 01:52 on Sep 28, 2018

nexxai
Jul 17, 2002

quack quack bjork
Fun Shoe
I have a very small public (non-revenue generating) website that uses MariaDB (Server version: 10.2.18-MariaDB-10.2.18+maria~xenial-log mariadb.org binary distribution) as the database backend. I currently backup the DB by an every-15-minute cron job running mysqldump and then sending the resulting .sql file to a cloud storage provider. I was wondering if anyone has any recommendations for better backup software that's easier to use and restore from, but also that is free (or at least very inexpensive).

McGlockenshire
Dec 16, 2005

GOLLOCKS!
Your current solution is the "easiest" one. At some point, you'll find that mysqldump can be kind of a drag on the live database. It's perfectly safe to run your backups from a read-only replica on another machine.

Make sure that you are actively testing restores. Consider automating restores to a separate environment on a continual basis and inspecting the data from time to time. A place I've worked before imported the previous evening's backup into each developer's version of the site every morning, for example.

The "best" way to do MySQL backups is to use something that talks to the built in InnoDB hot backup bits, like Percona Xtrabackup. There are some caveats when using it with MariaDB 10.2 or higher, but it looks like they have a fork that'll work for you, the link is near the bottom of the page. It is absolutely not an easy solution and operating it requires knowledge of a bunch of MySQL jargon, including replication things, even if you don't use replication. Both backups and restores require extra steps and a thorough understanding of the process. Restores also can't be done piecemeal - it's effectively a transaction-safe copy of /var/lib/mysql.

Stick with mysqldump as long as it suits your use case. It's a good enough tool.

McGlockenshire fucked around with this message at 08:22 on Sep 28, 2018

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

McGlockenshire posted:

Your current solution is the "easiest" one. At some point, you'll find that mysqldump can be kind of a drag on the live database. It's perfectly safe to run your backups from a read-only replica on another machine.
Given the circumstances, indeed the best solution, and a very thorough writeup.

I've seen exceptions to that last sentence, however, because backup load on a read only results in flooding of the master after the backup concludes. I haven't seen this performance dip on PostgreSQL but it used to be very common on MySQL.

Regarding Percona tools, they are generally required for any production environment, so definitely plan on using them. As above, however, I've seen cases where xtrabackup fails under live load (a while ago, and iirc it couldn't keep up as there's no locking to prevent the history from going away mid-backup).

Pollyanna
Mar 5, 2005

Milk's on them.


Took a look at pg_stat_activity and there was a whole-table update query that had been running since yesterday afternoon on that very column. I killed it and all the rest of the queries stopped waiting. Now a single update works fine :dance:

So is it because of the indexes? Or just the sheer size of the table, and I need to batch it?

Edit: Hmm, even via psql I can’t do a “select * from big_table limit 1;” without it taking forever. The only thing using that table is an autovacuum on it. Is that what’s preventing me from getting even one record?

Edit 2: it looks like the autovacuum is indeed what’s locking it. There’s a bunch of RowExclusiveLocks associated with its PID, as well as an ExclusiveLock and a ShareUpdateExclusiveLock. Am I just supposed to wait until the vacuum finishes? It’s been going for like 3 hours now.

Edit 3: we ended up narrowing it down to the indexes. 10~15 million records and an index on id desc means that just doing a blind select limit takes ages. :bang: Problem solved, though.

Pollyanna fucked around with this message at 21:36 on Sep 28, 2018

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Pollyanna posted:

Edit 3: we ended up narrowing it down to the indexes. 10~15 million records and an index on id desc means that just doing a blind select limit takes ages. :bang: Problem solved, though.
This I don't quite follow. An unqualified SELECT should not invoke an index at all, and simply return rows in page order.

Also auto vacuum shouldn't be doing so much unless you have a large number of updates and deletes or something.

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


PhantomOfTheCopier posted:

Also auto vacuum shouldn't be doing so much unless you have a large number of updates and deletes or something.

Like updating PII on 15 million rows?

Love Stole the Day
Nov 4, 2012
Please give me free quality professional advice so I can be a baby about it and insult you
Need some SQL help. Postgres. I'm using a table that contains id references to another table and I am making a query to replace the keys with their corresponding row data.

For the id keys themselves it's fine, but I need to use an array and form a similar array of matching row data as json objects... and that array of integer ids ( int[] ) is what's throwing me off. So far, I'm able to get the matching content itself, but something about how I'm forming my select clause is causing it to return many array sof one element each rather than all contained in a single array.

Here's the query:
code:

select ..., json_build_object(...) as director, array[json_build_object(...)] as cast
from
     (people p1
     inner join movies m1 on p1.id = m1.director_id
     left join ( 
          (select unnest(m2.cast_ids) cast_member from movies m2) t2 
          left join people p2 on p2.id = cast_member
     ) on p2.id = any(m2.cast_ids)

What this query returns is everything correctly, except the "cast" array contains only one cast_member data as json and instead of having all cast members for each movie, I have multiple copies of the same movie, each with its own different cast member associated with the movie via the cast_ids array in the movies table.

I'm trying to figure out how to make all of the cast member objects go into that array so that we have one result for each movie with an array of all cast members based on the id references.

I hope this sufficiently explains my issue. I'm sure I'm probably doing this in a very stupid and inefficient way. Thanks for reading!

Love Stole the Day fucked around with this message at 22:59 on Sep 29, 2018

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!

Nth Doctor posted:

Like updating PII on 15 million rows?
Not really, since that's not much data, though it's possible we're not getting the whole story. If your rows are 32 8-byte fields, it's only 4GB of data to move. Sure there are indexes to consider, but that's a far cry from TBs, so you've got something else going on.

First, the single-transaction update is a concern, because you're required to keep an entire copy of the data in case of a crash or rollback. This is why they recommend dropping and rebuilding indexes, dumping and restoring, changing transaction isolation, shared table locking, creating temporary or secondary tables for the new data, adding new null columns and migrating, and so forth. Second, your nominal traffic is important. I remember the word 'staging' somewhere up there, but if you can't update a single field you should really be asking how you got in this situation in the first place and how you're going to avoid it on the master.

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


PhantomOfTheCopier posted:

Not really, since that's not much data, though it's possible we're not getting the whole story. If your rows are 32 8-byte fields, it's only 4GB of data to move. Sure there are indexes to consider, but that's a far cry from TBs, so you've got something else going on.

First, the single-transaction update is a concern, because you're required to keep an entire copy of the data in case of a crash or rollback. This is why they recommend dropping and rebuilding indexes, dumping and restoring, changing transaction isolation, shared table locking, creating temporary or secondary tables for the new data, adding new null columns and migrating, and so forth. Second, your nominal traffic is important. I remember the word 'staging' somewhere up there, but if you can't update a single field you should really be asking how you got in this situation in the first place and how you're going to avoid it on the master.

I really appreciate a thorough answer in response to be being a dickish smartass, POTC. Cheers, buddy.

nem
Jan 4, 2003

panel.dev
apnscp: cPanel evolved
What are some recommended solutions to store round-robin data in a database, preferably Postgres, indefinitely? I've got a crude setup that chunks bandwidth data into 5 minute buckets, then periodically averages to daily, weekly, monthly as accounts age. It chokes as diversity scales though, and I'd like to explore alternatives.

I've looked into InfluxDB. Pair with Grafana, nice, and can begin logging/displaying other server metrics, but another daemon to run. Alternatively what about Timescale? It's a Postgres extension, nice, which partitions data automatically, double nice. With a little legwork it can also log server data too, so lend me some enlightenment. :unsmith:

Pollyanna
Mar 5, 2005

Milk's on them.


PhantomOfTheCopier posted:

Not really, since that's not much data, though it's possible we're not getting the whole story. If your rows are 32 8-byte fields, it's only 4GB of data to move. Sure there are indexes to consider, but that's a far cry from TBs, so you've got something else going on.

First, the single-transaction update is a concern, because you're required to keep an entire copy of the data in case of a crash or rollback. This is why they recommend dropping and rebuilding indexes, dumping and restoring, changing transaction isolation, shared table locking, creating temporary or secondary tables for the new data, adding new null columns and migrating, and so forth. Second, your nominal traffic is important. I remember the word 'staging' somewhere up there, but if you can't update a single field you should really be asking how you got in this situation in the first place and how you're going to avoid it on the master.

They're pretty large rows, with a large amount of columns, one of which is an hstore holding arbitrary JSON data. That may affect things. As for staging vs. master, we're looking to wipe PII data only on staging, so it's not a major concern - it's considered tech debt that has a deadline for us (demos n poo poo for customers).

Truncation is still probably going to be our best bet, and we also need to figure out what the gently caress is going on re: indexes. I suspect that a lot of our problems stem from those 16 indexes...at least one of which is redundant.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.
I have a Folding@Home table [user_hourly_summary]hosted on AWS Aurora using the MySQL 5.6 engine with the following columns:

-- username [varchar(100)]- team[int] - stamp[datetime] - score[bigint] - wu[bigint]


I am trying to write a query that will calculate the activity (net gain in work units and points) between a timestamp and the timestamp immediately prior for every timestamp for every user and insert it into a different table [user_hourly_activity]. A new set of data arrives every 60 minutes and inserts 1.8 million new rows into the database representing every user and their latest cumulative score and work units. The table user_hourly_summary has two indexes:

PRIMARY (UserName, team, stamp)
Index1 (username, team, stamp, score, wu)


This query will run immediately after the arrival of the latest batch of user data and needs to then calculate the latest user activity for the preceding 60 minutes.

This is what I have and it takes twenty minutes to run. I feel like there should be a more efficient way to run this than running three queries an hour over two million rows.

code:
INSERT INTO user_hourly_activity ( stamp, UserName, pph, wuph, team)
	SELECT username, team, stamp, score - 
		(SELECT score FROM user_hourly_summary t2 
			WHERE t2.username = t1.username 
			AND t2.team = t1.team
			AND t2.stamp < t1.stamp
			ORDER BY t2.stamp DESC LIMIT 1) AS pph,
		wu-
		(SELECT wu from user_hourly_summary t3 
			WHERE t3.username = t1.username 
			AND t3.team = t1.team
			AND t3.stamp < t1.stamp
			ORDER BY t3.stamp DESC LIMIT 1) AS wuph
		FROM user_hourly_summary t1
Is there a better way of doing this?

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Agrikk posted:

I have a Folding@Home table [user_hourly_summary]hosted on AWS Aurora using the MySQL 5.6 engine with the following columns:

-- username [varchar(100)]- team[int] - stamp[datetime] - score[bigint] - wu[bigint]


I am trying to write a query that will calculate the activity (net gain in work units and points) between a timestamp and the timestamp immediately prior for every timestamp for every user and insert it into a different table [user_hourly_activity]. A new set of data arrives every 60 minutes and inserts 1.8 million new rows into the database representing every user and their latest cumulative score and work units. The table user_hourly_summary has two indexes:

PRIMARY (UserName, team, stamp)
Index1 (username, team, stamp, score, wu)


This query will run immediately after the arrival of the latest batch of user data and needs to then calculate the latest user activity for the preceding 60 minutes.

This is what I have and it takes twenty minutes to run. I feel like there should be a more efficient way to run this than running three queries an hour over two million rows.

code:
INSERT INTO user_hourly_activity ( stamp, UserName, pph, wuph, team)
  SELECT username, team, stamp, score - 
    (SELECT score FROM user_hourly_summary t2 
      WHERE t2.username = t1.username 
      AND t2.team = t1.team
      AND t2.stamp < t1.stamp
      ORDER BY t2.stamp DESC LIMIT 1) AS pph,
    wu-
    (SELECT wu from user_hourly_summary t3 
      WHERE t3.username = t1.username 
      AND t3.team = t1.team
      AND t3.stamp < t1.stamp
      ORDER BY t3.stamp DESC LIMIT 1) AS wuph
    FROM user_hourly_summary t1
Is there a better way of doing this?

Subqueries aren't the greatest.

Here's a sketch just from noodling over it while my Outlook shits itself:
code:
SELECT
  t1.username,
  t1.team,
  t1.stamp,
  t1.score - t3.score, -- sph
  t1.wu - t3.wu -- wuph
FROM user_hourly_summary t1
LEFT OUTER JOIN (
  SELECT
    t2.UserName,
    t2.team,
    MAX(t2.stamp) AS stamp
  FROM user_hourly_summary AS t2
  WHERE t2.UserName = t1.UserName
    AND t2.team = t1.team
    AND t2.stamp < t1.stamp
  GROUP BY t2.UserName, t2.team
) AS dt
  ON dt.username = t1.userName
  AND dt.team = t1.team
LEFT OUTER JOIN user_hourly_summary AS t3
  ON t3.UserName = dt.UserName
  AND t3.team = dt.team
  AND t3.stamp = dt.stamp
 

I don't really like it, but MySQL didn't get window functions until version 8. What you really want here is the first lagging row from the latest for the user.

The derived table will hopefully be a little bit less stupid. If there is some way to know ahead of time which user/teams you are doing this query with, you can throw that list into the derived table query's WHERE clause (and the main query's WHERE clause too, for that matter) to limit the rows that you process.

That could also be a problem, come to think of it. Your query as presented may be recalculating the per-hour values EVERY TIME it runs for every row. Not just the latest ones.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

Nth Doctor posted:

Subqueries aren't the greatest.

That could also be a problem, come to think of it. Your query as presented may be recalculating the per-hour values EVERY TIME it runs for every row. Not just the latest ones.

My plan is to perform the query once over the existing data and dump the results into an activity table and then run it hourly thereafter, limiting via a WHERE clause to the previous 1-2 hours.

But I'll see how your query performs. Thank you for it.


And yeah, I was bummed to find that MySQL didn't have a lag function like MS-SQL. :(

Adbot
ADBOT LOVES YOU

Jethro
Jun 1, 2000

I was raised on the dairy, Bitch!
Does the data that comes in each hour (potentially) contain multiple records per user, or is this just one record per user per hour? If it's just one record per user per hour, do all the new records that come in have the same timestamp, or are all the timestamps different?

Is user_hourly_summary truncated each hour before the data comes in, or does it have all the data that has ever come in? If all the data stays in there, is there a where clause that you didn't show us that prevents you from recalculating all the user activity that you calculated in the previous hours?

If you're seeing multiple timestamps per hour and you're already only calculating based on the new data that came in, then you're probably already as good as you're going to get.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply