|
Jethro posted:Does the data that comes in each hour (potentially) contain multiple records per user, or is this just one record per user per hour? If it's just one record per user per hour, do all the new records that come in have the same timestamp, or are all the timestamps different? The data comes in a batch each hour. Each user appears only once and the timestamp is the same for all users. user_hourly_summary contains all the data for each user forever and yes, there will be a where clause that processes the prior two updates (for thoroughness: the two hour ago update should already be calculated but sometime's it's missed due to patching or whatever): code:
[1] pull data from web site-> [2] push data into user_current_stats -> [3] push data into user_hourly_summary -> [4] calculate data for user_hourly_activity 1. Raw user data is pulled from the folding@home web site and some processing is done to clean up the data. 2. The cleaned data gets pushed into user_current_stats and is tagged with the timestamp of the download time. This table is truncated before every update. This data contains only the latest data available to speed the summary page of the web site and various back-end calculations. 3. user data then gets pushed into user_hourly_summary which contains all of the data for each user for each hour forever. 4. user_hourly_activity gets calculated based on the most recent two timestamps in user_hourly_summary If there is a better workflow for this, I'm all ears. I'm more than happy to rebuild this database and reload the data however makes updates fast and the web page as responsive as possible. Also: https://folding.rainwalk.net/index.php is my web page currently running the data on a pair of SQL Server 2016 databases. These databases are independently processing data and are running behind a load balancer on pretty beefy AWS EC2 instances. I'm messing around with AWS Aurora (MySQL 5.6) and I want to see if I can make Aurora as responsive while adding resilience. Agrikk fucked around with this message at 17:26 on Oct 1, 2018 |
# ? Oct 1, 2018 17:12 |
|
|
# ? May 29, 2024 11:45 |
|
If you have the option, why not go with PostgreSQL in RDS and get the benefits of window functions (and the query optimizer)?
|
# ? Oct 2, 2018 00:28 |
|
PhantomOfTheCopier posted:If you have the option, why not go with PostgreSQL in RDS and get the benefits of window functions (and the query optimizer)? That’s for another day. I’ve been doing a tour of AWS databases and Aurora MySQL 5.6 is current on my list. Aurora Postgres is next, followed by the RDS versions of MySQL, Postgres and Oracle.
|
# ? Oct 2, 2018 01:13 |
|
PhantomOfTheCopier posted:If you have the option, why not go with PostgreSQL in RDS and get the benefits of window functions (and the query optimizer)? Edit: derp not me
|
# ? Oct 2, 2018 01:14 |
|
Well, if this is the stupid question thread, this is probably where I belong. I'm learning SQL (and VBA... guess why) for my job, and I am basically having to teach myself. My current question is - is there a really good book or site or blog post that just explains what the gently caress joins are *for* and why/when you use them? I've read a bunch of MS support articles and I am not loving getting it. "This is how to use a join" I am drowning in, "This is why joins are cool and useful and why you should know and use them" I am absolutely not finding. Also... Is there a resource for Access 2013/16 that's better than the others? Book, site, whatever besides a real class? I don't get a choice about using Access, but I would like to be as good as I can be at it.
|
# ? Oct 3, 2018 04:46 |
|
Remora posted:Well, if this is the stupid question thread, this is probably where I belong. Basically, you have one table full of data, and you need to correlate that data with data in another table. Let's say you've got the following tables: code:
code:
SQL code:
anthonypants fucked around with this message at 07:25 on Oct 3, 2018 |
# ? Oct 3, 2018 07:21 |
|
I suppose what I'm not getting is - and I haven't watched the videos yet, phoneposting before I leave for work - that's structured very similarly to a lot of examples I've seen, and I don't see why you wouldn't just SELECT FROM [Posts] WHERE [ParentThread]=whatever. The threads table doesn't look like it has anything you even want.
|
# ? Oct 3, 2018 11:13 |
|
Well, what if you had a ReceiptHeader and a ReceiptRow table? The ReceiptHeader would tell you which store the receipt is from, the date of purchase, name of the seller etc. Each individual ReceiptRow would tell you which product you've bought and how much it cost, for a particular receipt. Does that make any more sense?
Tax Oddity fucked around with this message at 13:26 on Oct 3, 2018 |
# ? Oct 3, 2018 13:24 |
|
Remora posted:I suppose what I'm not getting is - and I haven't watched the videos yet, phoneposting before I leave for work - that's structured very similarly to a lot of examples I've seen, and I don't see why you wouldn't just SELECT FROM [Posts] WHERE [ParentThread]=whatever. The threads table doesn't look like it has anything you even want. Let's say you were selecting all posts in all threads by a particular author. In this case, you need to join Thread table to Post table to accomplish this because you're working with data in two separate tables. SQL code:
Cold on a Cob fucked around with this message at 13:31 on Oct 3, 2018 |
# ? Oct 3, 2018 13:29 |
|
Remora posted:I suppose what I'm not getting is - and I haven't watched the videos yet, phoneposting before I leave for work - that's structured very similarly to a lot of examples I've seen, and I don't see why you wouldn't just SELECT FROM [Posts] WHERE [ParentThread]=whatever. The threads table doesn't look like it has anything you even want. SQL code:
|
# ? Oct 3, 2018 14:49 |
|
Oh, cool, okay, that makes a lot more sense. Sorry for being dense.
|
# ? Oct 3, 2018 16:58 |
|
Agrikk posted:The data comes in a batch each hour. Each user appears only once and the timestamp is the same for all users. user_hourly_summary contains all the data for each user forever and yes, there will be a where clause that processes the prior two updates (for thoroughness: the two hour ago update should already be calculated but sometime's it's missed due to patching or whatever): I ended up changing the work flow slightly to get a better performing query that represents a lag function: I push the data into current stats, then do a join using max(stamp) against the hourly_summary table. THEN i dump the current stats into the summary table. The processing of data end to end takes four minutes instead of twenty. Thanks all for your help!
|
# ? Oct 3, 2018 22:39 |
|
We seem to be getting much slower performance using an inner join than a theta join using Oracle. i.e. "SELECT a.uid from TableA a, TableB b WHERE b.parentFk = a.uid AND b.someVal = '1234' order by a.uid asc" runs much faster than: SELECT a.uid from TableA a INNER JOIN TableB on b.parentFk = a.uid WHERE b.someVal = '1234' group by a.uid" Is there some intuitive reason why theta joins should be faster? You would think it would be slower given that theta joins can include arbitrary comparisons...
|
# ? Oct 4, 2018 00:31 |
|
Paul MaudDib posted:We seem to be getting much slower performance using an inner join than a theta join using Oracle. i.e. The... gently caress?
|
# ? Oct 4, 2018 01:41 |
|
Nth Doctor posted:The... gently caress? there's a reason why oracle is named after priestesses who huffed fumes in a temple before giving predictions. don't think about it too hard, if oracle does weird poo poo just run tkprof and figure out if something is indexed that shouldn't be or vice versa.
|
# ? Oct 4, 2018 01:57 |
|
Paul MaudDib posted:We seem to be getting much slower performance using an inner join than a theta join using Oracle. i.e. This may be just a typo, but you have an ORDER BY in the first and a GROUP BY in the second.
|
# ? Oct 4, 2018 07:42 |
|
I assume Oracle just McGlockenshire fucked around with this message at 08:56 on Oct 4, 2018 |
# ? Oct 4, 2018 08:40 |
|
Paul MaudDib posted:We seem to be getting much slower performance using an inner join than a theta join using Oracle. i.e. My guess would be that the engine is giving more deference to your table order in the explicit join query than in the other one, and that is causing the difference. See what happens with SQL code:
Jethro fucked around with this message at 14:36 on Oct 4, 2018 |
# ? Oct 4, 2018 14:31 |
|
hi all, I have a weird Microsoft SQL question. I have a view that when I do a SELECT TOP (1000), or even just TOP 1, it will immediately spike the server CPU usage to 100%, and never finish. However, If I right click and ALTER TO, but then get rid of the alter portion of the query and just run it open, the view returns 65000 rows in about 45 seconds. Any obvious reason why that would occur? Is there something I can do to troubleshoot the process a bit? I'm pretty new to SQL so hopefully I'm just missing something obvious
|
# ? Oct 5, 2018 18:29 |
|
What are you using for sort criteria? e.g. SELECT TOP 1000 * FROM TABLE ORDER BY (indexed int) will operate very differently than ORDER BY (nvarchar) If you're not providing a criteria I think it'll use the first column
|
# ? Oct 5, 2018 18:40 |
|
I have a partitioning question: How to partition? I am loading data from a csv into a Aurora MySQL 5.6 table set up like so: user_daily_summary - stamp (DATETIME) - UserName VARCHAR(100) - Score (BIGINT) - WU (BIGINT) - team (BIGINT) With a primary key on (UserName, Team, Stamp). Every day, two million rows are added to the table from a CSV. Each row in the CSV contains one row per user/teamnumber combo (a user name must be unique within a team, but the same username can exist in multiple teams) and the whole batch is timestamped for the time the file is loaded. A second table is then calculated from the first table such that the values for the previous stamp is subtracted from the values of the most current stamp to get the activity for the last day: user_daily_activity - stamp (DATETIME) - UserName (VARCHAR(100) - PointsPerDay (BIGINT) - WUperDay (BIGINT) - team (BIGINT) With a primary key on (UserName, Team, Stamp). With all that said: I am doing an initial load of data consisting of roughly 1900 files representing the previous 1900 days of data. Each file contains roughly 1.6 million lines of data. I'm iterating over the list of files and loading the data one file at a time and the job is bogging down like crazy: The job kicks off by loading 20 days per hour. But then quickly bogs down to 1-2 days loaded per hour. Loading 1800 files at 2 files per hour is unacceptable, because ongoing data loads will take too long. So now all of this circles back to the basic question: How do I speed up the data load? Is partitioning the way to go here? Since users will be queried by username (and pulling all data by user) partitioning on username seems to make sense. But what method? HASH or LINEAR HASH on UserName? KEY on Username/Team? KEY on UserName/Team/Stamp?
|
# ? Oct 6, 2018 17:03 |
|
How are you loading the data? Is it batched into transactions? For reading, the best thing to do is usually to put an index on those two selection criteria if you haven't done it yet.
|
# ? Oct 6, 2018 17:54 |
|
rt4 posted:How are you loading the data? Is it batched into transactions? I'm loading using INSERT INTO... SELECT FROM... from a temp table on which I do some initial processing. I'm doing the initial load into the temp table by the Aurora LOAD FROM S3: code:
|
# ? Oct 6, 2018 19:52 |
|
For bulk loading in almost everything you're best off stripping all the indexing and rebuilding it afterwards, building 2 giant indices is vastly faster than keeping 2 b-trees balanced while you keep writing. The cost of constantly rebuilding them will be what's causing that fall off in speed, early on they're still simple. What problem is your second index meant to solve? An index that holds a subset of columns that completely fulfils a known query can be useful to stop you having to read the row after the index lookup, but will be murder on your writes. Btree indices are normally there to let you get to the set of rows you want efficiently, what are you expected access paths onto the data once it's all done? Most recent row for a customer? Today's data for all customers? All data for a customer? This would set what your leading keys should be.
|
# ? Oct 6, 2018 21:58 |
|
(This is Postgres) I am using an array of foreign keys and creating a json object based on its corresponding rows to the reference table.code:
Whenever I try to do this the intuitive way, I run into endless "undefined value" problems due to context. I've been trying to untangle this for several days now and am having trouble getting it done. I wasn't able to find a way to make use of a `left join lateral` and I wasn't able to find a way to shoe-horn in the condition checks necessary anywhere. So I'm thinking I might need to just scrap all of this and start over with a new approach. I'm hoping someone here can point out my blind spot for me. As you can see where I marked the issue, I am able to manually control which movie's cast data is returned, but I can't seem to figure out how to make it work generally so that each row will yield based on the given `movie` table's `cast_ids.` If I remove the line that I labeled with the above arrow, then for every movie I will get the entire `people` table as if everybody in the entire table was involved for each movie. So naturally I'm trying to figure out how to make this where clause to filter out the non-relevant things. Love Stole the Day fucked around with this message at 04:43 on Oct 8, 2018 |
# ? Oct 8, 2018 04:40 |
|
Love Stole the Day posted:(This is Postgres) I am using an array of foreign keys and creating a json object based on its corresponding rows to the reference table. If all the data you are interested in is from the people table, why is that the outer table in the join? I guess in theory every t1.c_id should have a corresponding p1.id, so it shouldn't make that much of a difference, but it certainly looks funny. You filter out null cast_ids, but are you sure none of the entries in the array are null (I have never used Postgres or array columns, so I don't know if that's even a sensible question)? If I want "records from one table where the id is in another table", I write a query using an IN SQL code:
|
# ? Oct 8, 2018 14:36 |
|
There's no movie id in the output rows, nor any grouping by the movie id, so without a constraint you'd get a list of people ids that appear in any movie. Start by getting (movieid,peopleid) rows from your cast unnest, join that to people names to get (movieid, peopleid, name) duplicates and nulls allowed. jsonify.
|
# ? Oct 8, 2018 23:32 |
|
This is the source table: This is what I want: Basically, for each row of Thing ID, I want a list of Group IDs that the Thing ID is not in. I'm having trouble figuring out how to make this happen. Any ideas?
|
# ? Oct 9, 2018 07:31 |
|
How large is the data set? Something like this will work but it won't be fast and it won't scale well. You can replace the inner SELECT DISTINCTs if you have other canonical sources of your Things and Groups, but it's the string concat of all of the data that will really kill you here. Cloudflare has decided that posting derived tables in the FROM clause is forbidden, so sorry about the SEL ECTs. code:
|
# ? Oct 9, 2018 08:56 |
|
Has anyone used TeamSQL here (for MSSQL)? I grabbed it on Friday but it asked for JODBC which I grudgingly installed java for and now it can't connect to even my local instance of MSSQL, let alone the real one I want to work with. Good app, undercooked (my impression), total trash?
|
# ? Oct 9, 2018 16:50 |
|
If you want something lighter than SSMS, Microsoft also makes SQL Operations Studio, which they recently renamed to Azure Data Studio. It's based on VS Code, but it's got some neat SQL gimmicks in it.
|
# ? Oct 9, 2018 18:20 |
Lighter in what way? SSMS is very quick. What are people missing from it to try other clients?
|
|
# ? Oct 9, 2018 18:28 |
|
I was hoping to try the share/collaboration stuff since we have a bunch of one off scripts to find things that aren't in the interface/presented to end users
|
# ? Oct 9, 2018 18:34 |
|
Continuing my theme of embarrassing questions. I have two tables, WpnSalesGenStorage and CourierGenStorage. code:
Sorry if the formatting sucks, I am phoneposting. If it matters, there's 48 possible values for the warehouse fields (they both were randomly generated from the same table), and FullNames are all guaranteed to be unique values. There are other fields in each table but I don't think they matter. What I want to do is check the AssignedWarehouse of each row of WpnSalesGenStorage, pull a list of FullName values that have that AssignedWarehouse value as Warehouse, randomly choose one of those FullNames, and update the AssignedCourier field with that randomly selected FullName value. I kind of assume if I can figure out how to do this for one row, I can figure out how to loop it until it does all rows where AssignedCourier=NULL. I've done that before. I think I have the UPDATE logic to select the single row, but I'm unsure how to do the rest of the query. I can do pieces in isolation - for example, I have a customized GetRandomInteger(LowLimit,HighLimit) function that will return a decent random number and I've used it to randomly select records before in this DB. Not sure how that's going to help here, though. code:
After this, I'm guessing at best. I'm guessing it's going to be a fairly involved subquery (at least by my lovely baby standards) and it'll run like poo poo, but I should only have to do it once so performance is not a huge deal. Do I want to do some kind of join on the tables where p.AssignedWarehouse = t.Warehouse, or something? If this thread is not really for crap this simple, I can try Stack Exchange or something, but I wanted to try to keep my shame to a single location. TIA for any advice or proverbs you can offer.
|
# ? Oct 9, 2018 19:11 |
|
Remora posted:words this is stupid and clunky but it should work code:
|
# ? Oct 9, 2018 20:02 |
|
Vegetable posted:This is the source table:
|
# ? Oct 10, 2018 00:43 |
|
kumba posted:this is stupid and clunky but it should work Thanks a million. Dissecting that was very educational.
|
# ? Oct 11, 2018 12:30 |
|
Vegetable posted:This is the source table: I hope your list of IDs isn’t just stored in your table. Assuming you’re normalized appropriately... SELECT T.ThingId, G.GroupId FROM Things T CROSS JOIN Groups G LEFT JOIN ThingGroups TG ON TG.GroupId = G.GroupId AND TG.ThingId = T.ThingId WHERE TG.ThingId IS NULL Then aggregate or whatever. The cross join gets you every combination of thing and group possible, and the left join and where will filter out all existing matches.
|
# ? Oct 11, 2018 14:25 |
|
What steps should I take to try to optimize this query? The execution plan shows that everything is either an index seek or a primary key lookup except for a single sort triggered by the one order by clause. The sort takes up 1/3 of the total query cost so that seems like a good place to start, but neither I nor anyone else at work knows a good way to speed it up.code:
I'm pretty new at this so any advice is appreciated. Edit: GoogleDrive seems to think the execution plan is an audio file but if you download it and open it, it behaves correctly. Here's a jpg of the same thing. LLSix fucked around with this message at 15:21 on Oct 11, 2018 |
# ? Oct 11, 2018 15:05 |
|
|
# ? May 29, 2024 11:45 |
|
An index covering all the fields of XLink you're using might help a bit by eliminating that key lookup.
|
# ? Oct 11, 2018 18:03 |