|
Paul MaudDib posted:cool, that works great, thanks! I just learned a new thing you could do with regexes. As a formal methods, theory, and regex nerd: one of us! One of us! One of us! We accept you! We accept you!
|
# ? Nov 7, 2019 18:31 |
|
|
# ? Jun 8, 2024 07:17 |
Now, don't be greedy
|
|
# ? Nov 8, 2019 01:14 |
|
Nth Doctor posted:As a formal methods, theory, and regex nerd: one of us! One of us! One of us! We accept you! We accept you! I did stack machines and regular languages and all that stuff back in college, but we never really got into some of the more esoteric features of regexes. I’m under the impression that some of those actually take regexes past the power of a regular language into context sensitive languages, if I’m remembering the terms right.
|
# ? Nov 8, 2019 06:20 |
Paul MaudDib posted:I did stack machines and regular languages and all that stuff back in college, but we never really got into some of the more esoteric features of regexes. I’m under the impression that some of those actually take regexes past the power of a regular language into context sensitive languages, if I’m remembering the terms right. Yes basically any features for look ahead or look behind assertions that don't consume input are outside regular language powers. At least as rule of thumb. Some regex implementations might degrade from a FSM implementation with O(n) time to match, where n is length of the haystack, to a much slower backtracking algorithm that can be quadratic in length of input plus complexity of the expression. All of this is from memory, might have some details wrong.
|
|
# ? Nov 8, 2019 07:42 |
|
nielsm posted:Yes basically any features for look ahead or look behind assertions that don't consume input are outside regular language powers. At least as rule of thumb. Some regex implementations might degrade from a FSM implementation with O(n) time to match, where n is length of the haystack, to a much slower backtracking algorithm that can be quadratic in length of input plus complexity of the expression. I think those features are fine. They're basically easier to express with an NFA but NFAs can be transformed into DFAs with only a little bit of brain melting math and acceptance classes. Some of the weirder stuff you can do with advanced capturing and rematching seems to be where you jump the bondary of finite automata into a higher order machine.
|
# ? Nov 8, 2019 14:15 |
|
Backreferences and recursion are the excitements you seek.
|
# ? Nov 8, 2019 15:27 |
|
I've got a table that looks something like this.code:
START_DATE: the date the loan began TERM: the length of the loan in months MONTHLY: the amount paid each month for the duration of the term Note: The first payment for each loan is due the following month. I need help generating a report that calculates the total monthly income for each month for the past 12 months. I believe for the sample table above, the results would look something like this: code:
|
# ? Nov 12, 2019 20:37 |
|
The precise details depend on your flavor of SQL, but I'd say you do a view that extrapolates the monthly payments from your first table (probably by joining with something that acts as an integer series generator rejiggered to create rows for individual months), then do a second view based on that to create the sums and filter for the months you want.
|
# ? Nov 12, 2019 20:54 |
|
Batterypowered7 posted:I've got a table that looks something like this. It's a hack, but trying to generate a sequence of months between a start and end date kinda sucks in SQL. Here's a fiddle where I messed around a bunch: http://sqlfiddle.com/#!18/02fb2/3
|
# ? Nov 12, 2019 21:45 |
|
Nth Doctor posted:trying to generate a sequence of months between a start and end date kinda sucks in SQL. Not when you got generate_series and range types code:
|
# ? Nov 12, 2019 22:38 |
|
Nth Doctor posted:It's a hack, but trying to generate a sequence of months between a start and end date kinda sucks in SQL. Pardot posted:Not when you got generate_series and range types Thank you very much, you two. I'm using Postgresql, so I was able to modify the second fiddle ever so slightly so that the series start date wasn't hard coded. You guys were a great help!
|
# ? Nov 12, 2019 23:08 |
|
Pardot posted:Not when you got generate_series and range types Batterypowered7 posted:Thank you very much, you two. I'm using Postgresql, so I was able to modify the second fiddle ever so slightly so that the series start date wasn't hard coded. You guys were a great help! Glad you got it!
|
# ? Nov 12, 2019 23:16 |
|
Happiness Commando posted:I do mostly traditional sysadmin stuff for a fintech company and work is willing to pay for some SQL classes for me. We use MS SQL server in house and are looking at Aurora MySQL, although I'm at select * from foo level, so there shouldn't be anything system specific in whatever it is that I learn. Check out Brent Ozar's black Friday sale and his first responder tool kit (free)
|
# ? Nov 16, 2019 00:45 |
|
Hey SQL gang, I'm running up against something that I feel should be simple but I can't figure it out and I don't even really know where to start. Here's my sqlfiddle: http://sqlfiddle.com/#!9/622103/4 Given my data set here, and the query, what I'd like is to get back a row for each user, their "coins_count" at the first date, their coins_count for the second date, and ideally the dates of the two records it chose. So for example given my query there I'd like to get back something like code:
Can anyone point me down the right road here. I feel like what I basically want is to somehow select/join the first and last coin_count in the given date range for each user, but I can't figure out how to do that. So then I'm thinking I probably want a subquery, but I'm not sure how to link back the coin_counts with the users. This is a pretty simplified version of my actual data model but I think if I can figure out how to do this I can solve my larger problem. Really appreciate any advice here! edit: lopped the first line off my post with the link to the fiddle prom candy fucked around with this message at 22:58 on Dec 12, 2019 |
# ? Dec 12, 2019 22:35 |
|
prom candy posted:stuff Probably a better way to do it, but quick and dirty and works code:
|
# ? Dec 13, 2019 00:20 |
|
if you are using a flash-based SSD (performs better at high queue depth) then will using Postgres partitions increase performance by giving it the ability to fire off multiple full-table scans in parallel?
|
# ? Dec 13, 2019 00:46 |
|
hm well i don't know for sure but i'd guess that it won't, if a parallel scan would improve performance then postgres will use parallel workers (if enabled) even without the table being partitioned
|
# ? Dec 13, 2019 01:05 |
|
If anything I’d expect this to be detrimental, since you incur the cost of session overhead and planning for a multi-partition scan, but can’t prune (assuming your full-scan has no predicates whatsoever).
|
# ? Dec 13, 2019 02:03 |
|
Just-In-Timeberlake posted:Probably a better way to do it, but quick and dirty and works Thank you! Unfortunately this didn't work when I added the data for the second user, but I was able to tweak it a bit and I think this should work. http://sqlfiddle.com/#!9/ad2b6e/11 Thanks for pointing me in the right direction! Like I said my actual data model is way more complex than this so sorry in advance if I keep wandering back in here
|
# ? Dec 13, 2019 04:05 |
|
Actually, here's maybe a really dumb question that's only tangentially related. I'm doing this to build a new report for our system that has to select and filter data from lots and lots of tables, some containing millions of rows. At a certain point, is it ever a good idea to build out cached versions of this data in either an SQL table built specifically for this report, or even a separate db like Mongo or DynamoDB or something? Basically imagine this schema I posted except there's several other tables similar to "coin_counts" and also multiple other join tables between "users" and "coin_counts." Like am I still in the realm of things that SQL was built to handle well or would it be smarter to offload some of this to something else entirely? It feels silly to duplicate the data but at the same time some of the pathways you have to take through the DB to get the right data are pretty daunting, especially because the end user is also supposed to be able to filter this report. i.e. coin_counts don't actually belong directly to users, they belong to purses, and purses are joined in a users_purses table because multiple users can share multiple purses. And the end users of the report want to see totals, but also see those totals broken down by the coin_counts in the users' Gucci purse vs. their Chanel purse, or opt to filter to only get totals from Gucci purses in the first place. Also just to be clear this isn't some crypto app or dealing with actual money in any way I'm just trying to come up with analogy that works. And maybe the real answer here is hire a DBA.
|
# ? Dec 13, 2019 16:05 |
|
prom candy posted:Actually, here's maybe a really dumb question that's only tangentially related. I'm doing this to build a new report for our system that has to select and filter data from lots and lots of tables, some containing millions of rows. At a certain point, is it ever a good idea to build out cached versions of this data in either an SQL table built specifically for this report, or even a separate db like Mongo or DynamoDB or something? Basically imagine this schema I posted except there's several other tables similar to "coin_counts" and also multiple other join tables between "users" and "coin_counts." I'm not a DBA by education, just by being the guy that got hired and learned (and still learning) a bunch of poo poo along the way. When I got hired most of our reporting was dogshit slow, and the solutions getting suggested were similar, faster hardware, dedicated hardware for reporting, etc. Basically throw money at the problem. Instead I went through all the problem report queries with the query analyzer, got everything indexed properly, rewrote most of the queries to make them more performant, and now reports that used to take 10 minutes to run take under 10 seconds. So if you gotta start somewhere, I'd start with the query analyzer if things aren't fast.
|
# ? Dec 13, 2019 16:32 |
|
prom candy posted:At a certain point, is it ever a good idea to build out cached versions of this data in either an SQL table built specifically for this report, or even a separate db like Mongo or DynamoDB or something? This is a core difference between OLTP and OLAP: OLTP (a transaction database) is designed for large numbers of small transactions, while OLAP (a data warehouse) is designed to gather all the data in one place for analysis. Whereas OLTP requires extensive normalization for speed of data access and upserts, OLAP data sets are designed for reporting and analysis and have lots of data redundancy. For example, in a transaction database, you might have tables for customerInfo, productInfo, supplierInfo, orderInfo and transactionIbfo, normalizing this data so every time a customer buys something only the transactionInfo table gets updated. This allows for quick in/out of data to/from tables. On the other hand, if I wanted to run a query giving me flow of inventory, I’d have to run a bunch of joins for each item bought or sold. And this query would be run over every single item bought or sold for millions or billions of transactions. Instead, I prepare the data in my data warehouse so a single line in a single table contains all the data I need for a select statement. And these tables are organized so that the most common queries are accounted for. In the above example you might create an Inventory OLAP cube that includes purchase date purchase price, supplier information, sale date, sell price, customer information all in a single row so that I can aggregate on it in various ways. This is hugely inefficient for your shopping card application, but incredibly efficient for your monthly or yearly reporting requirements. If I had a different reporting requirement, say data allowed to be viewed in my China office, I would create a brand new cube with lines of data optimized for whatever my reporting needs are in China. Note that this China cube can and does have duplicate data to my Inventory cube from above. Here I’ve made a design decision to trade disk space for reporting speed.
|
# ? Dec 13, 2019 18:09 |
|
That’s more comprehensive than the post I was writing on my phone, but the general principle is to remember that everything in the database is a balancing act. If you’re trying to use the same schema and the same system to satisfy both transactional and analytical workloads, you optimize one use-case (generally) at the expense of the other; every covering-index for your analytical use-case means more write-overhead in the transactional workload. If you lift and shift to a new server with identical schema? Yeah, you’re just throwing money at it without addressing the problem, while adding complexity. If you offload historical data from OLTP to an OLAP structure, maybe propagate via CDC from the transactional source, and properly optimize for the reporting workloads? Baby you got a stew going.
|
# ? Dec 13, 2019 18:37 |
|
I think the stock approach is more or less to have an OLTP and an OLAP server (transactional and analytical) and to accept a certain amount of latency between when things show up in production and when they show up in OLAP. Typically nightly or weekly depending on the quantity of data and how bad your server is. To go back to the original question, a materialized view is a perfectly acceptable way of implementing that. Materialized views are tables that are generated on a command from live data, which obviously fits the use-case of sucking data from a transactional format to an analytical format. A middle ground is that you can mark rows as exported or not exported and then come through hourly or something and suck the changed rows over to the new format, so you don't have to rematerialize a whole table, just the changed rows.
|
# ? Dec 13, 2019 18:51 |
|
Thank you for the really thorough explanations! I'm glad to hear I'm not entirely off-base here. I think as our product moves towards requiring more and more of this reporting we may need to start looking at replicating or entirely keeping some of our data in OLAP storage. I guess one of the other considerations is then dealing with stale data. For example in my ridiculous coin and purse example, if an existing purse was assigned to an existing user then we would want to update the user's historical coin counts to take all of that purse's historical data into account as well, and if a purse is removed from that user then same deal, when you went to pull data from the user you wouldn't want to see any of the data from the no-longer-assigned purse. A similar situation might be in the inventory reporting cube, if a supplier changed their address then we would want to update the rows in that inventory report to contain the new address. Is this something that people deal with when using OLAP solutions or would that kind of requirement mean that you'd need to make OLTP work for you because it's excessive and way too many transactions to keep analytical data "up to date?"
|
# ? Dec 13, 2019 19:44 |
|
In general I wouldn't worry too much about scalability issues until you actually have them. You can scale plain vanilla postgres astonishingly far these days just by throwing hardware at it, and hardware is really cheap these days. Hundreds of gigabytes of RAM and 64 CPU cores in a single socket are available even in consumer hardware now, and you can run a whole lot of business with something like that. The main database we use at work is barely more powerful than my gaming PC, and we're not a super small company or anything.
TheFluff fucked around with this message at 17:49 on Dec 14, 2019 |
# ? Dec 14, 2019 17:45 |
|
TheFluff posted:In general I wouldn't worry too much about scalability issues until you actually have them. You can scale plain vanilla postgres astonishingly far these days just by throwing hardware at it, and hardware is really cheap these days. Hundreds of gigabytes of RAM and 64 CPU cores in a single socket are available even in consumer hardware now, and you can run a whole lot of business with something like that. The main database we use at work is barely more powerful than my gaming PC, and we're not a super small company or anything. This. You can do an awful lot on an 8-core computer, database-wise, as long as you have speedy storage. A bunch of SSD and a bunch of RAM will hide all manner of sins. My current data warehouse lives on an 8-core, 128gb box with 3TB of SSD, and I could build a desktop PC with those specs. And it contains almost 8 billion rows of data.
|
# ? Dec 15, 2019 18:22 |
|
I have a table that has information for all distinct transactions that have gone through our system. It includes the country code (1, 2, 3), order date (MM-DD-YY H:MM), order amount, and order response (approved, declined). What I want to do is take all the data from the beginning of the year until now and group it like the picture that is attached. I want to see the rate of the order amount of declined transactions divided by the order amount of approved transactions based on the month. How should I go about this?
|
# ? Dec 17, 2019 14:21 |
|
Busy Bee posted:I have a table that has information for all distinct transactions that have gone through our system. It includes the country code (1, 2, 3), order date (MM-DD-YY H:MM), order amount, and order response (approved, declined). I'd do a subquery to roll up the transactions (GROUP BY) month-countrycode pairs and produce counts of approved/declined for that month+countrycode and then a second transaction to do your math. Which database are you using? PG has some nice syntactic sugar to generate dateranges.
|
# ? Dec 17, 2019 17:25 |
|
Paul MaudDib posted:Which database are you using? PG has some nice syntactic sugar to generate dateranges. Not sure to be honest. My SQL skills are very amateur and I wouldn't even know where to look :\
|
# ? Dec 17, 2019 17:35 |
|
Busy Bee posted:Not sure to be honest. My SQL skills are very amateur and I wouldn't even know where to look :\ Well, you say “we have a table”, which database are you doing this in? It generally doesn’t make sense to export and re import your data just to run a query. If you don’t have anything at all currently, you probably want Postgres 12 or SQLite.
|
# ? Dec 17, 2019 17:49 |
|
Why does everyone who has similar questions to me on StackOverflow use the most incomprehensible table and column names?
|
# ? Dec 18, 2019 20:22 |
|
On a related note, how do you guys name your subqueries and CTEs? I work mostly with people who just use letters in the alphabet (a, b, c) in lieu of more descriptive and verbose labels. It’s kind of a giant pain when I have to read their queries.
|
# ? Dec 19, 2019 01:24 |
|
if i have a subquery then it's probably just one level of extra nesting so i can do another group by or whatever, and i only call it anything because postgres will give me an error if i don't, so i call it q. otherwise i use a cte and i name it like i would name a normal table. i can always abbreviate it with a letter or two for joins when i select it
|
# ? Dec 19, 2019 01:28 |
|
prom candy posted:Why does everyone who has similar questions to me on StackOverflow use the most incomprehensible table and column names? Generally my ability to grok databases from intuition is inversely-proportional to how-freely the devs were allowed to do whatever the gently caress they felt like at whatever company I’m integrating. And generally people are bad at governance so naming standards (if even defined) are rarely enforced, which is how you end up with vw_users2_tmp as a core component of some auth scheme.
|
# ? Dec 19, 2019 03:48 |
|
My company doesn’t do code reviews so everything is terrible all the time. It seriously takes thirty seconds to do an IDE assisted variable rename so that we know what “tempvar” is guys, leave a few occasional comments that describe why not just what, give a descriptive git message, don’t spam the log and stop using console logging when we decided we’re using SLf4J, etc. It’s little stuff, I fix it when I can, but I don’t know how people even work like that. Someone (well, specifically, Andy) just checked in a file named “AndyServlet”. I don’t know what it does and at this point I’m afraid to look.
|
# ? Dec 19, 2019 04:01 |
I'm getting a T-SQL task dumped on me that's (probably) too complex/weird to be solved by a straight sequence of insert/update/delete statements, and I have almost zero knowledge of the procedural language. Can anyone recommend some books/other resources for learning procedural T-SQL?
|
|
# ? Dec 19, 2019 08:34 |
|
nielsm posted:I'm getting a T-SQL task dumped on me that's (probably) too complex/weird to be solved by a straight sequence of insert/update/delete statements, and I have almost zero knowledge of the procedural language. The docs are good.
|
# ? Dec 19, 2019 14:48 |
|
nielsm posted:I'm getting a T-SQL task dumped on me that's (probably) too complex/weird to be solved by a straight sequence of insert/update/delete statements, and I have almost zero knowledge of the procedural language. Try us - what’s the complicated problem
|
# ? Dec 19, 2019 17:14 |
|
|
# ? Jun 8, 2024 07:17 |
|
Alright so I've been working on this query for the past week or so and I'm making a ton of progress. It's super gnarly but all my specs are passing which is really exciting after so much wheel-spinning. Then I ran it against some production data and ran into "MySQL can only use 61 tables in a join." I'm creating the query dynamically because a parent account needs to see data broken down by all their child accounts, so for a parent that has lots of children, I'm basically creating another join + several columns for each one like "child_1_stat_a", "child_1_stat_b", "child_2_stat_a", "child_2_stat_b", etc. I guess when we get to a parent that has a lot of children this has the effect of creating too many joins. Each of the child account's joins use a sub-query that also has 3-4 joins of its own (basically querying a generic table again like that "coins" table and then doing joins to limit it to just the ones that are owned by that child account.) Is this when I should be making temporary tables? Or doing a join where I get all of the data I need in one subquery and then doing subsequent joins off of that table? Sorry if this isn't specific enough, I can try to come up with a fiddle that's similarly structured if the problem isn't very clear. I just feel like if I'm hitting MySQL's join limit I'm probably doing something dumb (besides running MySQL, which I'm hoping to replace with Postgres in 2020)
|
# ? Dec 20, 2019 21:50 |