Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Paul MaudDib posted:

cool, that works great, thanks! I just learned a new thing you could do with regexes.

I'm still gonna keep not_match_mask in the script because it's already there and might be useful down the line, but if it's null then it's ignored anyway.

Yeah, more or less freeform user text, it's not like CSS or HL7 where you could key off the delimiter. Obviously it would be nicer not to have to clean the data up myself but is there some other reason it's bad?

As a formal methods, theory, and regex nerd: one of us! One of us! One of us! We accept you! We accept you!

Adbot
ADBOT LOVES YOU

Data Graham
Dec 28, 2009

📈📊🍪😋



Now, don't be greedy

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

Nth Doctor posted:

As a formal methods, theory, and regex nerd: one of us! One of us! One of us! We accept you! We accept you!

I did stack machines and regular languages and all that stuff back in college, but we never really got into some of the more esoteric features of regexes. I’m under the impression that some of those actually take regexes past the power of a regular language into context sensitive languages, if I’m remembering the terms right.

nielsm
Jun 1, 2009



Paul MaudDib posted:

I did stack machines and regular languages and all that stuff back in college, but we never really got into some of the more esoteric features of regexes. I’m under the impression that some of those actually take regexes past the power of a regular language into context sensitive languages, if I’m remembering the terms right.

Yes basically any features for look ahead or look behind assertions that don't consume input are outside regular language powers. At least as rule of thumb. Some regex implementations might degrade from a FSM implementation with O(n) time to match, where n is length of the haystack, to a much slower backtracking algorithm that can be quadratic in length of input plus complexity of the expression.
All of this is from memory, might have some details wrong.

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


nielsm posted:

Yes basically any features for look ahead or look behind assertions that don't consume input are outside regular language powers. At least as rule of thumb. Some regex implementations might degrade from a FSM implementation with O(n) time to match, where n is length of the haystack, to a much slower backtracking algorithm that can be quadratic in length of input plus complexity of the expression.
All of this is from memory, might have some details wrong.

I think those features are fine. They're basically easier to express with an NFA but NFAs can be transformed into DFAs with only a little bit of brain melting math and acceptance classes.

Some of the weirder stuff you can do with advanced capturing and rematching seems to be where you jump the bondary of finite automata into a higher order machine.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
Backreferences and recursion are the excitements you seek.

Batterypowered7
Aug 8, 2009

The mist that chills you keeps me warm.

I've got a table that looks something like this.

code:
LOAN
+----------------+------------+------+---------+----------+--------------+
|       ID       | START_DATE | TERM | MONTHLY | INTEREST | DOWN_PAYMENT |
+----------------+------------+------+---------+----------+--------------+
| 12345678912345 | 2019-6-15  |   13 |  404.00 |      3.5 |         5000 |
| 255697128      | 2019-01-01 |    7 |  375.00 |      4.0 |         7500 |
| S530460864050  | 2019-01-10 |    5 |  200.00 |      2.5 |        10000 |
| 748698418      | 2018-10-31 |   12 |  350.00 |      3.2 |         3000 |
+----------------+------------+------+---------+----------+--------------+
ID: the ID of the person the loan belongs to
START_DATE: the date the loan began
TERM: the length of the loan in months
MONTHLY: the amount paid each month for the duration of the term

Note: The first payment for each loan is due the following month.

I need help generating a report that calculates the total monthly income for each month for the past 12 months. I believe for the sample table above, the results would look something like this:

code:
+------+-------+----------------+
| YEAR | MONTH | MONTHLY_INCOME |
+------+-------+----------------+
| 2019 |    11 |         404.00 | June13
| 2019 |    10 |         754.00 | Oct12, June13
| 2019 |    09 |         754.00 | Oct12, June13
| 2019 |    08 |        1129.00 | Oct12,  Jan7, June13
| 2019 |    07 |        1129.00 | Oct12,  Jan7, June13
| 2019 |    06 |         925.00 | Oct12, Jan5, Jan7
| 2019 |    05 |         925.00 | Oct, Jan5, Jan7
| 2019 |    04 |         925.00 | Oct12, Jan5, Jan7
| 2019 |    03 |         925.00 | Oct12, Jan5, Jan7
| 2019 |    02 |         925.00 | Oct12, Jan5, Jan7
| 2019 |    01 |         350.00 | Oct12
| 2018 |    12 |         350.00 | Oct12
| 2018 |    11 |         350.00 | Oct12
+------+-------+----------------+
I've written which loans I believe should contribute to the total monthly income next to each row for convenience. It is not necessary for the report. Would someone be able to point me in the right direction?

Gatac
Apr 22, 2008

Fifty Cent's next biopic.
The precise details depend on your flavor of SQL, but I'd say you do a view that extrapolates the monthly payments from your first table (probably by joining with something that acts as an integer series generator rejiggered to create rows for individual months), then do a second view based on that to create the sums and filter for the months you want.

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Batterypowered7 posted:

I've got a table that looks something like this.

code:
LOAN
+----------------+------------+------+---------+----------+--------------+
|       ID       | START_DATE | TERM | MONTHLY | INTEREST | DOWN_PAYMENT |
+----------------+------------+------+---------+----------+--------------+
| 12345678912345 | 2019-6-15  |   13 |  404.00 |      3.5 |         5000 |
| 255697128      | 2019-01-01 |    7 |  375.00 |      4.0 |         7500 |
| S530460864050  | 2019-01-10 |    5 |  200.00 |      2.5 |        10000 |
| 748698418      | 2018-10-31 |   12 |  350.00 |      3.2 |         3000 |
+----------------+------------+------+---------+----------+--------------+
ID: the ID of the person the loan belongs to
START_DATE: the date the loan began
TERM: the length of the loan in months
MONTHLY: the amount paid each month for the duration of the term

Note: The first payment for each loan is due the following month.

I need help generating a report that calculates the total monthly income for each month for the past 12 months. I believe for the sample table above, the results would look something like this:

code:
+------+-------+----------------+
| YEAR | MONTH | MONTHLY_INCOME |
+------+-------+----------------+
| 2019 |    11 |         404.00 | June13
| 2019 |    10 |         754.00 | Oct12, June13
| 2019 |    09 |         754.00 | Oct12, June13
| 2019 |    08 |        1129.00 | Oct12,  Jan7, June13
| 2019 |    07 |        1129.00 | Oct12,  Jan7, June13
| 2019 |    06 |         925.00 | Oct12, Jan5, Jan7
| 2019 |    05 |         925.00 | Oct, Jan5, Jan7
| 2019 |    04 |         925.00 | Oct12, Jan5, Jan7
| 2019 |    03 |         925.00 | Oct12, Jan5, Jan7
| 2019 |    02 |         925.00 | Oct12, Jan5, Jan7
| 2019 |    01 |         350.00 | Oct12
| 2018 |    12 |         350.00 | Oct12
| 2018 |    11 |         350.00 | Oct12
+------+-------+----------------+
I've written which loans I believe should contribute to the total monthly income next to each row for convenience. It is not necessary for the report. Would someone be able to point me in the right direction?

It's a hack, but trying to generate a sequence of months between a start and end date kinda sucks in SQL.
Here's a fiddle where I messed around a bunch: http://sqlfiddle.com/#!18/02fb2/3

Pardot
Jul 25, 2001




Nth Doctor posted:

trying to generate a sequence of months between a start and end date kinda sucks in SQL.

Not when you got generate_series and range types :c00lbert:


code:
select month, sum(monthly)
from generate_series('2018-11-01'::timestamptz, now(), '1 month') month
join (select *, tstzrange(start_date, start_date + term*'1 month'::interval, '[]') from loans) f
  on month <@f.tstzrange
group by month
order by month desc;
         month          |   sum
------------------------+---------
 2019-11-01 00:00:00-07 |  404.00
 2019-10-01 00:00:00-07 |  754.00
 2019-09-01 00:00:00-07 |  754.00
 2019-08-01 00:00:00-07 | 1129.00
 2019-07-01 00:00:00-07 | 1129.00
 2019-06-01 00:00:00-07 |  930.00
 2019-05-01 00:00:00-07 |  930.00
 2019-04-01 00:00:00-07 |  930.00
 2019-03-01 00:00:00-08 |  930.00
 2019-02-01 00:00:00-08 |  930.00
 2019-01-01 00:00:00-08 |  725.00
 2018-12-01 00:00:00-08 |  350.00
 2018-11-01 00:00:00-07 |  350.00
(13 rows)
http://sqlfiddle.com/#!17/f4216/2

Batterypowered7
Aug 8, 2009

The mist that chills you keeps me warm.

Nth Doctor posted:

It's a hack, but trying to generate a sequence of months between a start and end date kinda sucks in SQL.
Here's a fiddle where I messed around a bunch: http://sqlfiddle.com/#!18/02fb2/3

Pardot posted:

Not when you got generate_series and range types :c00lbert:


code:
select month, sum(monthly)
from generate_series('2018-11-01'::timestamptz, now(), '1 month') month
join (select *, tstzrange(start_date, start_date + term*'1 month'::interval, '[]') from loans) f
  on month <@f.tstzrange
group by month
order by month desc;
         month          |   sum
------------------------+---------
 2019-11-01 00:00:00-07 |  404.00
 2019-10-01 00:00:00-07 |  754.00
 2019-09-01 00:00:00-07 |  754.00
 2019-08-01 00:00:00-07 | 1129.00
 2019-07-01 00:00:00-07 | 1129.00
 2019-06-01 00:00:00-07 |  930.00
 2019-05-01 00:00:00-07 |  930.00
 2019-04-01 00:00:00-07 |  930.00
 2019-03-01 00:00:00-08 |  930.00
 2019-02-01 00:00:00-08 |  930.00
 2019-01-01 00:00:00-08 |  725.00
 2018-12-01 00:00:00-08 |  350.00
 2018-11-01 00:00:00-07 |  350.00
(13 rows)
http://sqlfiddle.com/#!17/f4216/2

Thank you very much, you two. I'm using Postgresql, so I was able to modify the second fiddle ever so slightly so that the series start date wasn't hard coded. You guys were a great help!

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Pardot posted:

Not when you got generate_series and range types :c00lbert:


code:
select month, sum(monthly)
from generate_series('2018-11-01'::timestamptz, now(), '1 month') month
join (select *, tstzrange(start_date, start_date + term*'1 month'::interval, '[]') from loans) f
  on month <@f.tstzrange
group by month
order by month desc;
         month          |   sum
------------------------+---------
 2019-11-01 00:00:00-07 |  404.00
 2019-10-01 00:00:00-07 |  754.00
 2019-09-01 00:00:00-07 |  754.00
 2019-08-01 00:00:00-07 | 1129.00
 2019-07-01 00:00:00-07 | 1129.00
 2019-06-01 00:00:00-07 |  930.00
 2019-05-01 00:00:00-07 |  930.00
 2019-04-01 00:00:00-07 |  930.00
 2019-03-01 00:00:00-08 |  930.00
 2019-02-01 00:00:00-08 |  930.00
 2019-01-01 00:00:00-08 |  725.00
 2018-12-01 00:00:00-08 |  350.00
 2018-11-01 00:00:00-07 |  350.00
(13 rows)
http://sqlfiddle.com/#!17/f4216/2

:hfive:


Batterypowered7 posted:

Thank you very much, you two. I'm using Postgresql, so I was able to modify the second fiddle ever so slightly so that the series start date wasn't hard coded. You guys were a great help!

Glad you got it!

Phummus
Aug 4, 2006

If I get ten spare bucks, it's going for a 30-pack of Schlitz.

Happiness Commando posted:

I do mostly traditional sysadmin stuff for a fintech company and work is willing to pay for some SQL classes for me. We use MS SQL server in house and are looking at Aurora MySQL, although I'm at select * from foo level, so there shouldn't be anything system specific in whatever it is that I learn.

Can anyone recommend some places to look for training?

Check out Brent Ozar's black Friday sale and his first responder tool kit (free)

prom candy
Dec 16, 2005

Only I may dance
Hey SQL gang, I'm running up against something that I feel should be simple but I can't figure it out and I don't even really know where to start. Here's my sqlfiddle: http://sqlfiddle.com/#!9/622103/4

Given my data set here, and the query, what I'd like is to get back a row for each user, their "coins_count" at the first date, their coins_count for the second date, and ideally the dates of the two records it chose. So for example given my query there I'd like to get back something like

code:
id | name | starting_balance | ending_balance | starting_on | ending_on   | change  
----------------------------------------------------------------------------------  
1  | Jack | 15               | 25             | 2019-04-01  | 2019-06-01  | 10
I'd also love to be able to get the sum of ending balance, and the sum of change.

Can anyone point me down the right road here. I feel like what I basically want is to somehow select/join the first and last coin_count in the given date range for each user, but I can't figure out how to do that. So then I'm thinking I probably want a subquery, but I'm not sure how to link back the coin_counts with the users.

This is a pretty simplified version of my actual data model but I think if I can figure out how to do this I can solve my larger problem. Really appreciate any advice here!

edit: lopped the first line off my post with the link to the fiddle

prom candy fucked around with this message at 22:58 on Dec 12, 2019

Just-In-Timeberlake
Aug 18, 2003

Probably a better way to do it, but quick and dirty and works

code:
  Select users.*
        , startTbl.coins `start coins`
        , endTbl.coins `end coins`
        , startTbl.created_on `start date`
        , endTbl.created_on `end date`
        , endTbl.coins - startTbl.coins `change`
    from users
      inner join
         (SELECT *
         from coin_counts
         where coin_counts.created_on <= '2020-01-01'
         order by created_on desc
         limit 1) endTbl on users.id = endTbl.user_id
      inner join
         (SELECT *
         from coin_counts
         where coin_counts.created_on >= '2019-04-01'
         order by created_on asc
         limit 1) startTbl on users.id = startTbl.user_id
  

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE
if you are using a flash-based SSD (performs better at high queue depth) then will using Postgres partitions increase performance by giving it the ability to fire off multiple full-table scans in parallel?

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
hm well i don't know for sure but i'd guess that it won't, if a parallel scan would improve performance then postgres will use parallel workers (if enabled) even without the table being partitioned

ElehemEare
May 20, 2001
I am an omnipotent penguin.

If anything I’d expect this to be detrimental, since you incur the cost of session overhead and planning for a multi-partition scan, but can’t prune (assuming your full-scan has no predicates whatsoever).

prom candy
Dec 16, 2005

Only I may dance

Just-In-Timeberlake posted:

Probably a better way to do it, but quick and dirty and works

code:
  Select users.*
        , startTbl.coins `start coins`
        , endTbl.coins `end coins`
        , startTbl.created_on `start date`
        , endTbl.created_on `end date`
        , endTbl.coins - startTbl.coins `change`
    from users
      inner join
         (SELECT *
         from coin_counts
         where coin_counts.created_on <= '2020-01-01'
         order by created_on desc
         limit 1) endTbl on users.id = endTbl.user_id
      inner join
         (SELECT *
         from coin_counts
         where coin_counts.created_on >= '2019-04-01'
         order by created_on asc
         limit 1) startTbl on users.id = startTbl.user_id
  

Thank you! Unfortunately this didn't work when I added the data for the second user, but I was able to tweak it a bit and I think this should work. http://sqlfiddle.com/#!9/ad2b6e/11

Thanks for pointing me in the right direction! Like I said my actual data model is way more complex than this so sorry in advance if I keep wandering back in here

prom candy
Dec 16, 2005

Only I may dance
Actually, here's maybe a really dumb question that's only tangentially related. I'm doing this to build a new report for our system that has to select and filter data from lots and lots of tables, some containing millions of rows. At a certain point, is it ever a good idea to build out cached versions of this data in either an SQL table built specifically for this report, or even a separate db like Mongo or DynamoDB or something? Basically imagine this schema I posted except there's several other tables similar to "coin_counts" and also multiple other join tables between "users" and "coin_counts."

Like am I still in the realm of things that SQL was built to handle well or would it be smarter to offload some of this to something else entirely? It feels silly to duplicate the data but at the same time some of the pathways you have to take through the DB to get the right data are pretty daunting, especially because the end user is also supposed to be able to filter this report. i.e. coin_counts don't actually belong directly to users, they belong to purses, and purses are joined in a users_purses table because multiple users can share multiple purses. And the end users of the report want to see totals, but also see those totals broken down by the coin_counts in the users' Gucci purse vs. their Chanel purse, or opt to filter to only get totals from Gucci purses in the first place.

Also just to be clear this isn't some crypto app or dealing with actual money in any way I'm just trying to come up with analogy that works. And maybe the real answer here is hire a DBA.

Just-In-Timeberlake
Aug 18, 2003

prom candy posted:

Actually, here's maybe a really dumb question that's only tangentially related. I'm doing this to build a new report for our system that has to select and filter data from lots and lots of tables, some containing millions of rows. At a certain point, is it ever a good idea to build out cached versions of this data in either an SQL table built specifically for this report, or even a separate db like Mongo or DynamoDB or something? Basically imagine this schema I posted except there's several other tables similar to "coin_counts" and also multiple other join tables between "users" and "coin_counts."

Like am I still in the realm of things that SQL was built to handle well or would it be smarter to offload some of this to something else entirely? It feels silly to duplicate the data but at the same time some of the pathways you have to take through the DB to get the right data are pretty daunting, especially because the end user is also supposed to be able to filter this report. i.e. coin_counts don't actually belong directly to users, they belong to purses, and purses are joined in a users_purses table because multiple users can share multiple purses. And the end users of the report want to see totals, but also see those totals broken down by the coin_counts in the users' Gucci purse vs. their Chanel purse, or opt to filter to only get totals from Gucci purses in the first place.

Also just to be clear this isn't some crypto app or dealing with actual money in any way I'm just trying to come up with analogy that works. And maybe the real answer here is hire a DBA.

I'm not a DBA by education, just by being the guy that got hired and learned (and still learning) a bunch of poo poo along the way.

When I got hired most of our reporting was dogshit slow, and the solutions getting suggested were similar, faster hardware, dedicated hardware for reporting, etc. Basically throw money at the problem. Instead I went through all the problem report queries with the query analyzer, got everything indexed properly, rewrote most of the queries to make them more performant, and now reports that used to take 10 minutes to run take under 10 seconds.

So if you gotta start somewhere, I'd start with the query analyzer if things aren't fast.

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

prom candy posted:

At a certain point, is it ever a good idea to build out cached versions of this data in either an SQL table built specifically for this report, or even a separate db like Mongo or DynamoDB or something?

It feels silly to duplicate the data but at the same time some of the pathways you have to take through the DB to get the right data are pretty daunting, especially because the end user is also supposed to be able to filter this report.

This is a core difference between OLTP and OLAP: OLTP (a transaction database) is designed for large numbers of small transactions, while OLAP (a data warehouse) is designed to gather all the data in one place for analysis.

Whereas OLTP requires extensive normalization for speed of data access and upserts, OLAP data sets are designed for reporting and analysis and have lots of data redundancy.

For example, in a transaction database, you might have tables for customerInfo, productInfo, supplierInfo, orderInfo and transactionIbfo, normalizing this data so every time a customer buys something only the transactionInfo table gets updated. This allows for quick in/out of data to/from tables.

On the other hand, if I wanted to run a query giving me flow of inventory, I’d have to run a bunch of joins for each item bought or sold. And this query would be run over every single item bought or sold for millions or billions of transactions.

Instead, I prepare the data in my data warehouse so a single line in a single table contains all the data I need for a select statement. And these tables are organized so that the most common queries are accounted for.

In the above example you might create an Inventory OLAP cube that includes purchase date purchase price, supplier information, sale date, sell price, customer information all in a single row so that I can aggregate on it in various ways. This is hugely inefficient for your shopping card application, but incredibly efficient for your monthly or yearly reporting requirements.


If I had a different reporting requirement, say data allowed to be viewed in my China office, I would create a brand new cube with lines of data optimized for whatever my reporting needs are in China. Note that this China cube can and does have duplicate data to my Inventory cube from above. Here I’ve made a design decision to trade disk space for reporting speed.

ElehemEare
May 20, 2001
I am an omnipotent penguin.

That’s more comprehensive than the post I was writing on my phone, but the general principle is to remember that everything in the database is a balancing act.

If you’re trying to use the same schema and the same system to satisfy both transactional and analytical workloads, you optimize one use-case (generally) at the expense of the other; every covering-index for your analytical use-case means more write-overhead in the transactional workload.

If you lift and shift to a new server with identical schema? Yeah, you’re just throwing money at it without addressing the problem, while adding complexity.

If you offload historical data from OLTP to an OLAP structure, maybe propagate via CDC from the transactional source, and properly optimize for the reporting workloads? Baby you got a stew going.

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE
I think the stock approach is more or less to have an OLTP and an OLAP server (transactional and analytical) and to accept a certain amount of latency between when things show up in production and when they show up in OLAP. Typically nightly or weekly depending on the quantity of data and how bad your server is.

To go back to the original question, a materialized view is a perfectly acceptable way of implementing that. Materialized views are tables that are generated on a command from live data, which obviously fits the use-case of sucking data from a transactional format to an analytical format.

A middle ground is that you can mark rows as exported or not exported and then come through hourly or something and suck the changed rows over to the new format, so you don't have to rematerialize a whole table, just the changed rows.

prom candy
Dec 16, 2005

Only I may dance
Thank you for the really thorough explanations! I'm glad to hear I'm not entirely off-base here. I think as our product moves towards requiring more and more of this reporting we may need to start looking at replicating or entirely keeping some of our data in OLAP storage. I guess one of the other considerations is then dealing with stale data. For example in my ridiculous coin and purse example, if an existing purse was assigned to an existing user then we would want to update the user's historical coin counts to take all of that purse's historical data into account as well, and if a purse is removed from that user then same deal, when you went to pull data from the user you wouldn't want to see any of the data from the no-longer-assigned purse. A similar situation might be in the inventory reporting cube, if a supplier changed their address then we would want to update the rows in that inventory report to contain the new address.

Is this something that people deal with when using OLAP solutions or would that kind of requirement mean that you'd need to make OLTP work for you because it's excessive and way too many transactions to keep analytical data "up to date?"

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE
In general I wouldn't worry too much about scalability issues until you actually have them. You can scale plain vanilla postgres astonishingly far these days just by throwing hardware at it, and hardware is really cheap these days. Hundreds of gigabytes of RAM and 64 CPU cores in a single socket are available even in consumer hardware now, and you can run a whole lot of business with something like that. The main database we use at work is barely more powerful than my gaming PC, and we're not a super small company or anything.

TheFluff fucked around with this message at 17:49 on Dec 14, 2019

Agrikk
Oct 17, 2003

Take care with that! We have not fully ascertained its function, and the ticking is accelerating.

TheFluff posted:

In general I wouldn't worry too much about scalability issues until you actually have them. You can scale plain vanilla postgres astonishingly far these days just by throwing hardware at it, and hardware is really cheap these days. Hundreds of gigabytes of RAM and 64 CPU cores in a single socket are available even in consumer hardware now, and you can run a whole lot of business with something like that. The main database we use at work is barely more powerful than my gaming PC, and we're not a super small company or anything.

This.

You can do an awful lot on an 8-core computer, database-wise, as long as you have speedy storage. A bunch of SSD and a bunch of RAM will hide all manner of sins.

My current data warehouse lives on an 8-core, 128gb box with 3TB of SSD, and I could build a desktop PC with those specs. :D And it contains almost 8 billion rows of data.

Busy Bee
Jul 13, 2004
I have a table that has information for all distinct transactions that have gone through our system. It includes the country code (1, 2, 3), order date (MM-DD-YY H:MM), order amount, and order response (approved, declined).

What I want to do is take all the data from the beginning of the year until now and group it like the picture that is attached.

I want to see the rate of the order amount of declined transactions divided by the order amount of approved transactions based on the month.

How should I go about this?

Only registered members can see post attachments!

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

Busy Bee posted:

I have a table that has information for all distinct transactions that have gone through our system. It includes the country code (1, 2, 3), order date (MM-DD-YY H:MM), order amount, and order response (approved, declined).

What I want to do is take all the data from the beginning of the year until now and group it like the picture that is attached.

I want to see the rate of the order amount of declined transactions divided by the order amount of approved transactions based on the month.

How should I go about this?



I'd do a subquery to roll up the transactions (GROUP BY) month-countrycode pairs and produce counts of approved/declined for that month+countrycode and then a second transaction to do your math.

Which database are you using? PG has some nice syntactic sugar to generate dateranges.

Busy Bee
Jul 13, 2004

Paul MaudDib posted:

Which database are you using? PG has some nice syntactic sugar to generate dateranges.

Not sure to be honest. My SQL skills are very amateur and I wouldn't even know where to look :\

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE

Busy Bee posted:

Not sure to be honest. My SQL skills are very amateur and I wouldn't even know where to look :\

Well, you say “we have a table”, which database are you doing this in? It generally doesn’t make sense to export and re import your data just to run a query.

If you don’t have anything at all currently, you probably want Postgres 12 or SQLite.

prom candy
Dec 16, 2005

Only I may dance
Why does everyone who has similar questions to me on StackOverflow use the most incomprehensible table and column names?

Vegetable
Oct 22, 2010

On a related note, how do you guys name your subqueries and CTEs? I work mostly with people who just use letters in the alphabet (a, b, c) in lieu of more descriptive and verbose labels. It’s kind of a giant pain when I have to read their queries.

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
if i have a subquery then it's probably just one level of extra nesting so i can do another group by or whatever, and i only call it anything because postgres will give me an error if i don't, so i call it q. otherwise i use a cte and i name it like i would name a normal table. i can always abbreviate it with a letter or two for joins when i select it

ElehemEare
May 20, 2001
I am an omnipotent penguin.

prom candy posted:

Why does everyone who has similar questions to me on StackOverflow use the most incomprehensible table and column names?

Generally my ability to grok databases from intuition is inversely-proportional to how-freely the devs were allowed to do whatever the gently caress they felt like at whatever company I’m integrating.

And generally people are bad at governance so naming standards (if even defined) are rarely enforced, which is how you end up with vw_users2_tmp as a core component of some auth scheme.

Paul MaudDib
May 3, 2006

TEAM NVIDIA:
FORUM POLICE
My company doesn’t do code reviews so everything is terrible all the time.

It seriously takes thirty seconds to do an IDE assisted variable rename so that we know what “tempvar” is guys, leave a few occasional comments that describe why not just what, give a descriptive git message, don’t spam the log and stop using console logging when we decided we’re using SLf4J, etc. It’s little stuff, I fix it when I can, but I don’t know how people even work like that.

Someone (well, specifically, Andy) just checked in a file named “AndyServlet”. I don’t know what it does and at this point I’m afraid to look.

nielsm
Jun 1, 2009



I'm getting a T-SQL task dumped on me that's (probably) too complex/weird to be solved by a straight sequence of insert/update/delete statements, and I have almost zero knowledge of the procedural language.
Can anyone recommend some books/other resources for learning procedural T-SQL?

ElehemEare
May 20, 2001
I am an omnipotent penguin.

nielsm posted:

I'm getting a T-SQL task dumped on me that's (probably) too complex/weird to be solved by a straight sequence of insert/update/delete statements, and I have almost zero knowledge of the procedural language.
Can anyone recommend some books/other resources for learning procedural T-SQL?

The docs are good.

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


nielsm posted:

I'm getting a T-SQL task dumped on me that's (probably) too complex/weird to be solved by a straight sequence of insert/update/delete statements, and I have almost zero knowledge of the procedural language.
Can anyone recommend some books/other resources for learning procedural T-SQL?

Try us - what’s the complicated problem

Adbot
ADBOT LOVES YOU

prom candy
Dec 16, 2005

Only I may dance
Alright so I've been working on this query for the past week or so and I'm making a ton of progress. It's super gnarly but all my specs are passing which is really exciting after so much wheel-spinning. Then I ran it against some production data and ran into "MySQL can only use 61 tables in a join." I'm creating the query dynamically because a parent account needs to see data broken down by all their child accounts, so for a parent that has lots of children, I'm basically creating another join + several columns for each one like "child_1_stat_a", "child_1_stat_b", "child_2_stat_a", "child_2_stat_b", etc. I guess when we get to a parent that has a lot of children this has the effect of creating too many joins. Each of the child account's joins use a sub-query that also has 3-4 joins of its own (basically querying a generic table again like that "coins" table and then doing joins to limit it to just the ones that are owned by that child account.)

Is this when I should be making temporary tables? Or doing a join where I get all of the data I need in one subquery and then doing subsequent joins off of that table? Sorry if this isn't specific enough, I can try to come up with a fiddle that's similarly structured if the problem isn't very clear. I just feel like if I'm hitting MySQL's join limit I'm probably doing something dumb (besides running MySQL, which I'm hoping to replace with Postgres in 2020)

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply