Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Seventh Arrow
Jan 26, 2005

mortarr posted:

A lot of Data Engineering jobs mention python - where should I start? I've done a lot of data related dev stuff, almost all on c#/ms SQL if that helps? Looking at one of the Azure certs mentioned above too.

Our humble SA thread

https://forums.somethingawful.com/showthread.php?threadid=3812541

Reddit

https://www.reddit.com/r/inventwithpython/
https://www.reddit.com/r/learnpython/
https://www.reddit.com/r/Python/
https://www.reddit.com/r/PythonNoobs/

Books

Python Crash Course
Automate the Boring Stuff with Python
Think Python

Youtube

Indently
Tech With Tim
mCoding
ArjanCodes
Python Simplified

Online Learning

Codecademy
Real Python
The Official Guide
DataCamp

Discussion

Official Python IRC
Discord

Brain-busters

LeetCode
Practice Python
HackerRank
Edabit

Adbot
ADBOT LOVES YOU

Hughmoris
Apr 21, 2007
Let's go to the abyss!

Seventh Arrow posted:

Great resources...

This is some good stuff and should definitely be pinned in the OP, and the new data thread (if someone makes it).

Hughmoris
Apr 21, 2007
Let's go to the abyss!
Anyone here work with Spark on a routine basis? If so, what's the nature of the work and how do you like it?

I'm studying up for the databricks Spark Developer cert but unfortunately don't have much excuse to use it in my current role.

abelwingnut
Dec 23, 2002


theory question.

let's say you have two scripts to get you the same result set. one scripts the conditions via predicates in the WHERE clause, and the JOIN are simply ON a.x = b.x. the other scripts the conditions via the ON statements in its JOINs, and has no WHERE clause.

does it matter at all? is one more efficient?

personally i've always done everything in the WHERE clause. it's just so much more readable. but maybe that comes at an expense to efficiency?

abelwingnut fucked around with this message at 19:10 on Aug 11, 2023

nielsm
Jun 1, 2009



If you were doing left (or right) joins instead of inner joins, then they would not necessarily be equivalent.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

abelwingnut posted:

let's say you have two scripts to get you the same result set. one scripts the conditions via predicates in the WHERE clause, and the JOIN are simply ON a.x = b.x. the other scripts the conditions via the ON statements in its JOINs, and has no WHERE clause.

does it matter at all? is one more efficient?

the usual positions people strike on this are (1) they like to do everything in the WHERE clause or (2) they like to put join logic in ON clauses and filtering logic in the WHERE clause. Afaik it's uncommon to advocate for (3) putting filtering logic in ON clauses and avoiding having a WHERE clause at all. With good reason, since filtering unrelated to the JOIN logic is expressly what the WHERE clause is for and to hide it in ON clauses is to obfuscate what your query is doing. I would find writing SQL like that extremely weird.

I feel like your question is pretty abstract so it amounts to asking, does there exist a pair of queries - one putting filtering logic in the WHERE clause and one putting it in ON clauses - where the queries would give identical results but with a difference in efficiency? I feel like the answer morally ought to be "no there does not" but I couldn't begin to prove it and as I said, it's very abstract so maybe there is a way you could come up with one. It's a totally academic question anyway because as I said, I don't think people generally advocate for putting filtering logic in the ON clause. Any improvement in efficiency you somehow find would be offset by the increased difficulty of understanding the code.

FWIW I'm in camp (2)

LightRailTycoon
Mar 24, 2017
Run some EXPLAINs , see how the optimizer handles it, then choose the most readable.

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE

abelwingnut posted:

theory question.

let's say you have two scripts to get you the same result set. one scripts the conditions via predicates in the WHERE clause, and the JOIN are simply ON a.x = b.x. the other scripts the conditions via the ON statements in its JOINs, and has no WHERE clause.

does it matter at all? is one more efficient?

personally i've always done everything in the WHERE clause. it's just so much more readable. but maybe that comes at an expense to efficiency?

As nielsm says, for an inner join the ON and the WHERE clauses are logically equivalent. It is possible some wacky query planner might treat them differently but they have exactly the same effect.

For an outer join however they have completely different meanings and which one to use depends on what you want. In an outer join, a condition placed in the ON clause is applied before the join, while a condition in the WHERE clause is applied after the join. Probably easiest to explain what this means by example. Consider these two queries:

SQL code:
select *
from
    orders
    left join order_items on (order_items.order_id = orders.id)
where
    order_items.product_name = 'gizmo';
    
select *
from
    orders
    left join order_items on (
      order_items.order_id = orders.id
      and order_items.product_name = 'gizmo'
    );
The first one will find only orders that have an order_item with the product name 'gizmo', and only return order_items rows with that product name. The join is done first, then rows that don't match the condition are filtered out. The second one will find all orders, but only join in order_items rows with the product name 'gizmo'; orders without a gizmo order item will have null for the order_items columns, since there were no matching rows for the left join.

Basically use the first version when you want to filter out rows from the resultset entirely, use the second version when you want nulls instead. Fiddle for extra clarity: https://www.db-fiddle.com/f/7cFSepFbUuTjuA1jMiicd8/0

If you don't have pressing reasons to do this type of outer join shenanigans, please do as Hammerite and put the join condition in the ON clause and anything else in the WHERE clause. Putting join conditions in the WHERE clause is noisy and putting stuff in the ON clause that isn't a join condition and doesn't need to be there is mostly just confusing.

kumba
Nov 8, 2003

I posted my food for USPOL Thanksgiving!

enjoy the ride

Lipstick Apathy
i have a weird question based on some troubleshooting i was doing this morning with a coworker: say i have the following query

SELECT 1
FROM BILLING_SEGMENT bs
JOIN BILL b on bs.BillKey = b.BillKey
WHERE CAST(b.Bill_CompleteDttm as DATE) >= '2023-08-01'
AND bs.Month = 202308

Important detail: bs.Month is a char(6) column

The query, as written above, executes totally fine and returns 357,447 rows. However, when I change date to 2023-07-01 instead of 2023-08-01, I mysteriously get:

"Conversion failed when converting the varchar value '0.0000' to data type int."

This is fixed by surrounding the 202308 in quotes, so the following query:

SELECT 1
FROM BILLING_SEGMENT bs
JOIN BILL b on bs.BillKey = b.BillKey
WHERE CAST(b.Bill_CompleteDttm as DATE) >= '2023-07-01'
AND bs.Month = '202308'

...runs totally fine, and returns 723,393 rows!

I was surprised to find that the very first query even ran at all without that month being explicitly made into a string, but what in the world is happening to cause it to suddenly stop working? Something with the amount of rows returning causing the interpreter to misfire? The fix is easy but I'm mostly wondering from an academic perspective why it works sometimes but not others!

kumba fucked around with this message at 15:56 on Aug 16, 2023

monochromagic
Jun 17, 2023

Sounds to me like the data set returned for >'2023-07-01' has a bs.Month column with bad data, i.e. 0.0000 which as it rightly states cannot be cast to an integer. In any case it is more correct to compare char to char.

kumba
Nov 8, 2003

I posted my food for USPOL Thanksgiving!

enjoy the ride

Lipstick Apathy

monochromagic posted:

Sounds to me like the data set returned for >'2023-07-01' has a bs.Month column with bad data, i.e. 0.0000 which as it rightly states cannot be cast to an integer. In any case it is more correct to compare char to char.

That was my first guess, but it does not! i can select distinct bs.Month for the timeframe and there's nothing weird at all - just a list of 6 character values that I would expect to see!

nielsm
Jun 1, 2009



I don't think SQL has any values (besides NULL) that must be written unquoted. Always quote your literal values, even ints and floats.

But also, make sure to mentally distinguish between numbers and numeric strings. A number is a value that has a single magnitude and you can do meaningful arithmetic on, such as use the AVG() function. Trying to use the AVG() function on your bs.Month column wouldn't produce a logically meaningful value, because it's not a proper number with a single magnitude, it's a composite value consisting of two magnitudes (year and month of year) that just happen to be formatted as a character string resembling a number.

So quote that literal value regardless, and if you're writing a program that does a prepared statement make sure the parameter type for comparing to bs.Month is a character type and not integer type.

Lib and let die
Aug 26, 2004

kumba posted:

That was my first guess, but it does not! i can select distinct bs.Month for the timeframe and there's nothing weird at all - just a list of 6 character values that I would expect to see!

Does including a

code:
WHERE TRY_CAST(bs.Month AS Date) IS NULL AND bs.Month IS NOT NULL
return any specific results?

is the schema using something weird like SMALLDATETIME?

kumba
Nov 8, 2003

I posted my food for USPOL Thanksgiving!

enjoy the ride

Lipstick Apathy

nielsm posted:

I don't think SQL has any values (besides NULL) that must be written unquoted. Always quote your literal values, even ints and floats.

But also, make sure to mentally distinguish between numbers and numeric strings. A number is a value that has a single magnitude and you can do meaningful arithmetic on, such as use the AVG() function. Trying to use the AVG() function on your bs.Month column wouldn't produce a logically meaningful value, because it's not a proper number with a single magnitude, it's a composite value consisting of two magnitudes (year and month of year) that just happen to be formatted as a character string resembling a number.

So quote that literal value regardless, and if you're writing a program that does a prepared statement make sure the parameter type for comparing to bs.Month is a character type and not integer type.

Oh yeah I 100% agree; this is mostly an academic question that piqued my curiosity as I have a co-worker who is very unfamiliar with SQL that wrote the original query and didn't realize he needed to quote it because the code ran fine in most circumstances without it. I was entirely surprised it ran at all as written


Lib and let die posted:

Does including a

code:
WHERE TRY_CAST(bs.Month AS Date) IS NULL AND bs.Month IS NOT NULL
return any specific results?

Adding that condition didn't change anything, I still get the conversion failed error

Lib and let die posted:

is the schema using something weird like SMALLDATETIME?

I didn't notice this before, but the Bill_CompleteDttm field is actually a datetime2(0) field. Maybe this is somehow the culprit? I don't understand how it could be but that's why I'm here :D

(side note: I have no clue why this field is a datetime2(0) vs just a standard date field, nor do I understand really why we're casting it in the first place, but i did not create this database so who knows)

Hed
Mar 31, 2004

Fun Shoe
I'm coming from a SQL ORM (Django) but diving into the actual SQL because things aren't working and I need to get better at this layer (Postgres).

Here's my relationships:


Here's what I'm trying to do:

  • For a given commoditycontractseries
  • find all the commodityfuturescontracts for which we have a position on and return a quantity of what we have on
  • where a position on a day is the summation of all the quantities for a given contract and reference_date

The SQL looks like this:

https://pastebin.com/hFUZ6DCZ (apologies for the pastebins, cloudflare really doesn't like me)

And it works, I get <id, year, month, series_id, quantity> tuples where the quantity is correct.

So now I want to do:

  • Same as above, but enrich each contract for which we have a quantity with
  • average_volume, an average of the last N entries of market_data_futuresdataentries. In other words, if I wanted the position reference_date for July 19, 2023, I'd also want to average the volume for those same contracts from reference_date <= '2023-07-19' LIMIT 5. Basically, the last 5 days of data.

So I can do just the average, and it works: https://pastebin.com/JEnxqiqA

When I combine these two end up getting results, but the quantity there is some multiple, but the averages are correct. Changing the SUM to SUM (DISTINCT ...) gets different but also wrong numbers. I feel like my SUM ... AS "quantity" is considering way more than the reference_date in the WHERE clause but unsure how.

Here's my pastebin for this: https://pastebin.com/LZDDwWcQ

Where should I be looking for the issue?

Hed fucked around with this message at 22:04 on Aug 18, 2023

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE
Yeah that's some Django generated stuff alright. Been there, been annoyed, wrote the raw SQL instead because it was so hard to get it to do what I wanted.

The first query does what you want but in an extremely dumb way. It's doing the same calculation twice. First it's got the main query that finds the commodityfuturescontracts rows, then it's got a subquery that calculates the quantity for each reconciliation_positionfilefuturesentry associated with that contract. The thing is though that the subquery is returning exactly the same resultset as the outer query. Here's that first query with some comments:

SQL code:
select distinct -- distinct presumably necessary because one contract can have many positionfilefuturesentry
    core_commodityfuturescontract.id,
    core_commodityfuturescontract.year,
    core_commodityfuturescontract.month,
    core_commodityfuturescontract.series_id,
    (
        select SUM(U1.quantity) AS quantity
        from
            core_commodityfuturescontract U0 -- same table as in the outer query, but with an alias
            inner join reconciliation_positionfilefuturesentry U1 
                ON (U0.id = U1.contract_id) -- same join as in the outer query
        where (
            U1.reference_date = date '2023-07-19' -- same predicate as in the outer query
            AND U0.id = (core_commodityfuturescontract.id) -- ensures we find the same rows as in the outer query
    )) AS quantity
from
    core_commodityfuturescontract
    inner join reconciliation_positionfilefuturesentry
        ON (core_commodityfuturescontract.id = reconciliation_positionfilefuturesentry.contract_id)
where (
    core_commodityfuturescontract.series_id = 2
    AND reconciliation_positionfilefuturesentry.reference_date = date '2023-07-19')
If we remove the duplication we can reduce it to just this:

SQL code:
select
    core_commodityfuturescontract.id,
    core_commodityfuturescontract.year,
    core_commodityfuturescontract.month,
    core_commodityfuturescontract.series_id,
    sum(reconciliation_positionfilefuturesentry.quantity) as quantity
from
    core_commodityfuturescontract
    inner join reconciliation_positionfilefuturesentry
        on (core_commodityfuturescontract.id = reconciliation_positionfilefuturesentry.contract_id)
where
    core_commodityfuturescontract.series_id = 2
    and reconciliation_positionfilefuturesentry.reference_date = date '2023-07-19'
group by
    core_commodityfuturescontract.id
Side note: in many databases, if you have a GROUP BY and an aggregate, all projected columns must either be aggregated or named in the GROUP BY clause or you get an error, but postgres is apparently smart enough these days to figure out that if you group by a primary key then the other columns selected from that table are implicitly grouped too.

Moving on, the following two queries are mostly nonsense. I'm not going to attempt to correct them, they're kinda in the "not even wrong" territory, they're just not doing anything like what you actually want. You can't get Django to spit out a query that does what you want with any reasonable amount of effort. You'll basically want the above query plus some sort of subquery to calculate the average_volume of each contract, but Django isn't even remotely close to doing what you want. Doing a partial aggregate like the one you describe (for each row in the main resultset, find a limited number of correlated rows in a different table ordered by some criteria and aggregate over them) requires going deeper into the SQL weeds than Django really supports.

Basically the problem you have here is that you have two aggregation operations in the same query but they're aggregating over completely different resultsets. The quantity comes from aggregating over the full set of reconciliation_positionfilefuturesentry rows related to a contract, but the average_volume comes from aggregating over a limited number of market_data_futuresdataentry rows. If you drag both of these two correlated tables into the same resultset you'll get a lot more rows to aggregate over than you actually want. There are a couple of ways to tackle this problem, but all of them involve splitting the query into parts and aggregating the parts separately. Here's a variant using a CTE:

SQL code:
with contracts as (
    select
        core_commodityfuturescontract.id,
        core_commodityfuturescontract.year,
        core_commodityfuturescontract.month,
        core_commodityfuturescontract.series_id,
        sum(reconciliation_positionfilefuturesentry.quantity) as quantity
    from
        core_commodityfuturescontract
        inner join reconciliation_positionfilefuturesentry
            on (core_commodityfuturescontract.id = reconciliation_positionfilefuturesentry.contract_id)
    where
        core_commodityfuturescontract.series_id = 2
        and reconciliation_positionfilefuturesentry.reference_date = date '2023-07-19'
    group by
        core_commodityfuturescontract.id
)
select
    contracts.id,
    contracts.year,
    contracts.month,
    contracts.series_id,
    contracts.quantity,
    avg(market_data_futuresdataentry.quantity) as average_volume
from
    market_data_futuresdataentry
    inner join contracts on 
        (market_data_futuresdataentry.contract_id = contracts.id)
where
    market_data_futuresdataentry.reference_date between (date '2023-07-19' - interval '5 days') and date '2023-07-19'
group by
    contracts.id,
    contracts.year,
    contracts.month,
    contracts.series_id
    contracts.quantity
(not tested, can't be bothered to set up a fiddle with fake data right now)

A possibly more elegant way is to remove the aggregation from the main query altogether and just rely on lateral subqueries for the summing instead. Need to be lateral subqueries because we're referencing columns from the outer query in the subqueries; can't do that without the LATERAL keyword. Would look something like this:

SQL code:
select
    core_commodityfuturescontract.id,
    core_commodityfuturescontract.year,
    core_commodityfuturescontract.month,
    core_commodityfuturescontract.series_id,
    sum_of_positions_on_given_date.quantity as quantity,
    volume_last_5_days.average_volume as average_volume
from
    core_commodityfuturescontract
    cross join lateral (
        select sum(reconciliation_positionfilefuturesentry.quantity) as quantity
        from reconciliation_positionfilefuturesentry
        where
            core_commodityfuturescontract.id = reconciliation_positionfilefuturesentry.contract_id
            and reconciliation_positionfilefuturesentry.reference_date = date '2023-07-19'
    ) as sum_of_positions_on_given_date
    cross join lateral (
        select avg(market_data_futuresdataentry.quantity) as average_volume
        from market_data_futuresdataentry
        where
            core_commodityfuturescontract.id = market_data_futuresdataentry.contract_id
            and market_data_futuresdataentry.reference_date between 
                (date '2023-07-19' - interval '5 days')
                and date '2023-07-19'
    ) as volume_last_5_days
(probably needs a bit more polish to actually work as intended; might need a semi-join on reconciliation_positionfilefuturesentry to make sure the outer query is returning only the appropriate contracts)

I doubt you can get Django to spit out either of these two above, but perhaps you could - with sufficient .Annotate() shenanigans - get it to generate this:

SQL code:
select
    core_commodityfuturescontract.id,
    core_commodityfuturescontract.year,
    core_commodityfuturescontract.month,
    core_commodityfuturescontract.series_id,
    sum(reconciliation_positionfilefuturesentry.quantity) as quantity,
    (
        select
            avg(market_data_futuresdataentry.quantity)
        from market_data_futuresdataentry
        where
            core_commodityfuturescontract.id = market_data_futuresdataentry.contract_id
            and market_data_futuresdataentry.reference_date between 
                (date '2023-07-19' - interval '5 days')
                and date '2023-07-19'
    ) as average_volume
from
    core_commodityfuturescontract
    inner join reconciliation_positionfilefuturesentry
        on (core_commodityfuturescontract.id = reconciliation_positionfilefuturesentry.contract_id)
where
    core_commodityfuturescontract.series_id = 2
    and reconciliation_positionfilefuturesentry.reference_date = date '2023-07-19'
group by
    core_commodityfuturescontract.id
You might have to put average_volume in the GROUP BY clause though, I'm genuinely not sure if postgres will complain about it or not.

Maybe you could do it with some window function nonsense too but I'll have to stop here. Doesn't help you with the Django part either though.

BTW the trick to bypassing the cloudflare nonsense seems to be to make a newline between the select and the projection list. In the last query it complained about select avg(... until I made it select<newline>avg(.... I had to basically bisect the post to find the issue.

Oh, and one last note: your post sort of implies "last 5 entries" from market_data_futuresdataentry is equivalent to "last five days of data for this contract", so I exploited that in all of the queries above to avoid the LIMIT 5 and ordering stuff. If that isn't actually the case and you really want exactly five entries ordered by timestamp before a given date, then things get a bit more annoying and the lateral join variant is probably your best bet, except it won't quite work as written.

e: sorry, I just suck at explaining. I might try to straighten some of this out tomorrow

TheFluff fucked around with this message at 01:23 on Aug 20, 2023

Lib and let die
Aug 26, 2004

I will occasionally tap into ChatGPT when I need my statements to bridge the gap between "clever" and "fancy", and did so this morning when working with some data exported from salesforce (my goal was to take each Account with a type of 'Household' from the Accounts file and populate, in-line, the primary household member (indicated in the one2oneContact field as per usual SF) and all associated members of the household identified by the [AccountId] in the Contacts file - the insert script for this particular item in our database is built around the expectation that we are getting data in the form of [HouseholdID][Member1][Member2][Member3] and so on - suffice to say it needs some serious reworking but that's neither here nor there).

Anyway, chatGPT gave me a fine output. It might not be the most efficient way to do it, but it gives me the results I want. I've got to export the results to excel and do a text to columns then re-import it, but it's still way less annoying than it could be.

My question, ultimately, is with ChatGPT's followup 'notes' or whatever.

quote:

In this query, the ContactConcatenation CTE uses the STUFF and FOR XML PATH('') technique to concatenate the secondary contact IDs into a single column.

What exactly is the "STUFF and FOR XML PATH('') technique" and what is it doing?

I can provide the full output here, but the relevant code bit is this:

code:
SELECT
        pc.AccountID,
        STUFF((SELECT ', ' + CAST(sc.ContactID AS VARCHAR(500))
               FROM SecondaryContacts sc
               WHERE pc.AccountID = sc.AccountID
               FOR XML PATH('')), 1, 2, '') AS SecondaryContactIDs
    FROM PrimaryContacts pc
pc and sc are aliases assigned to CTEs for primary and secondary contacts defined a bit further up in the output, for reference

Hed
Mar 31, 2004

Fun Shoe

TheFluff posted:

Yeah that's some Django generated stuff alright. Been there, been annoyed, wrote the raw SQL instead because it was so hard to get it to do what I wanted.

The first query does what you want but in an extremely dumb way. It's doing the same calculation twice. First it's got the main query that finds the commodityfuturescontracts rows, then it's got a subquery that calculates the quantity for each reconciliation_positionfilefuturesentry associated with that contract. The thing is though that the subquery is returning exactly the same resultset as the outer query. Here's that first query with some comments:

SQL code:
select distinct -- distinct presumably necessary because one contract can have many positionfilefuturesentry
    core_commodityfuturescontract.id,
    core_commodityfuturescontract.year,
    core_commodityfuturescontract.month,
    core_commodityfuturescontract.series_id,
    (
        select SUM(U1.quantity) AS quantity
        from
            core_commodityfuturescontract U0 -- same table as in the outer query, but with an alias
            inner join reconciliation_positionfilefuturesentry U1 
                ON (U0.id = U1.contract_id) -- same join as in the outer query
        where (
            U1.reference_date = date '2023-07-19' -- same predicate as in the outer query
            AND U0.id = (core_commodityfuturescontract.id) -- ensures we find the same rows as in the outer query
    )) AS quantity
from
    core_commodityfuturescontract
    inner join reconciliation_positionfilefuturesentry
        ON (core_commodityfuturescontract.id = reconciliation_positionfilefuturesentry.contract_id)
where (
    core_commodityfuturescontract.series_id = 2
    AND reconciliation_positionfilefuturesentry.reference_date = date '2023-07-19')
If we remove the duplication we can reduce it to just this:

SQL code:
select
    core_commodityfuturescontract.id,
    core_commodityfuturescontract.year,
    core_commodityfuturescontract.month,
    core_commodityfuturescontract.series_id,
    sum(reconciliation_positionfilefuturesentry.quantity) as quantity
from
    core_commodityfuturescontract
    inner join reconciliation_positionfilefuturesentry
        on (core_commodityfuturescontract.id = reconciliation_positionfilefuturesentry.contract_id)
where
    core_commodityfuturescontract.series_id = 2
    and reconciliation_positionfilefuturesentry.reference_date = date '2023-07-19'
group by
    core_commodityfuturescontract.id
Side note: in many databases, if you have a GROUP BY and an aggregate, all projected columns must either be aggregated or named in the GROUP BY clause or you get an error, but postgres is apparently smart enough these days to figure out that if you group by a primary key then the other columns selected from that table are implicitly grouped too.

Moving on, the following two queries are mostly nonsense. I'm not going to attempt to correct them, they're kinda in the "not even wrong" territory, they're just not doing anything like what you actually want. You can't get Django to spit out a query that does what you want with any reasonable amount of effort. You'll basically want the above query plus some sort of subquery to calculate the average_volume of each contract, but Django isn't even remotely close to doing what you want. Doing a partial aggregate like the one you describe (for each row in the main resultset, find a limited number of correlated rows in a different table ordered by some criteria and aggregate over them) requires going deeper into the SQL weeds than Django really supports.

Basically the problem you have here is that you have two aggregation operations in the same query but they're aggregating over completely different resultsets. The quantity comes from aggregating over the full set of reconciliation_positionfilefuturesentry rows related to a contract, but the average_volume comes from aggregating over a limited number of market_data_futuresdataentry rows. If you drag both of these two correlated tables into the same resultset you'll get a lot more rows to aggregate over than you actually want. There are a couple of ways to tackle this problem, but all of them involve splitting the query into parts and aggregating the parts separately. Here's a variant using a CTE:

SQL code:
with contracts as (
    select
        core_commodityfuturescontract.id,
        core_commodityfuturescontract.year,
        core_commodityfuturescontract.month,
        core_commodityfuturescontract.series_id,
        sum(reconciliation_positionfilefuturesentry.quantity) as quantity
    from
        core_commodityfuturescontract
        inner join reconciliation_positionfilefuturesentry
            on (core_commodityfuturescontract.id = reconciliation_positionfilefuturesentry.contract_id)
    where
        core_commodityfuturescontract.series_id = 2
        and reconciliation_positionfilefuturesentry.reference_date = date '2023-07-19'
    group by
        core_commodityfuturescontract.id
)
select
    contracts.id,
    contracts.year,
    contracts.month,
    contracts.series_id,
    contracts.quantity,
    avg(market_data_futuresdataentry.quantity) as average_volume
from
    market_data_futuresdataentry
    inner join contracts on 
        (market_data_futuresdataentry.contract_id = contracts.id)
where
    market_data_futuresdataentry.reference_date between (date '2023-07-19' - interval '5 days') and date '2023-07-19'
group by
    contracts.id,
    contracts.year,
    contracts.month,
    contracts.series_id
    contracts.quantity
(not tested, can't be bothered to set up a fiddle with fake data right now)

A possibly more elegant way is to remove the aggregation from the main query altogether and just rely on lateral subqueries for the summing instead. Need to be lateral subqueries because we're referencing columns from the outer query in the subqueries; can't do that without the LATERAL keyword. Would look something like this:

SQL code:
select
    core_commodityfuturescontract.id,
    core_commodityfuturescontract.year,
    core_commodityfuturescontract.month,
    core_commodityfuturescontract.series_id,
    sum_of_positions_on_given_date.quantity as quantity,
    volume_last_5_days.average_volume as average_volume
from
    core_commodityfuturescontract
    cross join lateral (
        select sum(reconciliation_positionfilefuturesentry.quantity) as quantity
        from reconciliation_positionfilefuturesentry
        where
            core_commodityfuturescontract.id = reconciliation_positionfilefuturesentry.contract_id
            and reconciliation_positionfilefuturesentry.reference_date = date '2023-07-19'
    ) as sum_of_positions_on_given_date
    cross join lateral (
        select avg(market_data_futuresdataentry.quantity) as average_volume
        from market_data_futuresdataentry
        where
            core_commodityfuturescontract.id = market_data_futuresdataentry.contract_id
            and market_data_futuresdataentry.reference_date between 
                (date '2023-07-19' - interval '5 days')
                and date '2023-07-19'
    ) as volume_last_5_days
(probably needs a bit more polish to actually work as intended; might need a semi-join on reconciliation_positionfilefuturesentry to make sure the outer query is returning only the appropriate contracts)

I doubt you can get Django to spit out either of these two above, but perhaps you could - with sufficient .Annotate() shenanigans - get it to generate this:

SQL code:
select
    core_commodityfuturescontract.id,
    core_commodityfuturescontract.year,
    core_commodityfuturescontract.month,
    core_commodityfuturescontract.series_id,
    sum(reconciliation_positionfilefuturesentry.quantity) as quantity,
    (
        select
            avg(market_data_futuresdataentry.quantity)
        from market_data_futuresdataentry
        where
            core_commodityfuturescontract.id = market_data_futuresdataentry.contract_id
            and market_data_futuresdataentry.reference_date between 
                (date '2023-07-19' - interval '5 days')
                and date '2023-07-19'
    ) as average_volume
from
    core_commodityfuturescontract
    inner join reconciliation_positionfilefuturesentry
        on (core_commodityfuturescontract.id = reconciliation_positionfilefuturesentry.contract_id)
where
    core_commodityfuturescontract.series_id = 2
    and reconciliation_positionfilefuturesentry.reference_date = date '2023-07-19'
group by
    core_commodityfuturescontract.id
You might have to put average_volume in the GROUP BY clause though, I'm genuinely not sure if postgres will complain about it or not.

Maybe you could do it with some window function nonsense too but I'll have to stop here. Doesn't help you with the Django part either though.

BTW the trick to bypassing the cloudflare nonsense seems to be to make a newline between the select and the projection list. In the last query it complained about select avg(... until I made it select<newline>avg(.... I had to basically bisect the post to find the issue.

Oh, and one last note: your post sort of implies "last 5 entries" from market_data_futuresdataentry is equivalent to "last five days of data for this contract", so I exploited that in all of the queries above to avoid the LIMIT 5 and ordering stuff. If that isn't actually the case and you really want exactly five entries ordered by timestamp before a given date, then things get a bit more annoying and the lateral join variant is probably your best bet, except it won't quite work as written.

e: sorry, I just suck at explaining. I might try to straighten some of this out tomorrow

holy crap, thank you! The lateral cross joins will take me a bit to fully understand, I haven't seen them. The reduced queries (first couple in the reply) are straightforward and run faster, though. I'm at the point where I could do 2 queries and join it in Python if needed. But, I will take another look.

The only adaptation I need to make now is that it is the case I need "the last 5 entries of data". Apologies for the implication that I could do a date range thing because I wouldn't have data for dates the market is closed for weekends/holidays. That's why I was doing LIMIT 5. However, the Python that calls this could pretty easily have the market calendars in memory, guaranteeing I could do a BETWEEN with who actual dates that we know should equal 5. That's probably the easier way to do it since the "get the 5 entries with reference day less-than-equal, sorted by time" involves a lot more join machinery.

Generic Monk
Oct 31, 2011

Lib and let die posted:

I will occasionally tap into ChatGPT when I need my statements to bridge the gap between "clever" and "fancy", and did so this morning when working with some data exported from salesforce (my goal was to take each Account with a type of 'Household' from the Accounts file and populate, in-line, the primary household member (indicated in the one2oneContact field as per usual SF) and all associated members of the household identified by the [AccountId] in the Contacts file - the insert script for this particular item in our database is built around the expectation that we are getting data in the form of [HouseholdID][Member1][Member2][Member3] and so on - suffice to say it needs some serious reworking but that's neither here nor there).

Anyway, chatGPT gave me a fine output. It might not be the most efficient way to do it, but it gives me the results I want. I've got to export the results to excel and do a text to columns then re-import it, but it's still way less annoying than it could be.

My question, ultimately, is with ChatGPT's followup 'notes' or whatever.

What exactly is the "STUFF and FOR XML PATH('') technique" and what is it doing?

I can provide the full output here, but the relevant code bit is this:

code:
SELECT
        pc.AccountID,
        STUFF((SELECT ', ' + CAST(sc.ContactID AS VARCHAR(500))
               FROM SecondaryContacts sc
               WHERE pc.AccountID = sc.AccountID
               FOR XML PATH('')), 1, 2, '') AS SecondaryContactIDs
    FROM PrimaryContacts pc
pc and sc are aliases assigned to CTEs for primary and secondary contacts defined a bit further up in the output, for reference

It's away to work around the lack of a GROUP_CONCAT function (returning the rows of a query result stitched together in one row) in older versions of MS SQL server.

FOR XML in SQL server allows you to return the results of a query as XML elements, i.e. text that is marked up with XML tags. Adding the PATH('') modifier removes the tags and replaces them with nothing so you have all the rows concatted together. You want a comma-separated list of values so each value in the subquery begins with a comma and a space (the SELECT ', ' + part). But you don't want the result to begin with a comma and a space, so the STUFF function is just taking 2 characters of the string, starting from position 1, and replacing them with nothing (the 1, 2, ''' at the end are the first position, 2 characters, and a blank string (nothing) respectively).

It's just an ugly hack to work around the lack of a purpose built function that does that in SQL Server. 2017 and later have the STRING_AGG function which does what you want in a more intuitive and readable way. You may or may not be using this, but the way chatGPT has done it is the 'classic' way that will be all over the internet, hence why it recommended it :)

Lib and let die
Aug 26, 2004

And there's the problem with ChatGPT: I can use STRING_AGG lmfao

Generic Monk
Oct 31, 2011

Lib and let die posted:

And there's the problem with ChatGPT: I can use STRING_AGG lmfao

It's a great resource, it just behoves you to sanity check what it tells you. Which is what you're doing, so sounds like you're using it correctly :)

Just-In-Timeberlake
Aug 18, 2003

Lib and let die posted:

And there's the problem with ChatGPT: I can use STRING_AGG lmfao

then you tell it that and to use STRING_AGG and it will

Lib and let die
Aug 26, 2004

Yeah I could have really improved my prompt there in retrospect.

Generic Monk
Oct 31, 2011

e: I'm an idiot and overcomplicating a simple problem LOL

Generic Monk fucked around with this message at 17:35 on Aug 24, 2023

credburn
Jun 22, 2016
President, Founder of the Brent Spiner Fan Club
Hey, gang!

I've been smashing my head against this code for the last week and am finally at the end of it. I'm taking an online course and unfortunately, if I need any help, I have to either figure it out myself, e-mail my course instructor and wait 12 days for a response, or come begging on these forums for some guidance.

The only thing not working at this point is there is a trigger that needs to fire that does not seem to be.

The goals in entirety (so as to perhaps explain some parts of the code) is:

1. Find the top ten most popular districts in a DVD rental database
2. In those districts, find the most popular genre of films being rented
3. Create a detail table that lists the district, genre and genre count
4. Create a summary table that lists the district, genre, and a ranking of the top three genres in each district
5. Create a trigger that will update summary_table whenever detail_table is updated

Everything works but Step number 5. I've run this through Chat-GPT and it says everything looks fine. I can see the trigger is there in postgres, and the code produces no errors. But when I update the details_table (by changing the number of rented action films to 0), the summary_table does not change.

Here is my code...

https://pastebin.com/3JpQ5RYh

Thank you for any advice :|

credburn fucked around with this message at 02:01 on Aug 26, 2023

redleader
Aug 18, 2005

Engage according to operational parameters

credburn posted:

Hey, gang!

I've been smashing my head against this code for the last week and am finally at the end of it. I'm taking an online course and unfortunately, if I need any help, I have to either figure it out myself, e-mail my course instructor and wait 12 days for a response, or come begging on these forums for some guidance.

The only thing not working at this point is there is a trigger that needs to fire that does not seem to be.

The goals in entirety (so as to perhaps explain some parts of the code) is:

1. Find the top ten most popular districts in a DVD rental database
2. In those districts, find the most popular genre of films being rented
3. Create a detail table that lists the district, genre and genre count
4. Create a summary table that lists the district, genre, and a ranking of the top three genres in each district
5. Create a trigger that will update summary_table whenever detail_table is updated

Everything works but Step number 5. I've run this through Chat-GPT and it says everything looks fine. I can see the trigger is there in postgres, and the code produces no errors. But when I update the details_table (by changing the number of rented action films to 0), the summary_table does not change.

Here is my code...

https://pastebin.com/DmQptSpy

Thank you for any advice :|

i don't speak postgres, but from a quick google id expect there to be a CREATE TRIGGER statement somewhere. that is, a statement to wire up that function to an actual trigger on the table

credburn
Jun 22, 2016
President, Founder of the Brent Spiner Fan Club

redleader posted:

i don't speak postgres, but from a quick google id expect there to be a CREATE TRIGGER statement somewhere. that is, a statement to wire up that function to an actual trigger on the table

Whoops; yes, you're right. I accidentally omitted that in the example but it's been updated. It still doesn't appear to update the summary_table :(

Bruegels Fuckbooks
Sep 14, 2004

Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

credburn posted:

Whoops; yes, you're right. I accidentally omitted that in the example but it's been updated. It still doesn't appear to update the summary_table :(

To be sure, you did this right?
code:
-- Creating the trigger that will update the summary_table every time detail_table is updated:
CREATE TRIGGER tr_UpdateSummaryTable
AFTER UPDATE ON detail_table
FOR EACH ROW
EXECUTE FUNCTION UpdateSummaryTableTriggerFunction();
Might want do add the following to your trigger function to clean the summary as well...
code:
...
BEGIN
  DELETE FROM summary_table;

  INSERT INTO summary_table (district, genre_rank, genre)
...

kiwid
Sep 30, 2013

Two questions regarding SQL Server.

First, is Ola Hallengren's "IndexOptimize" job the same thing as rebuilding indexes and updating statistics scripts or do I still need to run these?

Second, I'm migrating from one SQL Server to another. Actually, I've finished migrating everything, settings, databases, logins, etc. with the exception of SSIS.

Does anyone know the best way to migrate the SSIS catalog and all the jobs and poo poo that go with it? Do I need to manually open up the SSIS packages to update connections with the new server name?

MrMoo
Sep 14, 2000

Stupid question ahoy. If I have a simple soft delete query, i.e.
SQL code:
UPDATE assets SET is_deleted WHERE id = ?;
I can perform the basic "zero-write optimisation" as such,
SQL code:
UPDATE assets SET is_deleted WHERE id = ? AND is_deleted = 0;
However if I want the caller to be able to distinguish between whether the row exists and whether it was already soft deleted, is there a recommended form for that? In pseudo SQL that could be:
SQL code:
UPDATE assets SET is_deleted WHERE id = ? AND is_deleted = 0;
CASE WHEN (changes() = 0) THEN
  SELECT COUNT(*) AS is_exists, is_deleted FROM assets WHERE id = ?
ELSE
  SELECT 1 AS is_exists, 1 AS is_deleted
END;
I'm basically writing some unit tests and seeing whether I can raise HTTP/404 and HTTP/410 as separate responses. The non-transaction safe form would be to query on the asset ID if the UPDATE statement returned zero changes.

Currently targeting SQLite, open to learn for other vendor dialects.

The obvious answer is to not bother making the distinction, but it's nice to see if it is possible and practical.

MrMoo fucked around with this message at 04:11 on Oct 7, 2023

redleader
Aug 18, 2005

Engage according to operational parameters
does your sqlite wrapper expose one of the sqlite3_changes functions?

looks like another option might be the count_changes pragma, although this is deprecated - by what, i'm not sure. it definitely seems like a nicer way to get the affected row count than sqlite3_changes

MrMoo
Sep 14, 2000

The problem appears to be more about reading a column value before an UPDATE is applied. With Postgres this is possible, but Sqlite likes to only return the updated value.

https://stackoverflow.com/questions/7923237/return-pre-update-column-values-using-sql-only

i.e. the following is fine in PGSQL:
SQL code:
UPDATE assets SET is_deleted = 1 WHERE id = 1
RETURNING (SELECT is_deleted FROM assets WHERE id = 1);

redleader
Aug 18, 2005

Engage according to operational parameters
oooh, yeah, i get it now. have you tried the solution in the accepted answer on that question? i think that could work for you, if sqlite allows it. the rowset will be empty if the id doesn't exist, and you'd get the old value of is_deleted to see if it had been previously deleted or not

code:

UPDATE assets x
SET is_deleted = 1
FROM assets y
WHERE x.id = y.id
    AND x.id = 3
RETURNING y.is_deleted AS old_is_deleted;

you'll no doubt need to tweak the syntax, because sql is a vibe and not a standard

MrMoo
Sep 14, 2000

Alas, sqlite doesn't like that for some reason, need to check why, https://www.sqlite.org/lang_update.html

SQL code:
UPDATE demo AS x
SET name = "frog"
FROM demo AS y
WHERE x.id = y.id
    AND x.id = 3
RETURNING y.name AS old_name;
https://sqliteonline.com/

Raises "no such column: y.name" 🤷‍♀️ Doesn't seem to understand table alias in the RETURNING clause.

There are three conditions:
  • If id matches and is not already soft-deleted. This is detectable via changes() = 1 if there is the additional WHERE is_deleted = 0 clause.
  • If id matches and is already soft-deleted. The basic logic would be that changes() = 0 then verify that is_deleted = 1. Otherwise the above PGSQL syntax of detecting the value of is_deleted before the UPDATE would achieve the same goal.
  • If id does not match any rows, i.e. changes() = 0 and a SELECT COUNT(*) FROM assets WHERE id = 1 would be zero.

I think that just leaves this:
SQL code:
BEGIN TRANSACTION;
SELECT name FROM demo WHERE id = 3;
UPDATE demo SET name = "frog" WHERE id = 3 AND name <> "frog";
COMMIT;
Remembering that this version still returns "frog":
SQL code:
UPDATE demo
SET name = 'frog'
WHERE id = 3 AND name <> 'frog'
RETURNING name;
Which I guess is explained by this:

quote:

The first prototype of the RETURNING clause returned values as they were generated. That approach used less memory, but it had other problems:

If the calls sqlite3_step() for two or more DML statements where interleaved and if one of the statements hit a constraint failure and aborted, reverting its changes, then that could disrupt the operation of the other DML statement. This could not corrupt the database file, but it could yield surprising and difficult-to-explain results in the database.

If an application failed to call sqlite3_step() repeatedly until it received SQLITE_DONE, then some of the database changes might never occur.

The order of operations was different from client/server database engines like PostgreSQL, which might cause portability issues for some applications.

For these reasons, the current implementation was modified so that all database changes happen before any RETURNING output is emitted.
https://www.sqlite.org/lang_returning.html

MrMoo fucked around with this message at 18:28 on Oct 7, 2023

nielsm
Jun 1, 2009



The page you're quoting has one more thing at the very end:

quote:

7. The RETURNING clause may only reference the table being modified. In an UPDATE FROM statement, the auxiliary tables named in the FROM clause may not participate in the RETURNING clause.


Edit: In MS SQL it's easier, with their proprietary OUTPUT clause. You can use the pseudo-table names "deleted" and "inserted" in that to return values from the pre-update respectively post-update row.

nielsm fucked around with this message at 18:37 on Oct 7, 2023

Roadie
Jun 30, 2013
God, I wish MERGE in Postgres supported RETURNING. I've got this awful mess of like 5 INSERT/UPDATE/SELECT CTEs (because of two separate unique indexes that I can't change for unrelated reasons, so I can't just ON CONFLICT) that I could have replaced with a single MERGE, except that I need the returned ID in the insert/upsert case.

Roadie fucked around with this message at 18:11 on Oct 24, 2023

Hughmoris
Apr 21, 2007
Let's go to the abyss!

Roadie posted:

God, I wish MERGE in Postgres supported RETURNING. I've got this awful mess of like 5 INSERT/UPDATE/SELECT CTEs (because of two separate unique indexes that I can't change for unrelated reasons, so I can't just ON CONFLCIT) that I could have replaced with a single MERGE, except that I need the returned ID in the insert/upsert case.

*I can't read, ignore me.

Mind_Taker
May 7, 2007



A few design/best practices questions:

1. Do you name your tables using singular or plural nouns? E.g. Customer or Customers.

2. Do you use snake case, camel case, or another case to name columns? E.g. first_name or FirstName?

3. Do you prefix a table's id column name with the table name or just leave it as id? E.g. id or customer_id


I've always done plural, snake case, and id respectively but apparently the new team I have joined has used the alternatives.

If it matters the DB is the backend to an API written in C# with Entity Framework in a DB first workflow. They have said this drives some of their opinions but these opinions are all very weird to me.

I guess I'm wondering if, aside from personal preference, there is any reason to go with one vs. the other in the above items.

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
None of those things matters as long as you are consistent

Adbot
ADBOT LOVES YOU

DELETE CASCADE
Oct 25, 2017

i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003
in postgres it is common to use snake_case instead of camelCase, because postgres does automatic case folding of all identifiers to lowercase. you declare the column fooBar, the actual column created is foobar. feel free to call it fooBar in all your queries, postgres will just silently convert that to foobar and it works. if you really want to name the column fooBar with the caps, you can use double quotes to disable case folding: "fooBar". great, you've created a camelCase column. but now you must refer to it as "fooBar" each time, with the quotes, because plain fooBar gets case folded to foobar, and that's not the name of the column. and since in postgres we don't want to sprinkle quotes all over the place, we use snake_case. (microsofties seem to have no problem with the [mssql].[nonsense]...)

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply