|
mortarr posted:A lot of Data Engineering jobs mention python - where should I start? I've done a lot of data related dev stuff, almost all on c#/ms SQL if that helps? Looking at one of the Azure certs mentioned above too. Our humble SA thread https://forums.somethingawful.com/showthread.php?threadid=3812541 https://www.reddit.com/r/inventwithpython/ https://www.reddit.com/r/learnpython/ https://www.reddit.com/r/Python/ https://www.reddit.com/r/PythonNoobs/ Books Python Crash Course Automate the Boring Stuff with Python Think Python Youtube Indently Tech With Tim mCoding ArjanCodes Python Simplified Online Learning Codecademy Real Python The Official Guide DataCamp Discussion Official Python IRC Discord Brain-busters LeetCode Practice Python HackerRank Edabit
|
# ? Aug 1, 2023 22:37 |
|
|
# ? May 25, 2024 11:55 |
|
Seventh Arrow posted:Great resources... This is some good stuff and should definitely be pinned in the OP, and the new data thread (if someone makes it).
|
# ? Aug 2, 2023 02:58 |
|
Anyone here work with Spark on a routine basis? If so, what's the nature of the work and how do you like it? I'm studying up for the databricks Spark Developer cert but unfortunately don't have much excuse to use it in my current role.
|
# ? Aug 3, 2023 03:10 |
|
theory question. let's say you have two scripts to get you the same result set. one scripts the conditions via predicates in the WHERE clause, and the JOIN are simply ON a.x = b.x. the other scripts the conditions via the ON statements in its JOINs, and has no WHERE clause. does it matter at all? is one more efficient? personally i've always done everything in the WHERE clause. it's just so much more readable. but maybe that comes at an expense to efficiency? abelwingnut fucked around with this message at 19:10 on Aug 11, 2023 |
# ? Aug 11, 2023 19:04 |
If you were doing left (or right) joins instead of inner joins, then they would not necessarily be equivalent.
|
|
# ? Aug 11, 2023 19:24 |
|
abelwingnut posted:let's say you have two scripts to get you the same result set. one scripts the conditions via predicates in the WHERE clause, and the JOIN are simply ON a.x = b.x. the other scripts the conditions via the ON statements in its JOINs, and has no WHERE clause. the usual positions people strike on this are (1) they like to do everything in the WHERE clause or (2) they like to put join logic in ON clauses and filtering logic in the WHERE clause. Afaik it's uncommon to advocate for (3) putting filtering logic in ON clauses and avoiding having a WHERE clause at all. With good reason, since filtering unrelated to the JOIN logic is expressly what the WHERE clause is for and to hide it in ON clauses is to obfuscate what your query is doing. I would find writing SQL like that extremely weird. I feel like your question is pretty abstract so it amounts to asking, does there exist a pair of queries - one putting filtering logic in the WHERE clause and one putting it in ON clauses - where the queries would give identical results but with a difference in efficiency? I feel like the answer morally ought to be "no there does not" but I couldn't begin to prove it and as I said, it's very abstract so maybe there is a way you could come up with one. It's a totally academic question anyway because as I said, I don't think people generally advocate for putting filtering logic in the ON clause. Any improvement in efficiency you somehow find would be offset by the increased difficulty of understanding the code. FWIW I'm in camp (2)
|
# ? Aug 11, 2023 19:25 |
|
Run some EXPLAINs , see how the optimizer handles it, then choose the most readable.
|
# ? Aug 12, 2023 00:13 |
|
abelwingnut posted:theory question. As nielsm says, for an inner join the ON and the WHERE clauses are logically equivalent. It is possible some wacky query planner might treat them differently but they have exactly the same effect. For an outer join however they have completely different meanings and which one to use depends on what you want. In an outer join, a condition placed in the ON clause is applied before the join, while a condition in the WHERE clause is applied after the join. Probably easiest to explain what this means by example. Consider these two queries: SQL code:
Basically use the first version when you want to filter out rows from the resultset entirely, use the second version when you want nulls instead. Fiddle for extra clarity: https://www.db-fiddle.com/f/7cFSepFbUuTjuA1jMiicd8/0 If you don't have pressing reasons to do this type of outer join shenanigans, please do as Hammerite and put the join condition in the ON clause and anything else in the WHERE clause. Putting join conditions in the WHERE clause is noisy and putting stuff in the ON clause that isn't a join condition and doesn't need to be there is mostly just confusing.
|
# ? Aug 12, 2023 04:23 |
|
i have a weird question based on some troubleshooting i was doing this morning with a coworker: say i have the following query SELECT 1 FROM BILLING_SEGMENT bs JOIN BILL b on bs.BillKey = b.BillKey WHERE CAST(b.Bill_CompleteDttm as DATE) >= '2023-08-01' AND bs.Month = 202308 Important detail: bs.Month is a char(6) column The query, as written above, executes totally fine and returns 357,447 rows. However, when I change date to 2023-07-01 instead of 2023-08-01, I mysteriously get: "Conversion failed when converting the varchar value '0.0000' to data type int." This is fixed by surrounding the 202308 in quotes, so the following query: SELECT 1 FROM BILLING_SEGMENT bs JOIN BILL b on bs.BillKey = b.BillKey WHERE CAST(b.Bill_CompleteDttm as DATE) >= '2023-07-01' AND bs.Month = '202308' ...runs totally fine, and returns 723,393 rows! I was surprised to find that the very first query even ran at all without that month being explicitly made into a string, but what in the world is happening to cause it to suddenly stop working? Something with the amount of rows returning causing the interpreter to misfire? The fix is easy but I'm mostly wondering from an academic perspective why it works sometimes but not others! kumba fucked around with this message at 15:56 on Aug 16, 2023 |
# ? Aug 16, 2023 15:00 |
|
Sounds to me like the data set returned for >'2023-07-01' has a bs.Month column with bad data, i.e. 0.0000 which as it rightly states cannot be cast to an integer. In any case it is more correct to compare char to char.
|
# ? Aug 16, 2023 15:33 |
|
monochromagic posted:Sounds to me like the data set returned for >'2023-07-01' has a bs.Month column with bad data, i.e. 0.0000 which as it rightly states cannot be cast to an integer. In any case it is more correct to compare char to char. That was my first guess, but it does not! i can select distinct bs.Month for the timeframe and there's nothing weird at all - just a list of 6 character values that I would expect to see!
|
# ? Aug 16, 2023 15:55 |
I don't think SQL has any values (besides NULL) that must be written unquoted. Always quote your literal values, even ints and floats. But also, make sure to mentally distinguish between numbers and numeric strings. A number is a value that has a single magnitude and you can do meaningful arithmetic on, such as use the AVG() function. Trying to use the AVG() function on your bs.Month column wouldn't produce a logically meaningful value, because it's not a proper number with a single magnitude, it's a composite value consisting of two magnitudes (year and month of year) that just happen to be formatted as a character string resembling a number. So quote that literal value regardless, and if you're writing a program that does a prepared statement make sure the parameter type for comparing to bs.Month is a character type and not integer type.
|
|
# ? Aug 16, 2023 16:11 |
|
kumba posted:That was my first guess, but it does not! i can select distinct bs.Month for the timeframe and there's nothing weird at all - just a list of 6 character values that I would expect to see! Does including a code:
is the schema using something weird like SMALLDATETIME?
|
# ? Aug 16, 2023 16:16 |
|
nielsm posted:I don't think SQL has any values (besides NULL) that must be written unquoted. Always quote your literal values, even ints and floats. Oh yeah I 100% agree; this is mostly an academic question that piqued my curiosity as I have a co-worker who is very unfamiliar with SQL that wrote the original query and didn't realize he needed to quote it because the code ran fine in most circumstances without it. I was entirely surprised it ran at all as written Lib and let die posted:Does including a Adding that condition didn't change anything, I still get the conversion failed error Lib and let die posted:is the schema using something weird like SMALLDATETIME? I didn't notice this before, but the Bill_CompleteDttm field is actually a datetime2(0) field. Maybe this is somehow the culprit? I don't understand how it could be but that's why I'm here (side note: I have no clue why this field is a datetime2(0) vs just a standard date field, nor do I understand really why we're casting it in the first place, but i did not create this database so who knows)
|
# ? Aug 16, 2023 17:54 |
|
I'm coming from a SQL ORM (Django) but diving into the actual SQL because things aren't working and I need to get better at this layer (Postgres). Here's my relationships: Here's what I'm trying to do:
The SQL looks like this: https://pastebin.com/hFUZ6DCZ (apologies for the pastebins, cloudflare really doesn't like me) And it works, I get <id, year, month, series_id, quantity> tuples where the quantity is correct. So now I want to do:
So I can do just the average, and it works: https://pastebin.com/JEnxqiqA When I combine these two end up getting results, but the quantity there is some multiple, but the averages are correct. Changing the SUM to SUM (DISTINCT ...) gets different but also wrong numbers. I feel like my SUM ... AS "quantity" is considering way more than the reference_date in the WHERE clause but unsure how. Here's my pastebin for this: https://pastebin.com/LZDDwWcQ Where should I be looking for the issue? Hed fucked around with this message at 22:04 on Aug 18, 2023 |
# ? Aug 18, 2023 21:48 |
|
Yeah that's some Django generated stuff alright. Been there, been annoyed, wrote the raw SQL instead because it was so hard to get it to do what I wanted. The first query does what you want but in an extremely dumb way. It's doing the same calculation twice. First it's got the main query that finds the commodityfuturescontracts rows, then it's got a subquery that calculates the quantity for each reconciliation_positionfilefuturesentry associated with that contract. The thing is though that the subquery is returning exactly the same resultset as the outer query. Here's that first query with some comments: SQL code:
SQL code:
Moving on, the following two queries are mostly nonsense. I'm not going to attempt to correct them, they're kinda in the "not even wrong" territory, they're just not doing anything like what you actually want. You can't get Django to spit out a query that does what you want with any reasonable amount of effort. You'll basically want the above query plus some sort of subquery to calculate the average_volume of each contract, but Django isn't even remotely close to doing what you want. Doing a partial aggregate like the one you describe (for each row in the main resultset, find a limited number of correlated rows in a different table ordered by some criteria and aggregate over them) requires going deeper into the SQL weeds than Django really supports. Basically the problem you have here is that you have two aggregation operations in the same query but they're aggregating over completely different resultsets. The quantity comes from aggregating over the full set of reconciliation_positionfilefuturesentry rows related to a contract, but the average_volume comes from aggregating over a limited number of market_data_futuresdataentry rows. If you drag both of these two correlated tables into the same resultset you'll get a lot more rows to aggregate over than you actually want. There are a couple of ways to tackle this problem, but all of them involve splitting the query into parts and aggregating the parts separately. Here's a variant using a CTE: SQL code:
A possibly more elegant way is to remove the aggregation from the main query altogether and just rely on lateral subqueries for the summing instead. Need to be lateral subqueries because we're referencing columns from the outer query in the subqueries; can't do that without the LATERAL keyword. Would look something like this: SQL code:
I doubt you can get Django to spit out either of these two above, but perhaps you could - with sufficient .Annotate() shenanigans - get it to generate this: SQL code:
Maybe you could do it with some window function nonsense too but I'll have to stop here. Doesn't help you with the Django part either though. BTW the trick to bypassing the cloudflare nonsense seems to be to make a newline between the select and the projection list. In the last query it complained about select avg(... until I made it select<newline>avg(.... I had to basically bisect the post to find the issue. Oh, and one last note: your post sort of implies "last 5 entries" from market_data_futuresdataentry is equivalent to "last five days of data for this contract", so I exploited that in all of the queries above to avoid the LIMIT 5 and ordering stuff. If that isn't actually the case and you really want exactly five entries ordered by timestamp before a given date, then things get a bit more annoying and the lateral join variant is probably your best bet, except it won't quite work as written. e: sorry, I just suck at explaining. I might try to straighten some of this out tomorrow TheFluff fucked around with this message at 01:23 on Aug 20, 2023 |
# ? Aug 19, 2023 23:44 |
|
I will occasionally tap into ChatGPT when I need my statements to bridge the gap between "clever" and "fancy", and did so this morning when working with some data exported from salesforce (my goal was to take each Account with a type of 'Household' from the Accounts file and populate, in-line, the primary household member (indicated in the one2oneContact field as per usual SF) and all associated members of the household identified by the [AccountId] in the Contacts file - the insert script for this particular item in our database is built around the expectation that we are getting data in the form of [HouseholdID][Member1][Member2][Member3] and so on - suffice to say it needs some serious reworking but that's neither here nor there). Anyway, chatGPT gave me a fine output. It might not be the most efficient way to do it, but it gives me the results I want. I've got to export the results to excel and do a text to columns then re-import it, but it's still way less annoying than it could be. My question, ultimately, is with ChatGPT's followup 'notes' or whatever. quote:In this query, the ContactConcatenation CTE uses the STUFF and FOR XML PATH('') technique to concatenate the secondary contact IDs into a single column. What exactly is the "STUFF and FOR XML PATH('') technique" and what is it doing? I can provide the full output here, but the relevant code bit is this: code:
|
# ? Aug 21, 2023 21:45 |
|
TheFluff posted:Yeah that's some Django generated stuff alright. Been there, been annoyed, wrote the raw SQL instead because it was so hard to get it to do what I wanted. holy crap, thank you! The lateral cross joins will take me a bit to fully understand, I haven't seen them. The reduced queries (first couple in the reply) are straightforward and run faster, though. I'm at the point where I could do 2 queries and join it in Python if needed. But, I will take another look. The only adaptation I need to make now is that it is the case I need "the last 5 entries of data". Apologies for the implication that I could do a date range thing because I wouldn't have data for dates the market is closed for weekends/holidays. That's why I was doing LIMIT 5. However, the Python that calls this could pretty easily have the market calendars in memory, guaranteeing I could do a BETWEEN with who actual dates that we know should equal 5. That's probably the easier way to do it since the "get the 5 entries with reference day less-than-equal, sorted by time" involves a lot more join machinery.
|
# ? Aug 21, 2023 22:45 |
|
Lib and let die posted:I will occasionally tap into ChatGPT when I need my statements to bridge the gap between "clever" and "fancy", and did so this morning when working with some data exported from salesforce (my goal was to take each Account with a type of 'Household' from the Accounts file and populate, in-line, the primary household member (indicated in the one2oneContact field as per usual SF) and all associated members of the household identified by the [AccountId] in the Contacts file - the insert script for this particular item in our database is built around the expectation that we are getting data in the form of [HouseholdID][Member1][Member2][Member3] and so on - suffice to say it needs some serious reworking but that's neither here nor there). It's away to work around the lack of a GROUP_CONCAT function (returning the rows of a query result stitched together in one row) in older versions of MS SQL server. FOR XML in SQL server allows you to return the results of a query as XML elements, i.e. text that is marked up with XML tags. Adding the PATH('') modifier removes the tags and replaces them with nothing so you have all the rows concatted together. You want a comma-separated list of values so each value in the subquery begins with a comma and a space (the SELECT ', ' + part). But you don't want the result to begin with a comma and a space, so the STUFF function is just taking 2 characters of the string, starting from position 1, and replacing them with nothing (the 1, 2, ''' at the end are the first position, 2 characters, and a blank string (nothing) respectively). It's just an ugly hack to work around the lack of a purpose built function that does that in SQL Server. 2017 and later have the STRING_AGG function which does what you want in a more intuitive and readable way. You may or may not be using this, but the way chatGPT has done it is the 'classic' way that will be all over the internet, hence why it recommended it
|
# ? Aug 21, 2023 23:44 |
|
And there's the problem with ChatGPT: I can use STRING_AGG lmfao
|
# ? Aug 21, 2023 23:47 |
|
Lib and let die posted:And there's the problem with ChatGPT: I can use STRING_AGG lmfao It's a great resource, it just behoves you to sanity check what it tells you. Which is what you're doing, so sounds like you're using it correctly
|
# ? Aug 21, 2023 23:58 |
|
Lib and let die posted:And there's the problem with ChatGPT: I can use STRING_AGG lmfao then you tell it that and to use STRING_AGG and it will
|
# ? Aug 22, 2023 00:04 |
|
Yeah I could have really improved my prompt there in retrospect.
|
# ? Aug 22, 2023 00:07 |
|
e: I'm an idiot and overcomplicating a simple problem LOL
Generic Monk fucked around with this message at 17:35 on Aug 24, 2023 |
# ? Aug 24, 2023 17:31 |
|
Hey, gang! I've been smashing my head against this code for the last week and am finally at the end of it. I'm taking an online course and unfortunately, if I need any help, I have to either figure it out myself, e-mail my course instructor and wait 12 days for a response, or come begging on these forums for some guidance. The only thing not working at this point is there is a trigger that needs to fire that does not seem to be. The goals in entirety (so as to perhaps explain some parts of the code) is: 1. Find the top ten most popular districts in a DVD rental database 2. In those districts, find the most popular genre of films being rented 3. Create a detail table that lists the district, genre and genre count 4. Create a summary table that lists the district, genre, and a ranking of the top three genres in each district 5. Create a trigger that will update summary_table whenever detail_table is updated Everything works but Step number 5. I've run this through Chat-GPT and it says everything looks fine. I can see the trigger is there in postgres, and the code produces no errors. But when I update the details_table (by changing the number of rented action films to 0), the summary_table does not change. Here is my code... https://pastebin.com/3JpQ5RYh Thank you for any advice :| credburn fucked around with this message at 02:01 on Aug 26, 2023 |
# ? Aug 26, 2023 00:44 |
|
credburn posted:Hey, gang! i don't speak postgres, but from a quick google id expect there to be a CREATE TRIGGER statement somewhere. that is, a statement to wire up that function to an actual trigger on the table
|
# ? Aug 26, 2023 01:01 |
|
redleader posted:i don't speak postgres, but from a quick google id expect there to be a CREATE TRIGGER statement somewhere. that is, a statement to wire up that function to an actual trigger on the table Whoops; yes, you're right. I accidentally omitted that in the example but it's been updated. It still doesn't appear to update the summary_table
|
# ? Aug 26, 2023 02:02 |
|
credburn posted:Whoops; yes, you're right. I accidentally omitted that in the example but it's been updated. It still doesn't appear to update the summary_table To be sure, you did this right? code:
code:
|
# ? Aug 26, 2023 04:04 |
|
Two questions regarding SQL Server. First, is Ola Hallengren's "IndexOptimize" job the same thing as rebuilding indexes and updating statistics scripts or do I still need to run these? Second, I'm migrating from one SQL Server to another. Actually, I've finished migrating everything, settings, databases, logins, etc. with the exception of SSIS. Does anyone know the best way to migrate the SSIS catalog and all the jobs and poo poo that go with it? Do I need to manually open up the SSIS packages to update connections with the new server name?
|
# ? Sep 12, 2023 16:12 |
|
Stupid question ahoy. If I have a simple soft delete query, i.e.SQL code:
SQL code:
SQL code:
Currently targeting SQLite, open to learn for other vendor dialects. The obvious answer is to not bother making the distinction, but it's nice to see if it is possible and practical. MrMoo fucked around with this message at 04:11 on Oct 7, 2023 |
# ? Oct 7, 2023 03:59 |
|
does your sqlite wrapper expose one of the sqlite3_changes functions? looks like another option might be the count_changes pragma, although this is deprecated - by what, i'm not sure. it definitely seems like a nicer way to get the affected row count than sqlite3_changes
|
# ? Oct 7, 2023 04:37 |
|
The problem appears to be more about reading a column value before an UPDATE is applied. With Postgres this is possible, but Sqlite likes to only return the updated value. https://stackoverflow.com/questions/7923237/return-pre-update-column-values-using-sql-only i.e. the following is fine in PGSQL: SQL code:
|
# ? Oct 7, 2023 04:45 |
|
oooh, yeah, i get it now. have you tried the solution in the accepted answer on that question? i think that could work for you, if sqlite allows it. the rowset will be empty if the id doesn't exist, and you'd get the old value of is_deleted to see if it had been previously deleted or notcode:
you'll no doubt need to tweak the syntax, because sql is a vibe and not a standard
|
# ? Oct 7, 2023 05:14 |
|
Alas, sqlite doesn't like that for some reason, need to check why, https://www.sqlite.org/lang_update.htmlSQL code:
Raises "no such column: y.name" 🤷♀️ Doesn't seem to understand table alias in the RETURNING clause. There are three conditions:
I think that just leaves this: SQL code:
SQL code:
quote:The first prototype of the RETURNING clause returned values as they were generated. That approach used less memory, but it had other problems: MrMoo fucked around with this message at 18:28 on Oct 7, 2023 |
# ? Oct 7, 2023 17:50 |
The page you're quoting has one more thing at the very end:quote:7. The RETURNING clause may only reference the table being modified. In an UPDATE FROM statement, the auxiliary tables named in the FROM clause may not participate in the RETURNING clause. Edit: In MS SQL it's easier, with their proprietary OUTPUT clause. You can use the pseudo-table names "deleted" and "inserted" in that to return values from the pre-update respectively post-update row. nielsm fucked around with this message at 18:37 on Oct 7, 2023 |
|
# ? Oct 7, 2023 18:32 |
|
God, I wish MERGE in Postgres supported RETURNING. I've got this awful mess of like 5 INSERT/UPDATE/SELECT CTEs (because of two separate unique indexes that I can't change for unrelated reasons, so I can't just ON CONFLICT) that I could have replaced with a single MERGE, except that I need the returned ID in the insert/upsert case.
Roadie fucked around with this message at 18:11 on Oct 24, 2023 |
# ? Oct 23, 2023 23:13 |
|
Roadie posted:God, I wish MERGE in Postgres supported RETURNING. I've got this awful mess of like 5 INSERT/UPDATE/SELECT CTEs (because of two separate unique indexes that I can't change for unrelated reasons, so I can't just ON CONFLCIT) that I could have replaced with a single MERGE, except that I need the returned ID in the insert/upsert case. *I can't read, ignore me.
|
# ? Oct 24, 2023 01:22 |
|
A few design/best practices questions: 1. Do you name your tables using singular or plural nouns? E.g. Customer or Customers. 2. Do you use snake case, camel case, or another case to name columns? E.g. first_name or FirstName? 3. Do you prefix a table's id column name with the table name or just leave it as id? E.g. id or customer_id I've always done plural, snake case, and id respectively but apparently the new team I have joined has used the alternatives. If it matters the DB is the backend to an API written in C# with Entity Framework in a DB first workflow. They have said this drives some of their opinions but these opinions are all very weird to me. I guess I'm wondering if, aside from personal preference, there is any reason to go with one vs. the other in the above items.
|
# ? Oct 25, 2023 23:59 |
|
None of those things matters as long as you are consistent
|
# ? Oct 26, 2023 00:01 |
|
|
# ? May 25, 2024 11:55 |
|
in postgres it is common to use snake_case instead of camelCase, because postgres does automatic case folding of all identifiers to lowercase. you declare the column fooBar, the actual column created is foobar. feel free to call it fooBar in all your queries, postgres will just silently convert that to foobar and it works. if you really want to name the column fooBar with the caps, you can use double quotes to disable case folding: "fooBar". great, you've created a camelCase column. but now you must refer to it as "fooBar" each time, with the quotes, because plain fooBar gets case folded to foobar, and that's not the name of the column. and since in postgres we don't want to sprinkle quotes all over the place, we use snake_case. (microsofties seem to have no problem with the [mssql].[nonsense]...)
|
# ? Oct 26, 2023 02:45 |