Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Vegetable
Oct 22, 2010

Quick question:

orders has columns buyer_id, seller_id, order_id, total_buyer_orders.

total_buyer_orders is the number of orders ever created by the buyer.

I’m trying to get a list of sellers whose orders comprise at least 80% of any buyer’s total_buyer_orders.

SELECT buyer_id, seller_id
FROM bookings
HAVING count(booking_id)/total_buyer_orders >= 0.8

Does this work? Specifically, is it okay to combine two columns in a HAVING clause like that?

Adbot
ADBOT LOVES YOU

Moonwolf
Jun 29, 2004

Flee from th' terrifyin' evil of "NHS"!


Vegetable posted:

Quick question:

orders has columns buyer_id, seller_id, order_id, total_buyer_orders.

total_buyer_orders is the number of orders ever created by the buyer.

I’m trying to get a list of sellers whose orders comprise at least 80% of any buyer’s total_buyer_orders.

SELECT buyer_id, seller_id
FROM bookings
HAVING count(booking_id)/total_buyer_orders >= 0.8

Does this work? Specifically, is it okay to combine two columns in a HAVING clause like that?

That should work, HAVING is just like WHERE but you can also interrogate the output of aggregates, like you are doing. You could easily make that as:
SELECT * FROM (
SELECT buyer_id, seller_id, count(booking_id)/total_buyer_orders perc
FROM bookings
)
WHERE perc >= 0.8

which is what HAVING lets you avoid in cleaner syntax.

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


You are missing a group by expression

D34THROW
Jan 29, 2012

RETAIL RETAIL LISTEN TO ME BITCH ABOUT RETAIL
:rant:

redleader posted:

microsoft dynamics crm? that's backed by a sql server db, almost certainly on-prem. i happen to know this because i used ms dynamics for task tracking and found it vastly easier to query the db (once i'd reverse-engineered the schema and how we used it) directly rather than using the front-end

i wouldn't suggest updating that db directly though, because there's a fair chance doing that could lead to your support contract or w/e being voided

I figured it was SQL-based, and I know it's on-prem because the network shares all point to the same IP and different servers - in this instance, however, I want tables in the backend that we'll never have, and I'm forced to move to Access for those; I just want to use SQL tables rather than Access tables.

How did you reverse engineer it? I've been trying to figure that out and I can't find anything worth a poo poo on Google to try and query the DB. Specifically, what I want to be able to do in this case is put in a given job number (in this case) and have it backtrack through the DB and extract desired fields, rather than having to dump to Excel and extract the data from those. I wanted to be able to do this in both Excel and Access. It would make so many things so much easier.

Also, does it REALLY duplicate data as much as it seems like, rather than something similar to a DLookup() in Access?

Hughmoris
Apr 21, 2007
Let's go to the abyss!
I might have an opportunity to take a BI/data warehouse job. I can write basic SQL and I've created data viz dashboards for leadership, both which I enjoy. The data warehouse part will be new to me. Stuff like ETL and SSIS etc...

For those that do this type of work, what are your thoughts? Do you enjoy it? Is it a snooze fest? Does being an administrator for a data warehouse and ETL stuff have a lot of variability and things to learn?

Vegetable
Oct 22, 2010

Moonwolf posted:

That should work, HAVING is just like WHERE but you can also interrogate the output of aggregates, like you are doing. You could easily make that as:
SELECT * FROM (
SELECT buyer_id, seller_id, count(booking_id)/total_buyer_orders perc
FROM bookings
)
WHERE perc >= 0.8

which is what HAVING lets you avoid in cleaner syntax.

Ruggan posted:

You are missing a group by expression
Thanks guys, the query worked out nicely!

I also have a similar question as the above poster. Currently I work more with the business side of things as a data analyst. But there’s an opportunity for me to explore more ETL work and become more of a data engineer/administrator. Is it a lot of just making sure things don’t break? What’s exciting? How much room is there for growth? What do exit opportunities look like?

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Hughmoris posted:

I might have an opportunity to take a BI/data warehouse job. I can write basic SQL and I've created data viz dashboards for leadership, both which I enjoy. The data warehouse part will be new to me. Stuff like ETL and SSIS etc...

For those that do this type of work, what are your thoughts? Do you enjoy it? Is it a snooze fest? Does being an administrator for a data warehouse and ETL stuff have a lot of variability and things to learn?

You’re gonna need more than basic SQL skills, although you can probably pick that up in the role. It’s more technical - data viz is report writer territory which isn’t compensated nearly as well. There’s a lot of opportunity to design graceful, performant queries and there’s a fair bit to learn about writing SQL to get there. If you do it well your job will be relatively low stress as long as your nightly succeeds or you build error resistant pipelines.

I do a mix of ETL, data viz, full stack dev, and process improvement work at my job. I like ETL more than the reporting by a long shot.

In terms of advancement opportunities, hard to say. You’re definitely less visible to leadership in that type of role because by nature it’s more behind the scenes. There are plenty of Data Warehouse jobs out there at other companies though and last time I looked they compensated in the 70-120k range depending on experience and skill set. So even if you can’t move up you can definitely move laterally to another job or to consulting pretty easily - and I have a few ex coworkers who have done exactly that.

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Tired of fielding stupid report writing requests? Become a data warehouse admin!

Hughmoris
Apr 21, 2007
Let's go to the abyss!

Ruggan posted:

You’re gonna need more than basic SQL skills, although you can probably pick that up in the role. It’s more technical - data viz is report writer territory which isn’t compensated nearly as well. There’s a lot of opportunity to design graceful, performant queries and there’s a fair bit to learn about writing SQL to get there. If you do it well your job will be relatively low stress as long as your nightly succeeds or you build error resistant pipelines.

I do a mix of ETL, data viz, full stack dev, and process improvement work at my job. I like ETL more than the reporting by a long shot.

In terms of advancement opportunities, hard to say. You’re definitely less visible to leadership in that type of role because by nature it’s more behind the scenes. There are plenty of Data Warehouse jobs out there at other companies though and last time I looked they compensated in the 70-120k range depending on experience and skill set. S noo even if you can’t move up you can definitely move laterally to another job or to consulting pretty easily - and I have a few ex coworkers who have done exactly that.

Thanks. I'd be joining a good team that knows my current skill set, and are willing to teach the things that I'm lacking.

And your current role sounds like where'd I'd like to get to. I currently work in clinical informatics (focusing more on the clinical aspect) and this would be an opportunity to learn more technical skills and tools.


*I know this varies by company but a BI/Data Warehouse admin seems like a job that could be well suited to some WFH days or completely remote, right? I'm tired of being in the office 5 days a week.

Hughmoris fucked around with this message at 17:08 on Mar 4, 2020

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Perfectly suited to remote and/or atypical schedule.

As long as your org is supportive of that, culturally speaking.

Trabant
Nov 26, 2011

All systems nominal.
Hello thread. I know just enough SQL to write basic statements (learned by cannibalizing other people's code) and this is beyond me. Hoping you can help me out, although I fully acknowledge this is less "quick question" and more "actual problem". I just don't have a lot of confidence in the contractors who are trying to solve this...

I have a table organized like this:

code:
Person      Person's Manager      Person's Manager's Manager
John        Bob                   Jane
Bob         Jane                  Frank
Frank       Frank                 Frank 
....
In other words, it's a reporting structure, but it goes only two managers up from each employee. So John reports to Bob who reports to Jane who reports to Frank. Once you reach Frank (the CEO) it just repeats - the CEO reports to himself reports to himself etc. Just like in real life!

Note that you have different levels of hierarchy going from lowest to highest: John - Bob - Jane - Frank is one that has 4 levels, but you also have something like:

code:
Person      Person's Manager      Person's Manager's Manager
Mary        Amy                   Frank
Amy         Frank                 Frank
....
Mary - Amy - Frank is a hierarchy of different "depth" and it has 3 levels, even though Mary is an individual contributor who doesn't have any reports. In my actual data, we can have up to 12 levels :downs:

Now...

What I'd like to do is get a table that looks like this:

code:
Person       L1 Manager     L2 Manager    L3 Manager
John         Frank          Jane          Bob
Bob          Frank          Jane
Jane         Frank                 
Mary         Frank          Amy
What this does is normalize everyone to the same management levels -- rather than going up from an individual contributor to the CEO, the labels go from the CEO down.

In order to do this, I'd have to:

1) Create a WHILE loop that takes each employee and builds out a chain from them until the chain reaches Frank, and
2) Recognize that the chain has reached Frank (because the "manager" starts repeating)
3) Relabel accordingly, using the Lx convention starting with Frank as L1

Does that sound right? Or is it better to nix the WHILE loop and instead use sub-queries (that go 12 deep...) and build out the dataset that way?

I've built out the resulting hierarchy manually before (using mostly JMP and Excel), so I know it can be done. Refreshing it is a huge pain though, which is why I want this in code instead.

Just-In-Timeberlake
Aug 18, 2003

At first glance this looks like a job for a CTE and recursion

cybertier
May 2, 2013

Just-In-Timeberlake posted:

At first glance this looks like a job for a CTE and recursion

Agreed. Though if you would actually want those column headers you'd need a form of dynamic pivot, which not all DBMS can do.

Trabant
Nov 26, 2011

All systems nominal.
Thank you for the search terms! I did a quick search and yes, it looks like that's what is needed.

It's also even further beyond my SQL skills, so I'm either going to have to find better contractors (lol @ doing that in under a month) or... gently caress, figure it out myself.

hardstyle
Mar 12, 2004

"...Neither seen nor heard"


SQL Server has a hierarchyid type that would help with this if you have it available.

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE
Not exactly an answer to the problem at hand, but if you want to do tree structures in SQL without recursive CTE's, you can use the Modified Preorder Tree Traversal (MPTT) algorithm which makes querying really fast at the expense of making inserts and node rearrangements potentially quite expensive. Implementing it yourself might seem scary but if you're using an ORM or a query builder there's a decent chance it might have an extension or plugin that implements it.

On the other hand, these days it's also possible your database might have something builtin that might help with tree structures - SQL Server's hierarychyid has already been mentioned, and postgres has ltree.

TheFluff fucked around with this message at 18:03 on Mar 6, 2020

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


code:
-- lets seed a sample table with some hierarchical garbo
drop table if exists #peeps
go
;

create table #peeps (person varchar(50), mgr varchar(50), mgrmgr varchar(50))
insert into #peeps values
('b', 'h', 'o'),
('h', 'o', 'l'),
('o', 'l', 'e'),
('l', 'e', 'e'),
('e', 'e', 'e'),
('f', 'e', 'e'),
('y', 'b', 'h'),
('m', 'y', 'b')
go
;

-- we dont give a gently caress about mgrmgr, that is a stupid datastructure
alter table #peeps drop column mgrmgr
go
;

-- ok, let's write a cte or whatever that recurses
-- lets... uhh... stuff it into a single path of people so we don't need dynamic columns... sound good?
-- maybe do something clever with it in the front end
with ctePeeps as (
    --top level
    select      p.person, p.mgr, 1 as level, cast(p.person as varchar(max)) as path
    from        #peeps p
    where       p.person = p.mgr
    UNION ALL   
    --recurse down
    select      p.person, p.mgr, p2.level + 1 as level, cast(concat(p.person, ' - ', path) as varchar(max)) as path
    FROM        #peeps p
    inner join  ctePeeps p2
    on          p.mgr = p2.person
    and         p.person != p.mgr
)

--heres ur result
select * from ctePeeps
result set

code:
person  mgr    level   path
------- ------ ------- ----------------------------
e       e      1       e
l       e      2       l - e
f       e      2       f - e
o       l      3       o - l - e
h       o      4       h - o - l - e
b       h      5       b - h - o - l - e
y       b      6       y - b - h - o - l - e
m       y      7       m - y - b - h - o - l - e
im a professional

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Ruggan posted:

im a professional

:golfclap:

kumba
Nov 8, 2003

I posted my food for USPOL Thanksgiving!

enjoy the ride

Lipstick Apathy

Ruggan posted:

im a professional poet

ftfy

Trabant
Nov 26, 2011

All systems nominal.

Ruggan posted:

im a professional

:aaaaa:

What you are is an amazing person -- thank you! I ended up putting another CTE around your structure (to parse the path into columns because SPLIT_STRING isn't available on our 12.0 server) and did a number of other hacked-together, probably inadvisable things.

But the code runs! And it's accurate! And I don't have to rely on the contractors!

Thank you all for your suggestions and help. I kinda-sorta learned how to do recursive CTE things and XML parsing today thanks to Ruggan and Google. It's the most accomplished I felt at work in a long time.

Gatac
Apr 22, 2008

Fifty Cent's next biopic.
This may not be precisely the thread for it, but I know you guys are smart and cool people so I figured I'd ask here first. :)

I've been (very freshly) brought onto a product where we have an Oracle-based DB that has to somehow communicate with a variety of other products via web services. The rough outline of what we import and export is similar (receive specialized technical tasks, report results), so the actual data is about 70% the same, but each product has its own message spec. The current implementation is basically canonical How Not To PL/SQL, both in terms of the overall process and how it's coded, so one way or another we're gonna have to replace it. What I'm struggling with is how to replace it intelligently. The current process for importing, f'rex, goes something like this:

1) Scheduled job makes a new entry in a "task" table containing all the various attempts to get data for that particular interface and sends a HTTP call to a local web service to get crackin'. (Yes, we basically hacked together our own scheduler.)
2) Local web service communicates with remote web service, gets the raw data of new task(s) (as XML/JSON/whatever good idea the externals had), decodes it, then attempts to write that data directly into local "buffer" tables mirroring the structure of the incoming data. This is the part that really squicks me out, btw, direct SQL execution on the Db with write access from an external user is no bueno imo, but it is what it is right now. Each interface has its own set of buffer tables, naturally.
3) Another "scheduler" task checks for new data in those buffer tables, denormalizes the whole shebang (again, by methods specific to each set of buffer tables) and writes it into one enormous, flat "staging" table on our product's internal schema.
4) Yet another scheduled task then takes these staging entries and transforms/normalizes them into the actual internal data model used by our product, at which point it can do whatever it needs to do with the task and I wash my hands of it.

So, I didn't want to be one of Those Guys who develops strong opinions on How To Redo Everything in week 1, but as far as I can tell, nobody's happy with the product as it currently stands and is looking to me for ideas on how to do it better. Well, sure, I guess I can look into proposing a different way. Now, here are my main issues as I conceptualize:

1) Error handling. What if a web service call errors out? What if the data that comes back is garbage? An argument for the current method is that as data worms its way through, there are tables for storing it at every step so it's possible to figure out what went wrong in intermediate steps. My personal observation is that tracing any errors in this thing is a mess and we'd do better to have a strong validation method and some sort of error/reject feedback mechanism to the external products, but I can see the point that we may not be able to automatically determine and intelligently describe the nature of the error if we can't get a look at the raw-ish incoming data after the task errors out, which would require some sort of intermediate storage. Or that we might be forced to accept all incoming data and then beat it into shape rather than be able to just reject a malformed task.

2) Data transformation. Currently, basically every interface does this its own way because each process just got hacked together by whoever happened to be working on it when it was needed. I'd like to replace this with *one* process that is then configurable for each external product. Ideally, it'd be a solution that supports both importing and exporting data. To that end, I've begun to look into ETL tools. My boss thinks Talend could do the trick, f'rex, but I'm worried about a) how complex of a transformation is possible there and b) how difficult it is to set up and also grok the configuration there. My prime concern is readability/maintainability, as we're largely in this mess because people came and went on this project and left behind code/table structures that nobody else felt comfortable working on. Documentation is, as I'm sure you've guessed, nonexistent. We're not quite at Adeptus Mechanicus levels of blind prayer to the machine spirit, but it's a serious issue and anything I do, I'd like it to leave the project in better shape than I found it.

3) Change control. This is more of a general problem with how things are done, but when I say there's no documentation, I'm also saying there are no tickets documenting changes, there's no SVN use for the PL/SQL, there's no automated testing, there's no testing/staging system, there's not even a clear change request process other than people popping into the office and dropping half-baked "It needs to do X by date Y" poo poo into our laps, which is then coded directly on the production DB, after which we continue to play whack-a-mole with process corrections and bug fixes until the user complaints stop. I'm very aware that this is an organizational problem at its core, but basically, where the gently caress do I *start* with making it better? Not looking for concrete "Do A, B,C" advice so much as any experience you guys may have had dealing with similar issues and what you did to improve them.

nem
Jan 4, 2003

panel.dev
apnscp: cPanel evolved

Ruggan posted:

im a professional

Nicely done. Working this out on my own would've given me CTE. Saving this snippet for recursion down the road.

Hughmoris
Apr 21, 2007
Let's go to the abyss!
If I land this Data Warehouse BI job (will hopefully hear this week), I'll definitely need to improve my SQL chops. One way I was thinking of doing that was by reviewing other people's queries. My thought is find a way to identify the most commonly executed queries, find those that eat up the most resources or time, and then review those and see if I can re-write for better performance.

Maybe I've been watching too many Bret Ozar YT videos but it seems like it could be a useful way to increase my SQL knowledge and possibly add value. Either way I'm excited to get started.

PhantomOfTheCopier
Aug 13, 2008

Pikabooze!
Don't ask why they've used SELECT * everywhere instead of proper joins. They'll just get angry if you point out that transferring all that data to ultimately filter and join it in the app is a waste of time and resources.

Vegetable
Oct 22, 2010

In my interview today for a data analyst position, I was asked how I improve the efficiency of my queries.

I actually make no conscious attempt to improve efficiency, but I enumerated these things:

- Subqueries before joining: Avoid joining big tables to each other
- Grouping sets: Using functions like ROLLUP and GROUPING SETS instead of UNION-ing a bunch of different queries
- Window functions: I just assume these are more efficient than the alternative
- Integers rather than strings where possible: If you ever need to create a column to tag rows based on certain conditions, use integers rather than strings cuz they’re like smaller.

Not sure if these are universally true. But I’ve faced these efficiency questions a couple of times.

Would be super cool if someone could share some general tips on efficiency. I’ve read about how using temporary tables can help, but I don’t recall any of it.

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Vegetable posted:

In my interview today for a data analyst position, I was asked how I improve the efficiency of my queries.

I actually make no conscious attempt to improve efficiency, but I enumerated these things:

- Subqueries before joining: Avoid joining big tables to each other
- Grouping sets: Using functions like ROLLUP and GROUPING SETS instead of UNION-ing a bunch of different queries
- Window functions: I just assume these are more efficient than the alternative
- Integers rather than strings where possible: If you ever need to create a column to tag rows based on certain conditions, use integers rather than strings cuz they’re like smaller.

Not sure if these are universally true. But I’ve faced these efficiency questions a couple of times.

Would be super cool if someone could share some general tips on efficiency. I’ve read about how using temporary tables can help, but I don’t recall any of it.

You should always write your code with an eye to efficiency, but preemptive over-optimization is bad.

Regarding your notes, I guess it depends on the db you're on, but for MS SQL:
- Subqueries before joins doesn't make a difference if it compiles to the same query plan. This only makes sense to me when you're doing some complex logic in the subquery pre-join.
- Window functions are actually less efficient than a simple group-by in many situations (e.g. FIRST_VALUE vs join to self-subquery).

Temp tables are good in many situations because they are essentially a way to pre-cache a partial dataset. This gives the optimizer the benefit of improved estimates and table statistics for following code. I often use temp tables in ETL processes in lieu of a staging table, when I know my dataset isn't huge. Do not use temp tables in a high-volume transactional database - it won't work anywhere near as well (I learned this the hard way).

In general the typical advice applies: don't fetch more data than you need, and take advantage of indexes for joins and where clauses wherever possible.

Some other tips
- don't use calculated values in joins and where clauses unless you can't avoid it as you lose the benefit of indexes in many cases (e.g. ON COALESCE(a, b) = c)
- avoid function calls in joins and where clauses for the longer form which is more efficient (e.g. use WHERE (a = 2 OR a IS NULL) instead of WHERE COALESCE(a, 2) = 2
- if you're using a cursor or doing any programmatic logic (IF ... ELSE) you're probably doing something wrong unless you really absolutely know what you're doing
- don't APPLY functions unless you absolutely know what you're doing, as these run on every row of your dataset (they aren't set based operations) - there is often a better way to approach whatever problem you have
- avoid RIGHT JOINs for readability

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


Vegetable posted:

In my interview today for a data analyst position, I was asked how I improve the efficiency of my queries.

I actually make no conscious attempt to improve efficiency, but I enumerated these things:

- Subqueries before joining: Avoid joining big tables to each other
- Grouping sets: Using functions like ROLLUP and GROUPING SETS instead of UNION-ing a bunch of different queries
- Window functions: I just assume these are more efficient than the alternative
- Integers rather than strings where possible: If you ever need to create a column to tag rows based on certain conditions, use integers rather than strings cuz they’re like smaller.

Not sure if these are universally true. But I’ve faced these efficiency questions a couple of times.

Would be super cool if someone could share some general tips on efficiency. I’ve read about how using temporary tables can help, but I don’t recall any of it.

Normalize your data.
When necessary, use supporting indexes on foreign key columns.
Use integer keys.
Have your PK and clustered index be defined over the same set of columns.
Actually maintain your indexes.
Use parameterized stored procedures and if there are wide variances in performance based on the parameters coming in, use OPTIMIZE FOR UNkNOWN to nudge the query optimizer toward a generally good plan.

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Nth Doctor posted:

Normalize your data.
When necessary, use supporting indexes on foreign key columns.
Use integer keys.
Have your PK and clustered index be defined over the same set of columns.
Actually maintain your indexes.
Use parameterized stored procedures and if there are wide variances in performance based on the parameters coming in, use OPTIMIZE FOR UNkNOWN to nudge the query optimizer toward a generally good plan.

Yeah, completely different answers for if you're actually in charge of the data structure vs just writing queries against it. If you actually own the data structure there is a boatload of stuff you can do to help performance. My post was mostly targeted at query writers with no control of the schema.

ElehemEare
May 20, 2001
I am an omnipotent penguin.

Ruggan posted:

Temp tables are good in many situations because they are essentially a way to pre-cache a partial dataset. This gives the optimizer the benefit of improved estimates and table statistics for following code. I often use temp tables in ETL processes in lieu of a staging table, when I know my dataset isn't huge. Do not use temp tables in a high-volume transactional database - it won't work anywhere near as well (I learned this the hard way).

Worked with someone who made the blanket assertion that temp tables are bad because recomps.

He’d then proceed to load up 100k records into a table variable and couldn’t understand why the plans were so bad.

Nth Doctor
Sep 7, 2010

Darkrai used Dream Eater!
It's super effective!


ElehemEare posted:

Worked with someone who made the blanket assertion that temp tables are bad because recomps.

He’d then proceed to load up 100k records into a table variable and couldn’t understand why the plans were so bad.

Temp tables can trigger lots of recomps, but table variables with more than ~25 records are way worse for it, last time I did some experimental verification.

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

I've seen similar things myself, but it seems odd that temp tables and table variables should have different runtime implementations. They feel like they should be just two language-level constructs that compile down to the same code.

amethystdragon
Sep 14, 2019

Nth Doctor posted:

I think the proliferation of ORMs and NoSQL implies a general discomfort with SQL.

I think that's a different set of people. One group who are confused what a relational database is other then a large excel spreadsheet and the other who just wanna store everything and worry about figuring out how to read it later.

Paper With Lines
Aug 21, 2013

The snozzberries taste like snozzberries!

Ruggan posted:

You should always write your code with an eye to efficiency, but preemptive over-optimization is bad.

Regarding your notes, I guess it depends on the db you're on, but for MS SQL:
- Subqueries before joins doesn't make a difference if it compiles to the same query plan. This only makes sense to me when you're doing some complex logic in the subquery pre-join.
- Window functions are actually less efficient than a simple group-by in many situations (e.g. FIRST_VALUE vs join to self-subquery).

Temp tables are good in many situations because they are essentially a way to pre-cache a partial dataset. This gives the optimizer the benefit of improved estimates and table statistics for following code. I often use temp tables in ETL processes in lieu of a staging table, when I know my dataset isn't huge. Do not use temp tables in a high-volume transactional database - it won't work anywhere near as well (I learned this the hard way).

In general the typical advice applies: don't fetch more data than you need, and take advantage of indexes for joins and where clauses wherever possible.

Some other tips
- don't use calculated values in joins and where clauses unless you can't avoid it as you lose the benefit of indexes in many cases (e.g. ON COALESCE(a, b) = c)
- avoid function calls in joins and where clauses for the longer form which is more efficient (e.g. use WHERE (a = 2 OR a IS NULL) instead of WHERE COALESCE(a, 2) = 2
- if you're using a cursor or doing any programmatic logic (IF ... ELSE) you're probably doing something wrong unless you really absolutely know what you're doing
- don't APPLY functions unless you absolutely know what you're doing, as these run on every row of your dataset (they aren't set based operations) - there is often a better way to approach whatever problem you have
- avoid RIGHT JOINs for readability

This is really helpful. I write SQL as a means to an end, as in, get data out and sometimes write ETL code to put data in. Thanks.

On the right join thing, I will usually prototype all my joins as left joins and then flip it to a right join to check things out. Is this dumb af? I guess I could do a full join for this and get the best of both worlds and then switch it out at the end.

Hughmoris
Apr 21, 2007
Let's go to the abyss!
Brent Ozar, a SQL trainer/consultant guy, has made some of his course videos free during this madness. I have zero affiliation with this guy but I find his teaching style helpful for breaking down concepts and I wanted to share the free in case it's helpful for others. The free videos go away April 1st.

How to Think Like The Engine (3 videos): https://www.brentozar.com/training/think-like-sql-server-engine/
- I walked through this last night and it helped me better understand clustered vs. non-clustered indexes.

Fundamentals of Index Tuning: https://www.brentozar.com/archive/2020/02/free-fundamentals-of-index-tuning-week-part-1-indexing-for-the-where-clause/
- Just started on this

Fundamentals of Query Tuning: https://www.brentozar.com/archive/2020/03/free-fundamentals-of-query-tuning-week-part-1-building-a-query-plan/
- Will hopefully start soon

TheFluff
Dec 13, 2006

FRIENDS, LISTEN TO ME
I AM A SEAGULL
OF WEALTH AND TASTE
TIL the hard way that in postgres, attempting to drop an empty table with foreign key constraints on it will in fact result in an exclusive lock on all of the referenced tables, and so it can and will deadlock on you.

thankfully it didn't cause any major problems. lesson learned I guess, better to drop the constraints one by one first

kloa
Feb 14, 2007


Brent Ozar is p legit. I've learned some nuggets of wisdom from their blog posts over the years.

e: guess the oldie thread is probably more apt for career choices

kloa fucked around with this message at 21:40 on Mar 19, 2020

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

I've got a large table in Postgres with a UUID primary key.

I'd like to pick a record at random, without doing a full table scan (so not using the naive ORDER BY RANDOM()).

If I had a sequential indexed column in there, I could just pick a random number between 1 and MAX(index_column). But with UUIDs that doesn't work, they're not sequential or ordered to begin with.

Can anyone think of a solution that doesn't involve adding that extra pointless column? It feels like, given that I don't care about the random key being reliable or consistent, there might be some kind of native row ID I should be able to query.

edit: TABLESAMPLE SYSTEM is very useful and very close to what I want, but it can only be executed before any WHERE filter (and doesn't seem to work with subqueries), which is a problem.

edit2: beaten :)

NihilCredo fucked around with this message at 10:14 on Mar 20, 2020

redleader
Aug 18, 2005

Engage according to operational parameters
give TABLESAMPLE a shot. you should be able to use the postgres equivalent of SELECT TOP 1 to get a single random row

Jabor
Jul 16, 2010

#1 Loser at SpaceChem

NihilCredo posted:

but it can only be executed before any WHERE filter

Uh, how would your proposed synthetic incremental id work in this case? As soon as you filter, you have holes, so there's no guarantee that whatever random number you've selected is still in the set.

Adbot
ADBOT LOVES YOU

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

Jabor posted:

Uh, how would your proposed synthetic incremental id work in this case? As soon as you filter, you have holes, so there's no guarantee that whatever random number you've selected is still in the set.

This is what I'm thinking:

a) SELECT * FROM foo WHERE predicate LIMIT 1

Scans the whole table from the start, upon meeting a matching record stops early and returns it.

b) SELECT * FROM foo WHERE predicate ORDER BY RANDOM() LIMIT 1

Scans the whole table from the start, gathers up all matching records, random sorts them and returns the first

c) <query that I want>

Scans the whole table starting from a random initial point (and rolling over), upon meeting a matching record stops early and returns it.

NihilCredo fucked around with this message at 10:57 on Mar 20, 2020

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply